Sampling method is the most efficient and popular data validation method, even it has little coverage with lots of risk. Here I would introduce an feature points sampling method. The reason to add "feature points" is to avoid generic random points that may not cover as much as possible for the report validation.
Feature points sampling method takes some special points and generic points to validate a reporting result, and ensure these points have most coverage of the reporting range. Feature points contains less data, and could be validated by human calculation easily. The results of these points can validate the query logic correctness. Once the query logic being validated we could add more complex data combination as generic points and finally compare the query results and report results.
Let's take a real automation report as an example. The data come from the log of BluePrism RPA. The existing ETL process has been tested already. I am going to focus on the report result against local data to verify the report result correctness.
The report includes daily, weekly, monthly, and yearly. In this article I'm gonna use monthly report for validation. Here're the points used in the verification.
|1||2018-10||2018-10||BM||hotspots||feature||specific lob and process_name at certain time point|
|2||2018-10||2018-10||BM||Shipped not Activate||feature||different specific lob and process_name at certain time point|
|3||2018-10||2018-10||BM||[ALL]||generic||same lob and time point, but all process_names|
|4||2018-11||2018-11||BRS||Disconnection||feature||different specific lob and process_name and different time point|
|5||2018-11||2018-11||BRS||Field Work||feature||different specific lob and process_name and different time point|
|6||2018-09||2018-11||[ALL]||[ALL]||generic||all lob and process_name at different time points, more generic|
|7||2017-12||2018-02||[ALL]||[ALL]||generic||cross year, more generic|
You could test more different feature points and generic points, however, with the same logic behind in most situation you would get the same results.
To same space, I only list point 1 and 7 in this article.
select lob,PROCESS_NAME,count(loaded) sendto,count(completed) worked,1.0*count(completed)/count(loaded) [% automated] from tbl_AutomationScorecard where lob='BM' and PROCESS_NAME='hotspots' and month(loaded)=10 and year(loaded)=2018 group by lob,PROCESS_NAME
Point 7 (generic point, cross year)
1,406 total views, 15 views today