The following table provides examples of standard quality checks to be performed on the raw data with descriptions, priorities, and pass criteria.
Table 1. Quality assurance checklist for raw MPD
# | Priority | Indicator | Dataset | Description | Pass criteria |
1 | critical | Missing values | Cells, domestic | Out of total records, number of empty values in the dataset needed for calculations | Any field has less than 5% missing values. |
2 | critical | Number of records per day | Domestic | Records per day | There should be no illogical lows or peaks on the timeline. |
3 | critical | Number of unique subscribers per day | Domestic | Unique subscribers per day | There should be no illogical lows or peaks on the timeline. |
4 | critical | Geographical distribution of cells | Cells | What to see: ● How many cells have incorrect coordinates (e.g., out of the country)? ● How are cells distributed in the country – are there any missing regions without cells? ● Are there any illogical cell locations? | With visual inspection, there should be less than 5% of cells that are out of the country or that have definitely incorrect coordinates |
5 | critical | Cell occupancy | Cells, domestic | How many of the cells have records in the domestic dataset? | Less than 5% of the cells should have 0 records. |
6 | critical | Cell occupancy | Cells, domestic | How many of the cells are missing from the cells table? Look at domestic data and see how many cells in domestic have a cell reference that is missing from cells data. | There are less than 5% missing cells. |
7 | critical | Subscriber presence in data | Domestic | Number of days domestic subscribers are present out of all days in the period | For domestic data, the subscriber should be present on most days. |
8 | critical | Diurnal distribution of records | Domestic | Average number of records per hour (0-23) | There should be peaks in the morning and afternoon, and no sudden peaks. |
9 | important | Weekly distribution of records | Domestic | Average number of records per hour (0-23) | Should represent weekly chart (weekends lower) |
10 | critical | Average number of records per day per subscriber | Domestic | Average number of records per day per subscriber | CDR: 3–4 IPDR 10–50 Signalling: > 50 |
11 | low | Time between subsequent events | Domestic | Time gap between subsequent events | It should follow folded normal distribution. |
12 | low | Identify time zone | Domestic | Based on diurnal distribution of records; identify what time zone is used | Should conclude that there is single time zone and it is identifiable. |
Source: Positium.