Weighing the quality of big data for SDG monitoring

Big data has raised big expectations among data users. Some users believe that real time data is readily available in large volumes, while official statistics are sparse and reported with a significant time lag. Against a backdrop of arguments about big data, some leaders of international development agencies see cooperation with big multinational companies engaged in information and digital technology as a much more effective and straightforward solution for SDG data collection.

Data production from national statistical institutes that produce official statistics using traditional survey methodology is indeed time-consuming, costly and burdensome. Additionally, the data flows from national sources to international users pass through a complex and often confusing structure of reporting mechanisms. However, even with these challenges and shortfalls with official statistics, we must still ask if big data is ready to offer real alternatives to official statistics. This question is of particular importance when we look at the issue of data quality.

I come from Nepal, a least developed country, where the number of Internet users is less than 20 per cent of the population. In India, our region's front runner in IT, only 30 per cent of the population uses the Internet. By contrast, in Austria, where I live and work, that figure is almost 85 per cent. Similar parallels could be presented in terms of the online registration of businesses, credit-card payments, Internet-shopping and other sources of big data applicable to industrial statistics. The amount of big data available in Austria is by no means comparable to that available in any least developed country. But Austria also has strong official statistics, which certainly influences whether or not big data sources should be used. The availability of big data on a wide scale generally occurs in countries with higher levels of capacity in official statistics (although no causal relationship is implied).) In many developing countries, limited access to and use of modern technology limits the scope and coverage of big data, and subsequently limits the statistical inference you can draw from it.

Applicability of big data is not the same across the fields of statistics capturing different SDG indicators. There are good practices of using big data in health, tourism, transportation and communications statistics. However, more than 90 per cent of big data is unstructured data that includes, text, images, audio-video clips, etc. Even after a highly sophisticated transformation, such data can yield few meaningful statistics. SDG monitoring requires highly disaggregated data to capture inequality and social inclusiveness in relation to a specific social stratum of a region or even a community. Thus, it would not be wise to generalize about the possibility of using big data for the monitoring of every development goal.

How good is the big data about which so much noise about their usefulness for SDG monitoring has been made? Let us assume that big data sources are much better in terms of timeliness and cost (although this is not yet proven to be true), but there are other attributes we are looking for in SDG data. In this context, I would emphasize the following:

  1. Accuracy -- First and foremost, we need to provide an accurate measure of the indicators to show the progress of nations towards achieving sustainable development. This requires accurate benchmark information for the reference year (2015) to measure change over time. There are many examples of estimating the growth from big data, but few for creating benchmark statistics.
  1. Coherence -- SDG indicators comprise a wide range of statistics that serve the single purpose of measuring sustainable development. We have to ensure that data compiled from different sources are not contradicting each other. For example, the sex ratio calculated from Facebook or other social media network user data cannot be accepted if it contradicts the estimates of a national demographic survey.
  1. Comparability -- Data obtained with internationally recognized statistical standards and methodological recommendations are conceptually compatible. Big data is in itself not compatible with any standard. Huge efforts are required to transform big data to meet the requirements of international comparability, which, in turn, would diminish the advantages big data may have of timeliness and cost effectiveness.
  1. Accountability -- Monitoring the SDGs is a nationally owned process and national statistics offices are accountable for the SDG data that they report. If an international agency derives information based on data from foreign-owned company-- irrespective of the level of technical sophistication used in data transformation — no national institution will accept responsibility for any misreported facts. Therefore, such data would not be fit for the purpose of SDG monitoring.

I strongly believe that we should embrace the opportunities offered by big data and make the best use of it, and I am looking forward to the many sessions and discussions that will take place at the UN World Data Forum 2018 on how to integrate big data into official statistics. In fact, we are already using big data as a complementary source in different fields. However, it would be premature to suggest that big data has already emerged as a better solution to meet the entire data requirements for SDG monitoring. I prefer to take a realistic approach.

Shyam Upadhyaya, is Chief Statistician for UNIDO. He started his career with the Central Bureau of Statistics of Nepal, and has worked as an international consultant on capacity building projects for NSOs. He was actively involved in the work of the IAEG-SDG in preparing the SDG indicator framework.