Whereas the statistical community has made good progress on using Big Data, many questions and challenges remain. Statistical Offices still need to make further progress in delivering timelier, more frequent and more granular data; the private sector remains a formidable competitor; more engagement and collaboration with private sector and other partners is necessary to combine and benefit from each other's strengths.
This means that additional steps have to be taken: results from the work of the existing Task Teams should become usable for the statistical community; Big Data, administrative data and traditional statistical data sources should really be treated together via a multi-source approach; the challenge is not to overcome problems of using Big Data in itself, but to achieve trusted data collaboration; and in that respect the UN Global Working Group on Big Data needs a collaborative Global Platform for the overall organization of its work.
In short, the strength of the community of official statistics is trusted data, and the real need of the community is close collaboration with private sector, academia, research community and civil society.
Data collaboratives are a new challenge and new opportunity for the community of official statistics – in relation to Big Data, to the SDGs, to the sharing of data, services, technologies, and know how.
These and similar questions will be addressed at the 4th UN Conference on Big Data in Bogota, Colombia from November 8th to 10th – organised in collaboration of the UN Global Working Group, DANE and the Colombian Ministry for ICT.
Click here for Agenda for Meeting of the GWG on Big Data for Official Statistics
The Cape Town Global Action Plan [Annex I] for Sustainable Development Data emphasized among others the strengthening of innovation and modernization of national statistical systems. This innovation effort calls for a rethinking of the partnerships of the community of official statistics with private sector, academia and civil society through an interconnected ecosystem of data and technology collaborations at national, regional and global level. In this context, “Trusted Data Collaboratives” are a new way of working together with proper definition of the interests of the various stakeholders, proper definition of responsibilities and access, and appropriate protocols to safeguard confidentiality.
On 25 September 2015 world leaders committed to the 2030 Agenda for Sustainable Development, including many ambitious goals and targets to be achieved by 2030. The statistical community was charged with defining appropriate indicators to monitor the progress towards these targets. It was stressed that differentiation by population groups, sub-national location and smaller time intervals (“Leaving no one behind”) would make the information base more useful for policy decisions. The 2030 Agenda explicitly calls for enhancing capacity building to support national plans to implement the sustainable development goals. Along with the SDG monitoring, the underlying microdata should also become accessible through national, regional and international open data platforms and become discoverable through standard metadata documentation. These platforms should apply open data protocols for the creation and use of interoperable APIs based on ISO standards to allow for trusted data collaboratives, improved governance, citizens engagement and inclusive development and innovation (see open data charter).
The World Council on City Data (WCCD) is the global leader in standardized city data - creating smart, sustainable, resilient, and prosperous cities. WCCD hosts a network of innovative cities committed to improving services and quality of life with open city data and provides a consistent and comprehensive platform for standardized urban metrics. WCCD is a global hub for creative learning partnerships across cities, international organizations, corporate partners, and academia to further innovation, envision alternative futures, and build better and more liveable cities. WCCD is implementing ISO 37120 Sustainable Development of Communities: Indicators for City Services and Quality of Life.
Moderator: Mr. Ivo Havinga, UNSDThe most promising way forward of compiling data for SDG monitoring is the integration of Big Data coming from new technologies with traditional data, in order to produce relevant high-quality information, with more detail and at higher frequencies to foster and monitor sustainable development. This implies also an increase in accessibility to data through much more openness and transparency, which should ultimately more empower people for better policies, better decisions and greater participation and accountability, leading to better outcomes for people and planet.
Moderator: Mr. Ronald Jansen, UNSDDigital services are becoming increasingly important in our lives. Services such as cloud, mobile apps, and other digital applications are facilitated by data centers. They have become the main enabler of the digital economy by facilitating a wide range of activities across government, business and society. Data centers therefore are an important part of the national critical infrastructure. As enablers of the digital economy, data centers play an important role when it comes to trust. Data should not only be accessible and available 24/7, secure data storage and privacy must be guaranteed. Data centers provide a platform for organizations to compute, run and store their services and data. In the Netherlands, municipalities join forces with Statistics Netherlands in urban data centers to use data more effectively in local administration.
Moderator: Mr. Bert Kroese, Statistics NetherlandsThe most promising way forward of compiling data for SDG monitoring is the integration of Big Data coming from new technologies with traditional data, in order to produce relevant high-quality information, with more detail and at higher frequencies to foster and monitor sustainable development. This implies also an increase in accessibility to data through much more openness and transparency, which should ultimately more empower people for better policies, better decisions and greater participation and accountability, leading to better outcomes for people and planet.
Moderator: Mr. Misha Lokshin, World BankWithin the community of official statistics “trusted data” is defined in terms of ‘compliance with quality standards’. There are national quality assurance frameworks, statistical codes of practice and compliance with international standards, such as the System of National Accounts. IMF developed a Special Data Dissemination Standard and a General Data Dissemination Standard, which can be generally used as a measure for “trusted data”. At a technical level, the statistical community has defined protocols for data exchange and interoperability of data systems, like SDMX and DDI. More broadly, the private sector has defined security standards and associated protocols for data transmission, data storage and the like. These are taken up as ISO standards. In the business world, assurance is given through the certification that something is ISO compliant.
What kind of certification do we need for a “trusted data collaborative”?
The GWG task team on satellite imagery, geospatial data and remote sensing developed a handbook that contains information on sources of Earth observation data, methodologies for producing crop statistics and other statistics through the use of satellite imagery data, outlines of pilot projects and guidance for national statistical offices in exploring the use of Earth observation data for the first time. The pilot projects include an application of satellite imagery data in the production of agricultural statistics. This session also discusses a hands-on course to teach methods for using Earth observation data in generating agricultural crop statistics and in monitoring the SDGs.
Moderator: Ms. Sylvie Michaud, Statistics CanadaThis session will provide information on how to access online data, selecting data sources, preparing raw data for use, classifying data and processing data for use in the CPI. It also discusses different methodologies, describe the status of scanner data implementation in different countries. Further, it gives a status update on scanner and online data integration projects and provides guidance for NSOs considering using this data source for the first time.
Moderator: Mr. Ivo Havinga, UNSDThis session gives an overview of data generated by mobile communication technologies and choices, which clarifies the trade-offs between size, complexity and usefulness. The session will discuss the importance to understand stakeholders and partnership models for mobile data projects and the logical order of steps in the process of data extraction. It will also discuss how to calculate tourism statistics and how to identify tourism indicators, calibration and inference.
Moderator: Mr. Margus Tiru, PositiumTrusted Data Collaboratives is about the use of big data and its integration with administrative sources, geospatial information and traditional survey and census data. The use of multi-source data requires collaboration by a number of stakeholders, such the statistical office, government agencies, research institutes, civil society and private sector. Assuring quality of the outcome therefore requires quality assessment at all levels of the collaborative.
Moderator: Mr. Niels Ploug, Statistics DenmarkA data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question. The term data lake is often associated with Hadoop-oriented object storage. In such a scenario, an organization's data is first loaded into the Hadoop platform, and then business analytics and data mining tools are applied to the data where it resides on Hadoop's cluster nodes of commodity computers. Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. Increasingly, however, the term is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried.
Moderator: Mr. Setia Pramana, BPS Indonesia, Jakarta