Chapter 4: Open data formats and standard interfaces

Overview

The adoption of common data and metadata models (chapter 2), and the use of controlled vocabularies and classifications to standardize their content (chapter 3), are necessary, but insufficient conditions to achieve data interoperability. In addition to the considerations and steps set out in chapters 2 and 3, standardized datasets need to be expressed in formats and made available through means that enable both machine-to-human and machine-to-machine access and use. In other words, once the basic structure of a dataset is in place and its contents are codified using standard classifications and controlled vocabularies, the data then needs to be made easily available and accessible to a variety of user groups.

Interoperability is therefore not only about standardized data production, but also about standardized “data logistics” (Walsh and Pollock n.d.), meaning that it requires the use of common patterns and pathways to get data from providers to users in a fast, convenient, effective, and efficient manner.

This section provides an overview of various approaches that exist to make data discoverable and present it so that developers and end-users can access data in more reliable and straightforward ways. The chapter recommends a set of data formats (among many others that exist) and the development of application programming interfaces (APIs) and user interfaces to support interoperability.

Open data formats

Electronic data files can be created in many ways, and data interoperability is greatly enhanced if data is made available using openly documented, non-proprietary formats. For maximum interoperability, data and metadata files need to be published in human-editable and machine-usable ways, and need to be agnostic to language, technology and infrastructure. A first step is to make the data available through bulk downloads in open data formats. There are various fully documented and widely agreed-upon patterns for the construction of digital data files, such as CSV, JSON, XML, and GeoJSON, among many others.

CSV is an easy-to-use data format for both developers and non-developers alike. The CSV serialization format is probably the most widely supported across different technological platforms, and although it does not incorporate a schema for validation, there are recent alternatives that combine CSV tabular data with additional schema information in containerized data packages (see, for instance, the Frictionless Data Packages described in Figure 15 below).

The use of JSON or XML to structure data in a text format and to exchange data over the internet is particularly useful for data interoperability, since these serializations allow producers and users to encode common data elements and sub-elements in such a way that data and metadata are linked together but clearly distinguishable from each other. It is important to note that while XML and JSON offer a common syntactic format for sharing data among data sources, they alone cannot address semantic integration issues, since it is still possible to share XML or JSON files whose tags are “completely meaningless outside a particular domain of application.” (Halevy and Ordille 2006).

Figure 15: The Data Package Standard

The Data Package standard is a containerization format used to describe and package any collection of data, based on existing practices for publishing open-source software. It provides a "contract for data interoperability" through the specification of a simple wrapper and basic structure for data sharing, integration and automation without requiring major changes to the underlying data being packaged.

Its specification is based on the principles of simplicity and extensibility, and the ability to provide both human-editable and machine-useable metadata. It emphasizes the re-use of existing standard formats for data, and is language, technology and infrastructure-agnostic. To create a data package, one has only to create a 'datapackage.json' descriptor file in the top-level directory of a set of data files. This descriptor file contains general metadata (e.g., name of the package, licence, publisher, etc.) as well as a list of all the other files included in the package, along with information about them (e.g., size and schema).

Data serializations in SDMX

SDMX is a data standard that encompasses a data model (the multidimensional data cube), standard vocabularies (content-oriented guidelines), a formal schema definition (DSD), and various data serialization formats for the construction of data files and electronic messages for data exchange.

In the context of SDMX, data providers can choose between data serialization formats for sharing datasets, including XML, CSV, JSON or even EDIFACT[1]. “The SDMX Roadmap 2020 foresees the promotion of easy-to-use SDMX-compatible file formats such as CSV. The most important thing about these formats is that, despite their compact size, the data structure defined by the SDMX metadata is still complied with.” (Stahl and Staab 2018, p. 97).

Application programming interfaces

Client applications help users discover, access, integrate and use data from multiple sources. Once a standard serialization format is in place, data providers can think of more sophisticated ways of making the data available, such as the use of APIs to deliver data resources over the web to multiple groups of users.

APIs are highly-reusable pieces of software that enable multiple applications to interact with an information system. They provide machine-to-machine access to data services and provide a standardized means of handling security and errors. When APIs behave in predictable ways, accept requests from users following well-known syntax rules, and yield results that are easy to understand and to act upon, it is possible to automate data flows that involve repetitive and frequent data sharing and exchange operations, avoiding costly and error-prone manual intervention. This allows users to focus on the data rather than spend their time collecting it. APIs provide the building blocks for users to easily pull the data elements they need to build their applications. In a sense, APIs are the virtual highways that allow data to travel back and forth between different websites and platforms.

API documentation is a technical contract between a data provider and its users. As such, it should describe all the options and resources that are available, as well as the parameters that need to be provided by the client to the server, and the content, format and structure of the resulting information that is sent back to the calling application, including error messages and sample responses. One good practice in the development of APIs is the consistent use an API description language (e.g., Swagger[2]) to document their functionality. It is also crucial for the documentation of an API to keep track of its different versions (Biehl 2016).

Web APIs

A web service delivers the result of a data processing task performed by a specialized computer (the server) upon request by other computers. Web APIs serve information resources to client applications through the internet, thus enabling the emergence of modern distributed data processing and analysis systems with loosely coupled components. They enable the implement of a service-oriented architecture where information resources are provided to data users upon request, independently and with no need for prior knowledge of the specifications that are used by the requesting applications.

Figure 16: The OpenAPI Specification

The OpenAPI Specification (OAS, formerly known as the Swagger Specification), “defines a standard, language-agnostic interface to RESTful APIs which allows both humans and computers to discover and understand the capabilities of the service”, allowing users to “understand and interact with the remote service with a minimal amount of implementation logic.”

In simpler terms, it establishes a standard format to document all the functionality of a web REST API, describing, in a way that is both human and machine readable, the resources it provides and the operations that can be performed on each of these resources, as well as the inputs needed for, and outputs provided by, each of these operations. It also can be used to document user authentication methods, and to provide additional information like contact information, license, terms of use, etc.

In the context of microservice and service-oriented architectures, the OpenAPI Specification has emerged as the standard format for defining the contract between client applications and services exposed via APIs, making it easier to orchestrate applications as collections of loosely coupled services, each of which supports self-contained business functions (Vasudevan 2017).

One example of a web API that provides access to data on SDG indicators following the OpenAPI specification is the UNSD’S Global Sustainable Development Goal Indicators API. This API enables developers to use indicator data in a flexible manner directly within their own applications. This dramatically lowers data maintenance costs and ensures their applications always contain official and up-to-date indicator data, straight from the source.

The use of common standards and design patterns in the implementation of web APIs allows multiple developers to easily consume and integrate data resources over the web, opening up new possibilities for on-the-fly generation of data analysis and visualizations. The adoption of web APIs for the dissemination of data, based on open specifications, is a crucial step towards improved data interoperability.

API interoperability

The explosive growth in availability of open APIs within and across organizations has led to a new interoperability challenge, namely that of integrating multiple APIs. Whereas building integrated systems from scratch would be extremely costly and disruptive, one approach to deal with legacy systems that do not “talk to each other” is therefore to build a middleware layer that connects one or more client applications with the legacy data systems through common API specifications (Feld and Stoddard 2004).

Figure 17: Building Effective APIs

As part of its API Highways initiative, the GPSDD has developed an API Playbook, a concise and accessible guide to build web services that can be seamlessly integrated by the global developer community. The API Playbook includes checklists and key questions drawn from successful practices from the private sector and government to help build better digital services in support of the SDGs. It is organized around the following 8 “plays” or recommendations:

Document everything;
Consider the Developer Experience;
Be an upstanding citizen of the web;
Be a part of the community and ask for help;
Build trust;
Consider the future;
Plan for the long tail;
Make it easy to use.

The use of standardized APIs across different data platforms allows application developers to quickly “mash up” data from multiple sources. Such APIs function as “middle tier” lenses that allow developers to view individual data assets as building blocks for their applications, which they can then put together in different combinations to address specific user needs. APIs should be designed with the needs of application developers in mind, focusing on helping them create information products that satisfy the requirements of end users. In this context, APIs need to be well-documented, easy to understand, and easy to integrate with other systems.

Broadly speaking, it is good practice to manage all of an organization’s APIs as a single product. Efforts should be made to ensure that all component APIs are standardized and have mutually consistent documentation and functionality and implement common design patterns built from reusable components. Moreover, adopting a common set of basic functionalities and design patterns is crucial to improve interoperability across different APIs. For instance, the error messages from all APIs in an organization’s API portfolio should have the same structure. To facilitate discoverability, the description of the API portfolio should be served by a specific end. This enables the caller to discover each API within the portfolio by downloading and parting the API portfolio.

However, there are often cases in which organizations need to maintain multiple APIs which are not fully standardized, say, because they are legacy services developed in isolation, or because they need to deviate from common patterns to deliver customized services to highly specialized client applications. In those cases, a useful approach to improve interoperability of APIs is the creation of API aggregation services, or “API mash ups”. This are services that draw data from multiple APIs (from one or more providers) and repackage it to create new, integrated data services for end users. Such API aggregation services can greatly improve developer experience and generate value by eliminating the need to work with multiple APIs. This is the approach followed, for instance, by the API Highways initiative described in Figure 18.

Standardized user experience

The user interface of a data dissemination platform is the point at which most lay users first come into contact with data. Standardizing this experience can help promote data use and usability. Interfaces should therefore be designed to behave consistently and in familiar ways, following common design patterns and rules of communication, so users can easily and intuitively interact with them instead of having to invest large amounts of time and effort trying to understand the rules of engagement whenever they are confronted with a new data source.

As computers’ processing power and internet speeds have increased, it has become a lot easier to present data online in far more visually exciting and engaging ways. A user’s experience of how they interact with data on a website has become a key indicator of a platform’s quality. Graphics, visualizations, and other often interactive tools contribute to an enhanced user experience. Using the world wide web as a conduit, applications can be programmed to interoperate and share data so that the same data can be integrated with information hosted on other platforms and presented in numerous different ways depending on user needs.

Adopting and implementing web-based APIs that follow common standards and well documented patterns enables multiple developers to produce interactive data-driven applications. It also creates new possibilities for user engagement. Similarly, using standardized patterns and reusable building blocks when designing human-machine interfaces across different applications can significantly reduce the effort that users have to invest to find and use the data they need.

Building a roadmap: an interoperability rapid assessment

The following is the first part of the assessment framework produced as part of this Guide. It focuses on the relevance and applicability of conceptual frameworks and the value of institutional frameworks, with a particular focus on legal and regulatory frameworks. It is designed to help inform the development and implementation of data governance strategies and should be supplemented with the resources identified under the Further Reading heading below as well as other context-specific materials.

Action areas	Initial Steps	Advanced Steps
Using open data formats	Make the data available through bulk downloads, for example as CSV files.	Use XML, JSON, GeoJSON or other widely-available open data formats to encode common data elements and sub-elements in such a way that data and metadata are linked together but clearly distinguishable from each other. Use a data containerization format such as the Data Package standard format to publish data sets.
Using standard APIs	Set up a webpage and document all the functionality of existing web APIs in use, describing the resources being provided, the operations that can be performed, as well as the inputs needed for, and outputs provided by, each of operation. Provide additional information such as contact information, any licences used, terms of use, etc.	Develop RESTful web APIs for data dissemination following the OpenAPI specification and document them fully using an API documentation language such as Swagger. Expose datasets using SDMX web services. Serve the description of the API portfolio by a specific end point, to facilitate the discoverability and use of APIs.
Enhancing user experience	Follow common design patterns and rules of communication, so users can easily and intuitively interact with system interfaces.	Create or subscribe to API aggregation services, or “API mashups” to improve developer experience and generate value by eliminating the need to work with multiple APIs.
Common pitfalls when using open data formats and standardized interfaces: Over-customization of an interface can inhibit its accessibility, usability and interoperability with other systems. System interfaces should prioritize interoperability and flexibility over specificity and optimization. In short, a balance must be struck between specific user group needs and broader usability. Notwithstanding the above, the ability to customize and alter interfaces for specific use cases should be planned for as part of a broader data governance policy. All the APIs of an organization should be managed as one product, making sure that all component APIs are standardized and have mutually consistent documentation and functionality, and implement common design patterns built from reusable components.

Page tree