Transforming Census Results Publications into Analysis-Ready Data: The IPUMS International Historical Geographic Information System (IHGIS)

Tracy Kugler, IPUMS Data Center, University of Minnesota

National Statistical Offices (NSOs) commonly publish large volumes of results reporting the findings of Population and Housing Censuses. These publications contain a wealth of demographic, socioeconomic, and housing information that can be used for tracking progress toward the Sustainable Development Goals and for many other research and policy analysis applications.

Unfortunately, this wealth of data has remained largely inaccessible to researchers and decision makers. The data are typically published in reports as tables summarizing population characteristics. Historically, results were published in bound volumes. In recent decades, many of these reports have been published as PDF documents and made available on NSO websites. While the reports are available, data from PDF documents, much less bound volumes, cannot be easily imported into a statistical or GIS package. Furthermore, the table structures are highly heterogeneous, both across countries and even within the same report.

IHGIS Data

The International Historical Geographic Information System (IPUMS IHGIS) is designed to provide easy access to these data in a way that researchers can easily use for analysis. IHGIS data are available through a web-based data access system, which allows users to filter by dataset, browse available tables, select tables of interest, and download the data. IHGIS extracts include consistently structured data tables in CSV format, ready for use in statistical or GIS software packages. Extracts also include comprehensive metadata in both human- and machine-readable formats.

IHGIS also provides GIS shapefiles delineating the boundaries of the geographic units described in the data tables. Each unit is identified with a unique code in both the data tables and shapefiles, facilitating easy linkage between data and boundaries in a GIS package.

The IHGIS collection currently includes over 1,200 tables from 26 censuses, including both population and housing and agricultural censuses. New data are added several times per year. The current collection is derived from documents and data tables published electronically on NSO websites. IPUMS has recently received funding from the U.S. National Science Foundation to extend our capabilities to include data from print documents using optical character recognition.

The IHGIS Workflow

The core challenge for IHGIS is transforming data tables from the myriad structures in which they are published into a standardized structure. Addressing this challenge requires substantial software infrastructure. However, it is not feasible to completely automate the task of interpreting the contents of any given table. Therefore, the overarching philosophy of IHGIS data processing is to have computers do what computers are good at and have humans do what humans are good at. For example, it is relatively easy for a person to determine whether row headers identify geographic units or categories of marital status or educational attainment. Developing software to make that determination would be a significant challenge. On the other hand, having humans extract state-level totals from a table by copying and pasting is tedious, time-consuming, and error-prone.

The heart of the IHGIS data processing workflow is a table markup framework. Using the markup framework, researchers (mostly undergraduate research assistants) indicate the location of key structural elements within each table. For each table, students extract information such as the universe, time frame, and geographic extent. They then add keyword tags indicating the location of geographic unit headers, headers describing the characteristics summarized in the table, the table title, the extent of the data, and other structural elements.






















Example of markup for a relatively simple table

The markup serves as a guide for the IHGIS software, enabling ingest into a metadata database. The database organizes all row and column headers, titles, universes, and other metadata elements and their relationships in a consistent way. The database, in turn, enables automated restructuring of the data tables to generate the consistently structured tables in IHGIS extracts. For example, many source tables include nested geographic units at two or more levels (e.g., states and counties). IHGIS pulls the appropriate rows apart to create separate files for each level, enabling easier data linkages in GIS packages.

By assembling these data and making them readily accessible to researchers, IPUMS IHGIS preserves the world’s statistical heritage. NSOs invest enormous resources and effort in conducting censuses, tabulating responses, and publishing the results. But the data may not be used to their full potential if they remain in thousands of documents on NSO websites or library shelves. Some of the data are even in danger of being lost if they are removed from websites when the next census becomes available, or libraries deaccession volumes to clear shelf space. IHGIS makes these critical data discoverable and accessible so that researchers and policy analysts can use them to understand population dynamics and socioeconomic systems and use that knowledge to improve quality of life for people around the world.

About the Author

Tracy Kugler,
Research Scientist,
IPUMS Data Center, University of Minnesota