Symposium
2001/20 25 September 2001 English only
|
Symposium on Global Review of 2000 Round of
Population
and Housing Censuses:
Mid-Decade
Assessment and Future Prospects
Statistics
Division
Department
of Economic and Social Affairs
United
Nations Secretariat
New York, 7-10 August 2001
Archiving Census Metadata and Microdata:
How to
Preserve Memory, Increase Stakeholders, and Enhance the Census as a Public Good*
Wendy L. Thomas and
Robert McCaa **
CONTENTS
B. Long-term preservation of
data and documentation
C. Determining what to
preserve
E. Inventory of available
technology, personnel and knowledge
1.
The
preservation of various types of census materials must be raised early in the
cycle of census activities. Well-preserved data and documentation contribute to
effective data collection, dissemination, planning, and future use of the
population census. The ability to learn from past processes, identify
strategies that contribute to a successful census, retain and build on core
activities and structures from previous censuses and effectively apply census
data to current and future issues is all dependent upon the preservation of
census data and the materials related to the collection and processing of those
data.
2.
In
an ideal world with unlimited resources, the questions of what to preserve and
how to preserve it would be easier to address. Unfortunately, this is not the
case and even in the wealthiest of countries the cost of preservation, and
questions surrounding the means of preservation, have a profound impact on what
materials are preserved and in what format. The purpose of this paper is to
look at the types of data and documentation accumulated during the census
process and to explore the benefits of preserving these types of documents in
terms of informing future censuses and data users, ensuring appropriate
preservation formats, and identifying stakeholders who may be an effective
force in lobbying for the preservation of various classes of documents.
3.
Classifying
materials for preservation in terms of their future impact and anticipated use
is useful for identifying the trade-offs in preservation decisions for
individual countries. By coupling this type of materials lists with an
inventory of the available technology, personnel and knowledge within a country
to process materials for preservation, governments will have the information
necessary to enable them to make informed preservation decisions. The use of a
questionnaire to elicit information on the available infrastructure for
preservation within a country may also bring to light options for cooperative
services or a profile of appropriate technologies for a variety of situations.
The ability to determine not only what will be preserved, but also what will
not be preserved, based on an understanding of the long-term impact of the
information contained in the document, is instrumental in developing a
long-term census preservation policy that will meet the needs of future
generations.
1. Definition
of long-term preservation
4.
Long-term
preservation takes on a new meaning with electronic records. “Archiving” is a
term used both by computer/information technology specialists and archivists,
yet it conveys different meanings to these two groups. “Archiving” in the world
of computing refers to inactive or off-line storage. To archivists “archiving”
means to preserve an information record in a format that is independent of its
production environment and to protect that record from loss, alteration or
deterioration.
5.
For
archivists, well-preserved electronic records have the following
characteristics (Dollar, 2000). They are:
6.
It
is important to keep this concept of preservation in mind when assessing the
value of preserving particular census records and in determining the costs of
distribution, storage and long-term preservation.
2. The value
of preservation
7.
Much
has been written on the importance of organizing and coordinating the process
of census taking within and between countries (United Nations, 2000). Numerous
intergovernmental and non-governmental agencies provide support and assistance
for this process. Emphasis has been placed on planning, data collection,
methodologies, product preparation and dissemination. The value of a strong
archival program lies not only in preserving the actual data, metadata and data
products for future use, but also in its ability to contribute to future census
and statistical activities.
8.
Given
the periodic nature of census taking, maintaining records on how specific
activities were performed can inform future census processes within a country,
allowing agencies to learn from past processes and strategies. This is
particularly important in countries that do not have and cannot afford to have
a permanent office for the census. Carefully selected and preserved records can
provide detailed information on the planning process, specifications of
collection, and insight into why certain decisions were made and how effective
particular activities were. In particular, it is these types of
country-specific processes and approaches that can assist in retaining and
building on successful core activities and structures.
9.
Preservation
and communication of information on data quality and process evaluation is of
value for informing future census activities and is essential for the informed
use of census data. Communicating information on the reliability, limitations
and strengths of the final data allows users to understand the impact of any
procedural changes on any analysis they may wish to perform. This is the type
of information that should be encapsulated through logical or physical links between
the census data and the procedural metadata in the preservation process.
2. Costs of
preservation
10.
The
cost of preservation is an issue for all countries. Recent discussions of
retention schedules for the 2000 United States census elicited numerous
responses from various stakeholder groups concerning both the preservation of
original forms and intermediary process output. The cost of preserving original
enumeration forms in various formats and the associated cost of making these
identifiable for future users was one of the key factors in negotiating a final
retention schedule.
11.
In
countries without permanent census offices and/or permanent national archives
structures, the cost of preservation becomes a major issue. By looking at these
costs early and including them in the discussion of the overall costs in
undertaking a census, additional options for allocating funds may be found. For
example, the way in which census data are captured and prepared for
dissemination can reduce the cost of creating a preservation-quality record. In
addition, capturing and retaining procedural information as it is produced and
creating the logical or physical links to emerging data collections increase
the likelihood of preservation while reducing the cost of reconstructing
valuable metadata information.
12.
Early
discussions of the costs and future value of information preservation allow for
both informed decisions and the opportunity to discuss cooperative long-term
preservation possibilities in a timely fashion.
1. Preserving
the products
13.
The
essential elements of any census in terms of preservation are the resulting
data and basic documentation. How those data are identified and defined varies
by country. Issues of confidentiality and security play a major role in
determining not only who should have access to the microdata and enumeration
forms, but also whether that information should be retained at all. Increasing
the availability of microdata contributes to the likelihood that these data
will be preserved.
14.
Access
to microdata is being made available by an increasing number of countries in a
variety of forms: public samples, scientific samples (restricted to a few
carefully screened projects) and through data enclaves where the user works in
a secure site and output is tightly controlled. From 1985 through 1994, of 153
countries with populations of one million or more, 134 conducted enumerations
in the 1990 round of censuses, 94 per cent of the world's population was
counted and 54 countries provided researchers access to anonymized census
samples of individuals and households. Some countries restricted access to a
single investigator or research facility, but what is remarkable about the
1990s is not only the globalization of the census, but the growing acceptance
of anonymized samples as statistical instruments. These trends are continuing in the 2000 round of censuses
(1995-2004).
15.
For
example, the approach used in the United States of providing public samples of
sizes ranging from 1 to 15 per cent for various area types supports a wide
range of research at both the local and national levels. In addition, the
release of the data from restricted status after 72 years has resulted in a
number of projects to make these data accessible to the public in digital
format. The most noted of these is the Integrated Public Use Microsample
(IPUMS) project. This project, begun in 1992 at the University of Minnesota,
integrates 65 million microdata records for the United States. Conceived by Steven Ruggles, founding
director of the Minnesota Population Center, and funded by the National Science
Foundation and the National Institutes of Health, IPUMS integrates the
decennial censuses of the United States, dating from 1850 to 1990. The first version of the IPUMS database was
released on tape in 1993 and by 1995 via the Internet. Thanks to the expansion
of the Internet, the data distribution problem was easily solved by means of a
web-site-driven data-dissemination engine (http://www.ipums.org).
The IPUMS database, distributed free of charge via the Internet, quickly
established itself as one of the three most frequently cited data sources in
population research about the United States.
16.
In
October 1999, with major funding secured from the National Science Foundation,
a global effort was inaugurated, dubbed IPUMS-International. With the
cooperation of national teams of investigators, the IPUMS-International
consortium proposes to integrate census microdata for more than a dozen
additional countries, with at least one from each continent. Historical census
microdata for Argentina, Canada, Costa Rica, Norway and the United Kingdom will
be included in the database, as well as those for the United States.
Contemporary microdata for Colombia and the United States will be integrated
along with those for Brazil, France, Hungary, Kenya, Mexico, Spain, United
Kingdom, Viet Nam and others. Based on
a prototype developed with the cooperation of the Colombian National
Statistical Office (Departamento Administrativo Nacional de Estadística, or
DANE), country teams of experienced census data users are being formed to
advise on how to harmonize the national census concepts using international
norms.
17.
The
creation of public use samples is being used by a variety of countries to
increase access to microdata. Software such as the Integrated Microcomputer
Processing System (IMPS and its successor CSPro), a system for data processing
of censuses and surveys, developed by the International Statistical Programs
Center of the U.S. Bureau of the Census, facilitates the dissemination of
microdata samples by providing tools for cross-tabulation, electronic map
production and other basic analysis, thereby reducing the cost of producing
these products for individual countries.
18.
Examples
of distributed microdata public-use samples include Viet Nam, which has
released a 3 per cent sample of the 1999 population and housing census, with
the intention of producing a full 100-per-cent sample at a later date. Mexico
has released a 10 per cent sample designed to yield valuable information at the
level of municipalities of 100,000 or more in size. France has released 5 per
cent samples for 1962-1990. Likewise,
the Central Bureau of Statistics of Kenya has prepared a mega-sample of the
1999 enumeration (with a maximum density of 20 per cent) to complete its
impressive series of samples for 1969, 1979 and 1989.
19.
These
collections not only provide data in a preservable format; they include a range
of metadata. The documentation is extraordinarily complete and includes details
on every aspect of the census from earliest preparations to the final
publication of tables. The discussion of sampling is particularly noteworthy.
20.
A
growing number of countries are offering data in the REDATAM format (developed
by the United Nations Demographic Centre for Latin America and the Caribbean,
CELADE), as a way of storing microdata and making them useful to researchers
and administrators who need small-area statistics.
21.
REDATAM
(REtrieval of DATa for small Areas by Microcomputer) was originally conceived
of as a low-cost data-retrieval computer program and has grown into a concept
that involves a proprietary database format, as well as a software development
system. The proprietary format is to secure sensitive data while keeping the
invaluable flexibility of microdata access. A web service is also available and
benefits national organizations reluctant to give away data but is ready to
provide public access to data and/or provide privileged access to selected
users. The program is freely available via the Internet. REDATAM has been
developed over the last two decades, thanks to the financial support of several
international organizations, including ECLAC (United Nations Economic
Commission for Latin America and the Caribbean), the Canadian government
through CIDA (Canadian International Development Agency) and IDRC
(International Development Research Centre), IDB (the Inter-American
Development Bank) and others (see the web site at http://www.cepal.cl/celade).
22.
Countries
with 1990 round censuses in REDATAM include:
Latin America: Argentina, Brazil, Chile, Colombia, Dominican Republic,
Guatemala, Honduras, Nicaragua, Paraguay, Suriname, Uruguay, Venezuela, and
English-speaking Caribbean.
Asia:
Cambodia* and Democratic People’s Republic of Korea*
Africa: Benin, Burkina Faso, Burundi, Cameroon, Egypt, Gabon, Ghana,
Kenya, Madagascar, Mali, Nigeria, Rwanda*, Seychelles, and Zimbabwe*.
* = database with 100 per
cent of the microdata for the population
23.
While
these microdata files are not in an archival format in the strict sense, they
have been captured in a way that allows for the authoring agency to output a formatted
ASCII file with complete structural metadata physically encapsulated to ensure
future understandability. It is important that such formats as REDATAM not be
viewed as long-term archival formats. The problem with not creating an archival
copy and maintaining records in a proprietary format is the cost of eventually
having to migrate that information to another format. Proprietary formats soon
become legacy formats, which, due to age, dependency on legacy languages,
systems or hardware, become difficult, costly and sometimes impossible to
migrate.
2. Preserving
the process
24.
Several
manuals and handbooks on performing and managing a national census give
detailed lists of procedures and processes. This type of information and details
of particular approaches and methodologies are needed for accurately
interpreting the resulting data. In addition to this information, consideration
of the types of process information that will be of value in preserving
institutional memory is useful and often missed. This involves recording and
preserving the “why” as well as the “how” of the census process. Capturing this
information as decisions are made is more cost-effective than reconstructing it
at a later date. Attention should be paid to capturing it in a non-proprietary
format to reduce the likelihood that the information will be lost due to
migration costs.
If the complete census cycle consists of
four phases (United Nations, 2000):
·
Preparation,
·
Field operations,
·
Data processing and
·
Evaluation,
then,
for each phase, of particular interest is documentation of the following:
·
Reports on procedures and methods;
·
Comparison of concepts and procedures with the
preceding census and current international standards;
·
Evaluation reports for each cycle of the census
and the most important documents on which the reports are based; and
·
Record books (from manager of the census to the
enumerator's logbook), although these may be too raw for general dissemination
The final step
is to document the data disseminated and the associated documentation and
notes.
1. Who are the stakeholders?
25.
As
is clear from the above discussion, standards for census microdata and microdata
access are still emerging. The major question of preserving and allowing access
to microdata is no longer a technical one but a question of policy. As access
to and use of census data expand, the character and complexity of stakeholder
groups also expand. These stakeholder groups will be found among governmental,
non-governmental, academic, commercial and new user groups. While there will
always be competing interests for what should be retained, common interests
among several groups will help to identify materials for long-term
preservation. Consulting these stakeholder groups early in the process
increases the likelihood of obtaining and maintaining funding for long-term
preservation.
2. Future
impact and anticipated use
26.
In
terms of the future impact and use patterns for census data, preserving and
retaining access to microdata hold the greatest potential. Census data are used
to address specific social, economic and demographic issues that change in
character over time. The ability to create comparable aggregations over time or
for new emerging geographic areas or definitions rests solely on the retention
of microdata files. This is particularly true for small-area statistics that
cannot be derived from larger-area aggregations.
27.
For
topical tabulations the difficulties are clear. The timing cycle of the census
and the production of statistical aggregations mean that the questions asked
and the tables created often reflect the concerns and interests from five to
ten years prior to publication. New tabulations may no longer be comparable to
tabulations from previous censuses due to change in cohort or classification
groupings, universe for the table or other changes in definition. Without
access to microdata, however it is secured, researchers and analysts are left
with few options. In addition, microdata lend themselves to research and
scholarly discourse and increase the value derived from an individual
census.
28.
Consider
the example of Canada: In the 1970s
National Statistical Services began to disseminate census microdata samples in
growing numbers. In Canada, the 1971
revision to the Statistics Act made possible the public release of
non-confidential microdata (Tambay and White, 2001). Since the 1970s Statistics Canada, with its series of quinquennial
enumerations, has regularly issued census microdata samples. Until 1996, researchers had to request
samples individually and distribution was highly restricted. In that year a data liberation initiative
was instituted to permit Canadian universities to disseminate microdata samples
to researchers and their students. The
result was an explosion of research.
Whereas before liberation five or ten scholars might acquire microdata
samples each year, afterward, a single sample at a major university might be
accessed hundreds of times per month.
The profusion of suppliers means that usage statistics are now
impossible to compile, where before the agency recorded every user by name.
Given the widespread use of census microdata in the university classroom,
Canadian scholars are educating a younger generation of citizens about the
utility of the census and democratizing access to census data (Lisa Dillon,
private communication, 21 April 2001).
29.
In
the United Kingdom, public-use census samples called SARs (Samples of
Anonymized Records) were first constructed for the 1991 enumeration, with a
sample density of 2.0 per cent for individual records, and 0.5 per cent for
households. Administrative units with
fewer than 120,000 inhabitants were not identified. Notwithstanding the small density of the samples and the absence
of geographical detail, there was an explosion of research using the SARs. Hundreds of studies were published within
six years of the initial release of the data. In anticipation of the 2001
enumeration, disclosure risks were re-assessed, now taking into account error
and coding variability as well as differences in timing and coding schemes
between data sets. With the permission of the Office of National Statistics,
privileged access was granted to attempt to match survey records against the
SARs (Dale and Elliot, in press). The purpose was to test the practical, as
opposed to the theoretical, risks of identifying individuals by matching two
sources. The authors reasoned that
prior assessments of the likelihood of identifying individuals exaggerated the
risks because they neglected to take into account error, differences in timing
and incompatibilities of coding schemes.
From this rigorous exercise in sleuthing, Dale and Elliot conclude:
For
a user of an outside database, attempting this sort of match with no
opportunity for verification would prove fruitless. In the first place, the
small degree of expected overlap would be a considerable deterrent to an
intruder. However, if a match between the two files was attempted the large
number of apparent matches would be highly confusing as an intruder would have
no way of checking correct identification.
3. Informing future censuses
30.
The
availability of process information specific to a country and/or organizational
structure can inform preparations for future censuses by providing a clear
picture of what took place, how it took place, why it was handled in a
particular manner, and the successes and difficulties that occurred. This type
of information is particularly important for countries with no permanent census
office or with minimal permanent staff that must essentially create a new
system with each census. Providing documentation of why processes and
procedures were followed in a specific way is also helpful to international
technical consultants, providing them with a clear and well-rounded picture of
the previous census activities.
31.
Prior
to the 1990 census round, the United Nations Statistics Division distributed a
questionnaire concerning general coverage of the census, organizational
structure, cartographic work, house and/or household listing, testing
(pre-tests, pilots, etc), the census questionnaires, enumerators and supervisors,
enumeration, sampling, data processing, evaluation and analysis, data
dissemination, costs and future activities. In addition to the information
already requested, the following areas of information related to long-term
preservation would be useful in helping countries integrate the discussion of
preservation early in the process. Early recognition of preservation needs and
possibilities will help in making informed preservation decisions.
32.
If
census microdata are to become widely used, issues of statistical
confidentiality must be resolved to the satisfaction of the national
statistical agencies and the public as well as researchers. Eurostat sponsored five international
conferences on the subject over the past decade. Thanks in part to these efforts and others, the standard
practice is now to prepare microdata samples for a variety of users. Among the 52 member states in the
International Monetary Fund's General Data Dissemination System, almost three
of every four disseminate census microdata samples, in one guise or another. The development of international microdata
standards will increase further the availability of census samples, thereby
facilitating comparative research, both in time and space. Everywhere that public dissemination
policies have been adopted, an explosion in research has resulted, without a
single instance of a breach, or even the allegation of a breach, in statistical
confidentiality.
33.
Understanding
and incorporating this concept of preservation is important in that it ensures
that census data will be protected from loss, alteration and deterioration. “In
this regard, the obligation of archivists is to explain to computer
specialists, information technology specialists, and others who are unfamiliar
with archives the importance of a physical or logical space, ‘independent of
the production environment,’ where records are protected from loss, alteration,
and deterioration so that they may be used as trustworthy evidence as far into
the future as is necessary. This is what archiving should be about” (Dollar,
2000).
Dale,
Angela, and Mark Elliot (In press). Proposals for 2001 SARS: an assessment of disclosure risk. Journal
of the Royal Statistical Society, Series A.
Dollar,
Charles M. (2000). Authentic Electronic
Records: Strategies for Long-Term Access. Chicago, IL: Cohasset Associates,
Inc.
Economic Commission for Europe (2001). Report of the March 2001 Work Session on
Statistical Data Confidentiality. Joint ECE/Eurostat Work Session on
Statistical Data Confidentiality, Skopje, March 2001.
Mexico, Instituto
Nacional de Estadística, Geografía e Informática. Contar 2000. Sistema
para la consulta de tabulados y base de datos de la muestra: XII
Censo General de Población y Vivienda 2000. Aguascalientes,
Mexico.
Ruggles, Steven, Catherine A. Fitch, Patricia Kelly
Hall and Matthew Sobek (2000). IPUMS-USA: Integrated Public Use Microdata
Series for the United States. In Handbook of International Historical
Microdata for Population Research, Patricia Kelly-Hall, Robert McCaa and
Gunnar Thorvaldsen, eds. Minneapolis, MN.
Ruggles, Steven, J. David Hacker, and Matthew Sobek
(1995). Order out of chaos: general design of the Integrated Public Use
Microdata Series. Historical Methods
vol. 28: pp.33-39.
Tambay, Jean-Louis, and Pamela White (2001). Providing
greater accessibility to survey data for analysis. Paper presented at Joint
ECE/Eurostat Work Session on Statistical Data Confidentiality, Skopje, March
2001.
United Nations (1990). Manual on Population Census
Data Processing using Microcomputers, Studies in Methods, Series F, No. 53.
New York: United Nations.
_____ (1991). Emerging
Trends and Issues in Population and Housing Censuses, Studies in Methods,
Series F No. 52. New York: United Nations.
____
(1992). Handbook of Population and Housing Censuses: Part I, Planning,
Organization and Administration of Population and Housing Censuses. Studies
in Methods, Series F, No. 54. Sales No. E.92.XVII.8.
_____ (1992). Handbook
of Population and Housing Censuses: Part II, Demographic and Social Characteristics,
Studies in Methods, Series F, No. 54. Sales No. E.91.XVII.9.
_____
(2000). Handbook on Census Management for
Population and Housing Censuses, Studies in Methods, Series F, No. 83.
Sales No. E.00.XVII.15 Rev. 1.
Viet Nam, General Statistics Office (2000). Data
and results from the 3% sample of The Population and Housing Census.
Hanoi: Central Data Processing Centre.