|
Symposium
2001/06 6 July 2001 English only
|
Symposium on Global Review of 2000 Round of
Population
and Housing Censuses:
Mid-Decade
Assessment and Future Prospects
Statistics
Division
Department
of Economic and Social Affairs
United
Nations Secretariat
New York, 7-10 August 2001
Adapting new technologies
to census operations *
Arij
Dekker**
B. Management, communication,
logistics and quality assurance. 53
1. Intelligent character
recognition (ICR)
3. Outsourcing and
decentralization
D. GIS, Remote Sensing and GPS
E. Data processing and storage
1. The Internet for data
collection
2. The Internet for data
dissemination
G. Data dissemination: other
issues
1. Statistical disclosure
control
2. High-capacity physical
media
3. Structured archives: the
statistical data warehouse
H. How to choose appropriate
technology
Adapting New
Technologies to Census Operations
Even small improvements in census technology can
result in important gains in the quality and cost-effectiveness of the whole
census operation. At present a number of organizations are attempting to help
bring innovation to census and statistical operations. Among the concerns
regarding new technology are these: how to choose appropriate technology; how
to maintain the integrity of existing statistical systems; how to deal with
outsourcing certain tasks; and how to maintain confidentiality of data. Some
technologies, such as mobile telephony, have made person-to-person
communication in the field easier, as have fax and e-mail capabilities.
Bar-code technology has made management of materials more efficient.
In the 2000 round of censuses, intelligent character
recognition (ICR) made a breakthrough in many countries, although illegible
handwritten characters and badly printed questionnaires still led to problems.
In general, countries that planned carefully for the new technology and
conducted pre-tests were more successful in their operations. The next step,
automatic or computer-assisted coding, is also being explored, and some data,
such as geographic names, may lend themselves to such coding. For some census
operations, especially one-time, high-volume tasks such as data entry,
outsourcing may be a good solution. Contractors with the necessary equipment
and skills can supplement the census staff, but outsourcing also raises
questions of overcoming bureaucratic obstacles, managing the contractor and
enforcing confidentiality rules.
Census mapping has made great strides in the last
few decades, from an activity requiring extensive fieldwork and manual drawing to
one using remote sensing and computer-assisted map production. Geographic
information system (GIS) technology is increasingly being used in population
and housing censuses to generate maps for enumeration and for data presentation
purposes. Global positioning systems (GPS) are cheap and available, and they
can be used by cartographic field staff to annotate topographical maps and
satellite photographs to produce excellent maps for enumerators.
Data-processing software for censuses, which was
previously developed and provided by non-profit agencies, is being supplanted
by commercially available software. However, customizing general-purpose
software for census purposes requires considerable programming skills, which
may not always be available in a census organization.
The Internet as a tool for census data collection is
still in its infancy, although several countries did allow some Internet
enumeration in their most recent censuses. Generally, such data were collected
from a small portion of the population on an experimental basis. Problems with
this method include the need for authentication from each household; lack of
coverage of households in many countries at this stage; and the fear that
hackers could compromise the integrity of the census. Moreover, data collected
via the Internet would have to be integrated into other data streams, including
mail-back questionnaires and telephone responses. As a tool for data
dissemination, however, the Internet is quickly becoming the principal medium,
and statistical offices are responding with more electronic publications and
effective web sites. Technology is also under development for the storage of
census data, including data “warehouses”, which would contain all the data and
metadata from a census.
It is impossible to create a single set of
guidelines to help census planners choose the best new technology. Choices
depend on the magnitude of the project, the availability of local skills, the
funding situation, prior experience, time for preparation, and other factors.
Census planners need to be conservative, because their solutions must be right
the first time. New technology should never endanger the continuity of existing
reporting systems and if possible should reinforce it.
1.
It
is commonly known that the art of population census taking goes back many
centuries. Ever since the end of the nineteenth century, there have been
efforts to take advantage of a succession of newly available technologies to
make such large and costly statistical enquiries more efficient and effective.
A census is labour-intensive, requiring large numbers of temporary staff.
Personnel costs usually are the principal component of census budgets, with
expenditure for information and communication technology coming second.
2.
Even
small improvements in the methodologies used, or in the effectiveness of the
equipment, can result in important gains in quality and/or cost-effectiveness
of the whole operation. Census budgets depend on national cost levels and the
depth of the enquiry, but generally vary between a few dollars per capita in
low-cost countries to as much as 30 dollars per capita in highly developed
environments. A rough estimate of the total expense of the current round of
censuses would put it between 30 and 50 billion dollars. This is certainly an
enticing target for those trying to improve the rate of value-for-money.
3.
The
name of Herman Hollerith stands out as an early adaptor of modern technology to
census work. He borrowed from the ideas of Joseph-Marie Jacquard, who had
invented punched cards to control looms. Hollerith saw a way to use such cards
in sorting and tabulation. By doing this he not only expedited the release of
the results of the 1890 US census; he started an entire industry.
4.
There
have been many less-known census innovators who have put newly discovered
methods and technology to good use. Information technology has usually been on
the forefront of these efforts. Census data-processing equipment has graduated
from machines just assisting tabulation work, to indispensable tools in
virtually all phases of census work. Computers are used for planning, to
support mapping, in project management, in all stages of data capture,
cleaning, coding, and reporting, and in demographic analysis (Dekker, 1997).
Many of the recent improvements in census taking have been possible thanks to
the ever-growing capabilities of data-processing equipment and communication
networks operating on local, national, and worldwide levels. For the sake of
continuity it is important that the use of newer technology is embedded into,
and builds upon, existing sound methodology (United Nations, 1998).
5.
There
are presently several important efforts to bring coordination and focus to the
innovation process in official statistics and census taking. One is the Paris
21 initiative: Partnership in Statistics for Development in the 21st
Century. The members of Paris 21—there are several hundred of them—are drawn
from leading national and international statistical agencies, academic
institutions, etc. One of the several issues currently being reviewed by the
experts combining their efforts under the Paris 21 initiative is how census
work can be made more cost-effective (See the web site at http://www.paris21.org
for details).
6.
The
United Nations Statistics Division (UNSD) has a long history of furthering
sound statistical principles and the sharing of know-how. A web site giving
access to information on good statistical practices has recently been opened
(http://www.esa.un.org/unsd/goodprac). On a regional scale, Eurostat has
conducted a series of technical seminars by the names of NTTS (New Techniques
and Technologies for Statistics) and ETK (Exchange of Technology and Know-how).
The 2001 meetings on these issues were conducted in June in a combined form on
Crete, Greece.
7.
Noteworthy
also is the Eurostat web site by the name of VIROS (Virtual Institute for Research in Official Statistics, web site
http://www.europa.eu.int/en/comm/eurostat/ research/viros). VIROS identifies
and classifies areas of research where participating organizations may place
the results of their studies and experiences, while remaining entirely
responsible for it. Eurostat acts as a central coordinator, attempting to integrate
the individual elements into a coherent set. The ultimate goal is to facilitate access to information on research
activities and results. Eurostat is naturally interested in such issues,
facing, as it does, the need to combine many statistical traditions, and
overlaying them where possible with state-of-the-art integration technology.
8.
When
considering the technological options before them, census offices face a number
of questions. Some of these are:
·
How
to make an informed choice in selecting appropriate technology;
·
How
to maintain the integrity of the existing statistical and census systems;
·
How
to deal with the option of outsourcing[1],
and management of outsourced tasks; and
·
Confidentiality
concerns relating to the preferred solutions.
9.
This
paper will look briefly at various areas where census work has recently
benefited from new technology and will discuss the issues referred to above.
Definite answers on the questions raised can be formulated only by individual
census organizations themselves.
10.
A
nationwide census differs in many respects from day-to-day statistical work. It
lacks the repetitive nature that allows collections with a greater periodicity
to gradually be improved. The level of expenditure and number of staff are much
higher than statistical managers are used to. Some governments therefore
establish census offices separate from the national statistical agency. It may
be necessary to recruit professional management, experienced in dealing with
large but temporary organizations. Since a census can be seen as a large
time-critical project, with many interlocking operations, the use of modern
project management software is of vital importance.
11.
A
census operation requires efficient communication between thousands of persons,
as well as procurement and storage of a large variety of items, most of which
have to be distributed to all corners of the country and then recollected.
12.
Recent
developments in mobile telephony (cell phones) have made person-to-person
communication easier, even in countries with extensive and reliable fixed-line
networks. But complete mobile coverage has not been accomplished in most
developing countries. Census communication with remote areas continues to be
problematic in some cases. It is still possible that satellite telephone
systems, which function everywhere on earth, will fill this void. Some
ambitious projects in this domain, such as that known as “Iridium,” have not
drawn enough initial subscribers. But with most of the enormous investment
costs now written off, user prices are coming down. The ground stations
including antennas are still rather voluminous but completely portable.
Operations planners need to be cognizant of all communications options open to
them, including regional differences, and make arrangements accordingly.
13.
Where
printed or printable communication is required, fax technology is rapidly
giving way to electronic mail. This is true for census operations, but relying
on e-mail entails vulnerability to Internet service interrupts, computer
illiteracy and virus attacks. It is important always to keep a fax capability
for backup.
14.
Improved
computer software and wide availability of personal computers (PCs) have made
managing the movement of goods much
easier. Bar-code technology can be a key element in this. Using bar codes
instead of printed numbers has advantages in avoiding transcription errors and
speeding up processing. A combination of the two can be used if easy human
recognition of the codes may also sometimes be required. Census managers, who
are not logistics professionals, tend to overlook this established technology.
15.
A
typical application of bar-code technology is to label all items specific for a
particular enumeration area (maps, enumerator identification, summary sheets,
transport box) with a specific bar code. At the point where the materials are
sent out, the codes will be scanned, allowing automatic update of a database of
items forwarded. The same process can be used to maintain a database of items
retrieved from the field.
16.
Labeling
individual questionnaires with unique codes can also be helpful, although the
resulting administrative overhead is considerable. Such identifiers can protect
against the fairly common problem that entire batches of questionnaires arrive
back erroneously geocoded. Standard retail scanners, but also most intelligent
character recognition systems (see Section C.1), will read bar codes without
difficulty.
17.
Quality
assurance, including the use of scientifically sound sampling methods, should be an integrating part of all census
operations. Many of the methods in this field depend on statistical principle and
have been developed by statistical innovators (Deming, 1986). The census office
must strive for a consistent level of assured quality throughout its
operations, and cannot afford to disregard the techniques that help to achieve
and verify it (Statistics Sweden, 2001).
18.
It
is probably true to say that the current round of censuses has seen the
breakthrough of ICR technology. In the 1985-1994 round only about 20 per cent
of countries undertaking censuses used some form of character or mark
recognition (Decker, 1994). The large majority still relied on keyboard data
capturing. In the current round nearly all census offices of industrial market
economies—and numerous other ones—apply imaging through scanners, recognition
software and other tools required to partially do away with manual data entry.
19.
There
is no doubt that recognition technology has made great strides in the last
decennium, but it seems true also that the example provided by census “pioneers”
has made switching course easier for those organizations that otherwise might
have hesitated. ICR offers a promise of greater efficiency, but it is
inherently riskier than keyboard data entry. For example: poorly designed or
badly printed questionnaires are a nuisance in manual data entry, but may sink
an anticipated ICR data-capturing operation. The need for elaborate pre-tests,
already so obvious in traditional census taking, is even more apparent when
scanning technology is to be used.
20.
The
main fundamental problem still existing is that handwritten characters are
often poorly recognized where the writer is not already familiar to the
recognition system. In censuses which use auto-response or a large number of
enumerators, this obviously is the case. To avoid the problem, it is possible
to limit the automatic recognition to marks or numeric digits only. But even
digits cannot always be reliably interpreted, so quite a few manual data-entry
personnel will still be required to fill the gaps.
21.
Scattered
information suggests that the ICR process proceeds not always as smoothly as
anticipated. Experiences obtained during the final operations tests induced the
United States Bureau of the Census to move from a one-pass to a two-pass
processing system, where sample data from the long forms will be
computer-stored only during a second capturing operation (Prewitt, 2000). This
change of approach has had no effect on processing deadlines. Some European
countries (for example, Estonia) have reported difficulties in recognizing
handwritten alphabetic characters, requiring them to hire additional staff to
assist the automatic recognition process. A recent meeting in Bangkok (United
Nations, 2001) heard about problems of varying severity in China, Indonesia,
Macao Special Administrative Region of China, the Philippines and Thailand[2].
(For information on the details of the problems experienced, retrieve the
country papers from the web site at
http://www.unescap.org/stat/pop-it/pop-wdt.htm.)
22.
In
Thailand, earlier plans to establish 15 regional ICR centers for the April 2000
census were cancelled after more sophisticated (and expensive) scanners and
software turned out to be required. A single ICR complex now operates in
Bangkok (Fujitsu 4099 scanners, TeleForm software). Some problems were reported
with poorly written characters and scanner maintenance.
23.
The
census of the Philippines on 1 May 2000 works with four decentralized capturing
centers, using Kodak 3590 scanners and Eyes and Hands software. One of the
biggest problems here is that the print quality of some questionnaires is not
in accordance with specifications, which causes the ICR software to tag them as
unidentifiable. Another difficulty is illegible handwritten entries. The number
of verification licenses, required to manually correct such rejects, had been
underestimated. This has been a learning process. Experiences are sufficiently
positive to use ICR again for the upcoming census of agriculture and fisheries.
24.
The
Macao Special Administrative Region of China reports good results for its pilot
operation for the 2001 Census. The paper contains an interesting table,
obtained from a sample of 150,000 images of digits. The table does not
immediately confirm the effectiveness of ICR as implemented. It would seem useful
to dispense training to enumerators about how to best write certain numerals.
|
Digit |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
All |
|
Recognition
rate (%) |
94.83 |
96.83 |
94.92 |
91.11 |
96.00 |
94.95 |
97.29 |
97.72 |
90.43 |
81.74 |
95.64 |
|
Reject
rate (%) |
5.17 |
3.17 |
5.08 |
8.89 |
4.00 |
5.05 |
2.71 |
2.28 |
9.57 |
18.26 |
4.36 |
|
Accuracy
rate (%) |
99.38 |
99.89 |
99.78 |
99.73 |
99.89 |
99.41 |
99.79 |
99.59 |
99.12 |
100.00 |
99.72 |
|
Error
rate (%) |
0.62 |
0.11 |
0.28 |
0.27 |
0.11 |
0.59 |
0.21 |
0.41 |
0.88 |
0.00 |
0.28 |
25.
ICR
for the 1 July 2000 Census of Indonesia is handled by 29 processing centers
throughout the country, using Kodak DS 3500 scanners and NCS NestorReader
recognition software embedded in own Visual Basic programming. The country
paper reports many troubles that hamper the census ICR operation. These include
sub-standard questionnaire printing (despite elaborate quality controls), poor
writing by enumerators, inadequate document handling in the field resulting in
unusable forms, scanner maintenance problems, and complex file management. The
authors deserve the highest praise for sharing these experiences for others to
learn from. The massive nature of the operation in Indonesia, scattered civil
unrest, financial constraints, and various logistics problems have obviously
all been a factor here. Despite the difficulties, the Central Bureau of
Statistics (CBS) of Indonesia is confident that the data-capture operation will
be completed successfully.
26.
The
October 2000 Census of Aruba (not reported in Bangkok) used Fujitsu M3079DG
scanners and Eyes and Hands software. All data for this small country of about 100,000
people were captured by April 2001. The operation was quite carefully prepared,
and proceeded smoothly, including the integrated computer-assisted coding work.
There were no cost advantages compared to keyboard data entry.
27.
Such
problems as are reported can be divided into those that have to do with the
recognition process itself, and all other ones. If the recognition rate is
unacceptably low, this can usually be remedied by reducing the pre-set security
level. But there is a price to pay: error rates will go up. Other problems may
include unreliable paper transport in the scanners, which can have plenty of
causes, including dirt, the use of correction fluid on sheets, and damaged
forms, possibly as a result of bad weather conditions. It is not unheard of
that such difficulties require large numbers of questionnaires to be
transcribed, again increasing error rates.
28.
As
a general rule, success is often reported by census offices that went through a
long and careful preparation process, including several pre-tests. Those that
have to cut short on the groundwork may become the source of less fortunate
stories. Complete quality assurance management—for example, in the printing
process of the questionnaires—is of the essence here.
29.
If
recognition of handwritten text is now becoming a more reliable tool, it would
be logical to think of speech recognition as the next step. After all, this is
a more direct method of data collection. Speech recognition has broad economic
potential and is a topic of much research. Some commercial applications of this
technology are appearing, especially in processing verbal instructions received
by telephone, and in the automotive industry. But progress in this area has
been slower than expected. Statistical applications are still rare.
30.
Recognizing verbal texts usually has the
purpose of accommodating associated automatic coding. That is, the computer
reads a text—for example, the name of a geographic area—and then selects the
applicable code from an associated file or database.
31.
Such
solutions, which ideally would allow completely automatic data capture and
coding, depend on two prerequisites: (1) the recognition process must be
sufficiently reliable and (2) the search algorithms do indeed lead from the
recognized term(s) to the appropriate code. A 100-per-cent
character-recognition rate is not required, since the algorithm may still be
successful with incomplete or partially mangled terms.
32.
However,
there are indeed problems with this process. First there is the recognition
reject rate, as referred to above, which might require an unexpected level of
human interference. Next comes the difficulty of automatically determining the
applicable codes, the severity of which depends on the nature of the variable
concerned. Geographic terms are usually not too difficult to code
automatically, except perhaps for the lowest level (e.g., village), where
spelling may not be standardized and homonyms occur. Occupation and industry
tend to be more problematic. Despite the efforts by census field staff to
extract full information from respondents, these variables will often be
reported in terms that cannot be easily linked to ISCO, ISIC or NACE codebooks
(see Glossary for terms).
33.
The
issues of automatic and computer-assisted coding have been the subject of
considerable research (Meyer and Rivière, 1997; Dopita, 1999; Blum, 1997). The
tasks are a challenge to those applying modern methods of artificial
intelligence, neural networks, and fuzzy logic[3].
But however elegant and advanced the matching algorithms are, once reporting
from the field is multi-interpretable, too general, or otherwise inadequate,
there is no easy way out. Many specialists feel that in those situations it is
difficult to conceive automatic solutions that approach in quality the
judgement of an experienced human coder. By letting the computer take care of
the simpler cases, and relaying the remainder to human coders, an efficiency
gain can nevertheless be obtained.
34.
As
to the coding of industry, it may be noted that this can be improved by using a
register of establishments or enterprises, and their known ISIC or NACE codes.
Respondents may find it easier to report the name of their employer than to
describe the principal economic activity of the company. This approach
obviously requires the existence of a comprehensive national business register.
35.
In
conclusion: ICR in censuses has certainly not become an off-the-shelf
technology. It requires careful design and extensive testing of questionnaires.
The integration of ICR with associated operations, such as coding, needs ample
prior thought and a clear strategy, again to be tested for effectiveness.
36.
Census
data entry, through ICR or otherwise, is a potential candidate for outsourcing.
Since it is a one-time high-volume application, there might be contractors that
possess equipment and skills allowing them to offer the census office
conditions that it could not match in an in-house operation. Meanwhile, it
should be noted that outsourcing brings responsibilities of contracting and
monitoring that require resources too. Confidentiality concerns multiply where
outside contractors dealing with individual data are concerned. Quality
assurance, already a major consideration in any event, becomes even more
crucial if outside contractors are involved (see, for example, Whitford and
Reichert, 2001). It would be attractive if the contractor could work within the
census premises. In any event, contractor staff should be subject to confidentiality
rules at least as severe as the ones imposed on temporary census staff.
37.
It
should be noted that managers with an excellent in-house management record may
still have difficulty controlling outsourced work, which requires different
skills. These include knowledge of the service market, awareness of legal
issues, negotiating skills, and more. In a census situation one easily ends up
in circumstances where the supplier is in control, since the census
organization, even while unhappy with the services provided, cannot afford to
turn away.
38.
Sometimes
government regulations put barriers in the way of outsourcing tasks that could
better be assigned to specialized providers outside the census office. That situation
obviously should be changed, but most likely the required reforms need to be
implemented at a government level different from the one supervising national
statistical services.
39.
Decentralized
data capture would allow the census organization to keep matters in its own
hands, but obtain advantages by spreading the work to its regional centers. The
problems are somewhat comparable to outsourcing, although easier managed. Much
depends on the local situation: magnitude of the task at hand, conditions of
the labour market, efficiency of communication and transport and so forth.
Assigning more work outside the capital may also have a social and public
relations benefit. General guidelines in this domain are impossible to
formulate.