Symposium 2001/06

6 July 2001

 

                                                                                                           English only

 

Symposium on Global Review of 2000 Round of

Population and Housing Censuses: 

Mid-Decade Assessment and Future Prospects

Statistics Division

Department of Economic and Social Affairs

United Nations Secretariat

New York, 7-10 August 2001

 

 

 

 

 

 

 

 

 

 

 

Adapting new technologies to census operations  *

Arij Dekker**



CONTENTS

 

Summary. 22

A. Introduction. 31

B. Management, communication, logistics and quality assurance. 53

C. Data capture. 69

1. Intelligent character recognition (ICR). 70

2. Automatic coding. 100

3. Outsourcing and decentralization. 113

D. GIS, Remote Sensing and GPS. 121

E. Data processing and storage. 139

1. Census-processing software. 140

2. Data storage. 156

F. Use of the internet 158

1. The Internet for data collection. 159

2. The Internet for data dissemination. 185

G. Data dissemination: other issues. 207

1. Statistical disclosure control 208

2. High-capacity physical media. 210

3. Structured archives: the statistical data warehouse. 211

H. How to choose appropriate technology. 213

I. Future technologies. 213

J. Conclusions. 214

K. Discussion. 214

References. 215

Glossary. 217

 



Summary

Adapting New Technologies to Census Operations

 

Even small improvements in census technology can result in important gains in the quality and cost-effectiveness of the whole census operation. At present a number of organizations are attempting to help bring innovation to census and statistical operations. Among the concerns regarding new technology are these: how to choose appropriate technology; how to maintain the integrity of existing statistical systems; how to deal with outsourcing certain tasks; and how to maintain confidentiality of data. Some technologies, such as mobile telephony, have made person-to-person communication in the field easier, as have fax and e-mail capabilities. Bar-code technology has made management of materials more efficient.

 

In the 2000 round of censuses, intelligent character recognition (ICR) made a breakthrough in many countries, although illegible handwritten characters and badly printed questionnaires still led to problems. In general, countries that planned carefully for the new technology and conducted pre-tests were more successful in their operations. The next step, automatic or computer-assisted coding, is also being explored, and some data, such as geographic names, may lend themselves to such coding. For some census operations, especially one-time, high-volume tasks such as data entry, outsourcing may be a good solution. Contractors with the necessary equipment and skills can supplement the census staff, but outsourcing also raises questions of overcoming bureaucratic obstacles, managing the contractor and enforcing confidentiality rules.

 

Census mapping has made great strides in the last few decades, from an activity requiring extensive fieldwork and manual drawing to one using remote sensing and computer-assisted map production. Geographic information system (GIS) technology is increasingly being used in population and housing censuses to generate maps for enumeration and for data presentation purposes. Global positioning systems (GPS) are cheap and available, and they can be used by cartographic field staff to annotate topographical maps and satellite photographs to produce excellent maps for enumerators. 

 

Data-processing software for censuses, which was previously developed and provided by non-profit agencies, is being supplanted by commercially available software. However, customizing general-purpose software for census purposes requires considerable programming skills, which may not always be available in a census organization.

 

The Internet as a tool for census data collection is still in its infancy, although several countries did allow some Internet enumeration in their most recent censuses. Generally, such data were collected from a small portion of the population on an experimental basis. Problems with this method include the need for authentication from each household; lack of coverage of households in many countries at this stage; and the fear that hackers could compromise the integrity of the census. Moreover, data collected via the Internet would have to be integrated into other data streams, including mail-back questionnaires and telephone responses. As a tool for data dissemination, however, the Internet is quickly becoming the principal medium, and statistical offices are responding with more electronic publications and effective web sites. Technology is also under development for the storage of census data, including data “warehouses”, which would contain all the data and metadata from a census.

 

It is impossible to create a single set of guidelines to help census planners choose the best new technology. Choices depend on the magnitude of the project, the availability of local skills, the funding situation, prior experience, time for preparation, and other factors. Census planners need to be conservative, because their solutions must be right the first time. New technology should never endanger the continuity of existing reporting systems and if possible should reinforce it.


A. Introduction

1.                  It is commonly known that the art of population census taking goes back many centuries. Ever since the end of the nineteenth century, there have been efforts to take advantage of a succession of newly available technologies to make such large and costly statistical enquiries more efficient and effective. A census is labour-intensive, requiring large numbers of temporary staff. Personnel costs usually are the principal component of census budgets, with expenditure for information and communication technology coming second.

 

2.                  Even small improvements in the methodologies used, or in the effectiveness of the equipment, can result in important gains in quality and/or cost-effectiveness of the whole operation. Census budgets depend on national cost levels and the depth of the enquiry, but generally vary between a few dollars per capita in low-cost countries to as much as 30 dollars per capita in highly developed environments. A rough estimate of the total expense of the current round of censuses would put it between 30 and 50 billion dollars. This is certainly an enticing target for those trying to improve the rate of value-for-money.

 

3.                  The name of Herman Hollerith stands out as an early adaptor of modern technology to census work. He borrowed from the ideas of Joseph-Marie Jacquard, who had invented punched cards to control looms. Hollerith saw a way to use such cards in sorting and tabulation. By doing this he not only expedited the release of the results of the 1890 US census; he started an entire industry.

 

4.                  There have been many less-known census innovators who have put newly discovered methods and technology to good use. Information technology has usually been on the forefront of these efforts. Census data-processing equipment has graduated from machines just assisting tabulation work, to indispensable tools in virtually all phases of census work. Computers are used for planning, to support mapping, in project management, in all stages of data capture, cleaning, coding, and reporting, and in demographic analysis (Dekker, 1997). Many of the recent improvements in census taking have been possible thanks to the ever-growing capabilities of data-processing equipment and communication networks operating on local, national, and worldwide levels. For the sake of continuity it is important that the use of newer technology is embedded into, and builds upon, existing sound methodology (United Nations, 1998).

 

5.                  There are presently several important efforts to bring coordination and focus to the innovation process in official statistics and census taking. One is the Paris 21 initiative: Partnership in Statistics for Development in the 21st Century. The members of Paris 21—there are several hundred of them—are drawn from leading national and international statistical agencies, academic institutions, etc. One of the several issues currently being reviewed by the experts combining their efforts under the Paris 21 initiative is how census work can be made more cost-effective (See the web site at http://www.paris21.org for details).

 

6.                  The United Nations Statistics Division (UNSD) has a long history of furthering sound statistical principles and the sharing of know-how. A web site giving access to information on good statistical practices has recently been opened (http://www.esa.un.org/unsd/goodprac). On a regional scale, Eurostat has conducted a series of technical seminars by the names of NTTS (New Techniques and Technologies for Statistics) and ETK (Exchange of Technology and Know-how). The 2001 meetings on these issues were conducted in June in a combined form on Crete, Greece.

 

7.                  Noteworthy also is the Eurostat web site by the name of VIROS (Virtual Institute for Research in Official Statistics, web site http://www.europa.eu.int/en/comm/eurostat/ research/viros). VIROS identifies and classifies areas of research where participating organizations may place the results of their studies and experiences, while remaining entirely responsible for it. Eurostat acts as a central coordinator, attempting to integrate the individual elements into a coherent set. The ultimate goal is to facilitate access to information on research activities and results. Eurostat is naturally interested in such issues, facing, as it does, the need to combine many statistical traditions, and overlaying them where possible with state-of-the-art integration technology.

 

8.                  When considering the technological options before them, census offices face a number of questions. Some of these are:

·        How to make an informed choice in selecting appropriate technology;

·        How to maintain the integrity of the existing statistical and census systems;

·        How to deal with the option of outsourcing[1], and management of outsourced tasks; and

·        Confidentiality concerns relating to the preferred solutions.

 

9.                  This paper will look briefly at various areas where census work has recently benefited from new technology and will discuss the issues referred to above. Definite answers on the questions raised can be formulated only by individual census organizations themselves.

B. Management, communication, logistics and quality assurance

10.              A nationwide census differs in many respects from day-to-day statistical work. It lacks the repetitive nature that allows collections with a greater periodicity to gradually be improved. The level of expenditure and number of staff are much higher than statistical managers are used to. Some governments therefore establish census offices separate from the national statistical agency. It may be necessary to recruit professional management, experienced in dealing with large but temporary organizations. Since a census can be seen as a large time-critical project, with many interlocking operations, the use of modern project management software is of vital importance.

 

11.              A census operation requires efficient communication between thousands of persons, as well as procurement and storage of a large variety of items, most of which have to be distributed to all corners of the country and then recollected.

 

12.              Recent developments in mobile telephony (cell phones) have made person-to-person communication easier, even in countries with extensive and reliable fixed-line networks. But complete mobile coverage has not been accomplished in most developing countries. Census communication with remote areas continues to be problematic in some cases. It is still possible that satellite telephone systems, which function everywhere on earth, will fill this void. Some ambitious projects in this domain, such as that known as “Iridium,” have not drawn enough initial subscribers. But with most of the enormous investment costs now written off, user prices are coming down. The ground stations including antennas are still rather voluminous but completely portable. Operations planners need to be cognizant of all communications options open to them, including regional differences, and make arrangements accordingly.

 

13.              Where printed or printable communication is required, fax technology is rapidly giving way to electronic mail. This is true for census operations, but relying on e-mail entails vulnerability to Internet service interrupts, computer illiteracy and virus attacks. It is important always to keep a fax capability for backup.

 

14.              Improved computer software and wide availability of personal computers (PCs) have made managing the movement of goods much easier. Bar-code technology can be a key element in this. Using bar codes instead of printed numbers has advantages in avoiding transcription errors and speeding up processing. A combination of the two can be used if easy human recognition of the codes may also sometimes be required. Census managers, who are not logistics professionals, tend to overlook this established technology.

 

15.              A typical application of bar-code technology is to label all items specific for a particular enumeration area (maps, enumerator identification, summary sheets, transport box) with a specific bar code. At the point where the materials are sent out, the codes will be scanned, allowing automatic update of a database of items forwarded. The same process can be used to maintain a database of items retrieved from the field.

 

16.              Labeling individual questionnaires with unique codes can also be helpful, although the resulting administrative overhead is considerable. Such identifiers can protect against the fairly common problem that entire batches of questionnaires arrive back erroneously geocoded. Standard retail scanners, but also most intelligent character recognition systems (see Section C.1), will read bar codes without difficulty.

 

17.             Quality assurance, including the use of scientifically sound sampling methods,  should be an integrating part of all census operations. Many of the methods in this field depend on statistical principle and have been developed by statistical innovators (Deming, 1986). The census office must strive for a consistent level of assured quality throughout its operations, and cannot afford to disregard the techniques that help to achieve and verify it (Statistics Sweden, 2001).

C. Data capture

1. Intelligent character recognition (ICR)

18.              It is probably true to say that the current round of censuses has seen the breakthrough of ICR technology. In the 1985-1994 round only about 20 per cent of countries undertaking censuses used some form of character or mark recognition (Decker, 1994). The large majority still relied on keyboard data capturing. In the current round nearly all census offices of industrial market economies—and numerous other ones—apply imaging through scanners, recognition software and other tools required to partially do away with manual data entry.

 

19.              There is no doubt that recognition technology has made great strides in the last decennium, but it seems true also that the example provided by census “pioneers” has made switching course easier for those organizations that otherwise might have hesitated. ICR offers a promise of greater efficiency, but it is inherently riskier than keyboard data entry. For example: poorly designed or badly printed questionnaires are a nuisance in manual data entry, but may sink an anticipated ICR data-capturing operation. The need for elaborate pre-tests, already so obvious in traditional census taking, is even more apparent when scanning technology is to be used.

 

20.              The main fundamental problem still existing is that handwritten characters are often poorly recognized where the writer is not already familiar to the recognition system. In censuses which use auto-response or a large number of enumerators, this obviously is the case. To avoid the problem, it is possible to limit the automatic recognition to marks or numeric digits only. But even digits cannot always be reliably interpreted, so quite a few manual data-entry personnel will still be required to fill the gaps.

 

21.              Scattered information suggests that the ICR process proceeds not always as smoothly as anticipated. Experiences obtained during the final operations tests induced the United States Bureau of the Census to move from a one-pass to a two-pass processing system, where sample data from the long forms will be computer-stored only during a second capturing operation (Prewitt, 2000). This change of approach has had no effect on processing deadlines. Some European countries (for example, Estonia) have reported difficulties in recognizing handwritten alphabetic characters, requiring them to hire additional staff to assist the automatic recognition process. A recent meeting in Bangkok (United Nations, 2001) heard about problems of varying severity in China, Indonesia, Macao Special Administrative Region of China, the Philippines and Thailand[2]. (For information on the details of the problems experienced, retrieve the country papers from the web site at http://www.unescap.org/stat/pop-it/pop-wdt.htm.)

 

22.              In Thailand, earlier plans to establish 15 regional ICR centers for the April 2000 census were cancelled after more sophisticated (and expensive) scanners and software turned out to be required. A single ICR complex now operates in Bangkok (Fujitsu 4099 scanners, TeleForm software). Some problems were reported with poorly written characters and scanner maintenance.

 

23.              The census of the Philippines on 1 May 2000 works with four decentralized capturing centers, using Kodak 3590 scanners and Eyes and Hands software. One of the biggest problems here is that the print quality of some questionnaires is not in accordance with specifications, which causes the ICR software to tag them as unidentifiable. Another difficulty is illegible handwritten entries. The number of verification licenses, required to manually correct such rejects, had been underestimated. This has been a learning process. Experiences are sufficiently positive to use ICR again for the upcoming census of agriculture and fisheries.

 

24.              The Macao Special Administrative Region of China reports good results for its pilot operation for the 2001 Census. The paper contains an interesting table, obtained from a sample of 150,000 images of digits. The table does not immediately confirm the effectiveness of ICR as implemented. It would seem useful to dispense training to enumerators about how to best write certain numerals.

 

Digit

0

1

2

3

4

5

6

7

8

9

All

Recognition rate (%)

94.83

96.83

94.92

91.11

96.00

94.95

97.29

97.72

90.43

81.74

95.64

Reject rate (%)

5.17

3.17

5.08

8.89

4.00

5.05

2.71

2.28

9.57

18.26

4.36

Accuracy rate (%)

99.38

99.89

99.78

99.73

99.89

99.41

99.79

99.59

99.12

100.00

99.72

Error rate (%)

0.62

0.11

0.28

0.27

0.11

0.59

0.21

0.41

0.88

0.00

0.28

 

25.              ICR for the 1 July 2000 Census of Indonesia is handled by 29 processing centers throughout the country, using Kodak DS 3500 scanners and NCS NestorReader recognition software embedded in own Visual Basic programming. The country paper reports many troubles that hamper the census ICR operation. These include sub-standard questionnaire printing (despite elaborate quality controls), poor writing by enumerators, inadequate document handling in the field resulting in unusable forms, scanner maintenance problems, and complex file management. The authors deserve the highest praise for sharing these experiences for others to learn from. The massive nature of the operation in Indonesia, scattered civil unrest, financial constraints, and various logistics problems have obviously all been a factor here. Despite the difficulties, the Central Bureau of Statistics (CBS) of Indonesia is confident that the data-capture operation will be completed successfully. 

 

26.              The October 2000 Census of Aruba (not reported in Bangkok) used Fujitsu M3079DG scanners and Eyes and Hands software. All data for this small country of about 100,000 people were captured by April 2001. The operation was quite carefully prepared, and proceeded smoothly, including the integrated computer-assisted coding work. There were no cost advantages compared to keyboard data entry.

 

27.              Such problems as are reported can be divided into those that have to do with the recognition process itself, and all other ones. If the recognition rate is unacceptably low, this can usually be remedied by reducing the pre-set security level. But there is a price to pay: error rates will go up. Other problems may include unreliable paper transport in the scanners, which can have plenty of causes, including dirt, the use of correction fluid on sheets, and damaged forms, possibly as a result of bad weather conditions. It is not unheard of that such difficulties require large numbers of questionnaires to be transcribed, again increasing error rates.

 

28.              As a general rule, success is often reported by census offices that went through a long and careful preparation process, including several pre-tests. Those that have to cut short on the groundwork may become the source of less fortunate stories. Complete quality assurance management—for example, in the printing process of the questionnaires—is of the essence here.

 

29.              If recognition of handwritten text is now becoming a more reliable tool, it would be logical to think of speech recognition as the next step. After all, this is a more direct method of data collection. Speech recognition has broad economic potential and is a topic of much research. Some commercial applications of this technology are appearing, especially in processing verbal instructions received by telephone, and in the automotive industry. But progress in this area has been slower than expected. Statistical applications are still rare.

2. Automatic coding

30.               Recognizing verbal texts usually has the purpose of accommodating associated automatic coding. That is, the computer reads a text—for example, the name of a geographic area—and then selects the applicable code from an associated file or database.

 

31.              Such solutions, which ideally would allow completely automatic data capture and coding, depend on two prerequisites: (1) the recognition process must be sufficiently reliable and (2) the search algorithms do indeed lead from the recognized term(s) to the appropriate code. A 100-per-cent character-recognition rate is not required, since the algorithm may still be successful with incomplete or partially mangled terms.

 

32.              However, there are indeed problems with this process. First there is the recognition reject rate, as referred to above, which might require an unexpected level of human interference. Next comes the difficulty of automatically determining the applicable codes, the severity of which depends on the nature of the variable concerned. Geographic terms are usually not too difficult to code automatically, except perhaps for the lowest level (e.g., village), where spelling may not be standardized and homonyms occur. Occupation and industry tend to be more problematic. Despite the efforts by census field staff to extract full information from respondents, these variables will often be reported in terms that cannot be easily linked to ISCO, ISIC or NACE codebooks (see Glossary for terms).

 

33.              The issues of automatic and computer-assisted coding have been the subject of considerable research (Meyer and Rivière, 1997; Dopita, 1999; Blum, 1997). The tasks are a challenge to those applying modern methods of artificial intelligence, neural networks, and fuzzy logic[3]. But however elegant and advanced the matching algorithms are, once reporting from the field is multi-interpretable, too general, or otherwise inadequate, there is no easy way out. Many specialists feel that in those situations it is difficult to conceive automatic solutions that approach in quality the judgement of an experienced human coder. By letting the computer take care of the simpler cases, and relaying the remainder to human coders, an efficiency gain can nevertheless be obtained.

 

34.              As to the coding of industry, it may be noted that this can be improved by using a register of establishments or enterprises, and their known ISIC or NACE codes. Respondents may find it easier to report the name of their employer than to describe the principal economic activity of the company. This approach obviously requires the existence of a comprehensive national business register.

 

35.              In conclusion: ICR in censuses has certainly not become an off-the-shelf technology. It requires careful design and extensive testing of questionnaires. The integration of ICR with associated operations, such as coding, needs ample prior thought and a clear strategy, again to be tested for effectiveness.

 

3. Outsourcing and decentralization

36.              Census data entry, through ICR or otherwise, is a potential candidate for outsourcing. Since it is a one-time high-volume application, there might be contractors that possess equipment and skills allowing them to offer the census office conditions that it could not match in an in-house operation. Meanwhile, it should be noted that outsourcing brings responsibilities of contracting and monitoring that require resources too. Confidentiality concerns multiply where outside contractors dealing with individual data are concerned. Quality assurance, already a major consideration in any event, becomes even more crucial if outside contractors are involved (see, for example, Whitford and Reichert, 2001). It would be attractive if the contractor could work within the census premises. In any event, contractor staff should be subject to confidentiality rules at least as severe as the ones imposed on temporary census staff.

 

37.              It should be noted that managers with an excellent in-house management record may still have difficulty controlling outsourced work, which requires different skills. These include knowledge of the service market, awareness of legal issues, negotiating skills, and more. In a census situation one easily ends up in circumstances where the supplier is in control, since the census organization, even while unhappy with the services provided, cannot afford to turn away.

 

38.              Sometimes government regulations put barriers in the way of outsourcing tasks that could better be assigned to specialized providers outside the census office. That situation obviously should be changed, but most likely the required reforms need to be implemented at a government level different from the one supervising national statistical services.

 

39.              Decentralized data capture would allow the census organization to keep matters in its own hands, but obtain advantages by spreading the work to its regional centers. The problems are somewhat comparable to outsourcing, although easier managed. Much depends on the local situation: magnitude of the task at hand, conditions of the labour market, efficiency of communication and transport and so forth. Assigning more work outside the capital may also have a social and public relations benefit. General guidelines in this domain are impossible to formulate.