The Dutch virtual census
Pieter Everaers and Paul van der Laan
Statistics Netherlands, Division for Social and Spatial Statistics
Résumé
Aux Pays-Bas, le recensement traditionnel de la population et des habitations a été remplacé par une combinaison de registres administratifs et d’enquêtes par sondage auprès des ménages (‘recensement virtuel’). Le registre de la population est l’élément essentiel de programme de recensement et d’autres sources administratives seront utilisées pour fournir des renseignements complémentaires. Les données socio-économiques et les données sur les habitations proviendront, pour l’essentiel, des enquêtes à grande échelle auprès des ménages effectuées par Statistique Pays-Bas.
Summary
In most western societies a strong trend towards more evidence based decisions can be observed. In the future the demand for impartial and reliable statistical information will therefore certainly continue to increase. This trend towards an information based society let on the other side to an excessive supply of uncoordinated and only partial reliable information. To overcome problems of uncoordinated statistics and to be able to react flexible on the demand for timely and accurate statistics, Statistics Netherlands is developing integrated systems of statistics. The system for social statistics, the Social Statistics Database (SSD), is described in more popular terms as the "Virtual Census", allowing Statistics Netherlands to construct a census at every random chosen moment. In this system of integrated statistics, datasets based on different sources like registers, administrative sources and household surveys, are combined using personal identification codes. The accessibility of the sources as described by law, the merging based on personal identification codes and the methods of informed consent are relevant issues describing the ethics of modern social statistics. The intensive use of personalised data asks for a society where there is trust in government, a trustful society. This trust only appears when a statistical institute has shown to be able to process and use in an integer way statistical information. Therefore, next to the methodological and technical issues these kind of integrative systems ask for the answers on fundamental questions as how far to go, which data to be combined, which not, what kind of priorities, how to consolidate administrative data into authoritative statistics and how to arrange proper legal and ethical issues.
Introduction
The need for an integrated statistical information system has several reasons. Very important aspects of the reputation of statistics are the publication of contradictory results, the relative high burden on respondents and the overall costs of statistical information. To improve the reputation of statistics, more emphasis has to be given to the users need, the quality of the results, the response burden and the overall costs of the statistical system.
In this paper the fundamentals of the integrated statistical system for social statistics as in state of development by Statistics Netherlands are described and the legal and ethical issues related to this system are discussed in more detail. The aim is to give an example of the fast development of a modern IT authoritative integrated statistical system.
Prerequisites for an integrated statistical system
Users of statistical information want relevant data and indicators for monitoring, projecting and evaluating and early recognising developments in society, economy and environment. Traditionally statisticians tried to satisfy users needs by producing a large number of predefined tables from single statistical sources. However, today users do not want statistical figures anymore from isolated statistical sources but relevant linked information.
Users not only expect more information but also that the data produced by the statistical information system is of high quality. Quality is usually defined according to relevance of statistical concepts, accuracy of estimates, timeliness and punctuality of disseminating results, accessibility and clarity of information, comparability of statistics, coherence and completeness.
The increasing importance of statistical information and statistical data in society has let to a situation where not only statistical agencies collect data but other governmental and private agencies do it as well. This has led to a considerable increase of administrative burden for establishments as well as households. Therefore, nowadays there is a strong pressure on statistical agencies to reduce response burden and to lower the overall costs of the statistical system in general.
Finally, several legal and ethical aspects are central for every national statistical system. The first legal aspects concerns the status of a national statistical agency. Only a scientific and institutionally independent agency can guarantee, that the relevant user needs are identified in an unbiased way and that the needed quality characteristics are considered.
Part of the government and even formally part of one of the ministries Statistics Netherlands has gained a position of an independent agency with a substantial part of scientific knowledge. The relations with the academic world and institutional researchers is safeguarded by strict procedures for the use of micro data. The second legal aspect concerns the data protection. The confidentiality of data on individual persons and firms must absolutely be guaranteed, this has to be taken care for physically and legal. Physically fire walls and strictly unconnected networks are needed. Legal access to data is taken care for via procedures and agreements. Misuse of the data, even stepping a little bit aside the tolerated margins by users is fatal for the confidence. A very strict checking of the use and users is therefore a prerequisite for the functioning of such a system.
The form and function of the integrated statistical system in the Netherlands
In the Netherlands in the early nineties of the last century developments started building an integrated system of social statistics. In 2001 the stadium has been reached where statistics are produced via this integrated system. When totally developed in about four years, the SSD contains a wide range of social characteristics on each individual in the Netherlands, demography, geographical information, information on income, labour, education, social protection and health. Extra to all this information resulting from the combination of information from register and administrative data there are data from sample surveys merged to these files. These data contain more qualitative information on attitudes, behaviour etc.
Since the early eighties four steps were made to reach this stadium, i.e. the development of a set of household sample surveys, the developments of an accounting system, the harmonisation and integration of the household surveys and the growing access and use of register data for statistical purpose. These four steps separately offered only partial solutions for the fundamental problems society was confronted with. Accounting systems led to consistent results for central indicators at the macro level, however they are restricted to only a few indicators, they are not flexible enough and they cannot fully solve the problems of contradictory results coming from different micro datasets. The integration of the household surveys enables to answer the users needs in a more flexible way and produces a certain consistency at the micro level, but it supplies no adequate substitute for the population census with respect to regional data. It also is able to reduce the response burden and the total expenses significantly. Finally the registers contain only a limited number of information and therefore cannot replace household surveys completely.
The steps in coherence are the base of the SSD. In view of the above mentioned weakness of each of the steps in itself Statistics Netherlands started the work on the SSD. Within the SSD the approaches described above are brought together in an integrated statistical system. Three criteria were central for this development: avoid the publication of conflicting information, make efficient use of existing resources and diminish fragmentation.
The SSD is based on six essential elements. These elements are 1) the intensive use of information from registers, 2) linking information from different registers, 3) integrating different household surveys in a few instruments, 4) linking register to survey data, 5) harmonising information from different sources on an individual level and 6) applying a weighting model to achieve overall consistency (harmonisation on the output level).
The building of the integrated statistical system: legal aspect
In building the integrated system four aspects are important from the legal viewpoint.
1. Register accessibility: the access by CBS to the registers and administrative data has to be based on a clear legal base. In the Netherlands this access is described as part of the Statistical Law and the fundamental description of Statistics Netherlands in the Dutch society.
2. Privacy and confidentiality: the access to the data by CBS employees and researchers has to be taken care for in such a way that each citizen’s information is 100% safe. Provisions have to be taken so that the data can solely be used for statistical purposes and not be used in an administrative environment (‘administrative immunity’).
3. Informed consent: individual citizens must be aware of the fact that the information they provide in specific surveys and that the information provided via administrative systems for registers is merged and/or used for specific statistical work.
4. Save environment: the environment for the processing and analysing must be isolated and safeguarded, physically as well as via written legal systems, agreements etc.
The use of the statistical information: ethical aspects
In using the statistical information as provided via the system of the SSD ethical aspects, to what extent and for what purposes the information is used should be taken care for.
1. The scientific world will plea for access to the statistical information. The level of detailed analysis the statistical information is used for, and for example the focus on vulnerable groups is strongly related to the sense of responsibility of these researchers. Via agreements with the research world, institutes as well as with the individual researcher Statistics Netherlands keeps track and confirms the type of analysis done with the data. These institutes are given access via on-site facilities or secured micro data sets.
2. Ministries and related departments are another user group often related to Statistics Netherlands as provider of one or more registrations. The unique position of Statistics Netherlands regulated by the Statistical Law allows Statistics Netherlands to merge the data, use them for statistical analysis, without being forced to give the government departments the unlimited access to the merged data.
Contents of the Social Statistics Database
The backbone of the SSD is the locally maintained population register. All information that is available in other registers, for instance in the registers of the social security administration, is linked to this backbone. Not only register information, but also survey information for only part of the population is linked to the backbone. Information from household surveys (i.e. Labour Force Survey and Living Conditions Survey) as well as information from establishment survey (i.e. Survey of Employment and Earnings) is integrated.
For linking the different data source a specific methodology was developed. The strategy behind this methodology is to minimise invalid linkages. Therefore, wherever possible, the registers are linked with the help of the unique identification number SOFI (social-fiscal number). The SOFI-Number is included in every register that contains information about persons. As none of the household survey includes SOFI-Numbers so far (Footnote 1: Once the SSD is produced on a regular basis, it wil also serve as a sampling frame for the household surveys. At that stage the SOFI-Numbers (or an other unique identificator) can be used to link the household surveys to the registers.) another approach has to be used to mach household survey data with register information: In the first step, the matching is done with the following three identifiers: "sex", "date of birth" and "present address". For all records, that cannot be matched by this identifiers, a second attempt is done with the variables "sex", "date of birth" and "former address" (for details, see Arts, Bakker and Van Lith 2000).
Linking data from registers and surveys with the help of personal identifiers asks for special measures concerning data protection. It must be prevented with all means that information about identified persons gets to the public. Therefore, all employees of the CBS must sign a data protection agreement. In addition a special "two step" coding technology is applied to anonymise the personal identifiers (for details, see Al and Altena 2000).
Harmonisation procedures
After all available register and survey data are matched, the data have to be harmonised on a individual record level (micro integration).
- The definitions of the statistical units (persons, jobs, households etc) have to be harmonised and corrected if necessary (special reference to comparability in space and time);
- Reference periods, reference populations and definitions have to be harmonised and corrected if necessary (two equally defined information coming from different sources must be identical, otherwise the less reliable source has to be corrected);
- Analytical variables have to be calculated;
- The overall consistency has to be controlled (do the data meet the requirements imposed by identity relations?) and data has to be corrected if it does not meet the consistency requirements.
A more detailed description of this harmonisation process can be found in Van der Laan 2000.
The weighting model
After all data are matched, harmonised and stored in Statistics Netherlands’ internal micro databases, the survey data must be weighted. The aim of weighting is to get reliable and mutually consistent tables from the SSD. The procedure, developed by Statistics Netherlands can be seen as a new application of traditional weighting techniques. It involves five steps (see Kooiman, Kroese and Renssen 2000 and Kroese, Renssen and Trijssenaar 2000):
1. Constructing different data sets according to the needed analytical unit: persons, households, jobs, etc.;
2. Constructing rectangular sub-datasets with complete information;
3. Assigning to each rectangular sub-dataset a set of weights that is derived according to some traditional weighting scheme;
4. Estimating as many mutually consistent tables of interest as possible;
5. Repeatedly reweighting the sub-datasets to some minimal reweighting scheme for other population tables of interest.
References to relevant literature
AL, P.G., ALTENA, J.W. (2000): ‘Data Security, Privacy and the SSB’, in STATISTICS NETHERLANDS (2000), pp. 47-50.
ARTS, C.H., BAKKER, B.F.M., LITH, F.J. VAN (2000): ‘Linking Administrative Registers and Household Surveys’, in STATISTICS NETHERLANDS (2000), pp. 16-22.
BOCHOVE, C.A. VAN, EVERAERS, P.C.J. (1996): ‘Micro-macro and Micro-micro Linkage in Social Statistics’, in The Future of European Social Statistics: Use of Administrative Registers and Dissemination Strategies; Proceedings of the Mondorf Seminar, Third session, Mondorf-les-Bains, Luxembourg, 25 and 26 January 1996, pp. 205-212, Luxembourg: Office for Official Publications of the European Communities.
BUHMANN, B., LEUNIS, W.P., VUILLE, A., WISMER, K. (2000): ‘Labour Accounts , Principles and Practices, Experiences in Denmark, the Netherlands and Switzerland’, mimeo.
EVERAERS, P.C.J. (1995): ‘The Integration of Household Surveys’, Paper prepared for the Eurostat Seminar ‘The Future of Social Statistics’, Mondorf-les-Bains, Luxembourg, 10-11 March 1995.
KOOIMAN, P., KROESE, A.H., RENSSEN, R.H. (2000): Official Statistics: An Estimation Strategy for the IT-era, Statistics Netherlands, Division for Research and Development, Research Paper No. 0018, Voorburg: Statistics Netherlands.
KROESE, A.H., RENSSEN, R.H., TRIJSSENAAR, M. (2000): ‘Weighting or Imputation: Constructing a Consistent Set of Estimates Based on Data from Different Sources’, in STATISTICS NETHERLANDS (2000), pp. 23-31.
LAAN, P. VAN DER (2000): ‘Integrating Administrative Registers and Household Surveys’, in STATISTICS NETHERLANDS (2000), pp. 7-15.
LEUNIS, W.P. (2000): ‘Employment and Compensation of Employees in the Netherlands According to ESA ’95: The Relation between Labour Force Survey and National Accounts’, Final report, Taskforce on ESA Employment.
LEUNIS, W.P., VERHAGE, C.G. (1986): ‘A Labour Accounting System I’, Bulletin of Labour Statistics, No. 2 (May 1986), pp. xxvi-xxx.
RENSSEN, R.H., NIEUWENBROEK, N.J. (1997): ‘Aligning estimates for common variables in two or more sample surveys’, Journal of the American Statistical Association, Vol. 92 (437), pp 368-374.
STATISTICS NETHERLANDS (1998): Netherlands Official Statistics, Vol. 13 (Summer 1998), Special Issue, Integration of Household Surveys: Design, Advantages and Methods, ed. B.F.M. BAKKER and J.W. WINKELS.
STATISTICS NETHERLANDS (2000): Netherlands Official Statistics, Vol. 15 (Summer 2000), Special Issue, Integrating Administrative Registers and Household Surveys, ed. P.G. AL and B.F.M. BAKKER.
TUINEN, H.K. VAN (1995): ‘Social Indicators, Social Surveys and Integration of Social Statistics: Strengths, Weaknesses and Future Developments of the Main Approaches in Social Statistics’, Statistical Journal of the United Nations Economic Commission for Europe, Vol. 12 (3/4), pp. 379-394.