- Created by Ann-Kristin Kreutzmann, last modified by Haoyi Chen on Feb 09, 2022
This section explains important data characteristics for SAE models. The differences lie in the kind of data – survey vs. additional data – and in the available level – unit- vs. area-level data.
Data sources - Survey versus additional data
One main idea of small area estimation models is the combination of various data sources. One data source is usually survey data. The other data sources are assumed to be without uncertainty or measurement errors. They can be administrative/register or census data. Special models also allow a second data source with measurement errors, e.g., a larger survey. Recently, big data sources are used as second data source which are also known to have an uncertainty (even though often assumed to be without uncertainty).
Overview of potential additional data sources
Additional data source | Examples |
---|---|
Census data |
|
Administrative data |
|
Survey data |
|
Geospatial data and other alternative data sources |
|
In general, good additional information is crucial for model-based estimation. Some characteristics can be:
- Good predictive power for the indicators of interest.
- Available for the domains of interest.
- Available for the same time period as the survey.
Area-level data
Area- or domain-level data contains information for each domain but not for the units itself. Area-level survey information could be, e.g., reported direct estimates along with their variance estimates. Area-level auxiliary information could be administrative data that is distributed for specific areas or domains. The availability of official data at an aggregated level is often more likely since it does not suffer from confidentiality issues. When the unit-level information is available, area-level data can usually be generated from the unit-level data.
The following tables show a typical example of how the data needs to look to apply an area-level SAE model. The survey data on the left contains information at the area-level in form of the direct estimate, which could be a mean or a total, and an estimate of the sampling error variance. Each row represents a domain such that the number of rows equals the number of domains. The additional data also only needs to be available at the area-level. The data set also has as many rows as domains which are the same domains as in the survey data. If not all domains were sampled, the additional data can have more domains than the survey data, but the domain level is still the same, e.g., municipalities or ethnicity groups. The auxiliary variables need to be available in the additional data source and should have the potential to predict the true unknown population parameters.
Survey area-level data, year 2000
Domain | Direct estimate | Sampling error variance |
---|---|---|
1 | 1500 | 105 |
2 | 1000 | 95 |
... | ... | ... |
44 | 750 | 60 |
45 | 800 | 79 |
Additional area-level data, year 2000
Domain | Pred1 | Pred2 | Pred3 |
---|---|---|---|
1 | 450 | 0.90 | 5 |
2 | 325 | 0.85 | 6 |
... | ... | ... | ... |
44 | 170 | 0.55 | 4 |
45 | 125 | 0.50 | 3 |
Unit-level data
Unit-level data contains information for each unit in each domain. For the survey data, it is often likely to have information at unit-level. The units can be, e.g., individuals or households. Additional data sources are less likely to contain unit-level information due to confidentiality issues. Sources that could have individual information are census data or specific registers. Sources that are less likely to have individual information are other administrative data or big data sources. However, this differs across the countries and even the availability of the data at unit-level does not ensure that the data can be used due to confidentiality issues.
Unit-level SAE models require that predictors have the same definition in the survey and the other data sources. This means that the additional data sources need
- to be available at the unit-level, or at the aggregated level for the domains,
- to be complete and cover the entire population,
- to contain variables that have the potential to explain the variable of interest which are also available in the survey data.
The following tables show a typical example how data needs to look like to apply a unit-level SAE model. The survey data on the left is a random sample of a defined population, i.e., it contains information for some units in the population. Furthermore, a domain of interest can be identified by a domain variable. The variable of interest and predictor variables are also in the data set. The additional unit-level data, e.g, a census, contains information for all units in the defined population. It also has a domain variable that identifies the same domains as in the survey data. The variable of interest is not available in the additional data source (otherwise, no estimation would be necessary). It further comprises three predictor variables but only two correspond to variables in the survey data set. Thus, the availability of the predictor in the second data source is an additional limitation in the variable selection.
Survey unit level data, year 2000
Unit | Domain | Variable of interest | Pred1 | Pred2 | Pred3 | Pred4 |
---|---|---|---|---|---|---|
1 | 1 | 1500 | 450 | 0 | 3 | 1 |
3 | 1 | 1000 | 400 | 0 | 3 | 1 |
4 | 1 | 1250 | 425 | 1 | 4 | 2 |
... | ... | ... | ... | ... | ... | ... |
200 | 45 | 850 | 135 | 1 | 4 | 2 |
220 | 45 | 540 | 120 | 0 | 2 | 2 |
225 | 45 | 780 | 125 | 1 | 2 | 2 |
Additional unit level data, year 2000
Unit | Domain | Pred1 | Pred2 | Pred6 |
---|---|---|---|---|
1 | 1 | 450 | 0 | 5 |
2 | 1 | 430 | 1 | 6 |
3 | 1 | 425 | 1 | 6 |
... | ... | ... | ... | ... |
223 | 45 | 135 | 1 | 4 |
224 | 45 | 120 | 0 | 4 |
225 | 45 | 125 | 1 | 3 |
Example data sets
For the implementation of the practical exercises, example data sets are generated. The data sets are described in the following and can be downloaded on the right for the replication of the examples.
Description of the example data set
The data used in these guidelines is derived from a 10% sample of the Colombian census 2005 that can be received from IPUMS. The variables are simplified and falsified from the original extract available. Furthermore, synthetic variables are added to the data set. The purpose of the data set is solely the explanation of concepts and the presentation of functions. The results cannot be interpreted in any way.
The idea is to use the example data for showing general procedures in small area estimation independent of country specific circumstances. The data availability and preparation is so different across countries that it cannot be covered in this work.
Five different data sets are used to conduct an exemplary analysis: a household survey (syntheticSurvey1.csv), a census with household information (syntheticCensus.csv), a survey at person level for the population in working age (between 15 and 74) (syntheticSurvey2.csv) and two aggregated data sets at different domain levels (auxiliaryAgeGeographic.csv and auxiliarySexSpatial.csv).
Data at household level (syntheticSurvey1.csv and syntheticCensus.csv)
The household survey is a random sample of the census data with 9155 observations and 9 variables. It contains information about the household income and some characteristics of the household head and the household. Furthermore, different geographical dimensions are available. The census has 1007572 observations and contains almost the same information but does not have any information about the household income.
Variables in survey and census:
- age of the household head (age),
- sex of the household head (sex),
- years of schooling of the household head (yrschool),
- employment status of the household head (classwkd),
- geographical area information (geolev2 and geolev1).
Variables only in survey:
- household equivalized income (eqIncome),
- electricity supply (electric),
- urban/rural status (urban).
################################################################################ # # Data sets at household level # ################################################################################ # Set working directory setwd("Add path") # Import sample and census at household level survey <- read.csv("syntheticSurvey1.csv") # The census csv was too large for the upload, thus it is available as RData file load('syntheticCensus.RData') # Survey ----------------------------------------------------------------------- # Number of observations and variables dim(survey) # Variables in the survey names(survey) # First six rows head(survey) # Census ----------------------------------------------------------------------- # Number of observations and variables dim(census) # Variables in the census names(census) # First six rows head(census)
Data for working population (syntheticSurvey2.csv)
The survey at person level has information about the working population, including the employment status and demographics. The number of observations equals 21560 and the data set contains 9 variables.
- Labor market activity (unemployed),,
- geographical area information (geolev1).
- urban/rural status (urban).
- age (age),
- sex (sex),
- disability status (disabled),
- age groups (ageGroup1 and ageGroup2),
- weight (sampling weight).
################################################################################ # # Data set for working population # ################################################################################ # Set working directory setwd("Add path") # Import survey survey2 <- read.csv("syntheticSurvey2.csv") # Survey ----------------------------------------------------------------------- # Number of observations and variables dim(survey2) # Variables in the survey names(survey2) # First six rows head(survey2)
Aggregated data at different disaggregation dimensions (auxiliaryAgeGeographic.csv and auxiliarySexSpatial.csv)
The aggregated data sets contain information about the working class and the educational attainment at the combined disaggregation dimensions ageGroup2 and urban/rural and sex and geolev1. Both data sets contain the same variables that describe the proportion of people with the following characteristics:
- Working class not applicable (classwk_niu),
- Self-employed worker (classwk_self_employed),
- Unknown working class (classwk_unknown),
- Unpaid worker (classwk_unpaid_worker),
- Salary worker (classwk_salary_worker),
- Primary school not completed (edattain_less_than_primary_completed),
- Primary school completed (edattain_primary_completed),
- Secondary school completed (edattain_secondary_completed),
- University completed (edattain_university_completed),
- Unknown educational attainment (edattain_unknown).
################################################################################ # # Aggregated data at different disaggregation dimensions # ################################################################################ # Set working directory setwd("Add path") # Import aggregated data sets auxiliaryAgeGeographic <- read.csv("auxiliaryAgeGeographic.csv") auxiliarySexSpatial <- read.csv("auxiliarySexSpatial.csv") # Auxiliary for dimensions age group and geographic location ------------------- # Number of observations and variables dim(auxiliaryAgeGeographic) # Variables in the survey names(auxiliaryAgeGeographic) # First six rows head(auxiliaryAgeGeographic) # Auxiliary for dimensions sex and geolevel1 ----------------------------------- # Number of observations and variables dim(auxiliarySexSpatial) # Variables in the survey names(auxiliarySexSpatial) # First six rows head(auxiliarySexSpatial)
Example data sets
For the replication of the examples shown in these guidelines, the data can be downloaded here.
File | Modified | |
---|---|---|
File syntheticSurvey2.csv | Dec 29, 2020 by Ann-Kristin Kreutzmann | |
Labels |
||
File syntheticSurvey1.csv | Dec 29, 2020 by Ann-Kristin Kreutzmann | |
Labels |
||
File syntheticCensus.RData | Dec 29, 2020 by Ann-Kristin Kreutzmann | |
Labels |
||
File auxiliarySexSpatial.csv | Dec 29, 2020 by Ann-Kristin Kreutzmann | |
Labels |
||
File auxiliaryAgeGeographic.csv | Dec 29, 2020 by Ann-Kristin Kreutzmann | |
Labels |
Get data information in R
R code that helps to get first information about the data sets can be downloaded here.
Practical exercise
The practical exercise in these guidelines will perform the analysis of three indicators for the SDGs 1, 7 and 8 with different input factors and estimation approaches. In this part, the example data that is used in the examples will be described. The examples are chosen such that the application can be transferred to a wide range of SDG indicators.
Goal: For the proper planning of social support schemes, it could be of interest to target where the population below the national poverty line lives.
Indicator of interest: The proportion of the population living below the national poverty line. The proportion describes the fraction of the population with the characteristic of having, e.g., an income, below the poverty line and has a value between 0 and 1.
Disaggregation dimension: Required disaggregation dimensions for the indicator 1.2.1 are sex and age. However, the example only follows a spatial disaggreagtion by the second administrative level due to the common application of poverty mapping. The number of categories (domains) is 433 in the example.
Data availability
Information about the household income is available in the survey syntheticSurvey1. The survey also contains variables that potentially explain the household income. Furthermore, a second data source, here the census (syntheticCensus), is available that does not contain the household income but the same explanatory variables as the survey. In both data sources, the second administrative level can be identified.
# Set working directory setwd("Add path") # Import sample and census at household level survey <- read.csv("syntheticSurvey1.csv") # The census csv was too large for the upload, thus it is available as RData file load("syntheticCensus.RData") # Overview of the variables head(survey) head(census)
> head(survey) eqIncome age sex yrschool classwkd geolev2 geolev1 electric urban 1 11997.722 30 1 5 wage/salary worker 170020016 170054 yes urban 2 24079.950 54 1 5 wage/salary worker 170070002 170013 yes urban 3 11737.735 51 1 3 niu (not in universe) 170073006 170073 yes urban 4 18713.431 25 1 13 wage/salary worker 170005049 170005 yes urban 5 9296.933 50 1 17 wage/salary worker 170063001 170066 yes urban 6 23142.577 59 0 0 niu (not in universe) 170005024 170005 yes urban > head(census) age sex yrschool classwkd geolev2 geolev1 1 56 1 17 working on own account 170005001 170005 2 45 1 17 wage/salary worker 170005001 170005 3 47 1 5 wage/salary worker 170005001 170005 4 69 1 17 niu (not in universe) 170005001 170005 5 29 0 9 unknown/missing 170005001 170005 6 45 1 17 wage/salary worker 170005001 170005
Goal: In order to have an idea if home schooling can work in rural and urban areas, it can be of interest to have information about the access to electricity which is a base requirement for digital education,
Indicator of interest: The proportion of population with access to electricity. The proportion describes the fraction of the population with the characteristic of having access to electricity and has a value between 0 and 1.
Disaggregation dimension: While the indicator does not have a required disaggregation dimension, the geographical location expressed in the two categories urban and rural is used in the example.
Data availability
The variable describing a households access to electricity is contained in the household survey (syntheticSurvey1). Furthermore, a variable identifying rural and urban households is available.
# Set working directory setwd("Add path") # Import sample and census at household level survey <- read.csv("syntheticSurvey1.csv") # First overview of data sets head(survey)
> head(survey) eqIncome age sex yrschool classwkd geolev2 geolev1 electric urban 1 11997.722 30 1 5 wage/salary worker 170020016 170054 yes urban 2 24079.950 54 1 5 wage/salary worker 170070002 170013 yes urban 3 11737.735 51 1 3 niu (not in universe) 170073006 170073 yes urban 4 18713.431 25 1 13 wage/salary worker 170005049 170005 yes urban 5 9296.933 50 1 17 wage/salary worker 170063001 170066 yes urban 6 23142.577 59 0 0 niu (not in universe) 170005024 170005 yes urban
Goal: Employment is often a key against hunger and extreme poverty. Thus, the identification of groups without employment could be of interest in order to counteract their unemployment with specialized programs.
Indicator of interest: The unemployment rate defined as the number of unemployed persons divided by the total number of persons in the working age population. The unemployment rate is a proportion describing the fraction of the labor force with the characteristic to be unemployed and has a value between 0 and 1. In the example, the working age is defined between 15 and 74.
Disaggregation dimension: The required disaggregation dimensions are sex, age, geographic location (urban/rural), and disability status. The example will consider the dimensions and show some limitations and challenges.
Data availability
Information about the employment status is available in the survey of the working population (syntheticSurvey2). The data further contains variables for the different disaggregation dimensions, The aggregated data sets contain variables that could help to explain the unemployment rate at different domain levels.
# Set working directory setwd("Add path") # Import sample and census at household level survey2 <- read.csv("syntheticSurvey2.csv") # Import aggregated data sets auxiliaryAgeGeographic <- read.csv("auxiliaryAgeGeographic.csv") auxiliarySexSpatial <- read.csv("auxiliarySexSpatial.csv") # First overview of data sets head(survey2) head(auxiliaryAgeGeographic) head(auxiliarySexSpatial)
> head(survey2) unemployed geolev1 urban age sex disabled ageGroup1 ageGroup2 weight 1 0 170005 urban 17 male no, not disabled 15-24 15-19 99.83459 2 0 170005 urban 58 male yes, disabled 45-64 55-59 99.83459 3 0 170005 urban 15 male no, not disabled 15-24 15-19 99.83459 4 1 170005 urban 33 female no, not disabled 25-44 30-34 99.83459 5 0 170005 urban 34 male no, not disabled 25-44 30-34 99.83459 6 1 170005 urban 30 male no, not disabled 25-44 30-34 99.83459 > head(auxiliaryAgeGeographic) domain classwk_niu classwk_self_employed classwk_unknown classwk_unpaid_worker classwk_salary_worker 1 15-19.rural 0.04476600 0.1262039 0.02371693 0.09226769 0.7130454 2 20-24.rural 0.03706738 0.1535128 0.02428678 0.02421396 0.7609190 3 25-29.rural 0.02094564 0.1835557 0.02432163 0.01586351 0.7553135 4 30-34.rural 0.01368511 0.2085479 0.02479579 0.01252371 0.7404475 5 35-39.rural 0.01093352 0.2292661 0.02178753 0.01117207 0.7268408 6 40-44.rural 0.01009656 0.2524360 0.02259601 0.01181606 0.7030554 edattain_less_than_primary_completed edattain_primary_completed edattain_secondary_completed edattain_university_completed edattain_unknown 1 0.3369659 0.5581732 0.09767126 0.0001130454 0.007076645 2 0.3266517 0.4485408 0.20982395 0.0083201340 0.006663390 3 0.3924313 0.4100554 0.17023323 0.0208367365 0.006443416 4 0.4583640 0.3925709 0.11648794 0.0261313925 0.006445743 5 0.5042343 0.3693344 0.08955550 0.0296199109 0.007255884 6 0.5572726 0.3349720 0.07052158 0.0296283233 0.007605485 > head(auxiliarySexSpatial) domain classwk_niu classwk_self_employed classwk_unknown classwk_unpaid_worker classwk_salary_worker 1 female.170005 0.01998601 0.1309242 0.05098013 0.009451625 0.7886580 2 male.170005 0.01475410 0.2249606 0.02876725 0.005483005 0.7260350 3 female.170008 0.06420981 0.1770292 0.03496872 0.014771271 0.7090210 4 male.170008 0.07113928 0.3236331 0.02970074 0.009375234 0.5661517 5 female.170011 0.01891617 0.1630182 0.02206017 0.006888095 0.7891173 6 male.170011 0.01502477 0.2279818 0.01806746 0.003034830 0.7358912 edattain_less_than_primary_completed edattain_primary_completed edattain_secondary_completed edattain_university_completed edattain_unknown 1 0.1566849 0.2839548 0.4182401 0.13812005 0.003000158 2 0.3805409 0.3500880 0.2126424 0.05398722 0.002741502 3 0.1140252 0.2600799 0.3934735 0.23023589 0.002185545 4 0.2128178 0.3372084 0.3350709 0.11257781 0.002325058 5 0.1033345 0.3216923 0.3652386 0.20833877 0.001395883 6 0.2118720 0.3944650 0.2628273 0.12954635 0.001289410
No interpretation of results
Please note that none of the results can be interpreted in any kind. The data is solely used to explain the methods and how to conduct a study, not for a real analysis.
- No labels