This section explains important data characteristics for SAE models. The differences lie in the kind of data – survey vs. additional data – and in the available level – unit- vs. area-level data.

Data sources - Survey versus additional data

One main idea of small area estimation models is the combination of various data sources. One data source is usually survey data. The other data sources are assumed to be without uncertainty or measurement errors. They can be administrative/register or census data. Special models also allow a second data source with measurement errors, e.g., a larger survey. Recently, big data sources are used as second data source which are also known to have an uncertainty (even though often assumed to be without uncertainty).

Overview of potential additional data sources

Additional data sourceExamples
Census data
  • Full census
  • Microcensus (large scale survey in-between 2 censuses)
Administrative data
  • Birth and death records
  • School records
  • Taxation
  • Customs
Survey data
  • Same or different surveys
Geospatial data and other alternative data sources
  • Mobile phone data
  • Satellite data

In general, good additional information is crucial for model-based estimation. Some characteristics can be:

  • Good predictive power for the indicators of interest.
  • Available for the domains of interest.
  • Available for the same time period as the survey.

Area-level data

Area- or domain-level data contains information for each domain but not for the units itself. Area-level survey information could be, e.g., reported direct estimates along with their variance estimates. Area-level auxiliary information could be administrative data that is distributed for specific areas or domains. The availability of official data at an aggregated level is often more likely since it does not suffer from confidentiality issues. When the unit-level information is available, area-level data can usually be generated from the unit-level data.

The following tables show a typical example of how the data needs to look to apply an area-level SAE model. The survey data on the left contains information at the area-level in form of the direct estimate, which could be a mean or a total, and an estimate of the sampling error variance. Each row represents a domain such that the number of rows equals the number of domains. The additional data also only needs to be available at the area-level. The data set also has as many rows as domains which are the same domains as in the survey data. If not all domains were sampled, the additional data can have more domains than the survey data, but the domain level is still the same, e.g., municipalities or ethnicity groups. The auxiliary variables need to be available in the additional data source and should have the potential to predict the true unknown population parameters. 

Survey area-level data, year 2000

DomainDirect estimateSampling error variance
11500105
2100095
.........
4475060
4580079

Additional area-level data, year 2000

DomainPred1Pred2Pred3
14500.905
23250.856
............
441700.554
451250.503

Unit-level data

Unit-level data contains information for each unit in each domain. For the survey data, it is often likely to have information at unit-level. The units can be, e.g., individuals or households. Additional data sources are less likely to contain unit-level information due to confidentiality issues. Sources that could have individual information are census data or specific registers. Sources that are less likely to have individual information are other administrative data or big data sources. However, this differs across the countries and even the availability of the data at unit-level does not ensure that the data can be used due to confidentiality issues.

Unit-level SAE models require that predictors have the same definition in the survey and the other data sources. This means that the additional data sources need

  • to be available at the unit-level, or at the aggregated level for the domains,
  • to be complete and cover the entire population, 
  • to contain variables that have the potential to explain the variable of interest which are also available in the survey data.

The following tables show a typical example how data needs to look like to apply a unit-level SAE model. The survey data on the left is a random sample of a defined population, i.e., it contains information for some units in the population. Furthermore, a domain of interest can be identified by a domain variable. The variable of interest and predictor variables are also in the data set. The additional unit-level data, e.g, a census, contains information for all units in the defined population. It also has a domain variable that identifies the same domains as in the survey data. The variable of interest is not available in the additional data source (otherwise, no estimation would be necessary). It further comprises three predictor variables but only two correspond to variables in the survey data set. Thus, the availability of the predictor in the second data source is an additional limitation in the variable selection. 

Survey unit level data, year 2000

UnitDomainVariable of interestPred1Pred2Pred3Pred4
111500450031
311000400031
411250425142
.....................
20045850135142
22045540120022
22545780125122

Additional unit level data, year 2000

UnitDomainPred1Pred2Pred6
1145005
2143016
3142516
...............
2234513514
2244512004
2254512513

                                                                                                                                                   

Example data sets

For the implementation of the practical exercises, example data sets are generated. The data sets are described in the following and can be downloaded on the right for the replication of the examples.

Description of the example data set

The data used in these guidelines is derived from a 10% sample of the Colombian census 2005 that can be received from IPUMS. The variables are simplified and falsified from the original extract available. Furthermore, synthetic variables are added to the data set. The purpose of the data set is solely the explanation of concepts and the presentation of functions. The results cannot be interpreted in any way.

The idea is to use the example data for showing general procedures in small area estimation independent of country specific circumstances. The data availability and preparation is so different across countries that it cannot be covered in this work.

Five different data sets are used to conduct an exemplary analysis: a household survey (syntheticSurvey1.csv), a census with household information (syntheticCensus.csv), a survey at person level for the population in working age (between 15 and 74) (syntheticSurvey2.csv) and two aggregated data sets at different domain levels (auxiliaryAgeGeographic.csv and auxiliarySexSpatial.csv). 


Data at household level (syntheticSurvey1.csv and syntheticCensus.csv)

The household survey is a random sample of the census data with 9155 observations and 9 variables. It contains information about the household income and some characteristics of the household head and the household. Furthermore, different geographical dimensions are available. The census has 1007572 observations and contains almost the same information but does not have any information about the household income.

Variables in survey and census:

  • age of the household head (age),
  • sex of the household head (sex),
  • years of schooling of the household head (yrschool),
  • employment status of the household head (classwkd),
  • geographical area information (geolev2 and geolev1).

Variables only in survey:

  • household equivalized income (eqIncome),
  • electricity supply (electric),
  • urban/rural status (urban).


Information about household data
################################################################################
#
# Data sets at household level
#
################################################################################

# Set working directory
setwd("Add path")

# Import sample and census at household level
survey <- read.csv("syntheticSurvey1.csv")
# The census csv was too large for the upload, thus it is available as RData file
load('syntheticCensus.RData')

# Survey -----------------------------------------------------------------------
# Number of observations and variables
dim(survey)
# Variables in the survey
names(survey)
# First six rows
head(survey)

# Census -----------------------------------------------------------------------
# Number of observations and variables
dim(census)
# Variables in the census
names(census)
# First six rows
head(census)




Data for working population (syntheticSurvey2.csv)

The survey at person level has information about the working population, including the employment status and demographics. The number of observations equals 21560 and the data set contains 9 variables.

  • Labor market activity (unemployed),,
  • geographical area information (geolev1).
  • urban/rural status (urban).
  • age (age),
  • sex  (sex),
  • disability status (disabled),
  • age groups (ageGroup1 and ageGroup2),
  • weight (sampling weight).


Information about data of working population
################################################################################
#
# Data set for working population
#
################################################################################

# Set working directory
setwd("Add path")

# Import survey 
survey2 <- read.csv("syntheticSurvey2.csv")


# Survey -----------------------------------------------------------------------
# Number of observations and variables
dim(survey2)
# Variables in the survey
names(survey2)
# First six rows
head(survey2)




Aggregated data at different disaggregation dimensions (auxiliaryAgeGeographic.csv and auxiliarySexSpatial.csv)

The aggregated data sets contain information about the working class and the educational attainment at the combined disaggregation dimensions ageGroup2 and urban/rural and sex and geolev1. Both data sets contain the same variables that describe the proportion of people with the following characteristics:

  • Working class not applicable (classwk_niu),
  • Self-employed worker (classwk_self_employed),
  • Unknown working class (classwk_unknown),
  • Unpaid worker (classwk_unpaid_worker),
  • Salary worker (classwk_salary_worker),
  • Primary school not completed (edattain_less_than_primary_completed),
  • Primary school completed (edattain_primary_completed),
  • Secondary school completed (edattain_secondary_completed),
  • University completed (edattain_university_completed),
  • Unknown educational attainment (edattain_unknown).


Information about aggregated data
################################################################################
#
# Aggregated data at different disaggregation dimensions
#
################################################################################

# Set working directory
setwd("Add path")

# Import aggregated data sets
auxiliaryAgeGeographic <- read.csv("auxiliaryAgeGeographic.csv")
auxiliarySexSpatial <- read.csv("auxiliarySexSpatial.csv")


# Auxiliary for dimensions age group and geographic location -------------------
# Number of observations and variables
dim(auxiliaryAgeGeographic)
# Variables in the survey
names(auxiliaryAgeGeographic)
# First six rows
head(auxiliaryAgeGeographic)

# Auxiliary for dimensions sex and geolevel1 -----------------------------------
# Number of observations and variables
dim(auxiliarySexSpatial)
# Variables in the survey
names(auxiliarySexSpatial)
# First six rows
head(auxiliarySexSpatial)

Example data sets

For the replication of the examples shown in these guidelines, the data can be downloaded here.

Get data information in R

R code that helps to get first information about the data sets can be downloaded here.

  File Modified
File dataInformation.R Dec 29, 2020 by Ann-Kristin Kreutzmann

Practical exercise

The practical exercise in these guidelines will perform the analysis of three indicators for the SDGs 1, 7 and 8 with different input factors and estimation approaches. In this part, the example data that is used in the examples will be described. The examples are chosen such that the application can be transferred to a wide range of SDG indicators.

1.1.1/1.2.1 Proportion of the population living below the international/national poverty line

R Code

Goal: For the proper planning of social support schemes, it could be of interest to target where the population below the national poverty line lives.

Indicator of interest: The proportion of the population living below the national poverty line. The proportion describes the fraction of the population with the characteristic of having, e.g., an income, below the poverty line and has a value between 0 and 1.

Disaggregation dimension: Required disaggregation dimensions for the indicator 1.2.1 are sex and age. However, the example only follows a spatial disaggreagtion by the second administrative level due to the common application of poverty mapping. The number of categories (domains) is 433 in the example.

Data availability

Information about the household income is available in the survey syntheticSurvey1. The survey also contains variables that potentially explain the household income. Furthermore, a second data source, here the census (syntheticCensus), is available that does not contain the household income but the same explanatory variables as the survey. In both data sources, the second administrative level can be identified.

Load data sets
# Set working directory
setwd("Add path")


# Import sample and census at household level
survey <- read.csv("syntheticSurvey1.csv")
# The census csv was too large for the upload, thus it is available as RData file
load("syntheticCensus.RData")

# Overview of the variables
head(survey)
head(census)
First lines of data sets
> head(survey)
   eqIncome age sex yrschool              classwkd   geolev2 geolev1 electric urban
1 11997.722  30   1        5    wage/salary worker 170020016  170054      yes urban
2 24079.950  54   1        5    wage/salary worker 170070002  170013      yes urban
3 11737.735  51   1        3 niu (not in universe) 170073006  170073      yes urban
4 18713.431  25   1       13    wage/salary worker 170005049  170005      yes urban
5  9296.933  50   1       17    wage/salary worker 170063001  170066      yes urban
6 23142.577  59   0        0 niu (not in universe) 170005024  170005      yes urban


> head(census)
  age sex yrschool               classwkd   geolev2 geolev1
1  56   1       17 working on own account 170005001  170005
2  45   1       17     wage/salary worker 170005001  170005
3  47   1        5     wage/salary worker 170005001  170005
4  69   1       17  niu (not in universe) 170005001  170005
5  29   0        9        unknown/missing 170005001  170005
6  45   1       17     wage/salary worker 170005001  170005
7.1.1 Proportion of population with access to electricity

R Code

Goal: In order to have an idea if home schooling can work in rural and urban areas, it can be of interest to have information about the access to electricity which is a base requirement for digital education,

Indicator of interest: The proportion of population with access to electricity. The proportion describes the fraction of the population with the characteristic of having access to electricity and has a value between 0 and 1.

Disaggregation dimension: While the indicator does not have a required disaggregation dimension, the geographical location expressed in the two categories urban and rural is used in the example.

Data availability

The variable describing a households access to electricity is contained in the household survey (syntheticSurvey1). Furthermore, a variable identifying rural and urban households is available.

Load data set
# Set working directory
setwd("Add path")

# Import sample and census at household level
survey <- read.csv("syntheticSurvey1.csv")

# First overview of data sets
head(survey)
First lines of data set
> head(survey)
   eqIncome age sex yrschool              classwkd   geolev2 geolev1 electric urban
1 11997.722  30   1        5    wage/salary worker 170020016  170054      yes urban
2 24079.950  54   1        5    wage/salary worker 170070002  170013      yes urban
3 11737.735  51   1        3 niu (not in universe) 170073006  170073      yes urban
4 18713.431  25   1       13    wage/salary worker 170005049  170005      yes urban
5  9296.933  50   1       17    wage/salary worker 170063001  170066      yes urban
6 23142.577  59   0        0 niu (not in universe) 170005024  170005      yes urban
8.5.2 Unemployment rate

R Code

Goal: Employment is often a key against hunger and extreme poverty. Thus, the identification of groups without employment could be of interest in order to counteract their unemployment with specialized programs.

Indicator of interest: The unemployment rate defined as the number of unemployed persons divided by the total number of persons in the working age population. The unemployment rate is a proportion describing the fraction of the labor force with the characteristic to be unemployed and has a value between 0 and 1. In the example, the working age is defined between 15 and 74.

Disaggregation dimension: The required disaggregation dimensions are sex, age, geographic location (urban/rural), and disability status. The example will consider the dimensions and show some limitations and challenges.

Data availability

Information about the employment status is available in the survey of the working population (syntheticSurvey2). The data further contains variables for the different disaggregation dimensions, The aggregated data sets contain variables that could help to explain the unemployment rate at different domain levels.

Load data sets
# Set working directory
setwd("Add path")

# Import sample and census at household level
survey2 <- read.csv("syntheticSurvey2.csv")
# Import aggregated data sets
auxiliaryAgeGeographic <- read.csv("auxiliaryAgeGeographic.csv")
auxiliarySexSpatial <- read.csv("auxiliarySexSpatial.csv")


# First overview of data sets
head(survey2)
head(auxiliaryAgeGeographic)
head(auxiliarySexSpatial)
First lines of data sets
> head(survey2)
  unemployed geolev1 urban age    sex         disabled ageGroup1 ageGroup2   weight
1          0  170005 urban  17   male no, not disabled     15-24     15-19 99.83459
2          0  170005 urban  58   male    yes, disabled     45-64     55-59 99.83459
3          0  170005 urban  15   male no, not disabled     15-24     15-19 99.83459
4          1  170005 urban  33 female no, not disabled     25-44     30-34 99.83459
5          0  170005 urban  34   male no, not disabled     25-44     30-34 99.83459
6          1  170005 urban  30   male no, not disabled     25-44     30-34 99.83459

> head(auxiliaryAgeGeographic)
       domain classwk_niu classwk_self_employed classwk_unknown classwk_unpaid_worker classwk_salary_worker
1 15-19.rural  0.04476600             0.1262039      0.02371693            0.09226769             0.7130454
2 20-24.rural  0.03706738             0.1535128      0.02428678            0.02421396             0.7609190
3 25-29.rural  0.02094564             0.1835557      0.02432163            0.01586351             0.7553135
4 30-34.rural  0.01368511             0.2085479      0.02479579            0.01252371             0.7404475
5 35-39.rural  0.01093352             0.2292661      0.02178753            0.01117207             0.7268408
6 40-44.rural  0.01009656             0.2524360      0.02259601            0.01181606             0.7030554
  edattain_less_than_primary_completed edattain_primary_completed edattain_secondary_completed edattain_university_completed edattain_unknown
1                            0.3369659                  0.5581732                   0.09767126                  0.0001130454      0.007076645
2                            0.3266517                  0.4485408                   0.20982395                  0.0083201340      0.006663390
3                            0.3924313                  0.4100554                   0.17023323                  0.0208367365      0.006443416
4                            0.4583640                  0.3925709                   0.11648794                  0.0261313925      0.006445743
5                            0.5042343                  0.3693344                   0.08955550                  0.0296199109      0.007255884
6                            0.5572726                  0.3349720                   0.07052158                  0.0296283233      0.007605485

> head(auxiliarySexSpatial)
         domain classwk_niu classwk_self_employed classwk_unknown classwk_unpaid_worker classwk_salary_worker
1 female.170005  0.01998601             0.1309242      0.05098013           0.009451625             0.7886580
2   male.170005  0.01475410             0.2249606      0.02876725           0.005483005             0.7260350
3 female.170008  0.06420981             0.1770292      0.03496872           0.014771271             0.7090210
4   male.170008  0.07113928             0.3236331      0.02970074           0.009375234             0.5661517
5 female.170011  0.01891617             0.1630182      0.02206017           0.006888095             0.7891173
6   male.170011  0.01502477             0.2279818      0.01806746           0.003034830             0.7358912
  edattain_less_than_primary_completed edattain_primary_completed edattain_secondary_completed edattain_university_completed edattain_unknown
1                            0.1566849                  0.2839548                    0.4182401                    0.13812005      0.003000158
2                            0.3805409                  0.3500880                    0.2126424                    0.05398722      0.002741502
3                            0.1140252                  0.2600799                    0.3934735                    0.23023589      0.002185545
4                            0.2128178                  0.3372084                    0.3350709                    0.11257781      0.002325058
5                            0.1033345                  0.3216923                    0.3652386                    0.20833877      0.001395883
6                            0.2118720                  0.3944650                    0.2628273                    0.12954635      0.001289410





No interpretation of results

Please note that none of the results can be interpreted in any kind. The data is solely used to explain the methods and how to conduct a study, not for a real analysis.

  • No labels