Practical exerciseThe practical exercise in these guidelines will perform the analysis of three indicators for the SDGs 1, 7 and 8 with different input factors and estimation approaches. In this part, the evaluation and benchmarking of the produced estimators is described. The examples are chosen such that the application can be transferred to a wide range of SDG indicators. Panel 

borderColor  #E5243B 

titleColor  white 

borderWidth  3 

titleBGColor  #E5243B 

borderStyle  solid 

title  1.1.1/1.2.1 Proportion of the population living below the international/national poverty line 


Goal: For the proper planning of social support schemes, it could be of interest to target where the population below the national poverty line lives. Indicator of interest: The proportion of the population living below the national poverty line. The proportion describes the fraction of the population with the characteristic of having, e.g., an income, below the poverty line and has a value between 0 and 1. Disaggregation dimension: Required disaggregation dimensions for the indicator 1.2.1 are sex and age. However, the example only follows a spatial disaggregation by the second administrative level due to the common application of poverty mapping. The number of categories (domains) is 433 in the example. Expand 

 Information about the household income is available in the survey syntheticSurvey1. The survey also contains variables that potentially explain the household income. Furthermore, a second data source, here the census (syntheticCensus), is available that does not contain the household income but the same explanatory variables as the survey. In both data sources, the domains defined by the second administrative level can be identified. Code Block 

language  text 

theme  RDark 

firstline  12 

title  Load data sets 

linenumbers  true 

collapse  true 

 # Set working directory
setwd("Add path")
# Import sample and census at household level
survey < read.csv("syntheticSurvey1.csv")
# The census csv was too large for the upload, thus it is available as RData file
load("syntheticCensus.RData")
# Overview of the variables
head(survey)
head(census)

Code Block 

language  text 

title  First lines of data sets 

collapse  true 

 > head(survey)
eqIncome age sex yrschool classwkd geolev2 geolev1 electric urban
1 11997.722 30 1 5 wage/salary worker 170020016 170054 yes urban
2 24079.950 54 1 5 wage/salary worker 170070002 170013 yes urban
3 11737.735 51 1 3 niu (not in universe) 170073006 170073 yes urban
4 18713.431 25 1 13 wage/salary worker 170005049 170005 yes urban
5 9296.933 50 1 17 wage/salary worker 170063001 170066 yes urban
6 23142.577 59 0 0 niu (not in universe) 170005024 170005 yes urban
> head(census)
age sex yrschool classwkd geolev2 geolev1
1 56 1 17 working on own account 170005001 170005
2 45 1 17 wage/salary worker 170005001 170005
3 47 1 5 wage/salary worker 170005001 170005
4 69 1 17 niu (not in universe) 170005001 170005
5 29 0 9 unknown/missing 170005001 170005
6 45 1 17 wage/salary worker 170005001 170005 

Expand 

 The indicator of interest is a proportion (Mean or total? → No). Note that, technically, a proportion is the mean of a binary variable. This indicator, however, is calculated as a ratio since the number of households is also estimated based on survey data. The flowchart represents model choices related to linear and nonlinear indicators (see SAE approaches). The sample only has information of 152 of the 433 administrative regions, leaving 281 areas unsampled. The sample size in the administrative regions varies between 31 and 1024. Thus, there are regions with small sample sizes and unsampled regions (Small domains? → Yes). The available data is a household survey with the variable of interest and also a second data source at the unitlevel, the census, is available (Unitlevel data? → Yes). This leads to SAE method group U2 in the figure above which means that a suitable approach for this example would be the ELL and the EBP. In the example, the EBP is used. 
Expand 

title  Analysis & Adaptation 

 To estimate the regional distribution of the proportion of the population living below a poverty line, the specification based on the input factors leads to the EBP. To implement the analysis, a software package needs to be chosen. For this example, the R packages emdi and maptools are used. Please note that the proportion of the population living below a poverty line is defined as the head count ratio (HCR) in the package emdi. Thus, the proportion will be named as HCR in the following. Code Block 

theme  RDark 

firstline  8 

title  Load R packages 

linenumbers  true 

collapse  true 

 # Load packages
library(emdi)
library(maptools)

The EBP approach is based on a linear mixed regression model that links the variable of interest, the equivalized income, with covariates and categorical predictors, e.g., age and working class. To use the categorical variables in a statistical model, they must be converted to factor variables in R. This lets the software know that it deals with categorical variables instead of numerical variables. Code Block 

theme  RDark 

firstline  25 

title  Data preparation 

linenumbers  true 

collapse  true 

 # Data preparation 
# Convert categorical variables to factor variables
census$classwkd < factor(census$classwkd)
census$sex < factor(census$sex, levels = c(0,1), labels = c("m","f"))
survey$classwkd < factor(survey$classwkd)
survey$urban < factor(survey$urban)
survey$electric < factor(survey$electric)
survey$sex < factor(survey$sex, levels = c(0,1), labels = c("m","f") 
The aim of the next step is to fit a suitable model. In the example, two different models are being compared. The first model (povEBP_reduced) uses "age", "sex" and "yrschool" as independent variables, whereas the second model additionally uses the variable "classwkd" as independent variable. It is expected, based on theoretical knowledge, that these covariates are associated with the equivalized household income. The default option for the poverty line in package emdi is 60% of the median of the dependent variable which is a common definition in the European Union. However, adding an argument for the threshold allows to use any required poverty line. Code Block 

theme  RDark 

firstline  37 

title  Model building 

linenumbers  true 

collapse  true 

 # Model building 
# Linear mixed model without transformation
povEBP_reduced < ebp(fixed = eqIncome ~ age + sex + yrschool,
pop_data = census, pop_domains = "geolev2", smp_data = survey,
smp_domains = "geolev2", na.rm = TRUE, transformation = 'no', L = 1)
povEBP_full < ebp(fixed = eqIncome ~ age + sex + yrschool + classwkd,
pop_data = census, pop_domains = "geolev2", smp_data = survey,
smp_domains = "geolev2", na.rm = TRUE, transformation = 'no', L = 1)
# Check ICC and R2
summary(povEBP_reduced)
summary(povEBP_full)
# Check coefficients
summary(povEBP_full$model)

With the help of the summary function, sample sizes in the sample and the census are displayed, as well as some model diagnostics, e.g., the (mathjaxinline(R^2)mathjaxinline) and the ICC. While there are multiple methods and criteria that can be used to compare different models, the (mathjaxinline(R^2)mathjaxinline) is used for this example. Based on the (mathjaxinline(R^2)mathjaxinline), the comparison shows that both models almost have an identical model goodness of fit. Because of the slight advantage of the full model over the reduced model it will be used going forward. The ICC indicates that there is unexplained heterogeneity between areas and the use of a random effects model is suitable. Code Block 

language  text 

title  Summary output 

collapse  true 

 > summary(povEBP_reduced)
Empirical Best Prediction
Call:
ebp(fixed = eqIncome ~ age + sex + yrschool, pop_data = census,
pop_domains = "geolev2", smp_data = survey, smp_domains = "geolev2",
L = 1, transformation = "no", na.rm = TRUE)
Outofsample domains: 281
Insample domains: 152
Sample sizes:
Units in sample: 9155
Units in population: 1007572
Min. 1st Qu. Median Mean 3rd Qu. Max.
Sample_domains 31 35 41 60.23026 50.25 1024
Population_domains 521 1302 1651 2326.95612 2337.00 70499
Explanatory measures:
Marginal_R2 Conditional_R2
0.3880963 0.4395183
Residual diagnostics:
Skewness Kurtosis Shapiro_W Shapiro_p
Error 1.1929905 5.365922 NA NA
Random_effect 0.0382906 2.768383 0.9928648 0.6540751
ICC: 0.08403619
Transformation: No transformation
> summary(povEBP_full)
Empirical Best Prediction
Call:
ebp(fixed = eqIncome ~ age + sex + yrschool + classwkd, pop_data = census,
pop_domains = "geolev2", smp_data = survey, smp_domains = "geolev2",
L = 1, transformation = "no", na.rm = TRUE)
Outofsample domains: 281
Insample domains: 152
Sample sizes:
Units in sample: 9155
Units in population: 1007572
Min. 1st Qu. Median Mean 3rd Qu. Max.
Sample_domains 31 35 41 60.23026 50.25 1024
Population_domains 521 1302 1651 2326.95612 2337.00 70499
Explanatory measures:
Marginal_R2 Conditional_R2
0.3892696 0.4407805
Residual diagnostics:
Skewness Kurtosis Shapiro_W Shapiro_p
Error 1.18494095 5.355828 NA NA
Random_effect 0.03667253 2.778514 0.992828 0.6497688
ICC: 0.08434322
Transformation: No transformation 
After an appropriate model was chosen the model assumptions must be checked. As before the output of the summary function provides information about the residuals. When looking at the skewness and kurtosis, it is obvious that the normality assumption of the unitlevel (Error) is not fulfilled in this example. The skewness of a normal distribution is 0 and its kurtosis is 3. The normality assumption can also be checked with statistical tests like the ShapiroWilk test. For the random effects, the test suggests that the hypothesis of normality is not rejected. However, the results are not returned for the unitlevel error terms since the corresponding function in R is only defined for sample sizes between 3 and 5.000. If the test of the unitlevel errors is of interest, another test like the KolmogorovSmirnov could be chosen. Another option to check the normality is to use the plot function on the model object that returns different plots, e.g., a QQplot of the unit and arealevel residuals. Code Block 

theme  RDark 

firstline  57 

title  Model assumptions 

linenumbers  true 

collapse  true 

 # Model assumptions 
# Check skewness, kurtosis, shapiro_w and shapiro_p test
summary(povEBP_full)
# Check plots
plot(povEBP_full) 
The plot underlines what the skewness and kurtosis estimates also indicate, namely that the normality assumption for the unitlevel error is not fulfilled.
To stay within the same model class and to meet the normality assumption of the EBP method, the log and the data driven BoxCox transformations are applied to the dependent variable of the model. As before, the summary and the plot functions are used to get information about the distribution of the residuals. Code Block 

theme  RDark 

firstline  67 

title  Violated model assumptions 

linenumbers  true 

collapse  true 

 # Violated model assumptions 
# Linear mixed model with Log transformation
povEBPLog < ebp(fixed = eqIncome ~ age + sex + yrschool + classwkd,
pop_data = census, pop_domains = "geolev2", smp_data = survey,
smp_domains = "geolev2", na.rm = TRUE, transformation = 'log', L = 1)
# Linear mixed model with Box Cox transformation
povEBPBoxCox < ebp(fixed = eqIncome ~ age + sex + yrschool + classwkd,
pop_data = census, pop_domains = "geolev2", smp_data = survey,
smp_domains = "geolev2", na.rm = TRUE, transformation = 'box.cox', L = 1)
# Compare skewness, kurtosis, shapiro_w and shapiro_p test
summary(povEBP_full)
summary(povEBPLog)
summary(povEBPBoxCox)
# Compare plots
plot(povEBP_full)
plot(povEBPLog)
plot(povEBPBoxCox) 
The estimated skewness and kurtosis of the residuals obtained from the newly fitted models (with log and BoxCox transformations) indicate that the normality assumption of the EBP method might not be violated anymore. For both models, the skewness of the residuals is close to 0 and the kurtosis is close to 3 (as expected for the normal distribution). Code Block 

language  text 

title  Output summary 

collapse  true 

 > summary(povEBPLog)
Empirical Best Prediction
Call:
ebp(fixed = eqIncome ~ age + sex + yrschool + classwkd, pop_data = census,
pop_domains = "geolev2", smp_data = survey, smp_domains = "geolev2",
L = 1, transformation = "log", na.rm = TRUE)
Outofsample domains: 281
Insample domains: 152
Sample sizes:
Units in sample: 9155
Units in population: 1007572
Min. 1st Qu. Median Mean 3rd Qu. Max.
Sample_domains 31 35 41 60.23026 50.25 1024
Population_domains 521 1302 1651 2326.95612 2337.00 70499
Explanatory measures:
Marginal_R2 Conditional_R2
0.3895767 0.4776992
Residual diagnostics:
Skewness Kurtosis Shapiro_W Shapiro_p
Error 0.07490357 2.730150 NA NA
Random_effect 0.28543231 3.246924 0.9921632 0.5728253
ICC: 0.1443628
Transformation:
Transformation Shift_parameter
log 0
> summary(povEBPBoxCox)
Empirical Best Prediction
Call:
ebp(fixed = eqIncome ~ age + sex + yrschool + classwkd, pop_data = census,
pop_domains = "geolev2", smp_data = survey, smp_domains = "geolev2",
L = 1, transformation = "box.cox", na.rm = TRUE)
Outofsample domains: 281
Insample domains: 152
Sample sizes:
Units in sample: 9155
Units in population: 1007572
Min. 1st Qu. Median Mean 3rd Qu. Max.
Sample_domains 31 35 41 60.23026 50.25 1024
Population_domains 521 1302 1651 2326.95612 2337.00 70499
Explanatory measures:
Marginal_R2 Conditional_R2
0.3774885 0.4738446
Residual diagnostics:
Skewness Kurtosis Shapiro_W Shapiro_p
Error 0.1368913 2.888930 NA NA
Random_effect 0.3890020 3.410824 0.988862 0.2696318
ICC: 0.1547861
Transformation:
Transformation Method Optimal_lambda Shift_parameter
box.cox reml 0.2009575 0 
Checking the QQplot of the newly fitted models' residuals confirms the prior findings. The next plot shows the results from the log model.
It is now up to the practitioner to decide if the model with either the log or the BoxCox transformation should be used for producing the final estimates of the HCR, because both seem to fulfill the model assumptions quite well. In the next step, the final model (with log transformation) is fitted. The parameter L controls the number of Monte Carlo repetitions used in the EBP method. In the example, L is set to 100. Molina and Rao (2010) state that L=50 gives fairly accurate results but for practical applications they propose values larger than 200. Code Block 

theme  RDark 

firstline  92 

title  Fit final model 

linenumbers  true 

collapse  true 

 # Fit final model 
povEBP_final < ebp(fixed = eqIncome ~ age + sex + yrschool + classwkd,
pop_data = census, pop_domains = "geolev2", smp_data = survey,
smp_domains = "geolev2", na.rm = TRUE, transformation = 'log', L = 100)
# Results per area 
# Table
estimators(povEBP_final, indicator = 'Head_Count', MSE = FALSE, CV = FALSE)
# Map
# Import shape file
geo2Shape < readShapePoly("Add shapefile", delete_null_obj=TRUE)
map_plot(object = povEBP_final, MSE = FALSE, CV = FALSE, map_obj = geo2Shape,
indicator = "Head_Count", map_dom_id = "GEOLEVEL2") 
The estimated HCR per area can be printed in a table and plotted on a map. For the latter, a shape file needs to be available that contains the same regions for which the indicator is estimated for. Unfortunately, the shape file used in the example is too large to upload it such that it is not available for the user. In the next figure, the produced map is plotted. Plotting the spatial distribution of the HCR enables policy makers to target political actions. 
Evaluation & Benchmarking To evaluate the domain indicators, the model is fitted and the MSE and the CV are estimated. The estimation of the MSE and CV is triggered by setting the parameter MSE to "TRUE". In emdi, the MSE and CV are estimated using bootstrap methods. The parameter B controls the number of bootstrap iterations. It is advisable to set B to a minimum value of 100 in order to obtain reliable MSE estimates. Code Block 

theme  RDark 

firstline  111 

title  Precision, accuracy and reliability 

linenumbers  true 

collapse  true 

 # Precision, accuracy and reliability 
# Fit model and bootstrap the MSE
povEBP_final < ebp(fixed = eqIncome ~ age + sex + yrschool + classwkd,
pop_data = census, pop_domains = "geolev2", smp_data = survey,
smp_domains = "geolev2", na.rm = TRUE, transformation = 'log',
L = 100, MSE = TRUE, B = 100)
# Print disaggregated indicators with MSE and CV
head(estimators(povEBP_final, indicator = 'Head_Count', MSE = TRUE, CV = TRUE))
# Plot disaggregated CV
map_plot(object = povEBP_final, MSE = FALSE, CV = TRUE, map_obj = geo2Shape,
indicator = "Head_Count", map_dom_id = "GEOLEVEL2") 
The estimated regional indicators (the HCR in this example) with its MSE and CV can be obtained in the form of a table. Code Block 

language  text 

title  MSE and CV per area 

collapse  true 

 > head(estimators(povEBP_final, indicator = 'Head_Count', MSE = TRUE, CV = TRUE))
Domain Head_Count Head_Count_MSE Head_Count_CV
1 170005001 0.07658444 8.710772e05 0.12186747
2 170005002 0.21080510 3.089281e04 0.08337725
3 170005003 0.14358329 4.427159e03 0.46340280
4 170005004 0.13571429 4.966465e03 0.51927584
5 170005005 0.15856915 4.544493e03 0.42513223
6 170005006 0.05288569 1.137584e03 0.63775437 
Additionally, the MSE and the CV can be plotted using the function map_plot. The results show that for some areas the estimated CV is considerably higher than 20% (threshold that is often used to argue that an estimator is reliable). This is a phenomenon often observed when working with ratio indicators. In the formula for the CV (given above), it is divided by the estimator (the estimated HCR). Hence, when the HCR takes on small values, the CV easily gets large.
To assess if the modelbased HCR estimates are more reliable than the direct estimates various comparisons can be made. For sampled domains, the direct estimates and their CVs can be compared to the modelbased estimates. Therefore, the direct estimates together with their variance are estimated. Afterwards, the CVs of the direct estimates are compared to the CVs of the modelbased estimates. Code Block 

theme  RDark 

firstline  127 

title  Comparison with direct estimates 

linenumbers  true 

collapse  true 

 # Comparison with direct estimates 
# Estimate direct estimators for sampled domains
povDirect < direct(y = "eqIncome", smp_data = survey,
smp_domains = "geolev2",var = TRUE, na.rm = TRUE, B = 100)
# Compare the CV between direct and model based estimates
compare_plot(povEBP_final, povDirect, indicator = 'Head_Count', MSE = FALSE,
CV = TRUE 
The first plot shows a boxplot of the CVs of the HCR over the sampled areas for the modelbased and direct estimates. The plot clearly indicates that the model based estimates are considerably more reliable than the direct estimates. Most of the modelbased estimates have a CV of under 25%. The second plot presents the estimated CVs ordered by sample size (from large to small sample size). As expected, for large areas the estimated CV is comparable. With decreasing sample size the advantage of the model based method, in terms of lower CVs, is visible.
Column 



Column 



Additionally, the modelbased estimated HCR can be compared to its direct counterpart. As expected, they are not identical but there is a strong linear relationship between them.
Before communicating the estimated indicators to policy makers, it can be desirable to benchmark the small area estimates to a direct national estimate. For estimating the direct national estimate, the whole sample is used. Afterwards, the popular but simple ratio adjustment method is applied to each modelbased HCR estimate. Code Block 

theme  RDark 

firstline  138 

title  Benchmarking 

linenumbers  true 

collapse  true 

 # Benchmarking 
# Estimate national direct estimate using sample data
survey$national < 1
direct_national < direct(y = "eqIncome", smp_data = survey,
smp_domains = "national",var = FALSE, na.rm = TRUE)
hcr_national < direct_national$ind$Head_Count
# Model based estimates
hcr_domain_model < povEBP_final$ind$Head_Count
# Sample size per domain as proportion
domain_sample_size_proportion < unname(
table(povEBP_final$framework$pop_domains_vec))/
sum(unname(table(povEBP_final$framework$pop_domains_vec)))
# Benchmarked model based estimates
hcr_domain_model_bench < hcr_domain_model * (
hcr_national/sum(domain_sample_size_proportion*hcr_domain_model))
df_benchmarked < data.frame(hcr_domain_model, hcr_domain_model_bench)
colnames(df_benchmarked) < c("Model based HCR", "Benchmarked model based HCR")
# Print results
head(df_benchmarked) 
The results show that the benchmarked model based HCR estimates are all shrunk by the same factor of 0.88. Code Block 

language  text 

title  Benchmarked estimators 

collapse  true 

 > head(df_benchmarked)
Model based HCR Benchmarked model based HCR
1 0.07658444 0.06754125
2 0.21080510 0.18591299
3 0.14358329 0.12662880
4 0.13571429 0.11968898
5 0.15856915 0.13984511
6 0.05288569 0.04664089 

Panel 

borderColor  #FCC30B 

titleColor  white 

borderWidth  3 

titleBGColor  #FCC30B 

borderStyle  solid 

title  7.1.1 Proportion of population with access to electricity 

 R Code Expand 

 Goal: In order to have an idea if home schooling can work in rural and urban areas, it can be of interest to have information about the access to electricity which is a base requirement for digital education, Indicator of interest: The proportion of population with access to electricity. The proportion describes the fraction of the population with the characteristic of having access to electricity and has a value between 0 and 1. Disaggregation dimension: While the indicator does not have a required disaggregation dimension, the geographical location expressed in the two categories urban and rural is used in the example. 
Expand 

 The variable describing households' access to electricity is contained in the household survey (syntheticSurvey1). Furthermore, a variable identifying rural and urban households is available. Code Block 

theme  RDark 

firstline  9 

title  Load data set 

linenumbers  true 

collapse  true 

 # Set working directory
setwd("Add path")
# Import sample and census at household level
survey < read.csv("syntheticSurvey1.csv")
# First overview of data sets
head(survey)

Code Block 

language  text 

title  First lines of data set 

collapse  true 

 > head(survey)
eqIncome age sex yrschool classwkd geolev2 geolev1 electric urban
1 11997.722 30 1 5 wage/salary worker 170020016 170054 yes urban
2 24079.950 54 1 5 wage/salary worker 170070002 170013 yes urban
3 11737.735 51 1 3 niu (not in universe) 170073006 170073 yes urban
4 18713.431 25 1 13 wage/salary worker 170005049 170005 yes urban
5 9296.933 50 1 17 wage/salary worker 170063001 170066 yes urban
6 23142.577 59 0 0 niu (not in universe) 170005024 170005 yes urban


Expand 

 The indicator of interest is a proportion (Mean or total? → No). This indicator is calculated as a ratio since the number of households is also estimated based on survey data. The goal is to distinguish between urban and rural households. The sample size for the two categories is 3.125 for urban and 6.030 for rural (Small domains? → No). This leads to SAE method group D which suggests that direct estimation approaches can be sufficient. The sample sizes are large enough for the disaggregation requirement, even with the consideration of the information of interest if the household has access to electricity. Code Block 

theme  RDark 

firstline  18 

title  Check sample sizes in domains 

linenumbers  true 

collapse  true 

 # How small are the sample sizes for required disaggregation dimensions?
table(survey$urban)
table(survey$electric, survey$urban)

Code Block 

language  text 

title  Sample sizes for disaggregation dimensions 

collapse  true 

 > table(survey$urban)
rural urban
3125 6030
> table(survey$electric, survey$urban)
rural urban
no 679 117
yes 2446 5913


Expand 

title  Analysis & Adaptation 

 To obtain estimates of the proportion of population with access to electricity, the specification based on the input factors leads to direct estimation. To implement the analysis, a software package needs to be chosen. For this example, the R package survey is used. The survey data set is a sample from the census via simple random sampling. Thus, there is no sampling design that needs to be specified. However, in most real household surveys more complex survey designs are used. Thus, the R package survey is used for the following analysis since it is especially powerful for the analysis of survey data, also with complex survey designs. In a first step, the survey design needs to be specified with function svydesign. Code Block 

theme  RDark 

firstline  22 

title  Specify the sampling design 

linenumbers  true 

collapse  true 

 # The direct estimation is conducted with the survey package
library(survey)
# Specify the survey design (here the sampling was simple random sampling such
# that no specific design needs to be specified but it is most likely the case
# in a real household survey)
samplingDesign < svydesign(ids = ~1, data = survey)

Code Block 

theme  RDark 

firstline  22 

title  Specify the sampling design 

linenumbers  true 

collapse  true 

 # The direct estimation is conducted with the survey package
library(survey)
# Specify the survey design (here the sampling was simple random sampling such
# that no specific design needs to be specified but it is most likely the case
# in a real household survey)
samplingDesign < svydesign(ids = ~1, data = survey)

The variable electric is a string variable with the options "yes" for electricity access and "no" if electricity access is not available. The function svymean automatically recognizes that the options are categories of a categorical variable and returns the proportions for both categories. With the help of function svyby, the proportions can be obtained for the geographic locations. Code Block 

theme  RDark 

firstline  30 

title  Get direct estimates 

linenumbers  true 

collapse  true 

 # The function notices that the string means two different categories
svyby(~electric, ~urban, samplingDesign, svymean)

The output contains the point estimates as well as the corresponding standard errors. In urban regions, it is very unlikely not to have access to electricity but in rural area the proportion is at 22%. Code Block 

language  text 

title  Output 

collapse  true 

 > svyby(~electric, ~urban, samplingDesign, svymean)
urban electricno electricyes se.electricno se.electricyes
rural rural 0.21728000 0.782720 0.007377544 0.007377544
urban urban 0.01940299 0.980597 0.001776416 0.001776416


Evaluation & Benchmarking To evaluate the domain indicators, the estimation is conducted and the MSE and the CV as measures for the uncertainty of the estimates are calculated. The estimation of the standard errors is automatically returned along with the point estimates. For the estimation of the CV, the function cv can be used. The CVs are far below the threshold of 20%. Thus, the direct estimation would be sufficient at this disaggregation dimension. Code Block 

theme  RDark 

firstline  33 

title  Precision, accuracy and reliability 

linenumbers  true 

collapse  true 

 cv(svyby(~electric, ~urban, samplingDesign, svymean))

Code Block 

language  text 

title  Variance and CV estimates 

collapse  true 

 > cv(svyby(~electric, ~urban, samplingDesign, svymean))
se.electricno se.electricyes
rural 0.03395409 0.009425521
urban 0.09155374 0.001811566

Benchmarking Direct estimates have the property to sum up to the population estimate such that benchmarking is not necessary. 
Panel 

borderColor  #A21942 

titleColor  white 

borderWidth  3 

titleBGColor  #A21942 

borderStyle  solid 

title  8.5.2 Unemployment rate 

 R Code Expand 

 Goal: Employment is often a key against hunger and extreme poverty. Thus, the identification of groups without employment could be of interest in order to counteract their unemployment with specialized programs. Indicator of interest: The unemployment rate defined as the number of unemployed persons divided by the total number of workingage persons in the labour force. The unemployment rate is a proportion describing the fraction of the labor force with the characteristic to be unemployed and has a value between 0 and 1. In the example, the working age is defined between 15 and 74. Disaggregation dimension: The required disaggregation dimensions are sex, age, geographic location (urban/rural), and disability status. The example will consider these dimensions and show some limitations and challenges. 
Expand 

 Information about the employment status is available in the survey of the working population (syntheticSurvey2). The data further contains variables for the different disaggregation dimensions, The aggregated data sets contain variables that could help to explain the unemployment rate at different domain levels. Code Block 

theme  RDark 

firstline  8 

title  Load data sets 

linenumbers  true 

collapse  true 

 # Set working directory
setwd("Add path")
# Import sample and census at household level
survey2 < read.csv("syntheticSurvey2.csv")
# Import aggregated data sets
auxiliaryAgeGeographic < read.csv("auxiliaryAgeGeographic.csv")
auxiliarySexSpatial < read.csv("auxiliarySexSpatial.csv")
# First overview of data sets
head(survey2)
head(auxiliaryAgeGeographic)
head(auxiliarySexSpatial)

Code Block 

language  text 

title  First lines of data sets 

collapse  true 

 > head(survey2)
unemployed geolev1 urban age sex disabled ageGroup1 ageGroup2 weight
1 0 170005 urban 17 male no, not disabled 1524 1519 99.83459
2 0 170005 urban 58 male yes, disabled 4564 5559 99.83459
3 0 170005 urban 15 male no, not disabled 1524 1519 99.83459
4 1 170005 urban 33 female no, not disabled 2544 3034 99.83459
5 0 170005 urban 34 male no, not disabled 2544 3034 99.83459
6 1 170005 urban 30 male no, not disabled 2544 3034 99.83459
> head(auxiliaryAgeGeographic)
domain classwk_niu classwk_self_employed classwk_unknown classwk_unpaid_worker classwk_salary_worker
1 1519.rural 0.04476600 0.1262039 0.02371693 0.09226769 0.7130454
2 2024.rural 0.03706738 0.1535128 0.02428678 0.02421396 0.7609190
3 2529.rural 0.02094564 0.1835557 0.02432163 0.01586351 0.7553135
4 3034.rural 0.01368511 0.2085479 0.02479579 0.01252371 0.7404475
5 3539.rural 0.01093352 0.2292661 0.02178753 0.01117207 0.7268408
6 4044.rural 0.01009656 0.2524360 0.02259601 0.01181606 0.7030554
edattain_less_than_primary_completed edattain_primary_completed edattain_secondary_completed edattain_university_completed edattain_unknown
1 0.3369659 0.5581732 0.09767126 0.0001130454 0.007076645
2 0.3266517 0.4485408 0.20982395 0.0083201340 0.006663390
3 0.3924313 0.4100554 0.17023323 0.0208367365 0.006443416
4 0.4583640 0.3925709 0.11648794 0.0261313925 0.006445743
5 0.5042343 0.3693344 0.08955550 0.0296199109 0.007255884
6 0.5572726 0.3349720 0.07052158 0.0296283233 0.007605485
> head(auxiliarySexSpatial)
domain classwk_niu classwk_self_employed classwk_unknown classwk_unpaid_worker classwk_salary_worker
1 female.170005 0.01998601 0.1309242 0.05098013 0.009451625 0.7886580
2 male.170005 0.01475410 0.2249606 0.02876725 0.005483005 0.7260350
3 female.170008 0.06420981 0.1770292 0.03496872 0.014771271 0.7090210
4 male.170008 0.07113928 0.3236331 0.02970074 0.009375234 0.5661517
5 female.170011 0.01891617 0.1630182 0.02206017 0.006888095 0.7891173
6 male.170011 0.01502477 0.2279818 0.01806746 0.003034830 0.7358912
edattain_less_than_primary_completed edattain_primary_completed edattain_secondary_completed edattain_university_completed edattain_unknown
1 0.1566849 0.2839548 0.4182401 0.13812005 0.003000158
2 0.3805409 0.3500880 0.2126424 0.05398722 0.002741502
3 0.1140252 0.2600799 0.3934735 0.23023589 0.002185545
4 0.2128178 0.3372084 0.3350709 0.11257781 0.002325058
5 0.1033345 0.3216923 0.3652386 0.20833877 0.001395883
6 0.2118720 0.3944650 0.2628273 0.12954635 0.001289410


Expand 

 The indicator of interest is a proportion (Mean or total? → No). This indicator is calculated as a ratio since the workingage population in the labor force is also estimated based on survey data. The goal is to try different disaggregation dimensions. For the disaggregation dimensions sex, age group, geographic location (urban/rural), and disability status direct estimates are obtained assuming that the sample sizes are large enough (Small domains? → No). For a combination of the dimensions age group and geographic location, as well as sex and administrative region, arealevel models are applied, assuming that the precision could be improved compared to the direct estimates (Small domains? → Yes) and considering that only aggregated auxiliary data is available (Unitlevel data? → No). The sample sizes are relatively large for all single disaggregation dimensions (except for unknown disability status). The crossclassification of disaggregation variables dimensions partly lead to smaller sample sizes. Code Block 

theme  RDark 

firstline  23 

title  Check sample sizes 

linenumbers  true 

collapse  true 

 # How small are the sample sizes for required disaggregation dimensions?
table(survey2$sex)
table(survey2$ageGroup1)
table(survey2$ageGroup2)
table(survey2$urban)
table(survey2$disabled)
table(survey2$sex, survey2$geolev1)
table(survey2$ageGroup2, survey2$urban)
table(survey2$unemployed, survey2$ageGroup2, survey2$urban)

Code Block 

language  text 

title  Sample sizes for disaggregation dimensions 

collapse  true 

 > table(survey2$sex)
female male
5747 15813
> table(survey2$ageGroup1)
1524 2544 4564 65+
4234 11131 5420 775
> table(survey2$ageGroup2)
1519 2024 2529 3034 3539 4044 4549 5054 5559 6064 65+
1544 2690 2908 2942 2742 2539 2076 1560 1097 687 775
> table(survey2$urban)
rural urban
8806 12754
> table(survey2$disabled)
no, not disabled unknown yes, disabled
20143 17 1400
> table(survey2$ageGroup2, survey2$urban)
rural urban
1519 865 679
2024 1134 1556
2529 1095 1813
3034 1149 1793
3539 987 1755
4044 946 1593
4549 767 1309
5054 598 962
5559 487 610
6064 339 348
65+ 439 336
> table(survey2$sex, survey2$geolev1)
170005 170008 170011 170013 170015 170018 170019 170023 170027 170041 170044 170050 170052 170054 170066 170068 170073 170076 170081
female 666 215 1004 325 387 83 318 141 68 131 86 123 302 351 399 288 217 443 47
male 1872 508 1963 1332 1124 231 926 599 172 635 183 323 870 1142 1049 849 667 987 105
170086 170088 170095
female 62 16 75
male 122 34 120

Considering the groups of employed and unemployed highlights a low number of unemployed population in some domains (exemplary for the combination of age groups and geographic location). Code Block 

language  text 

title  Sample sizes for disaggregation dimensions by category 

collapse  true 

 > table(survey2$unemployed, survey2$ageGroup2, survey2$urban)
, , = rural
1519 2024 2529 3034 3539 4044 4549 5054 5559 6064 65+
0 815 1051 1049 1111 961 918 745 582 479 331 434
1 50 83 46 38 26 28 22 16 8 8 5
, , = urban
1519 2024 2529 3034 3539 4044 4549 5054 5559 6064 65+
0 581 1365 1697 1681 1660 1524 1248 892 589 334 315
1 98 191 116 112 95 69 61 70 21 14 21


Expand 

title  Analysis & Adaptation 

 The goal of the analysis is to obtain estimates of the unemployment rate for different disaggregation dimensions. The following descriptions will be focused on the combination of the dimensions age group and geographic location. To estimate the unemployment rate for the different disaggregation dimensions, the specification based on the input factors leads to direct estimation and an arealevel model. To implement the analysis, a software package needs to be chosen. For this example, the R packages survey and emdi are used (for the data management, also package reshape2 is loaded). The survey data set contains sampling weights to correct the differences in selection probabilities. One popular R package for the analysis of survey data, especially with complex survey designs, is the survey package. The survey design can be specified with function svydesign. In this example, the survey design is specified with the sampling weights. Code Block 

theme  RDark 

firstline  36 

title  Specify the sampling design 

linenumbers  true 

collapse  true 

 # For a first analysis, direct estimation is conducted with the survey package
library(survey)
# Specify the survey design
samplingDesign < svydesign(ids = ~1, weights = ~weight, data = survey2)

The variable unemployment is 1, when the person is unemployed, and 0, when the person is employed. Thus, the unemployment rate can be directly estimated by taking the mean within the required disaggregation dimension. The function svyby enables the operation of function svymean for the estimation of the mean and its standard error by the required disaggregation dimension considering the sampling design. Code Block 

theme  RDark 

firstline  42 

title  Get direct estimates 

linenumbers  true 

collapse  true 

 # Unemployment rate by combination of age groups and geographic location
svyby(~unemployed, ~ageGroup2 + urban, samplingDesign, svymean)

For the combination of age groups in five year intervals and the geographic location, it is apparent that the unemployment rate is especially high for the young urban population with 14% in the age group 1519. Code Block 

language  text 

title  Output 

collapse  true 

 > svyby(~unemployed, ~ageGroup2 + urban, samplingDesign, svymean)
ageGroup2 urban unemployed se
1519.rural 1519 rural 0.06056085 0.009828506
2024.rural 2024 rural 0.07150598 0.008691182
2529.rural 2529 rural 0.03741252 0.006284867
3034.rural 3034 rural 0.03192538 0.006060037
3539.rural 3539 rural 0.02243432 0.005381977
4044.rural 4044 rural 0.02927986 0.006382300
4549.rural 4549 rural 0.02200518 0.005277605
5054.rural 5054 rural 0.01985975 0.005739776
5559.rural 5559 rural 0.01898093 0.007267124
6064.rural 6064 rural 0.02538600 0.009504805
65+.rural 65+ rural 0.01107436 0.005740872
1519.urban 1519 urban 0.14233478 0.015490624
2024.urban 2024 urban 0.12001686 0.009406144
2529.urban 2529 urban 0.05580399 0.005981996
3034.urban 3034 urban 0.06480648 0.006693129
3539.urban 3539 urban 0.05163059 0.005949129
4044.urban 4044 urban 0.04330892 0.005920045
4549.urban 4549 urban 0.05145245 0.007122834
5054.urban 5054 urban 0.08046659 0.010350335
5559.urban 5559 urban 0.03489428 0.008645406
6064.urban 6064 urban 0.03867333 0.011588445
65+.urban 65+ urban 0.07394510 0.017770256

While the direct estimation does not require any model building or checking, it is the starting point for the arealevel model. The first set of inputs required for a basic arealevel model are the direct estimates and their sampling error variances. While the sampling error variance is assumed to be known, it usually needs to be estimated from unitlevel data (see Estimation of the sampling error variance for more information). In this application, the indicator of interest is a proportion defined between 0 and 1. Thus, it is recommended to use either a different model specification for variables between 0 and 1 or to use the arcsin transformation. The latter option is chosen since it allows to stay within the same model class. To apply the transformed arealevel model, an approximation for the sampling error variance is needed. Thus, the effective sample size needs to be calculated. The design effect can be estimated with the svymean function from the survey package. The effective sample size is estimated by the division of the sample size and the design effect (CasasCordero, Encina e Lahiri, 2016). Code Block 

theme  RDark 

firstline  49 

title  Data preparation for arealevel model application 

linenumbers  true 

collapse  true 

 # Get the direct estimate and the design effect
directEstim < svyby(~unemployed, ~ageGroup2 + urban, samplingDesign, svymean, deff = TRUE)
directEstim < directEstim[, c('unemployed', 'se', 'DEff.unemployed')]
directEstim$domain < row.names(directEstim)
# Calculate the sample size for the domains
library(reshape2)
sampleSize < melt(table(survey2$ageGroup2, survey2$urban))
sampleSize$domain < paste0(sampleSize$Var1, '.', sampleSize$Var2)
# Merge the direct estimate and the sample size
directEstim < merge(directEstim, sampleSize[, c('domain', 'value')], by = 'domain')
# Calculate the effectice sample size
directEstim$effSampleSize < directEstim$value/directEstim$DEff.unemployed
# Calculate the variance
directEstim$directVar < directEstim$se^2
# Reduce to relevant variables
directEstim < directEstim[, c('domain', 'unemployed', 'effSampleSize', 'directVar')]

Code Block 

language  text 

title  Output 

collapse  true 

 > directEstim
domain unemployed effSampleSize directVar
1 1519.rural 0.06056085 577.7407 9.659952e05
2 1519.urban 0.14233478 500.7557 2.399594e04
3 2024.rural 0.07150598 861.5977 7.553665e05
4 2024.urban 0.12001686 1174.5608 8.847555e05
5 2529.rural 0.03741252 893.9596 3.949955e05
6 2529.urban 0.05580399 1448.8465 3.578428e05
7 3034.rural 0.03192538 824.8754 3.672405e05
8 3034.urban 0.06480648 1330.6919 4.479797e05
9 3539.rural 0.02243432 742.4734 2.896567e05
10 3539.urban 0.05163059 1360.5912 3.539214e05
11 4044.rural 0.02927986 684.2333 4.073376e05
12 4044.urban 0.04330892 1162.5466 3.504693e05
13 4549.rural 0.02200518 757.5538 2.785311e05
14 4549.urban 0.05145245 946.6877 5.073477e05
15 5054.rural 0.01985975 579.5396 3.294503e05
16 5054.urban 0.08046659 679.5051 1.071294e04
17 5559.rural 0.01898093 346.3156 5.281110e05
18 5559.urban 0.03489428 443.4726 7.474305e05
19 6064.rural 0.02538600 268.9703 9.034132e05
20 6064.urban 0.03867333 272.8263 1.342920e04
21 65+.rural 0.01107436 326.1544 3.295761e05
22 65+.urban 0.07394510 213.4372 3.157820e04

The second set of inputs is the aggregated auxiliary information at the domain level. To use the package emdi for the analysis, the two data sets need to be combined into one. While the data preparation steps above can differ depending on the data set used, the data set that is given to the function fh in the package emdi needs to look like the combined data set below, containing the direct estimate, the sampling error variance (here estimated by the direct variance), the effective sample size (if the arcsin transformation is applied), and the auxiliary variables. Code Block 

theme  RDark 

firstline  71 

title  Combine data sets 

linenumbers  true 

collapse  true 

 # Combine the survey data with the additional data source
combined_data < merge(directEstim, auxiliaryAgeGeographic, by = 'domain')
head(combined_data)

Code Block 

language  text 

title  First lines of combined data set 

collapse  true 

 > head(combined_data)
domain unemployed effSampleSize directVar classwk_niu classwk_self_employed classwk_unknown classwk_unpaid_worker
1 1519.rural 0.06056085 577.7407 9.659952e05 0.04476600 0.12620393 0.02371693 0.092267692
2 1519.urban 0.14233478 500.7557 2.399594e04 0.10781630 0.08427786 0.03029295 0.048356147
3 2024.rural 0.07150598 861.5977 7.553665e05 0.03706738 0.15351284 0.02428678 0.024213957
4 2024.urban 0.12001686 1174.5608 8.847555e05 0.06567831 0.10625758 0.02763255 0.008663353
5 2529.rural 0.03741252 893.9596 3.949955e05 0.02094564 0.18355568 0.02432163 0.015863508
6 2529.urban 0.05580399 1448.8465 3.578428e05 0.03302284 0.15582710 0.03055306 0.004865550
classwk_salary_worker edattain_less_than_primary_completed edattain_primary_completed edattain_secondary_completed
1 0.7130454 0.33696586 0.5581732 0.09767126
2 0.7292567 0.14210055 0.4969170 0.35804017
3 0.7609190 0.32665174 0.4485408 0.20982395
4 0.7917682 0.08839411 0.3060407 0.53568936
5 0.7553135 0.39243126 0.4100554 0.17023323
6 0.7757315 0.10186204 0.2872987 0.44789884
edattain_university_completed edattain_unknown
1 0.0001130454 0.007076645
2 0.0012025074 0.001739798
3 0.0083201340 0.006663390
4 0.0683943275 0.001481466
5 0.0208367365 0.006443416
6 0.1617471579 0.001193262

The aim of the next step is to fit a suitable model. The auxiliary data has information of working class and educational attainment. Both categorical predictors are potentially associated with employment status. Thus, two models are built and compared. For fitting the model, the R package emdi is used since it provides the option of using the arcsin transformation. The biascorrected backtransformation is chosen (bc) instead of a naive backtransformation. Code Block 

theme  RDark 

firstline  76 

title  Model building 

linenumbers  true 

collapse  true 

 # Model building 
# Load package emdi for model fitting
library(emdi)
# Model with educational attainment
unemplFH1 < fh(fixed = unemployed ~ edattain_less_than_primary_completed +
edattain_primary_completed + edattain_secondary_completed +
edattain_university_completed,
vardir = "directVar", combined_data = combined_data, domains = "domain",
method = 'ml', transformation = "arcsin", backtransformation = "bc",
eff_smpsize = "effSampleSize")
summary(unemplFH1)
# Mode with working class
unemplFH2 < fh(fixed = unemployed ~ classwk_unpaid_worker + classwk_salary_worker +
classwk_niu + classwk_self_employed,
vardir = "directVar", combined_data = combined_data, domains = "domain",
method = 'ml', transformation = "arcsin", backtransformation = "bc",
eff_smpsize = "effSampleSize")
summary(unemplFH2)

With the help of the summary function, the practitioner gets information about the number of domains as well as some model diagnostics, e.g., the information criteria and the (mathjaxinline(R^2)mathjaxinline). While there are multiple methods and criteria that can be used to compare different models; in this example, the focus is on the information criteria and the adjusted (mathjaxinline(R^2)mathjaxinline). Since the information criteria AIC and BIC are smaller in unemplFH2 and the adjusted (mathjaxinline(R^2)mathjaxinline) is larger for this model, it will be used going forward. Please note that the estimation method for the parameters needs to be set to maximum likelihood if the information criteria will be used for the model selection. Code Block 

language  text 

title  Output summary 

collapse  true 

 > summary(unemplFH1)
Call:
fh(fixed = unemployed ~ edattain_less_than_primary_completed +
edattain_primary_completed + edattain_secondary_completed +
edattain_university_completed, vardir = "directVar",
combined_data = combined_data, domains = "domain",
method = "ml", transformation = "arcsin", backtransformation = "bc",
eff_smpsize = "effSampleSize")
Outofsample domains: 0
Insample domains: 22
Variance and MSE estimation:
Variance estimation method: ml
Estimated variance component(s): 0.0006418727
MSE method: no mse estimated
Coefficients:
coefficients std.error t.value p.value
(Intercept) 15.95450 6.017710 2.651257 0.008019284
edattain_less_than_primary_completed 16.24525 6.093368 2.666054 0.007674728
edattain_primary_completed 16.30899 6.033234 2.703192 0.006867715
edattain_secondary_completed 16.41763 6.030143 2.722594 0.006477162
edattain_university_completed 15.64274 5.932548 2.636765 0.008370069
Explanatory measures:
loglike AIC BIC R2 AdjR2
1 43.8617 75.7234 69.17714 0.7071698 0.7725552
Residual diagnostics:
Skewness Kurtosis Shapiro_W Shapiro_p
Standardized_Residuals 0.8565795 3.109786 0.9275953 0.1093302
Random_effects 0.7580138 3.594312 0.9276477 0.1096054
Transformation:
Transformation Back_transformation
arcsin bc
> summary(unemplFH2)
Call:
fh(fixed = unemployed ~ classwk_unpaid_worker + classwk_salary_worker +
classwk_niu + classwk_self_employed, vardir = "directVar",
combined_data = combined_data, domains = "domain",
method = "ml", transformation = "arcsin", backtransformation = "bc",
eff_smpsize = "effSampleSize")
Outofsample domains: 0
Insample domains: 22
Variance and MSE estimation:
Variance estimation method: ml
Estimated variance component(s): 0.000322217
MSE method: no mse estimated
Coefficients:
coefficients std.error t.value p.value
(Intercept) 5.220890 1.241659 4.204771 2.613461e05
classwk_unpaid_worker 5.487824 1.180430 4.649004 3.335425e06
classwk_salary_worker 5.126887 1.258385 4.074181 4.617651e05
classwk_niu 3.299256 1.548756 2.130262 3.314998e02
classwk_self_employed 5.349499 1.321804 4.047119 5.185180e05
Explanatory measures:
loglike AIC BIC R2 AdjR2
1 48.56608 85.13215 78.5859 0.8139054 0.8835492
Residual diagnostics:
Skewness Kurtosis Shapiro_W Shapiro_p
Standardized_Residuals 0.7114812 3.428400 0.9479491 0.2874469
Random_effects 0.5690609 3.584361 0.9606795 0.5031370
Transformation:
Transformation Back_transformation
arcsin bc 
After an appropriate model is chosen the model assumptions must be checked. As before the summary function provides information about the residuals and the random effects. When looking at the skewness and kurtosis, it seems that the kurtosis estimates for the distribution of the residuals and random effects are compatible with a a normal distribution but that the data is slightly skewed. The skewness of a normal distribution is 0 and its kurtosis is 3. The ShapiroWilk test that tests the hypothesis whether the data follows normal distribution does not reject the normality of the residuals and random effects. Another option to check the normality assumption can be various plots, e.g., a QQ plot that can be produced with the plot function. Code Block 

theme  RDark 

firstline  99 

title  Model assumptions 

linenumbers  true 

collapse  true 

 # Model assumptions 
# Check skewness, kurtosis, shapiro_w and shapiro_p test
summary(unemplFH2)
# Check diagnostic plots
plot(unemplFH2) 
The QQ plots show a difficulty of this application. The arealevel model is conducted with only 22 domains in this example. With a small number of domains, it is more likely that the ShapiroWilk test will not reject the normality (the null hypotheses). The QQ plots may help to confirm or to reject the normality assumption. However, it is also up for interpretation. For the following, the model unemplFH2 is used as a final model assuming no violation of the model assumptions.
After checking the model, the final model can be fitted. The estimators function returns a table that allows a fast comparison of the modelbased estimates with the direct estimates. For the final model, the default estimation which is restricted maximum likelihood is used in this example. Code Block 

theme  RDark 

firstline  108 

title  Fit final model 

linenumbers  true 

collapse  true 

 # Fit final model 
unemplFH2 < fh(fixed = unemployed ~ classwk_unpaid_worker + classwk_salary_worker +
classwk_niu + classwk_self_employed,
vardir = "directVar", combined_data = combined_data, domains = "domain",
transformation = "arcsin", backtransformation = "bc",
eff_smpsize = "effSampleSize")
estimators(unemplFH2) 
The returned table of point estimates shows that the estimates are quite similar for the domains. Code Block 

title  Point estimates 

collapse  true 

 > estimators(unemplFH2)
Indicator/s: All indicators
Domain Direct FH
1 1519.rural 0.06056085 0.05811014
2 1519.urban 0.14233478 0.15020506
3 2024.rural 0.07150598 0.06667850
4 2024.urban 0.12001686 0.11536616
5 2529.rural 0.03741252 0.03974987
6 2529.urban 0.05580399 0.06016464
7 3034.rural 0.03192538 0.03420204
8 3034.urban 0.06480648 0.06473852
9 3539.rural 0.02243432 0.02509315
10 3539.urban 0.05163059 0.05036153
11 4044.rural 0.02927986 0.02892317
12 4044.urban 0.04330892 0.04197867
13 4549.rural 0.02200518 0.02476735
14 4549.urban 0.05145245 0.04953178
15 5054.rural 0.01985975 0.02062209
16 5054.urban 0.08046659 0.06542743
17 5559.rural 0.01898093 0.02031048
18 5559.urban 0.03489428 0.04331501
19 6064.rural 0.02538600 0.01816915
20 6064.urban 0.03867333 0.04617507
21 65+.rural 0.01107436 0.01606912
22 65+.urban 0.07394510 0.06862117 

Evaluation & Benchmarking To evaluate the domain indicators, the model is fitted and the MSE and the CV as measure for the uncertainty of the estimates are calculated. The estimation of the MSE and CV is triggered by setting the parameter MSE to "TRUE". For the transformed arealevel model with biascorrected backtransformation, a bootstrap MSE is provided. The parameter B controls the number of bootstrap iterations. It is advisable to set B to a minimum value of 100 to obtain reliable MSE estimates. Code Block 

theme  RDark 

firstline  124 

title  Precision, accuracy and reliability 

linenumbers  true 

collapse  true 

 # Evaluation 
unemplFH2 < fh(fixed = unemployed ~ classwk_unpaid_worker + classwk_salary_worker +
classwk_niu + classwk_self_employed,
vardir = "directVar", combined_data = combined_data, domains = "domain",
transformation = "arcsin", backtransformation = "bc",
eff_smpsize = "effSampleSize", MSE = TRUE, mse_type = 'boot',
B = 100)
estimators(unemplFH2, MSE = TRUE, CV = TRUE)
head(estimators(unemplFH2, MSE = TRUE, CV = TRUE)) 
The estimated regional indicators (the unemployment rate in this example) with its MSE and CV can be obtained in the form of a table. Generally, the CV should be used with caution when the indicator of interest is a ratio since point estimates close to zero can also be the reason for large CVs. In these cases, it is recommendable to focus on the MSE. In this example, the CV of the modelbased estimate (FH) is generally lower than for the direct estimate. However, there are also cases where the CV is slightly larger. One reason could be that the number of bootstrap iterations is too small. Code Block 

title  MSE and CV per domain 

collapse  true 

 > head(estimators(unemplFH2, MSE = TRUE, CV = TRUE))
Domain Direct Direct_MSE Direct_CV FH FH_MSE FH_CV
1 1519.rural 0.06056085 9.659952e05 0.16229141 0.05811014 8.992932e05 0.16319196
2 1519.urban 0.14233478 2.399594e04 0.10883232 0.15020506 2.052526e04 0.09538053
3 2024.rural 0.07150598 7.553665e05 0.12154483 0.06667850 3.694515e05 0.09115761
4 2024.urban 0.12001686 8.847555e05 0.07837353 0.11536616 4.120060e05 0.05563823
5 2529.rural 0.03741252 3.949955e05 0.16798831 0.03974987 4.805691e05 0.17439831
6 2529.urban 0.05580399 3.578428e05 0.10719657 0.06016464 4.202419e05 0.10774778 
The modelbased estimates are commonly compared with the results of direct estimates. The function compare_plot in emdi provides some plots for this comparison. Code Block 

theme  RDark 

firstline  136 

title  Comparison with direct estimation 

linenumbers  true 

collapse  true 

 # Comparison with direct estimates
compare_plot(unemplFH2, MSE = TRUE, CV = TRUE) 
Comparing direct with modelbased estimates helps to evaluate if the modelbased estimates are more reliable than the direct estimates measured in terms of the MSE or the CV. The boxplots confirm that the modelbased estimates have lower CVs overall. Approximately, 75% of the modelbased domain estimates show a CV below 20%. It is also apparent that the increase in efficiency is not huge. Furthermore, the second plot (on the right) shows that there are also domains where the CV of the modelbased estimates is slightly larger than those of corresponding direct estimates. Column 



Column 



When comparing the direct and modelbased point estimates, it can be seen that these do not differ strongly from each other. Column 



Column 



This could be due to a large weight on the direct estimator. For this, the estimated shrinkage factor can be checked. Furthermore, the Brown test can be conducted and the correlation between the direct estimates and the estimates of the regressionsynthetic part can be calculated with the function compare. Code Block 

theme  RDark 

firstline  138 

title  Comparison with direct estimation (continued) 

linenumbers  true 

collapse  true 

 # Summary of estimates shrinkage factor
summary(unemplFH2$model$gamma$Gamma)
# Brown test and correlation between direct estimates and regressionsynthetic part
compare(unemplFH2) 
The results show that the weight on the direct estimate ranges between 0.3 and 0.75 with a median of 0.6. Thus, the weight on the direct estimate is relatively high. Furthermore, the hypothesis that the modelbased estimates differ from the direct estimates is not rejected and the correlation between the direct estimates and the regressionsynthetic part is high with 0.93. Code Block 

title  Comparison with direct estimation (continued) 

collapse  true 

 > summary(unemplFH2$model$gamma$Gamma)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3050 0.4845 0.5944 0.5680 0.6574 0.7487
> compare(unemplFH2)
Brown test
Null hypothesis: EBLUP estimates do not differ significantly from the
direct estimates
W.value Df p.value
4.547096 22 0.9999734
Correlation between synthetic part and direct estimator: 0.93 
Before communicating the estimated indicators to policy makers, it is common practice to to benchmark the small area estimates to a direct population estimate. For estimating the direct population estimate, the whole sample is used. In this example, the aggregation of the modelbased domain estimates is about 5.5%, already quite close to the direct population estimate of 5.6%. This is due to the fact that the modelbased domain estimates are close to the direct domain estimates. To ensure the direct population estimate, the benchmarking following Datta et al. (2011) (option 1) is used in this example. To conduct the benchmarking the population shares per domain are needed. Download population shares Code Block 

theme  RDark 

firstline  144 

title  Benchmarking 

linenumbers  true 

collapse  true 

 # Benchmarking 
# Calculate population direct estimate
svymean(~unemployed, samplingDesign)
# Load population shares
sharePopulationSize < read.csv('sharePopulationSize.csv')
# To ensure the same ordering, the data sets are merged
share < merge(unemplFH2$ind, sharePopulationSize, by.x = 'Domain', by.y = 'domain')
# The modelbased estimate almost equals to the direct estimate
sum(share$ratio_n * unemplFH2$ind$FH)
# Benchmarking with raking
unemplFH2_benchmarked < benchmark(unemplFH2, benchmark = 0.056111,
share = share$ratio_n)
# The benchmarked value equals the direct estimate
sum(share$ratio_n * unemplFH2_benchmarked$FH_Bench)
# Benchmarked point estimates
head(unemplFH2_benchmarked) 
The benchmarking changes the estimates only slightly. Code Block 

title  Benchmarked estimates 

collapse  true 

 > head(unemplFH2_benchmarked)
Domain Direct FH FH_Bench Out
1 1519.rural 0.06056085 0.05811014 0.05865782 0
2 1519.urban 0.14233478 0.15020506 0.15075274 0
3 2024.rural 0.07150598 0.06667850 0.06722618 0
4 2024.urban 0.12001686 0.11536616 0.11591384 0
5 2529.rural 0.03741252 0.03974987 0.04029754 0
6 2529.urban 0.05580399 0.06016464 0.06071232 0 

