Descriptive analysis of data
Modified on 2015/05/26 14:56 by Sean Zheng — Categorized as: Chapter 4 - Analysis and presentation of gender statistics
The degree of data processing and analysis varies according to the types of statistical products prepared by the national statistical offices. (See box IV.1 for types of statistical products that may include gender statistics.) Typically, tables constructed to disseminate data collected in censuses or surveys involve minimum data processing and analysis. Large amounts of data are provided, often as absolute frequencies or counts of observations, making it difficult to discern the main differences between women and men. Additional processing and analysis are developed when more analytical reports or articles focused on specific topics are prepared. In those cases, the differences between women and men may become more visible.
Gender statistics requires the cross-tabulation of at least two statistical variables: sex and the main characteristic that is studied, such as educational attainment or labour force participation. Ideally, additional variables are used in further cross-tabulation of data (for example, by age group or geographical areas) in three- or multiple-way tables. Although statistics on individuals have been traditionally disseminated as totals with no further information on women and men, data are increasingly disaggregated by sex in dissemination materials. Still, one limitation in producing gender statistics persists. Sex is often used as only one of the breakdown variables for the data presented. As explained in chapter I and shown in chapter II, gender statistics and a meaningful gender analysis commonly require disaggregation by sex and other characteristics at the same time. For example, gender segregation in the labour market is partially determined by the gender gap in education, therefore data on occupations should be further disaggregated by the level of educational attainment.
A basic descriptive analysis of data involves the calculation of simple measures of composition and the distribution of variables by sex, and for each sex, that facilitate straightforward gender-focused comparisons between different groups of population. Depending upon the type of data, these measures may be proportions, rates, ratios or averages, for example. Furthermore, when necessary, such as in the case of sample surveys, measures of association between variables can be used to decide whether the differences observed for women and men are statistically significant or not.
Percentages, ratios, rates or averages are the basis for the calculation of gender indicators. Indicators, in general, are used to “indicate” how differently one group performs by comparison to a norm or reference group. Gender indicators should show how women perform in comparison to men, and their status relative to men’s status, in areas such as education, formal work, access to resources, health and decision-making. In this regard, gender indicators are important tools for planners and policymakers in monitoring progress towards gender equality.
The sections that follow present the type of data involved in gender statistics, measures of composition and distribution used in gender statistics and the types of gender indicators that can be constructed using those measures.
Box IV.1 Types of statistical products that disseminate gender statistics
Gender statistics are made available by national statistical offices through various types of dissemination products. Some of the dissemination products are part of the regular production of a statistical office and are aimed at making available data collected in censuses, sample surveys or compiled from administrative records. They usually concern one type of data source or one statistical field and are intended for specialists who wish to further analyse the results of censuses or surveys or to carry out research on specific topics. The data disseminated in these types of products can be detailed, organized in large tables and often are presented as absolute values or raw data that give specialists more flexibility in doing their own analysis. A gender perspective can be integrated into these products through the systematic sex-disaggregation of data and the systematic coverage of data needed to address gender issues.
Other dissemination products that may include gender statistics are analytical reports or articles focused on specific topics. Data and other information may be compiled from more than one source and different statistical fields may be covered. Policy concerns are usually taken into account. These publications are intended for a larger audience: not only statisticians but also research and policy specialists in the topic or topics covered. Data disseminated in this type of product is presented in small summary tables and charts and discussed in the accompanying text. Large tables with more detailed data may be provided in annexes. A gender perspective can be integrated into these products using three elements: data-based analysis of gender issues specific to the selected topic; illustrations with gender-sensitive tables and charts; and systematic sex-disaggregation of data presented in the annexes of the publication.
Statistical publications focused on gender issues are one type of analytical reports. A typical example are the “women and men” publications produced by many national statistical offices. These publications contain data from different statistical fields and from different sources, cover multiple policy areas and gender issues and are addressed to a large audience, including persons with limited or no experience in statistics. They are an important tool for non-statisticians, gender specialists, gender advocates and policymakers. Instead of presenting data and letting the reader analyse them and draw their own conclusions, these publications are focused on presenting the main results of data analysis and their interpretation, including implications for policymaking. They are usually designed to be user friendly and written in easily comprehended language, with simple tables and charts and an attractive presentation.
Lastly, gender statistics are disseminated through dedicated databases or through more comprehensive databases such as those focused on social indicators, development indicators or human development indicators. Data disseminated in this format usually cover several areas of concern and several points in time or time periods. Data are usually presented already processed into indicators that facilitate comparisons over time or between various groups of population. Information on the calculation of indicators included in the database, underlying definitions or concepts used and sources of data used are sometimes made available with the database. This type of dissemination product is usually targeted to specialists interested in analysing statistical information themselves, including for monitoring purposes.
Hedman, Perucci and Sundström, 1996; United Nations, 1997; and United Nations, Economic Commission for Europe, and World Bank Institute, 2010.
Types of data involved in gender statistics: qualitative and quantitative variables
Statistical variables are classified into two broad classes based on their measurement level: qualitative variables, also called categorical variables (for example, sex, marital status, ethnicity and educational attainment); and quantitative variables (for example, age, income and time spent on paid or unpaid activities). Categorical variables are of two major types: nominal variables (such as sex and marital status) and ordinal variables (such as educational attainment). Nominal variables do not imply any continuum or sequence of categories. Typical examples include sex or ethnicity. The categories can be arranged in any order without inconvenience in the analysis. For convenience in presentation, however, they can be arranged alphabetically, in order of their relative size in the population or in order of relative focus of the publication (for example, first women, followed by men). Ordinal variables imply an underlying continuum. When dealing with ordinal variables, the categories must be arranged in the order implied by the continuum to facilitate analysis of the data. A typical example is “level of educational attainment”. The categories can be organized in ascending or descending level of education. For example: no education, primary education, secondary education, post-secondary non-tertiary education and tertiary education. Some continuous variables tend to be coded into a few categories and treated as ordinal variables. For example, age in single years can be recoded in 5-year age groups and displayed from the youngest to the oldest ages.
The distinction between types of variables is important because specific statistical measures can be applied to each category, as shown in the paragraphs that follow.
Measures of composition or distribution for qualitative variables
Computations of proportions, percentages, ratios and rates are basic statistical procedures used in describing the categorical composition or distribution of qualitative variables and serve as useful tools for the standardization of the statistics compared. It is important to keep in mind that the measures of composition or distribution should not be calculated for small numbers of observations. In that case, actual numbers (absolute frequencies) should be preferred.
Proportions and percentages
A proportion is defined as the relative number of observations in a given category of a variable relative to the total number of observations for that variable. It is calculated as the number of observations in the given category divided by the total number of observations. The sum of proportions of observations in each category of a variable should equal to unity, unless the categories of the variable are not mutually exclusive. Most often, proportions are expressed in percentages. Percentages are obtained from proportions multiplied by 100. Percentages will add up to 100 unless the categories are not mutually exclusive.
In gender statistics, proportions can be calculated as relative measures of (a) distributions of each sex by selected characteristics; and (b) sex distributions within the categories of a characteristic. These two types of proportions are presented in the table IV.1. In the first case of distribution, the proportions are calculated as relative frequencies of the categories of a characteristic for each sex, with women’s and men’s respective totals used as the denominators. For example, in the third column of data in table IV.1 it can be observed that employed represents 39 per cent of all women. This is calculated as the number of women employed divided by women’s total population in the corresponding age group and multiplied by 100. In comparison, employed represents 73 per cent of all men, as shown in the fourth column of data. This is calculated as the number of men employed divided by men’s total population in the corresponding age group and multiplied by 100.
In gender-related analysis, proportions calculated as percentage distributions can be used to compare women and men with regard to various social or economic characteristics. A simple measure of the gender gap is the differential prevalence, where per cents in the distribution of a characteristic within the female population are subtracted from corresponding per cents in the distribution of the characteristic within the male population. The resulting percentage-point difference indicates the gender gap in the characteristic considered. In our case, the proportion of women employed is lower than the proportion of men employed by 34 percentage points.
The percentage distribution of the categories of a characteristic for each sex is the basis of most of the gender indicators. A few examples include the labour force participation rate, the literacy rate, the school attendance rate and contraceptive use. Based on the proportions calculated in columns 3 and 4 in table IV.1, two indicators of the status of women and men on the labour market can be directly figured out. For example, the proportion of women who are employed (39 per cent in our case) is actually the indicator employment-to-population ratio, one of the indicators for the first Millennium Development Goal on the eradication of poverty and hunger. Furthermore, the proportion of women who are employed or unemployed give the labour force participation rate (in our case, the labour force participation for women is 39+2=41 per cent). Based on the data presented in the table, two other indicators can be calculated: unemployment rate (which is the proportion of unemployed in the total of employed and unemployed); and employment rate (which is the proportion of employed in the total of employed and unemployed).
Table IV. 1
Economic activity status for population aged 15-64, Peru, 2007
Sex distribution (per cent)
Not economically active population
United Nations, 2012.
Sex distribution within the categories of a characteristic are shown in columns 5 and 6 in table IV.1. In this case the proportions are calculated by raw numbers, as opposed to the previous type of proportions, calculated by columns. For example, 36 per cent of the employed are women and the rest, 64 per cent, are men. The share of women employed is calculated as the number of women employed divided by the total number of women and men employed and multiplied by 100.
Among the gender indicators constructed, based on sex distribution within a category of population, are the proportion of seats in parliament held by women, the share of girls among the children out-school, the share of women among agricultural workers and the share of women among the older population who are living alone.
This type of indicator is often used for population groups known to have an overrepresentation of women or men. The selected groups are often linked to a policy concern. For example, in many countries women represent a minority of parliament members, ministries, chief executives of corporations, mayors and researchers. Policies based on gender quotas are used by some countries to increase the participation of women in those groups.
The percentage of women and the percentage of men in a group always add up to 100 per cent. Because of that, often only one of the indicators (usually share of women) is presented in tables or graphs.
Particular compositional aspects of a population can be made explicit by use of ratios. A ratio is a single number that expresses the relative size of two numbers. The ratio of one number A to another number B is defined as A divided by B. Ratios can take values greater than unity. Because of the way they are calculated, proportions can be considered a special type of ratio in which the denominator includes the numerator. Ordinarily, however, the term ratio is used to refer to instances in which the numerator (A) and the denominator (B) represent separate and distinct categories. Ratios can be expressed in any base that happens to be convenient; however, the base of 100 is often used.
A well-known example of a ratio based on qualitative variables is the sex ratio: the number of males per 100 females, used to state the degree to which members of one sex outnumber those of the other sex in a population or subgroup of a population. A variation of this indicator is the sex ratio of birth, defined as the number of male live births per 100 female live births.
Other gender indicators based on sex ratios may involve the standardization of the variables used. For example, a gender parity index calculated for participation at various levels of education is intended to reflect the surplus of girls or boys enrolled in school. The indicator can be calculated simply by dividing the number of girls enrolled by the number of boys enrolled. This gives a good estimation of the distribution by sex in enrolment. The indicator gives a poor measure of gender differences in access to education, however, because the differences in the number of girls and number of boys that should be in school (the school-age population) are not taken into account. An alternative calculation of the indicator that controls for the sex composition of the school-age population uses the ratio of net enrolment rates (or gross enrolment ratios) for girls to net enrolment rates (or gross enrolment ratios) for boys.
In general, proportions and ratios are useful for analysing the composition of a population or of a set of events. Rates, in contrast, are used to study the dynamics of change. Most often used in gender statistics are rates of incidence. A rate of incidence is usually defined as the number of events that occur within a given time interval (usually a year) divided by the number of members of the population who were exposed to the risk of the event during the same time interval. Rates can be considered a special type of ratio, in the sense that they are obtained by dividing a number (of events) to another number (of population exposed to the event). In calculating rates, it is usually assumed that the events are evenly distributed throughout the year, while the population at risk is approximated as the midyear population. Demographic rates such as fertility rates and mortality rates are typical examples of rates calculated in gender statistics. By convention, some ordinary percentage figures showing the composition of a population group are called rates. For example, what is called a literacy rate is actually a simple percentage of the population that is literate.
When data on population exposed to risk are not easily available, a close approximation of that population is used as a denominator to summarize the incidence of the events considered. The indicator obtained is not considered a rate anymore, but a ratio. For example, in the case of maternal mortality, when the originating population (the number of pregnant women) is not available, the indicator is calculated on the number of live births, and is more accurately called a maternal mortality ratio.
Data used for the numerator and data used for the denominator in calculating rates sometimes come from different sources. For example, in the case of mortality rates, data on deaths used for the numerator may come from the civil registration system, while data on population used for the denominator may come from population censuses. When data from different sources are to be combined, it is essential to ascertain whether they are comparable in terms of the coverage of all groups of population, and geographic areas and time period (see box IV.2).
A probability is similar to a rate, with one important difference: the denominator is composed of all those persons in a given population at the beginning of the period of observation. Typical examples are the infant mortality rate and the under-5 mortality rate. The numerators are infant and child deaths, respectively. The denominator used is the number of births, which represents the population at risk of dying at the beginning of the period of observation.
Measures of composition or distribution for quantitative variables
In gender statistics, the measures of central tendency and dispersion commonly used to analyse continuous variables are the median and quantiles, the arithmetic mean and the standard deviation.
Medians and quantiles
The median is the value that divides a set of ranked observations into two groups of observations of equal size. Examples of indicators based on the median are the median age of the population and the median income in the population. The concept of median can be generalized, obtaining quantiles, which divide a ranked distribution into groups of equal number of observations. Examples of quantiles are quartiles, quintiles, deciles and percentiles. Quartiles divide the ranked distribution into 4 equal groups, quintiles into 5 groups, deciles into 10 groups and while percentiles into groups of 100. These measures are often used to present the distribution of income or wealth scores.
Means and standard deviation
The arithmetic mean (or average) is defined as the sum of values recorded for a quantitative variable divided by the total number of observations. Examples of indicators based on arithmetic mean include the average time-use for unpaid work by sex, the average size of land owned by sex of the owner, mean age at first marriage by sex and mean age of the mother at first child. Some gender indicators are calculated as ratios between the averages calculated for women and for men. For example, one of the indicators commonly used to show the gender pay gap is the ratio of female to male earnings in manufacturing. It is calculated by dividing the average earnings gained by women employed in manufacturing by the average earnings gained by men employed in manufacturing.
Deviations from the mean are differences between the values of each observation for a particular variable and the mean of all values observed for that variable. Values of some observations are greater than the mean, therefore their deviations from the mean are positive; values of other observations are smaller than the mean, therefore their deviations from the mean are negative. When the deviations from the mean are squared, all the negative deviations become positive. The sum of all squared deviations divided by the number of observations (or by the number of observations minus 1 in the case of data from sample-based surveys) is called variance. Variance is a measure of variability in the distribution of a variable. It represents the degree to which individuals differ from a mean value of a variable. The greater the spread of observations, the greater the variance. Because the variance is measured in squared units of the variable, it is difficult to interpret its values. Taking the square root of the deviance returns the measure to the original unit of the variable. This measure is called the standard deviation. The size of the standard deviation relative to that of the mean is called the coefficient of variation.
Although measures of dispersions such as the standard deviation and the coefficient of variation are not often presented in gender statistics, they have an important role in measuring the degree of association between variables and in making inferences about a population on the basis of data collected from a sample of that population.
Using data from different sources
When data from different sources are to be combined, it is essential to ascertain whether they are comparable in terms of coverage, time period, definitions and concepts. Statistics from different government sources may differ in arrangement, detail and choice of derived figures. Moreover, what appear to be comparable figures may not be, because of errors or variations in classification or data-processing procedures. Lack of comparability can also be a problem with time-series data if concepts or methods have changed from one period to another.
Checks for consistency and comparability between different sources should be made whenever different sources are to be combined. Obtaining comparable data for the period covered by a study or for completing a time series should be a paramount concern. It is most problematic when different sources are used for the same indicator (say, if missing years require supplementary data). Any variations in concepts from different sources and even different years within the same source should be thoroughly checked.
In most cases these checks can be made by reviewing the source’s documentation. It is also a good idea to consult specialists in different fields who may themselves supply or use the data. These specialists often have additional information on the availability of data (which may not be well publicized). They often understand special considerations of specific types of data and know of existing evaluations.
: Excerpt from United Nations, 1997.