Statistics New Zealand Graphics Guidelines

The Graphics Guidelines have been prepared as a supplement to the Protocols for Official Statistics. They will assist with the implementation of Principle 8 of the protocol.

Principle 8: in analysing and reporting the results of a collection, objectivity and professionalism must be maintained and the data impartially presented in ways which are easy to understand.

1. Introduction

Graphs have two primary uses. Firstly, they can be used to explore and analyse data in order to uncover patterns and relationships. The second use of graphs forms the scope of these guidelines: the communication and display of results.

Graphs are widely used to communicate information. Unfortunately a focus on eye-catching graphic design and a lack of attention to principles for accurate presentation of information can result in graphs which are not clear and are misunderstood. The objective of these guidelines is to provide assistance in the production of graphs which accurately reflect the major story in the data and are presented in the clearest and most consistent possible way.

The design principles that follow are based primarily on recommendations from the following sources.

1. A. Wallgren, B. Wallgren, R. Persson, U. Jorner, and J. Haaland (1996). Graphing Statistics and Data, (Statistics Sweden). 93pp, Sage Publications Inc, Newbury Park.

2. E. Tufte, (1983). The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut.

3. N. Fisher, Informative graphics. CSIRO.

4. The Australian Bureau of Statistics graphics standards (1990).

These references (especially the first one) should be consulted where information is required that is beyond the scope of this document.

Producing graphs is an art as well as a skill. While adherence to the points in these guidelines will go a long way towards ensuring that a graph presents data in the best way possible, the process is not complete until at least one other person has reviewed the graph. This final step is vital.

A special point of note is that one should be careful when using computer packages to produce graphs. Many default settings are not ideal for good graphics, such as vertically written text, large spacing between bars on a bar graph etc.

1.1 When and why use graphics?

Graphs can be more revealing than statistical tables. The objective of a graph should be to convey the major story being revealed by the data in an unambiguous and illuminating form. Graphs should not only emphasise important statistical messages and indicate relative sizes or trends, but also create reader interest in the statistics.

The first step in deciding what to graph is to analyse the statistical output and understand the major elements to be represented. It should then be decided whether a graph is the best way of representing these elements. A table may be better.

How does one choose between use of a graph or a table? There are some fairly simple indicators of situations in which a table will be preferable. These are where the data sets:

1. are very small (perhaps just 3 or 4 values),

2. have several cross-classifications,

3. have comments attached to some of the data points,

4. contain numerical values which are of direct interest, or

5. contain numerical values that are likely to be required for future reference.

However in general a graph is preferable to a table since patterns can more easily be revealed.

Graphs should generally be located as closely as possible to the relevant tabular or descriptive presentation. In some cases, however (for example in small publications, or those where users may wish to compare one graph with another), it may be more appropriate to show all the graphs together.

1.2 Types of variables

Different types of variables require different sorts of graph. A variable will be one of the following:

1. Qualitative ('Words') e.g. Sex, Region

2. Quantitative ('Numbers')

a) Discrete ('Certain Values') e.g. number of rooms, family size.

b) Continuous ('All Values') e.g. an economic index, age, weight, temperature.

Continuous variables are often grouped into classes such as age ranges.

1.3 What should be in a graph?

The following are principles of constructing a graph. The components of a graph are defined in the Appendix.

- The graph should induce the reader to think about the data it contains, not the graphic design.

- Graphs should not give a false impression of the data by exaggerating differences.

- There should be as much white space as possible on a graph. In practice this means that a large proportion of the ink on a graph should be used to present the data itself. Grids, tick marks, labels etc. should be kept to a minimum.

- Graph 'junk’ should be minimised, e.g. hatching, stipples, unnecessary labelling and third dimensions.

- Graphs that contain only a few data points should be small in size.

- The interior of a graph, the plot area, is for data. This region should not be cluttered. Labels should be kept to a minimum; tick marks and scale labels should be outside the data region and when several series of data sets are included in the data region they must be visually distinguishable.

- The amount of text in the graph should be kept to a minimum. Explanatory text should be restricted to the title of the graph (and, where absolutely necessary, footnotes).

- A graph should still be intelligible after black-and-white photocopying or printing, so lines or bars should be distinguished by more than just colour.

Section 4, ‘Graphic standards’ gives more detail on what should be in a graph.

2. The process of graphing

The process of graphing falls into two stages, the first comprising data analysis and selection of the graph, and the second comprising construction of the graph and critical review of it.

2.1 First stage: Data analysis and graph selection

1. Perform a statistical analysis of the data set to find out what patterns and relationships (if any) it contains.

2. Decide if this information in the data is to be presented in a graph rather than in a table.

3. Decide on the basic, or primary, variables involved.

4. Identify the types of variables: quantitative or qualitative (categorical).

5. Decide on the specific variables and comparisons of interest.

6. Select an appropriate graph (time series, bar graph, dot graph etc) for the type of data. The interpretation should not be prejudiced by the technique of presentation.

2.2 Second stage: Construction of the graph

1. Construct an initial graph.

2. Consider re-ordering the variables.

3. Consider using extra plots.

4. Consider adding or removing zero.

5. Consider allowing for a break in an axis.

6. Consider changing the size of the graph.

7. Re-check against principles of graph construction.

- Is the graph easy to read?

- Can the graph be misinterpreted?

- Does the graph have a good size and shape?

- Is the graph in the right place?

- Does the graph benefit from being in colour?

8. Try the graph out on somebody.

These steps should be repeated until all points are satisfied.

3. When to use which graphs

3.1 Bar graphs

- A bar graph is best for comparison of quantities.

- Use when graphing a continuous variable by a categorical variable or when graphing classes (e.g. age ranges).

- Keep category labels as short as possible.

- It is often best to align the bars horizontally. This means that there is room for longer category labels (although these should be kept as short as possible). Vertical bar graphs with labels which do not fit neatly along the axis and require very large legends are difficult to interpret. The principle exception is when time is involved: the time axis should always be horizontal.

Example 3.1.1: Simple bar graph

Horizontal bars allow space for long category labels thus facilitating reading of the graph. Note that the bars are not touching and are evenly spaced. This indicates the categorical nature of the data.

- The best way to order the bars depends on intended purpose of the graph. In Example 3.1.1, the categories are geographical regions ordered from north to south, demonstrating the pattern in expected Maori population growth rates from north to south. Where there is no inherent ordering of the categories or the inherent order of the categories is not relevant to the message, order the bars by size. In Example 3.1.2 the categories are also geographic, but are ordered to demonstrate a pattern of increasing life expectancy.

- If more than one graph is displayed with a common set of categories then the bars should appear in the same order in all the graphs. Also the same size of graph should be kept if graphs can or have to be compared.

- The width of the gap between bars should be about 50 per cent of the bar width.

Where there is a third variable, e.g. time, one option is to use a grouped bar graph. In this case

- Group by either time period or category, depending on the message you wish to convey.

- Use no more than four categories in a group.

Example 3.1.3: Grouped bar graph. Grouping by years.

Here the graph is vertical so that time runs horizontally as is conventional.

3.3 Line graphs

- A line graph is best for showing changes and trends, especially over time.

- Use when graphing a continuous variable by a continuous variable. A common example is a time series, that is a graph where one variable is time.

- Display a maximum of three dependent variables (i.e. lines) on any one graph. Otherwise the graph can become crowded and difficult to read.

- Use a different line style for each variable, even if the lines are also distinguished by colour. This facilitates black-and-white printing and photocopying.

- Where multiple lines overlap such that they are difficult to distinguish, consider using more than one graph to display the data, or perhaps a grouped bar graph.

- In a graph of a time series, time should run horizontally.

- Consider using a vertical bar graph for time series where the series is short and the message relates to comparison of individual quantities, e.g. yearly results, rather than to changes.

- Where there is a visible seasonal component in a time series then at least two years' data should be graphed or the seasonal component of the variation may be mistaken for a trend

- Equal intervals (of time, for example) should be equally spaced. It follows that unequal intervals will be unequally spaced. For example, where data are for 1994, 1995, and 1997, the distance between 1995 and 1997 on the time axis should be twice that of the distance between 1994 and 1995.

- Where there is a discontinuity in the data, for example because of a change in the definition of a variable, do not join the points across the discontinuity. The discontinuity should be explained in the caption.

Example 3.3.1: Simple line graph

This time series presents a clear message. In particular the proportions of the graph are such that both the overall trend and the local deviations from it are obvious.

3.4 Histograms

Histograms look like vertical bar graphs except with the bars touching.

- Use to graph the frequency distribution (counts) of classes of a continuous variable

- The area of each bar represents the quantity. Always try to make the intervals for the continuous variable equal so the bar widths will be equal. However if the intervals must be unequal so also should be the bar widths and the height of the bar should be adjusted correspondingly.

For example, where the continuous variable is age and the intervals are five years in all but one category where the interval is ten years, the width of the ten-year bar should be twice that of the other bars. To preserve areas, it should also be half the height of a 5-year bar with the same number of counts.

Example 3.4.1: Histogram: population pyramid.

A population pyramid is a special case of histogram used for demographic data and comprises two back-to-back horizontal histograms, one for men and one for women. Note in this example that the last age class, 90+, is open-ended, therefore it is unclear what width (and therefore, preserving area, what length) the bar should be. Misinterpretation is avoided in this case because there are so few in this category anyway.

4. Graphics standards

The following requirements should be adhered to for any graph.

4.1 Shape and size

- If the nature of the data suggests the shape of the graphic, follow that suggestion.

- Otherwise the frame should be 1.5 times as wide as it is high.

- Small graphs should be used for simple messages, larger graphs for more complex messages.

- Comparison of related graphs should be facilitated by using identical scales of measurement and placing graphs side by side.

4.2 Graph title

- A graph title must be left aligned.

- The title should be informative but as short as possible. Supplement it with a separate caption under the graph if necessary.

- The title should be in mixed upper and lower (title) case, e.g. 'Sex Ratios of Elderly, Urban and Rural Areas'.

4.3 Plot Border

- A graph should have a border around the plot area if the plot area is the same colour as the rest of the page. This helps visually to connect the elements of the graph.

4.4 Scale

- A scale should be chosen which results in a balanced presentation and assists interpolation between labelled tick points. Use 1, 2, 5 (or 10, 20, 50 etc) as scale intervals. This will result in having easily recognisable values (even and multiples of 5) in the scale. For example avoid a scale as 30, 60, 90, 120, missing 100.

- Intervals should be evenly spaced. Non-linear scales (e.g. logarithmic) should only be used where absolutely necessary and where readers will not be misled.

- Use the same scale and format for graphs that are likely to be compared.

4.5 Axes

- Place the axes at the left and bottom of the graph.

- A second right hand axis should be used where the graph spans a whole page.

- Where the vertical axis has positive and negative values, the zero line should be clearly indicated.

- Two different types of vertical axis for different overlaid graphs should in general not be used as this is confusing to the reader. However occasionally this is a useful tool to compare patterns of trends (see Section 4.18).

4.6 Axis labels

- There should be name labels for both axes. These should begin with an initial capital followed by lower case e.g. 'Number never married'. An exception is where the category labels in a bar graph clearly identify what is being plotted on the axis (e.g. years, region names). In this case an axis name label may only add clutter to the graph.

- The unit and scale of measurement should be placed in the axis title and not in the graph title.

- The interval between the two highest y-axis labels should contain data.

4.7 Tick marks

- Tick marks must be outside of the axes.

- The width of the axis and the number of plot points will determine the number of tick points that are labelled. The number of labelled (major) tick marks must be less than 10 for the horizontal axis and less than 8 for the vertical axis.

- Minor ticks should be kept to the minimum necessary for clarity.

- The data should span the tick marks, i.e. the data should begin at the first tick mark and end at the last tick mark.

- The first and last major tick mark along the horizontal axis of a time series graph must be labelled.

- Do not put ticks between bars on graphs. They have no value and are confusing to readers.

4.8 Tick mark labels

- Numeric tick mark labels should have fewer than 4 digits and must have fewer than 6 digits (i.e. preferably 3 or less with a maximum of 5 digits). A comma as a thousands separator should be used for large numbers in graphs as it makes large numbers easier to read.

- The scale factor is the scaling to apply to the values labelling the tick marks e.g. a maximum scale value of 55,000 with a scale factor of 1,000 will display 55 as the maximum figure on the axis. The correct numeric axis label depends on the scale factor, the scale of the data, and the units of measurement (e.g. a label of '$M' where the scale factor is 1,000,000 and the units are dollars).

- The maximum and minimum vales of the numeric scale and the interval between tick marks must be selected appropriately so that suitable values appear for the axis tick labels. The value (maximum minus minimum) must be divisible by the specified interval value with no remainder.

- The tick mark labels should always be written under the plot area, not under the zero line.

4.9 Category labels

- Labels for categories of variables should be as short as possible consistent with interpretability.

4.10 Label alignment

- All labels should run horizontally.

4.11 Data point labels

It may occasionally be necessary to identify specific data points with labels. These should be

- inside the plot border, and

- with a line joining a label to its corresponding point.

4.12 Line styles

- Use different lines styles where lines cross or touch each other.

4.13 Legend

- Each line in a graph should be individually labelled if space permits. These labels must be clearly associated with the correct line only.

- Otherwise a legend (see definition of legend in the appendix) should be shown outside the graph area, preferably next to the lines, and to the right.

- Each column or group of columns in a column graph should be individually labeled when clearly possible, in preference to a legend.

4.14 Colours

- When colour is available use it sparingly, and normally only one colour, in soft tones and in a limited range of shades.

- Ensure that lines or bars can still be distinguished when reproduced in black and white, e.g. by using different line styles.

4.15 Fonts

- Use a san serif font, preferably Arial, for any text. On axes Arial Narrow can be used. Sources should be written in a font such as Goudy Old Style Italic, 7 points. In general the size depends on the publication. As a guideline the Publication Section of Statistics New Zealand uses for Analytical reports 12 points Arial for Headings.

4.16 Bars

- Bars must be filled with a solid shade, not hatching.

- Bars in a bar graph should not normally have borders (or the border must be the same shade as the bar itself).

- In grouped bar graphs in monochrome, the bars should be shaded in tints with the first bar shaded 80%, the second 60%, the third (where it exists) 40%, and the fourth (where it exists) 20%. If the background is grey, then the 20% bar may not show up very well. In this case a border is permitted around the bar.

- Where colour is used, it should be muted and preferably only one colour should be used. A colour should not be especially prominent as this could give false emphasis on the category (e.g. black, red and green: the red category is likely to be overemphasised). Recommended colours are blue, (bluish) green and purple.

4.17 Grids

- Subtle (grey-on-white or white-on-grey) grids should be used where possible to facilitate accurate value judgements.

4.18 Multiple scales

- Care should be taken with graphs with two different vertical scales as they can be difficult to interpret. Usually it is better to present the data on separate graphs (see Section 4.5).

- They are easiest to interpret where changes in pattern, not in levels, are of interest.

Example 4.18.1: Poor example: graph with two different vertical scales.

There is no relationship between the two scales so they only confuse the reader. In addition there is a broken axis which is not clearly indicated. This graph would be better as two separate graphs.

4.19 Use of symbols and abbreviations

- To save space, text in axis titles, tick mark labels and legends should use symbols and abbreviations as long as these are generally well understood. If in doubt, avoid their use.

- The use of symbols and abbreviations in graphs depends on the space available, but should be used only where necessary or where the abbreviation/symbol is generally well understood.

- Some examples of numeric axis titles are:

index - for index numbers

per cent - for per cent

ratio - for ratios that are not per cent

(000) - for thousands

million - for millions

$ - for dollars

$(000) - for thousands of dollars

$(million) - for millions of dollars

- For labels with months, in small graphs abbreviate to the first letter only without a full stop, i.e. J F M A M J J A S O N D.

4.20 Broken axes

Often the zero level for a continuous variable needs to be shown to indicate the level of the data and to prevent a misleading picture, but the scale required to do so would visually suppress small but interesting differences between the data points. In this case a percentage change graph should be considered instead. If this is not possible, the solution is to use a broken axis.

- When a broken axis is used it should be indicated clearly. For both bar and line graphs the preferred way of showing this is with a clear break across all the bars and the axis as shown in Example 4.20.1. A double slash at the y-axis may also be required. No indication, or only subtle indication, of a broken axis might cause the reader to have an exaggerated impression of the importance of small differences between the bars or data points.

Example 4.20.1: Broken axes.

The following graphs have clearly indicated broken axes.