General principles of graphic display
A graphical chart provides a visual display of data that otherwise would be presented in a table; a table, one that would other wise be presented in text. Ideally, a chart should convey ideas about the data that would not be readily apparent if they were displayed in a table or as text.
The three standards for tabular display of data – the efficient display of meaningful and unambiguous data – apply to charts as well. As with tables, it is crucial to good charting to choose meaningful data, to clearly define what the numbers represent, and to present the data in a manner that allows the reader to quickly grasp what the data mean. As with tabular display, data ambiguity in charts arises from the failure to precisely define just what the data represent. Every dot on a scatter plot, every point on a time series line, every bar on a bar chart represents a number (actually, in the case of a scatter plot, two numbers). It is the job of the chart’s text to tell the reader just what each of those numbers represents.
Designing good charts, however, presents more challenges than tabular display as it draws on the talents of both the scientist and the artist. You have to know and understand your data, but you also need a good sense of how the reader will visualize the chart’s graphical elements.
Two problems arise in charting that are less common when data are displayed in tables. Poor choices, or deliberately deceptive choices in graphic display can provide a distorted picture of numbers and relationships they represent. A more common problem is that charts are often designed in ways that hide what the data might tell us, or that distract the reader from quickly discerning the meaning of the evidence presented in the chart. Each of these problems is illustrated in the two classic texts on data presentation: Darrell Huff’s How to Lie with Statistics (1994) and Edward Tufte’s The Visual Display of Quantitative Information (1983).
The components of a chart
There are three basic components to most charts:
- the labeling that defines the data: the title, axis titles and labels, legends defining separate data series, and notes (often, to indicate the data source),
- scales defining the range of the Y (and sometimes the X) axis, and
- the graphical elements that represent the data: the bars in bar charts, the lines in times series plots, the points in scatter plots, or the slices of a pie chart.
Titles. In journalistic writing a chart title will sometimes state the conclusion the writer would have the reader draw from the chart. If figure 2 were used in a Governors State University press release, the title, “Tuition and Fees Lowest at GSU” might be appropriate. In academic writing, the title should be used to define the data series, as is shown in Figure 2, without imposing a data interpretation on the reader. Often, the units of measurement are specified at the end of the title after a colon or in parenthesis in a subtitle (e.g. “constant dollars”, “% of GDP”, or “billions of US dollars”).
Axis titles. Axis titles should be brief and should not be used at all if the information merely repeats what is clear from the title and axis labels. It would be redundant to repeat the phrase “Tuition and Fees” in the Y axis in figure 2, and the x-axis title, “University”, is completely unnecessary. If the title of the chart has the subtitle “% of GDP”, it is not necessary to repeat either the phrase or the word “percent” in the axis title.
Axis scale and data labels. The value or magnitude of the main graphical elements of the chart are defined by either or both the axis scale and individual data labels. Avoid using too many numbers to define the data points. A chart that labels the value of each individual data point does not need labeling on the y axis. If it seems necessary to label every value on a chart, consider that a table is probably a more efficient way of presenting the data.
Legends. Legends are used in charts with more than one data series. They should not be placed on the outside of the chart in a way that reduces the plot area, the amount of space given to represent the data. In figure 2, the legend is placed inside the chart (although some think that detracts from the main graphical elements), it could also be placed at the bottom of the chart (where the unnecessary “university” now stands).
Gridlines. If used at all, gridlines should use as little ink as possible so as to not overwhelm the main graphical elements of the chart.
The source. Specifying the source of the data is important for proper academic citation, but is can also give knowlegeable readers who are often familiar with common data sources important insights into the reliability and validity of the data. For example, knowing that crime statistics come from the FBI rather than The National Criminal Victimization Survey can be a crucial bit of information.
Other chart elements. The amount of ink given over to the non-data elements of a chart that are not necessary for defining the meaning and values of the data should be kept to an absolute minimum. Plot area borders and plot area shading are unnecessary. Keep the shading of the graphical elements simple and always avoid using unnecessary 3D effects. In most of the charts that follow, even the vertical line defining the y-axis has been removed, following the commendable charting standards of The Economist magazine.
When graphic design goes badly
The most general standards of charting data are thus the following:
- Present meaningful data.
- Define the data unambiguously.
- Do not distort the data.
- Present the data efficiently.
To see what happens when these rules are violated, consider figure 3, taken from Robert Putnam’s Bowling Alone (where it is labeled figure 47) a work that contains many good and bad examples of graphical data display (and unfortunately, no tables at all). In just one chart, Putnam violates the three fundamental rules of data presentation: the chart does not depict meaningful data; the data it does depict are ambiguous, and the chart design is seriously inefficient. One can’t accuse Putnam of distorting the data only because his main conclusions are not derived from the data presented in the chart.
Of these, let’s consider the inefficiency first: the first thing you notice about the chart is that the graphical elements are represented in three dimensions. On both efficiency and truthfulness this is unfortunate; the 3D effect is entirely unnecessary and in this case serves to distort the visual representation of the data. Had not the data labels been shown on the top of each bar, it would not be readily apparent that column A is in fact bigger than column F, or that column C is the same size as B. In addition the chart suffers from what might be called “numbering inefficiency”: Putnam uses 13 numbers to represent 6 data points. Eliminating the 3D, as shown in figure 4, offers a more exact representation of the data with a lot less ink.
There are two problems of ambiguous data in the chart. Partly this is resolved in Putnam’s text where it is explained that bar E is the percentage of women who are homemakers out of concern for their kids while bar A is the percentage of women who are working full-time because they need the money. It’s not quite clear what the numbers for those who work part-time mean. In the case of bar C, for example, are the women working only part-time because of the kids, or are they not full-time homemakers because of the money?
The other ambiguity, however, is not for the lack of proper labeling. If one looks at the chart quickly, the first impression one would get would be that is that only 11% of women who work full-time do so for reasons of personal satisfaction. But that is not the case. Look at the y-axis title. Or notice that all of the percentages add up to 100. Of all the women in the survey, 11% were in the single category of “employed full-time for reasons of personal satisfaction.” This is not what one expects in a bar chart, but given the data Putnam has decided to display, there isn’t a whole lot that can be done with the chart to fix it.
Still, we have to ask, “what does this chart mean?” In particular, what data do the arrows on the bars represent?
A critical standard of good charting is that the chart should be self explanatory. That there are problems with this chart become apparent to the reader as soon as one encounters Putnam’s page and a half of accompanying text devoted, not to explaining the significance of the data, but to explaining what the elements of the chart represent. A careful reading of the text tells us that there are basically three conclusions Putnam would have us to draw from the chart:
- Over time, (the 1980s and 90s) more women are working.
- They are doing so less for reasons of personal satisfaction and more out of necessity (i.e. to earn money).
- Correspondingly, there has been a significant decline in the number of women who choose to be homemakers for reasons of personal satisfaction.
These three conclusions are directly relevant to Putnam’s general thesis: that over time there has been a decline in social capital (adults are spending less time raising children and developing the social capital of future generations) driven in part by the demands of the expanding work force.
Based on the textual discussion that Putnam offers it becomes clear that the most meaningful data is represented in the chart, not by the height of the bars, but by the direction of the arrows on the bars. Recall that as a general rule data presentations that include more than one time point provide for much more meaningful analyses than cross sectional or single time point presentations. Although most of the data analysis in Bowling Alone is time series data, in this case Putnam averages 21 years of data down to single data points represented by the chart’s bars, with the times series change represented by directional arrows. Thus. the most meaningful comparison in the chart – the comparison that support the conclusion that Putnam seeks to draw from the data – is not that bar A is higher than bar B or F, but that the arrow for bar A is going up while the arrow for bar F is going down.
The crucial comparison is made directly in figure 5, based on the data presented in the textual discussion. Moreover, it directly illustrates several points that neither the text nor the original chart make clear: In 1978, a plurality of women were homemakers who did so out of personal satisfaction; in 1999 women who worked full time for financial reasons were the plurality.
Note also that figure 5 simplifies the data presentation by eliminating the ambiguous part-time category: for part-timers, is “personal satisfaction” the reason for not staying at home or the reason for not working full time? And it clarifies that the “necessity” refers to “kids” in the case of homemakers and to “money” in the case of full time workers.
Choose the right type of chart
Most charts are a variation of one of four basic types: pie charts, bar charts, time series charts and scatter plots. Choosing the right type of chart depends on the characteristics of the data and the relationships you want displayed.
Pie charts are used to represent the distribution of the categorical components of a single variable. Note that as a general rule, multivariate comparisons provide for more meaningful analyses than do single variable distributions and for this and other reasons pie charts should be rarely used, if at all.
Rules for pie charts:
- Avoid using pie charts.
- Use pie charts only for data that add up to some meaningful total.
- Never ever use 3D pie charts; they are even worse than 2D pies.
- Avoid forcing comparisons across more than one pie chart.
Pie charts should rarely be used. Pie charts usually contain more ink than is necessary to display the data and the slices provide for a poor representation of the magnitude of data points. Do you remember as a kid trying to decide which slice of your birthday cake was the largest? It is more difficult for the eye to discern the relative size of pie slices than it is to assess relative bar length. Forcing the reader to draw comparisons across the two pie charts shown in figure 6 is also a bad idea: without looking at the data label percentages in the above figures one cannot easily determine whether the FY 2000 slices are larger or smaller than the corresponding FY 2007 slices.
3D pie charts are even worse, as they also add a visual distortion of the data (in figure 7, the thick 3D band exaggerates the size of the corporate income tax slice).
All the information in the pie charts above can be conveyed more precisely and with far less ink in the simple bar chart shown in figure 8.
Nevertheless, people like pie charts. Readers expect to see one or two pie charts similar to those in figure 6 at the very beginning of an annual agency budget report. But it would be a big mistake to rely on several pie charts for the primary data analysis in a report.
For those who would ignore all the advice given here and insist that good charts must look pretty, the most recent version of the Microsoft Excel charting software (in Office 2007, beta) will satisfy all your foolish desires: 3D pie charts that gleam and glisten like Christmas tree ornaments, to say nothing about what you can do with the 3D pie chart’s cousins, the donut, cylinder, cone, radar and pyramid charts.
As a general rule, 3D charts are not a good idea even when the data are 3D. In theory they provide for a precise representation of data, but it is rare they provide a basis for drawing a simple conclusion.
Bar charts typically display the relationship between one or more categorical variables with one or more quantitative variables represented by the length of the bars. The categorical variables are usually defined by the categories displayed on the x-axis and, if there is more than one data series, by the legend.
Rules for bar charts:
- Minimize the ink, do not use 3D effects.
- Sort the data on the most significant variable.
- Use rotated bar charts if there are more than 8 to 10 categories.
- Place legends inside or below the plot area.
- With more than one data series, beware of scaling distortions.
Bar charts often contain little data, a lot of ink, and rarely reveal ideas that cannot be presented much more simply in a table. Minimizing the ink-to-data ratio is especially important in the case of bar charts. Never use a 3D bar chart. Keep the gridlines faint. Display no more than seven numbers on the y-axis scale. If there are fewer than five bars, consider using data labels rather than a y-axis scale; it doesn’t make sense to use a five-numbered scale when the exact values can be shown with four numbers.
Look at figure 9 and you can quickly grasp the main points – the United States has the highest child poverty rate among developed nations -, but then spend some time with it and you’ll discover other interesting things. Note, for example, the differences in child and elderly poverty across nations or that the three countries at the top, with the lowest child poverty rates are Scandinavian countries; five of the seven countries with the highest child poverty are English-language countries.
As with tables, sorting the data on the most significant variable greatly eases the interpretation of the data. The data in figure 9 are sorted on the child rather than the elderly poverty rates only because most of the research on the topic has focused on child poverty. Note also that if the sorted variable represents time, time should always go from left to right and on the x-axis.
One variation of the bar chart, the stacked bar chart, should be used with caution, especially when there is no implicit order to the categories (i.e. when the categorical variable is nominal rather than ordinal) that make up the bar, as is the case in figure 10. Note how difficult it is to discern the differences in the size of the components on the upper parts of the bar. The same difficulty occurs with the stacked line and area charts.
The stacked bar chart works best when the primary comparisons are to be made across the data series represented at the bottom of the bar. Thus, placing the “teachers” data series at the bottom of the bars in figure 11 (and sorting the data on that series) forces the reader’s attention on the crucial comparison and the obvious conclusion: American teachers are fortunate to have such a large supervisory and support staff.
One common bar charting mistake is including the legend on the right-hand side of the plot area (shown in figure 12), placing the legend inside the plot area, as in figure 9, or horizontally under the table title (as in figure 11) maximizes the size of the area given over to displaying the data.
Scaling effects occur, when a bar chart (or a line chart, as we will see) two data series with numbers of substantially different magnitude, the variation in the data series containing the smaller numbers. Figure 12, for example, depicts the increase in the labor force participation rate (the percent of the adult population in the labor force) from 60% in 1970 to 67% in 2000, and the increase in the unemployment rate from 5.3% to 7.1%. The immediate visual impression the chart gives is that the labor force participation rate is larger than the unemployment rate (a relatively meaningless comparison), while the important variation in the unemployment rate (a 30% increase) is hardly noticeable. Including an additional bar representing the sum of the other bars in a chart (as shown in figure 13) has the same effect of reducing the variation in the main graphical elements.
To see what happens when most of the bar charting rules are violated, consider the example in figure 14, produced by the Illinois Board of Higher Education (IBHE).
It’s not just the 3D. Look carefully at the x-axis. Using comparable data (the only available data: Fall headcounts rather than 12 month headcounts), eliminating the 3D effects, sorting time from left to right, and removing the community college data series, and adjusting the bottom of the scale, we see something in figure 15 that the IBHE chart obscured: private institution enrollments are responding to public demand for higher education, public universities are not.
Note, however, that some would object to not using a zero base for the y-axis scale in figure 15, but I don’t think that the depiction is all that unfair. It is fair to say, I think, that private institutions have accounted for most of the growth in university and college enrollments in the state, a disparity that would appear even more dramatic if annual change measure were depicted as in figure 16, with a zero base.
Time series line charts
The time series chart is one of the most efficient means of displaying large amounts of data in ways that provide for meaningful analyses. The typical time series line chart is a scatter plot chart with time represented on the x-axis and the lines connecting the data points.
Rules for time series (line) charts:
- Time is almost always displayed on the x-axis from left to right.
- Display as much data with as little ink as possible.
- Make sure the reader can clearly distinguish the lines for separate data series.
- Beware of scaling effects.
- When displaying fiscal or monetary data over time, it is often best to use deflated data (e.g. inflation-adjusted or % of GDP).
Scaling effects. When two variables with numbers of different magnitudes are graphed on the same chart, the variable with the large scale will generally appear to have a greater degree of variation; the smaller-scale variable will appear relatively “flat” even though the percentage change is the same. In figure 18, ABC Corp’s stock seems to be growing much faster than XYZ Com’s, yet the rate of increase is identical.
When the differences in scale are so great as to eliminate most of the perceived variation in the smaller-scale variable, using a second scale, displayed on the right-hand side as in figure 19, is sometimes preferable, although this may make the interpretation of the graph more complicated.
Many who have written about graphical distortion condemn the use of two-scale charts because the relative sizes of the two scales are completely arbitrary. This is true, had job approval and unemployment been plotted on the same 0 to 90 y-axis scale, the unemployment rate would be an almost flat line at the bottom of the chart.
One solution to trendlines of different magnitudes is to rescale the variables, calculating the percentage change from a base year – but note that the selection of the base year can produce dramatically different results.
When several times series lines are printed in black and white, it is sometimes difficult to separate out the different trend lines. Mixing solid, dotted, and dashed lines for each variable may solve this problem, although it is sometimes difficult to distinguish between dotted and dashed lines.
The 2D scatter plot is the most efficient medium for the graphical display of data. A simple scatter plot will tell you more about the relationship between two interval-level variables than any other method of presenting or summarizing such data.
Rules for scatter plots:
- Use two interval-level variables.
- Fully define the variables with the axis titles.
- The chart title should identify the two variables and the cases (e.g. cities or states)
- If there is an implied causal relationship between the variables, place the independent variable (the one that causes the other) on the x-axis and the dependent variable (the one that may be caused by the other) on the y-axis.
- Scale the axes to maximize the use of the plot area for displaying the data points.
- It’s a good idea to add data labels to identify the cases.
With good labeling of the variables and cases and common-sense scaling of the x and y-axes, there’s not a lot that can go wrong with a scatter plot, although extreme outliers on one or more of the variables can obscure patterns in the data.
In figure 20, TV viewing is the independent variable. (If you were trying to predict which type of students watch the most TV, the axes would be reversed). The scatter plot contains two optional plotting features: a regression trendline denoting the linear relationship between the two variables and the use of State postal ID data labels to indicate each state’s position on the chart (these labels require a special add-in to the Excel program). Although the chart suffers from overlapping data labels, the interpretation is straightforward; the higher the percentage of students in a state watching more than 6 hours of TV each day, the lower the state’s math scores.
John W. Tukey invented the box plot as a convenient method of displaying the distribution of interval-based variables.
Rules for box plots:
- A simple box plot plots the median and four quartiles of data for an interval level variable.
- Box plots are best used for comparing the distribution of the same variable for two or more groups or two or more points.
- Box plots are an excellent means of displaying how a single case compares to a large number of other cases.
The simple box plot, as shown in figure 21, displays the four quartiles of the data, with the “box” comprising the two middle quartiles, separated by the median. The upper and lower quartiles are represented by the single lines extending from the box. More detailed versions of the box plot restrict the “whiskers” on the plot to 1.5 times the size of the boxes and plot the higher or lower values (outliers) as individual points. Some versions also plot the mean in addition to the median.
A single box plot (as in figure 21) rarely reveals much about the data, and graphs of single variable data distributions, (using stem-and-leaf or histogram charts) rarely offer a more detailed graphic representation of the data distribution. The real advantages of the box plot graphic comes through, however, in single charts using several box plots to compare the distribution of a variable across groups or over time and an especially useful elaboration of the box plot graph involves plotting an individual case over the box plot to compare single cases to the overall distribution (see figure 22).
Thus, figure 22 displays the percentage Democratic vote for the 50 states over the past seven presidential elections. Labeling a single case, we can see that the Democratic vote in Nevada has moved steadily higher relative to the other states. One can easily imagine applying the same plotting strategy in a variety of other settings, for example, comparing one school district’s test scores to the distribution of test scores across other school districts.