This post is the first in a two-part series called “The Simplified Guide to Understanding Statistics in the Social Sciences” on our blog. The second post in this series is about data reliability and validity.
“There are Three Kinds of Lies: Lies, Damned Lies, and Statistics.” — Mark Twain
This famous quote is a half-truth. The truth in it derives from the fact that statistics can easily be misleading (or misinterpreted) if you are not looking at the whole picture. The falsity of this statement lies in the fact that the numbers cannot in themselves be “lies,” assuming that they are derived from high-quality data. Statistics can only “lie” to those who are not looking for the whole truth.
The bread and butter of understanding statistics is being able to understand the numbers (the actual “statistics”), as well as the illustrations (graphs) that depict differences and relationships between numbers. Read on to learn more about how to understand and properly interpret statistics in terms of both the numbers and graphs.
Deciphering the Digits
The first step to reading statistics is to understand how the numbers are being reported. Are the stats reported as raw numbers, presented as ratios or percentages, or written as a percent change? We will begin by discussing the ways in which statistics can be reported and how you can reinterpret these numbers to make them more digestible.
The first step to interpreting statistics is to understand how the numbers are being reported.Tweet
Often, in social statistics, the numbers being reported refer to individual people. For example, in a poll of 1000 people for or against a proposed law, 400 people may have opined in favor of the law, while the remaining 600 were against the law. These results are the raw numbers: 400 for, and 600 against the proposed law.
Ratios (Proportions) and Percentages
The raw numbers can be used to calculate ratios (also sometimes called proportions) or percentages. Looking at the percentages can be useful to illustrate large-scale trends. In this example, 40% (or 4 out of every 10) of respondents were for, and 60% (or 6 out of every 10) of respondents were against this new law. The percentages here also tell you that the majority of respondents were not in favor of the new law.
Sometimes, the raw numbers can be more useful than percentages, e.g. when the sample size is small. Saying that 2% of survey respondents had a certain opinion is a lot different than saying that one person surveyed in a sample of 50 people — which is also 2%, by the way — did.
Rates refer to the frequency among a population. Common examples of rates are death rate, birth rate, or incarceration rate. For example, you may report that the death rate for a given disease is 500 out of every 100,000 people.
Percent gain or percent change is the amount of difference either between time periods or between groups expressed as the percent difference between the two. Let’s say a group of 50,000 people were surveyed twice. In the first survey, 8,000 people said they would vote for the candidate; at the time of the second survey, 10,000 people said they would vote for that candidate. This translates to a percent gain of 25% from the first survey to the second survey. Another way to say this is that the number of people voting for the candidate at time two is 125% of the number voting for this candidate at time one.
Sometimes, translating a statistic into a different measurement can make it easier to understand, or more illustrative of the actual data you are looking at. For example, if statistics are reported as raw numbers, it can be more illuminating to convert these raw numbers to a relative percentage, or proportion.
Do Your Eyes Deceive You?
The most important part of interpreting data visualizations is to know what you are looking at. Examine the axes, legends, and any other information that is available. See if there are any error bars, which are often included in line and bar graphs, and sometimes scatterplots, to connote uncertainty or variability about the identified patterns in the data.
“The most important part of interpreting data visualizations [in statistics] is to know what you are looking at.”Tweet
Scales on the plot axes can sometimes be misleading. If you are looking at two different graphs that are on different scales, the data depicted in the two graphs can look much different, even if they are the same.
For example, compare these two graphs which both show the amount of immigrants in the US as a share of the US population between 1860 and 2017 (depicted by the orange line in the first graph):
Source: Migration Policy Institute. “U.S. Immigrant Population and Share over Time, 1850-Present”.
Source: BBC News. “Six Charts on the Immigrants Who Call the US Home” (Statistics from the American Community Survey 2000-2017).
These two graphs display the exact same information — the percentage of the US population who are immigrants as it has changed over time. However, the bottom graph – the “zoomed-in” version – makes the rises in the late 1800’s and 2010’s and fall in the 1970’s seem far more drastic.
Data visualization does not typically put two graphs with different scaling side-by-side — this is frowned upon as it can be misleading. However, it is important to know that scaling makes a difference in how the data appears.
Now that we’ve covered the basics of understanding data, we will discuss the five common methods of illustrating statistics: bar graphs, line graphs, pie charts, scatter plots, and pictographs.
Bar graphs use horizontal bars or vertical bars (this is also called a “column” graph) to illustrate statistics. The following bar graph shows the age breakdown of people who have at least one disability:
Example of a bar graph. Source: USAFacts/US Census. “Share of age group with at least one disability.”
Advantages of Bar Graphs
- The major benefit of the bar graph is its visual simplicity. Bar graphs allow you to quickly and clearly compare the numbers side-by-side.
Disadvantages of Bar Graphs
- The visual simplicity of bar graphs is also its major disadvantage. Bar graphs can only depict a single relationship – for example, the relationship between age and disability as shown above. Bar graphs cannot depict more complex relationships between properties of the data.
- Bar graphs cannot depict growth or decline, as they offer a snapshot of the data at one moment in time.
- The simplified nature of bar graphs can oversimplify, and therefore obscure, more minute differences in the numbers.
When Should I Use a Bar Graph?
Bar graphs are best used to depict stark differences in statistics among a single population concerning a single variable (characteristic).
Line graphs use a single connected line to visualize many different data points. An example of a line graph is this visualization of the male and female populations over time:
An example of a line graph. “US Population by Gender” (Male/Female). Source: USAFacts/2020 US Census.
Advantages of Line Graphs
- Line graphs are more complex than bar graphs, but not overly complex. With line graphs, you can visualize more information than bar graphs, such as the upper and lower limits of the statistics. Line graphs are able to depict the comparison or change between more than one population each sharing a single characteristic.
- Line graphs can depict changes over time.
- Because line graphs literally “connect the dots” (the data points), they are ideal to depict patterns of growth and decline.
Disadvantages of Line Graphs
- Line graphs can obscure differences in the data. The line itself is designed to depict the “best fit,” or average shape of difference, between data points, rather than illustrate exact data point values.
- The line absorbs and connects all dots and disregards outliers — points that are extremely high or extremely low.
- The connecting line cannot clearly show the difference between a large amount of data points. Line graphs are clearest when they connect fewer than 50 points on the graph.
When Should I Use a Line Graph?
Line graphs are best for displaying changes over time using statistics that compare only a handful of different groups. With too many groups displayed on the line graph, the lines for each group overlap too much and become visually muddled. Line graphs are also ideal for representing change measured in a handful of time periods; however, if there are too many notches along the x-axis, differences between points at each time will be difficult to see.
Pie charts, as the name indicates, show data as a fraction of a whole – like slices of a pie. The size of each “slice” corresponds to what proportion of the whole “pie” it represents. This pie chart shows the breakdown of average monthly expenses of a middle-class family:
Example of a pie chart. Source: USAFacts/US Census and IRS. “What Will a $1958 Relief Check Cover for the Average Middle Class Family?”
Advantages of Pie Charts
- Like bar graphs, pie charts are visually simple and easy to interpret.
- Pie charts are among the most visually appealing types of graphs.
- Pie charts are a simple way to depict the relative proportion of a whole (what percent of 100%) that each “piece” (category) represents.
Disadvantages of Pie Charts
- Pie charts typically depict whole percentages, so they are not exact.
- Smaller fractions appear as slivers on the pie chart, which are difficult to see. Pie charts only remain visually simple when there are less than 10 categories (or pieces).
- Pie charts can only depict comparisons between subcategories within a larger category. They cannot be used to show more complex relationships between the different categories.
When Should I Use Pie Charts?
Pie charts are best used when showing a particular subcategory in the proportion of a whole (for example, food spending as a part of overall monthly expenditures). The best and most visually appealing use of pie charts is to differentiate between only a handful of different subcategories holding relatively significant proportions of the whole.
A scatter plot (also called a “scatter chart” or “scatter graph”) depicts statistics as dots on a graph. Each dot corresponds to a particular number. In this scatter plot, we see US population by state charted over 119 years:
Example of a scatter plot. Source 2020 US Census. “US Population By State 1900 vs. 2019”. https://usafacts.org/data/topics/people-society/population-and-demographics/population-data/population/
Advantages of Scatter Plots
- Scatter plots are the most exact way of depicting statistical data.
- Because scatterplots show each individual point and do not include a “best fit” or averaging line, they relay minute differences between points that line graphs do not.
- Scatter plots can depict clusters – areas where there are large numbers of individuals or groups that have the same or similar statistical values. Scatter plots also show “outliers”, or values that are extremely high or extremely low.
Disadvantages of Scatter Plots
- Due to the increased complexity and the lack of clear bars or lines indicating connections or differences between points, scatter plots can be difficult to read.
- The dots on the scatter plot, especially when clustered, can be difficult to differentiate, and minute differences between points on the plot can be less than obvious.
- Different colored dots or different shapes can be used to represent different groups; however, this can create a dizzying amount of clutter if too many points are included. Lines can be used to connect the points for each group to make the graph easier to read.
When Should I Use Scatter Plots?
Scatter plots are preferable for graphs depicting many groups of data points, and/or data that may exhibit minute differences on one or two characteristics. Scatter plots are also ideal for depicting clustering and outliers.
Pictographs represent data in terms of representative images. The following pictograph shows US population density by state — deeper shades of green indicate greater population density.
Example of a pictograph. Source: USAFacts/2020 US Census. “Population by US State”.
Advantages of Pictographs
- Pictographs are the most simplistic and visually interesting way to illustrate data. The images representing the categories visually symbolize the category in a more appealing way than a simple bar, line, or dot on a plane.
- The visual differences in the pictograph correspond to statistical differences, making differences easier to interpret.
Disadvantages of Pictographs
- Because pictographs are so simple, they are very inexact.
- Pictographs cannot show minute differences between categories.
- The images used in pictographs are easily misleading, as visual cues that differentiate symbols can be easily exaggerated or otherwise not representative of the data they represent.
- Pictographs, like bar graphs and pie charts, only work well when there is a relatively small number of categories being compared.
When Should I Use Pictographs?
Pictographs work best when comparing a relatively small number of discrete categories in which the data differs widely by category. Things like state-by-state differences on a single variable can be easily captured with pictographs.
Mind the Big Picture
If the key to telling lies through statistics is to give only a partial picture or to paint an exaggerated impression of small trends in the data, then the key to getting the truth out of statistics is to pay attention to the big picture.
Take the time to understand what the numbers mean. This means you must understand the significance of a statistic in relation to the dataset in general. Also, consider other categories or groups to which the statistic is being compared, and what it all means in relation to the whole. Be cautious in interpreting graphical depictions like graphs or charts. Don’t let the representation (dot, bar, picture, etc.) speak for itself. Always consider what the graphical representation depicts, and interpret it relative to the scale. Be mindful of the limitations of each representation, which will also help you guide your understanding.
Ultimately, the key to making statistics tell the truth is to seek out the truth in statistics. Statisticians and those reporting on statistics will focus on what they consider to be the most pertinent and significant aspects of their findings. Statisticians and reporters have a point to make and they will make that point as starkly as possible, which can sometimes mean leaving out other contextualizing information. What is left out of the numbers and visualizations is often just as important as, sometimes more important than, what is included.
Read Part 2 of this series here.