The Journalist’s Guide to Statistics

There are three kinds of lies: lies, damned lies, and statistics.

Mark Twain, “My Autobiography”

Journalists need a good understanding of numbers. Tapping into the power of data would let them create more meaningful and effective stories. But making sense of numbers can be difficult. Reporting on data is often not as straightforward or manageable as other types of journalism. Writers need to separate signal from noise.

What’s more, researchers and writers need to know the context of data to draw appropriate conclusions. You can know everything about how candidates in an election fare against one another through polls and surveys, but, until you know the causes behind why people would vote that way, you can’t say much about those statistics. I’ve written more here on the nature of causation in the context of scientific research. This guide provides a logical, reader-friendly approach to writers wanting to harness the power of statistics.

Table of contents:

  1. Know the numbers
  2. Study the source
  3. Remember the reader
  4. Present the product

1. Know the numbers

Too often, writers throw around numbers not knowing what they mean. Here is a run-down of statistics terms you should know as a journalist:

  • Bayesian statistics
    • If it rains, how does that affect which football team will win? This branch of statistics lets you figure out how likely something may occur based on how it depends on other factors. This lets you account for factors like false positives (when an experiment detects something that doesn’t exist) such as medical screening flagging false cues as cancer. With Bayesian models, you can account for different sources of information in putting together these conditional probabilities.
    • Using Bayesian statistics to predict how likely future events are is “Bayesian inference.”
  • Beta distribution
    • Using a pre-defined distribution, you can determine how well a baseball player will do at the beginning of a season even when you haven’t collected much data so far. Using her batting average of .270, you can create beta distribution (shown above with α = .81 and β = .219. The average is .270 and the standard deviation is σ2 = .115.
    • If you don’t know the exact probability something occurs, you can figure out how which probability is most likely by selecting it from a beta distribution of probabilities. You can use α and β to calculate the mean μ and standard deviation σ with:
    • You’ll also find binomial distributions which use the same probability for all trials instead of letting it change.
  • Chi-square test (χ2 test)
      Suppose you wanted to find the relationship between being HIV positive and sexual preference. You survey 30 males and find the following data (in a contingency table):
      Sexual preference
      Not HIV+316221
      Then, you can multiply the raw numbers and divide by the total to calculate how likely it is HIV+ determines sexual preference. This gives you expected values, different from the observed ones as shown below:
      Sexual preference
      Observed (O)4239
      Expected (E)(9*7)/30  = 2.1(9*18)/30 = 5.4(9*5)/30=1.5
      Not HIV+
      Observed (O)316221
      Expected (E)(21*7)/30 = 4.9(21*18)/30 = 12.6(21*5)/30 = 3.5
    • If you have an expectation or prediction of what your results should look like, the chi-square test compares them to what you actually observe to tell you how well your predictions match what happens. This example is borrowed from David Stockburger at Missouri State.
    • Researchers calculate this by finding the difference between observed and expected values using the formula χ2 = (observed − expected)2/expected.
    • Sometimes you’ll see the difference between observed and expected values referred to as the “residual.”
  • Confounding variable
    • If you want to test if texting leads to an increase in crashes, you would want to make sure that text messages, not weather or traffic, cause the crashes. These extra variables the study doesn’t account for are confounding variables.
  • Controlled experiment
    • If you give a drug to students to observe how it affects sleep, you should compare this group (the treatment group) to a controlled group, a set of students under the same conditions, but without the drug. This makes sure you can determine that it was the drug causing differences in sleep and not some other variable.
  • Correlation
    • This tells you how well two variables are related to one another. Two stocks that change in similar ways to one another over time may be correlated.
  • Fisher’s exact test
      CuredNot CuredTotal
      Drug A4258100
      Drug B1486100
    • Similar to the chi-square test, this test compares whether an outcome occurs using a contingency table (shown above). There’s no formal calculation, but it can give you an idea of the probability an effect occurs.
  • Histogram
    • This tells you how data is distributed with the normal distribution shown above. The height of each bar shows you how many data points in the bin along the x-axis or how likely it is to fall in that bin. For probabilities, the area of the bins should equal 100 percent.
  • Margin of error
    • When you make a measurement, the margin of error (sometimes called “uncertainty”) tells you how much that measurement can change due to other factors. You’ll typically find this in a range of a confidence interval, such as “40 percent +/- 1 percent.”
    • If you’re polling a sample of people, the margin of error can tell you how close the sample is as representative of the entire population.
    • Writer Robert Niles defines this as “1 divided by the square root of the number of people in the sample.”
    • You can further break down error into bias and systematic error:
      • The same way standing on a weighing scale while wearing clothes makes you heavier, a bias creates an error based on how you measure something.
      • If, instead, the weighing scale itself isn’t calibrated properly, there’s a systematic error. This affects all results due to the nature of your measuring equipment itself.
  • Mean
    • This is the average of a set of data points, generally written using μ. When dealing with statistics, keep your language precise to communicate the most effective message possible. If the average life expectancy in the U.S. is 79 years, know the standard deviation and sample size. You may not need to report those factors, but they’ll help you put your averages in context.
    • When journalists write about the “average citizen” or the “average voter,” in most cases, they’re not referring to the strict mathematical definition of an average (the sum of each data point divided by the number of data points). Rather, journalists tend to refer to the “average” as a common, representative individual in a population. Keep in mind the statistical average only represents this “average individual” based on how the standard deviation and sample size.
  • Median
    • If you listed your data points from highest to lowest, the value in the middle is the median. Because this doesn’t depend on how far spread out or varied the data points are, the median is, more or less, the “middle.” It doesn’t matter that the richest person in America makes four times as much as the middle-class. What matters is whether it’s greater or lesser than those in the middle.
    • In some cases, the median can give you a more accurate idea of the “average” person in a population when reporting. Make sure you understand where the median falls in the space between the highest and lowest data point. That can tell you more about how the numbers are distributed.
    • Paleontologist Stephen Jay Gould quoted Twain’s “damned lies” quote to argue that using the eight-month median survival time for peritoneal mesothelioma was misleading. Many people, like Gould who lived for two more decades, would live for years and take an optimistic, positive view of statistics in general.
  • Mode
    • The mode is the number occurring most often. This simple and clean measurement can tell you who’s the most popular candidate in an election. You won’t see this much, but it’s helpful for comparing raw numbers against one another like sales figures.
  • Multiplication rule
    • If there’s a 1/2 chance you’ll draw a red card from a deck and a 1/13 chance the card is a King, then there chance to draw aa red King is 1/2*1/13= 1/26. This holds for independent events.
    • Keep track of how one event may affect the other. If you draw red card from a deck (with a 1/2 probability), the chance the next card is red is now 25/51 because you have one less red card in the deck.
  • Normal (or standard or Gaussian) distribution
    • Imagine taking a set of heights in a population and graphing the heights on the x-axis with how many times they occur on the y-axis. If the data is “normally” distributed, then most people should fall around an average height with fewer and fewer heights farther away from this average as shown in the graph above. In the normal distribution, you can define this distribution using the mean and standard deviation with  σ as the standard deviation and μ as the mean.
    • The normal distribution centers on the average and, with a greater standard deviation, it becomes more spread out in both directions. You most likely won’t report the normal distribution explicitly in a news story.
    • The standard deviation lets you compare the mean to the distribution. About 70 percent of people are one standard deviation from the mean (in either direction), 95 percent are two standard deviations away and almost everyone, three standard deviations away. The Z-score also tells you how far away a data point is from the mean.
    • If you wanted to test if a new psychiatric drug changed the frequency of mood swings, you might measure the number of mood swings in a population with the drug and a population without. If you found that the means of the two distributions are separated by a certain number of standard deviations, you can convert that to a p-value. The smaller the p-value, the more likely it is that the drug itself, not some random variable. This gives you a probability that the drug works.
  • Null hypothesis (H0)
    • To figure out if smoking truly causes cancer, scientists look for ways to show that “smoking doesn’t cause cancer” is false. This is a null hypothesis (H0), usually used to show that there is no effect or no relationship between what you want to show. In the words of scientists, they look for ways to “reject the null hypothesis.”
    • When creating a standard distribution, the p-value tells you how likely it is to reject the null hypothesis.
  • Quartile
    • Split the data into four equally sized groups. The lowest quarter is the lower quartile, the highest quarter is the upper quartile and everything in between is the middle quartile. The range of the middle quartile is the interquartile range.
  • Range
    • This is the highest value minus the smallest. Note that the range is a single number, not a range of numbers.
  • Regression
    • Regression tests what causes something to happen. If smoking really does cause an increase in cancer, then you should see it if you make a graph of cancer prevalence vs. smoking like the graph above, usually with a line of best fit (shown in red). Given enough linear regressions, you can separate a scientific observation explain the relationship between into the variables that cause it.
    • Keep in mind correlation does not imply causation. If you find that video game sales rise around similar times when violent crimes occur, you still need to show that one caused the other before drawing conclusions between the two. Otherwise things may be a coincidence or just a matter of randomness.
    • You’ll see an R value (how well one variable explains the other) or an R2 value (how well the model fits the data). The ANOVA (Analysis of Variance) creates an R2 and whether the result is “statistically significant.”
  • Standard deviation
    • The standard deviation is how widely values are spread apart or how much the data varies. This, along with mean, defines a normal distribution.
    • You can calculate the standard deviation of the population the formula above with x̄ as the average of data points x over n number of data points with Σ the sum of each value (xi – x̄)2. If you want the standard deviation of a specific sample, use n-1 instead of n in the denominator because you only know the mean of that sample, not the population.
    • The standard deviation squared gives you the variance. Sometimes researchers use “deviation” and “variance” interchangeably so keep in mind the difference.
  • Stochastic models
    • These are ways to predict future data like financial portfolios or weather forecasts that depend on randomness. Using distributions like the normal or beta distributions, you can simulate what future data will look like and form predictions.
  • Variable
    • Variables are anything that differs from person to person or sample to sample.
    • Categorical variables are ways of labeling people into groups (like biological sex or state of residence), continuous ones lie on a scale (like age or temperature), qualitative ones use adjectives (like colors) and random variables are what scientists measure as outcomes of experiments (like flipping a coin).
  • 2. Study the source

    The Society for Professional Journalists dictates you should remain accountable and transparent, seek the truth and report it and act independently. In the context of numbers, this means remaining open and honest in data analysis, scrutinizing findings and mathematical methods and doing so free from anything that my interfere with an investigation. After you know the definitions of statistics, you need to know where those numbers came from. This means not only knowing how data was collected, but appealing to statistics in a way that reflects the current principles of journalism.

    Writer William Davies argued the authority of statistics and the researchers who study them is declining. In a post-statistical society, journalists need to remain objective and skeptical to statistics while still appreciating them for what they are. It won’t be a battle between elite facts and populist feelings, but, rather, public rhetoric and the forces against it.

    Remember to keep numbers in context of their original source or how they were measured. If someone asks where you got your information from or how a number was calculated, you should have an appropriate answer. If you’re reporting a p-value for biomedical study, which variables were measured? How does the standard deviation affect the certainty of the results? Make sure that, for whatever claim or argument a scientist has put forward in a study, you can be responsible for however you report on it.

    As you become more statistically literate, you’ll naturally reevaluate how you reason. Becoming aware of common fallacies and pitfalls journalists fall into can make you more prepared to present accurate scientific findings. Be careful when you read a study suggesting that, because people are losing jobs, the economy must be doing poorly or that, if a study found no evidence on the link between fossil fuels and climate change that you conclude there’s evidence of absence. You can begin to see through the arguments that the majority of people saying something is true makes it true and, instead, take a more empirical approach to forming an opinion.

    Much more sinister are those who prey on individuals without a strong statistical or mathematical literacy. Showing that the cost of attending college is a smaller percent of the national debt now than it was in the 1960s doesn’t show that today’s college students pay less for their education. As you study the context and nuances of scientific findings, you’ll become better prepared to separate signal from noise in these situations.

    If there’s a 20 percent chance of rain, does that mean it will rain 20 percent of the time? If a medical procedure has a false positive rate of 1 out of 10 trials, how does that change its effectiveness? It’s easy to appeal to the authority of statistics and science without investigating for yourself. Check what experiments were performed or the historical use of tests like the Fisher’s exact test.

    This way, you’re acting as both a writer and a researcher. The key here is to avoid resorting to phrases like “studies show” or “survey says,” and, instead, ask yourself if you really know what the scientific studies purport. Many times scientists will refer to terms like “standard deviation” or “variance” interchangeably so make sure you know what’s being reported.

    3. Remember the reader

    Now that you have a deep understanding of what you’re reporting and what it means, you need to put it in a context that a general audience can understand.

    If you ask a drunkard what number is larger, 2/3 or 3/5, he won’t be able to tell you. But if you rephrase the question: what is better, 2 bottles of vodka for 3 people or 3 bottles of vodka for 5 people, he will tell you right away: 2 bottles for 3 people, of course.

    Edward Frenkel, “Love and Math: The Heart of Hidden Reality”

    In the quote above, how does drunkard arrive at the correct answer? The statistics are presented differently. In the rephrased question, he has a more “tangible,” usable way of understanding how the proportions of vodka would be arise from the distribution among people.

    How well do you understand what you write? Try answering this question to find out.

    Imagine you conduct a breast cancer screening using mammography in a certain region. You know the following information about the women in this region: The probability that a woman has breast cancer is 1 percent (known as “prevalence”). If a woman has breast cancer, the probability that she tests positive is 90 percent (“sensitivity”). If a woman does not have breast cancer, the probability that she nevertheless tests positive is 9 percent (false-positive rate). A woman tests positive. She wants to know from you whether that means that she has breast cancer for sure, or what the chances are. What is the best answer?

      A. The probability that she has breast cancer is about 81 percent.
      B. Out of 10 women with a positive mammogram, about 9 have breast cancer.
      C. Out of 10 women with a positive mammogram, about 1 has breast cancer.
      D. The probability that she has breast cancer is about 1 percent.

    When German psychologist Gerd Gigerezner posed the question to about 1000 gynecologists, about 21 percent chose the correct answer, C. While that is a little worse than random guessing, I must admit that, on my first attempt, I failed to answer this question correctly, as well. Through his research, Gigerezner has crafted a theory of understanding statistics that would help us in situations like this.

    Similar to Frenkel’s example of the fractions of vodka, psychologists like Daniel Kahneman and Gerd Gigerezner have shown that asking statistics questions in different ways can influence the ways we understand them. For example, when the information preceding the question is framed differently (as shown below), 87 percent of gynecologists answered correctly.

    Assume you conduct breast cancer screening using mammography in a certain region. You know the following information about the women in this region:

    • Ten out of every 1,000 women have breast cancer
    • Of these 10 women with breast cancer, 9 test positive
    • Of the 990 women without cancer, about 89 nevertheless test positive

    In both examples (of breast cancer screening and of bottles of vodka), when we change from “conditional probabilities” to “natural frequencies,” we suddenly understand statistics much better. Like Gigerezner, I believe we can teach the appropriate way to interpret statistics, and, with the effect it has on our health and society, we have a moral imperative to do so.

    You can use a confusion matrix like the one above to keep track of the accuracy metrics of an experiment when presenting information to colleagues.

    This isn’t a simple case of deliberately communicating false information or lying about the statistics we use. While there may be agendas and conflicts-of-interests between professionals (including scientists), we simply don’t understand how to interpret statistics. And, in the field of medicine, this can have disastrous results. We make poor decisions about how long a patient may live, how prevalence of cancer among smokers, and understanding the harms and benefits of screening for breast cancer.

    4. Present the product

    Many ways of visualizing, illustrating or explaining statistics exist no matter the medium. Looking across FiveThirtyEight, The Guardian‘s Data section or other data journalism publications, you can find effective ways of communicating complicated concepts either to the audience of your publication or to colleagues. Use figures and graphs to explain take-home messages and conclusions from your reporting. Make sure they’re easy to read and follow.

    Python and R offer ways of visualizing statistical findings with R providing much more extensive libraries for statistics than Python. My work in creating interactive network graphs, word clouds and even periodic tables show some examples. To produce a confusion matrix like the one shown below, you can use this code.

    Compare this confusion matrix to the Null hypothesis table above. Though it might be too complicated for someone reading a newspaper, you can use it to present findings to other researchers.

    It’s a good idea to value openness and transparency with your code and work in creating visualizations. This gives other researchers and writers ways to check and re-examine what you’ve done. The chart below shows how much the University of California Santa Cruz Science Communication class of 2020 used Slack during their fall quarter (with its code here). Interactive graphs give the reader a better sense of data and let you communicate more information as effective as possible.

    Make sure to perform statistical tests to confirm results from research when you report. In the movie “Rosencrantz and Guildenstern Are Dead,” the two protagonists flip a coin heads 92 times in a row. The chances this may happen is about 1 in 5 octillion. In a more realistic setting, the Dallas Cowboys have won 6 out of 8 coin tosses in the history of Super Bowls. In R, you can use a binomial distribution to return the value 0.109375.

    probability <- .5 # Set the odds of getting heads to .5. 
    wins <- 6 # number of winning coin flips
    totalFlips <- 8 # total coin flips
    dbinom(wins, totalFlips, probability)
    > 0.109375

    In the code, the comments are written with a # in front of them explaining what each line does. These comments are notes programmers write to explain things without affecting the code.

    With enough coin tosses you can make a graph of how these probabilities are based on the number of heads and flips. When you only have two outcomes (heads or tails), it’s a Bernoulli distribution.

    How likely is it the coin is fair? (Code found here.)

    Not all visuals are created equal. Statisticians William Cleveland and Robert McGill found that people can tell differences between length and angles much more easily than shapes and colors. This means, where appropriate, you should use charts and plots that rely on lines slopes when possible and avoid pie charts.

    No matter the code or plot you make, taking an independent investigative approach to statistics can let you harness the power of data in your stories. Becoming more savvy with numbers and calculations can let you present more accurate, verified findings. Though you can’t just drop statistics without context or understanding of how they came about, newsrooms and other workplaces for publications can use a more empirical approach in their findings in presenting scientific research for what it is. Whether its journalists themselves or a hired analyst creating statistical models of disease prevalence, they should adhere to the established standards of journalism.

    Life expectancy: visualized. (Code found here.)

    Journalism emphasizes quick, easy-to-understand conclusions and messages. While some projects can require more complicated work flows such as Bayesian models, bootstrapping or exploratory data analysis, sometimes all that matters is whether an experiment worked or didn’t. In many cases, you simply don’t have the time or capacity to explain what a p-value or regression test is. Still, becoming statistically literate and understanding the mathematics behind calculations involved in research can make you all the more prepared in presenting stories. Being able to tell the difference between causation and correlation can save you from drawing false conclusions and make your arguments more justified on the basis of statistics. It can give you the power to check the work of others and move journalism into a domains of peer-reviewed, egalitarian work. In writing this guide, I hope to do so as well.

    %d bloggers like this: