|Credit to Randall Munroe.|
With election season approaching, everyone wants to know how the future of the United States’ leadership will shape up. As we turn to data, we can make predictions through inferences of the past and present, especially as statisticians such as Nate Silver would explain. As the title would suggest, in this post I discuss under what conditions, exactly, can we use experimental data to deduce a causal relationship between two or more variables?
Israeli-American computer scientist and philosopher Judea Pearl laid out much of the research related to theories of causality. Causal inference itself is a theory that is still debated among scientists and philosophers, and the premises, arguments, and conclusions that the theory provides can give us an understanding of correlation that doesn’t fall to errors in reasoning.
Pearl’s causal calculus is a set of three simple, powerful algebraic rules which can be used to make inferences about causal relationships. In particular, I’ll talk about the ways causal inference is possible, but I’ll also go into detail of the limits of these methods.
In explaining causes, consider the relationship of smoking and lung cancer. Several decades ago, the U.S. Surgeon General published a study that put forward the claim that cigarette smoking causes lung cancer. But the report came under attack not just by tobacco companies, but also by some of the world’s most prominent statisticians, including the great Ronald Fisher. There could have been other factors that are at play in this complicated relationship between smoking and lung cancer such as genetics, environmental factors, and even more personal characteristics such as age or race. And, in actuality, one would have to understand the relationship among lung cancer and the decision of whether an individual chooses to smoke, not the simple correlation of smoking and lung cancer itself. The actual relationship and the way these factors are correlated with one another was most likely much more nuanced than just claiming cigarette smoking causes cancer. The importance of understanding the details and specifics of these relationships would be necessary for individuals to make healthy and beneficial decisions in their lives.
A randomized controlled experiment, as briefly discussed earlier, may be possible, but the way it should be done requires explanation. A scientist may choose to perform an experiment in which they would force a person to smoke or not and, through an appropriately sized sample or smokers and non-smokers, they could observe the cancer rates among those two groups. The scientist would have to keep all other factors equal and maintain that the groups are truly random and large enough to account for an appropriate generalization that smoking causes lung cancer. The scientist could determine whether there is a causal correlation.
Yet, in reality, things don’t occur so simply. Experiments such as these are time-consuming, difficult to maintain, and rely on controlling for many factors that complicate the issue observed. This doesn’t even account for ethical or legal issues in performing such an experiment. This raises a fundamental, significant issue for scientists seeking to explore the relationships among activities like smoking and its associated health consequences.
The causal models we construct to analyze the smoking-cancer connection allow us to create diagrams that dictate there’s a hidden factor at play with both smoking and lung cancer. Mathematically speaking, the arrows dictate the relationship between how one factor causes another. Since we don’t know exactly how it behaves or what it is, we illustrate it this way:
|simple and clean.|
There is also a third possibility: that the combination of both smoking and a hidden factor contribute to lung cancer. This makes our correlations and relationships even more complicated, but allow for more nuanced and detailed justifications of these relationships. We could perhaps develop the argument that smoking inherently may reduce the probability of lung cancer while some hidden factor increases the likelihood of cancer in a way that we observe the increase in risk of cancer. These possibilities and explanations may seem to hinge on unnecessarily complicated premises and observations, but, given how many factors are at play in the empirical evidence on the issue of smoking and cancer, they provide us with much more potential for creating accurate arguments about the issue.
The paper went on to argue that that women tended to apply to departments with low rates of admission among all or most applications (such as English), while men tended to apply to the departments of the contrary (such as in engineering and chemistry).
From a purely causal point of view (in understanding which factors provide the means for determining other factors) this result seems paradoxical. Making clear, educated statements on whether an individual is likely to be accepted to Berkeley may hinge more on the assumptions and premises that lead up to our conclusions rather than seeking easily-to-digest, generalizable conclusions. Two variables which appear correlated can become anti-correlated when another factor is taken into account.
By any means, causality itself seems to fly under the radar among too many scientists. To ascertain the truth and validity of arguments of causality and use them in any sort of discipline, one must come to understand the nature of causality itself. In creating arguments and recognizing the limitations of these methods of inquiry, we can create more refined understandings of the universe and allow more certainty in our predictions and inferences.
There are ways to resolve Simpson’s paradox, though. With a causal Bayesian network (an acyclic graph as we’ve been working with, let’s say X causes Y), we can measure how changing X would change Y and determine the relationship thereof. As with our example, we have ethical and logistic reasons why this might not be possible. One could also show that an extra variable correlates with both X and Y. As in, we could determine that X causes Z which causes Y. Finally, one might have an indirect variable which affects both X and Y. Such a relationship would look like Z causes X and Z causes Y. As explained using the graphs illustrating smoking and lung cancer, we generally want our measurements to avoid these hidden variables to determine how a causal model works.