What techniques can be used to differentiate between correlation and causation?

What techniques can be used to differentiate between correlation and causation?

Psychology is unique among other medical sciences in that it cannot ask test participants to like, prefer or support something. As such a vast body of psychological studies are purely observational and can find correlation but not causation.

For example if there is a link between self-awareness and empathy, how to test if being more empathetic increases self-awareness, being more self-aware increases empathy, or, let's say, level of some hormone control both and they are otherwise independent of each other?


Let's compare those two situations:

  1. A study found that people who eat a lot of sugar are fat. How researchers can test whether a) eating sugar makes people fat b) being fat makes people eat more sugar or c) both sugar consumption and body fat are influenced by the same external factor?
  2. A study found that people who like yellow color are more likely to be psychopaths. How researchers can test whether a) liking yellow turns people into psychopaths b) being a psychopath makes people like yellow c) being a psychopath and liking yellow are both influenced by the same external factor?

In first scenario the answer is easy - split test subjects into two groups, force one group to restrict sugar intake and monitor their weight, help second group lose weight and monitor their sugar intake.

In second - I cannot find the answer. Because psychology generally concerns itself with what people are as opposed to what they do, the standard approach in medicine described above cannot be utilized. Researchers can't force someone to like yellow, neither can they turn someone into a psychopath.


When researchers find a statistical correlation between character traits or preferences, how do they test whether there is a causal relationship between those characteristic, if traits cannot be adjusted?

There is no statistical test for distinguishing mere correlation from causation. It has to come from ulterior (model) considerations.

Having said that, a common way is to design a new experiment, called a controlled study, in which [based on ulterior model considerations] you change something in some but not all the items/subjects observed. This is basically what you describe (propose to do) in your situation 1.

If you have no way whatsoever to change/adjust anything, then you are stuck so to speak; there's no way to go beyond observational studies. In practice, researchers come up with an idea of what causes what and try to "eliminate" (or rather account for) confounders--what you call an "external factor". There are various degrees of statistical sophistication at which you can do this, e.g. the so called path analysis. However, coming up with a complete list of confounders is hopeless.

A science in which by necessity many or most studies are observational is called an observational science (duh). Psychology is not the only science where you find a lot of observational studies. Archeology, geology, epidemiology or astronomy are other, perhaps more exterme examples. In general, a science being (mostly) observational entails that it's more difficult to replicate the results of studies findings… but some contend that this state of affairs is something people got used to… so the so-called "replication crisis" has mostly to do with controlled rather than observational studies. Quoting from the previous link:

But Peng also describes his own field, epidemiology, in the top left of the chart, as having largely been spared the reproducibility crisis, even though it is as theoretically rudderless as medicine or psychology (or archaeology). His contention is that in fields that rely on experimentation, breeds an unreasonable expectation that the results of single experiments (if well-designed and statistically validated) are true. Meanwhile, observational sciences, which are accustomed to the fact that they have little control over their observations and that single results may well be wrong, have already learned the core lesson of the reproducibility crisis: don't trust that that isn't replicated.

In fact some of the most controversial results in psychology have both a theoretical (albeit not deeply so) basis and rely on controlled rather that observational studies, e.g. the ego depletion hypothesis, which is roughly that sugar restores willpower.

Fizz's answer that "there is no statistical test for distinguishing mere correlation from causation" is wrong. There is. It's called the Granger causality test. It's basically a test for correlation across time, which is what a causal relation essentially is:

The Granger causality test is a statistical hypothesis test for determining whether one time series is useful in forecasting another, first proposed in 1969. Ordinarily, regressions reflect "mere" correlations, but Clive Granger argued that causality in economics could be tested for by measuring the ability to predict the future values of a time series using prior values of another time series. (Wikipedia)

As long as you're able to observe the phenomena you're attempting to ascertain causality separated in time, you can test whether the forerunning phenomenon is a good statistical forecast of the subsequent phenomenon. Whether we fully understand the mechanics of how the former causes the latter or not is immaterial to them being in a causal relation, and, in fact, any such explanation is just a breakdown of one such lagged correlation into a series of correlations between intervening steps.

For more on the Granger causality test see:

Granger, C. W. (1969). Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society, 37(3), 424-438. doi:10.2307/1912791 PDF:

Correlation and Causation Regarding the Etiology of Lung Cancer in Regard to Smoking

The etiology of the increase in lung cancer over the course of the 20th century has been a large area of debate. Was it smoking that caused cancer? Or was some other, unknown, factor the cause? Causation is multifactorial and multi-level—that is, causes of anything are numerous and these causes all interact with each other. But when it comes to smoking, it was erroneously argued that genotypic differences between individuals were the cause of both smoking and cancer. We know now that smoking is directly related to the incidence of lung cancer, but in the 20th century, there were researchers who were influenced and bribed to bring about favorable conclusions for the tobacco companies.

Psychologist Hans Eysenck (1916-1997) was a controversial psychologist researching many things, perhaps most controversially, racial differences in intelligence. It came out recently, though, that he published fraudulent papers with bad data (Smith, 2019). He, among other weird things, believed that smoking was not causal in regard to cancer. Now, why might Eysenck think that? Well, he was funded by many tobacco companies (Rose, 2010 Smith, 2019). He accepted money from tobacco companies to attempt to disprove the strong correlation between smoking and cancer. Between the 1977-1989, Eysenck accepted about 800,000 pounds from tobacco companies. He is not alone in holding erroneous beliefs such as this, however.

Biostatistician Ronald Fisher (1890-1962) (a pipe smoker himself), the inventor of many statistical techniques still used today, also held the erroneous belief that smoking was not causal in regard to cancer (Parascandola, 2004). Fisher (1957) argued in a letter to the British Medical Journal that while there was a correlation between smoking and the acquisition of lung cancer, “both [are] influenced by a common cause, in this case the individual genotype.” He went on to add that “Differentiation of genotype is not in itself an unreasonable possibility“, since it has been shown that genotypic differences in mice precede differences “in the frequency, age-incidence and type of the various kinds of cancer.

So, if we look at the chain it goes like this: people smoke people smoking is related to incidences in cancer but it does not follow that since people smoke that the smoking is the cause of cancer, since an unknown third factor could cause both the smoking and cancer. So now we have four hypotheses: (1) Smoking causes lung cancer (2) Lung cancer causes smoking (3) Both smoking and lung cancer are caused by an unknown third factor. In the case of (3), this “unknown third factor” would be the individual genotype and (4) the relationship is spurious . Fisher was of the belief that “although lung cancer occurred in cigarette smokers it did not necessarily follow that the cancer was caused by cigarettes because there might have been something in the genetic make up of people destined to have lung cancer that made them addicted to cigarettes” (Cowen, 1999). Arguments of this type were popular in the 19th and 20th century—what I would term ‘genetic determinists’ arguments, in that genes dispose people to certain behaviors. In this case, genes disposed people to lung cancer which made them addicted to cigarettes.

Now, the argument is as follows: Smoking, while correlated to cancer is not causal in regard to cancer. Those who choose to smoke would have acquired cancer anyway, as they were predisposed to both smoke and acquire cancer at X age. We now know, of course, that such claims are ridiculous—no matter which “scientific authorities” they come from. Fisher’s idea was that differences in genotype caused differences in cancer acquisition and so along with it, caused people to either acquire the behavior of smoking or not. While at the time such an argument could have been seen as plausible, the mounting evidence against the argument did nothing to sway Fisher’s belief that smoking did not outright cause lung cancer.

The fact that smoking caused lung cancer was initially resisted by the mainstream press in America (Cowen, 1999). Cowen (1999) notes that Eysenck stated that, just because smoking and lung cancer were statistically associated, it did not follow that smoking caused lung cancer. Of course, when thinking about what causes, for example, an observed disease, we must look at similar habits they have. And if they have similar habits and it is likely that those with similar habits have the hypothesized outcome (smokers having a higher incidence of lung cancer, in this case), then it would not be erroneous to conclude that the habit in question was a driving factor behind the hypothesized disease.

It just so happens that we now have good sociological research on the foundations of smoking. Cockerham (2013: 13) cites Hughes’ (2003) Learning to Smoke: Tobacco Use in the West where he describes the five stages that smokers go through: “(1) becoming a smoker, (2) continued smoking, (3) regular smoking, (4) addicted smoking, and, for some, (5) stopping smoking.” Most people report their first few times smoking cigarettes as unpleasant, but power through it to become a part of the group. Smoking becomes somewhat of a social ritual for kids in high-school—with kids being taught how to light a cigarette and how to inhale properly. For many, starting smoking is a social thing that they do with their friends—it can be said to be similar to being social drinkers, they were social smokers. There is good evidence that, for many, their journey as smokers starts and is fully dependent on their social environment than actual physical addiction (Johnson et al, 2003 Haines, et al, 2009).

One individual interviewed in Johnson et al (2003: 1484) stated that “the social setting
of it all [smoking] is something that is somewhat addictive itself.” So, not only is the nicotine the addictive substance on the mind of the youth, so too is the social situation for the youth in which the smoking occurs. The need to fit in with their peers is one important driver for the beginning—and continuance—of the behavior of smoking. So we now have a causal chain in regard to smoking, the social, and disease: youths are influenced/pressured to smoke by their social group which then leads to addiction and then, eventually, health problems such as lung cancer.

The fact that the etiology of smoking is social leads us to a necessary conclusion: change the social network, change the behavior. Just as people begin smoking in social groups, so too, do people quit smoking in social groups (Christakis and Fowler, 2008). We can then state that, on the basis of the cited research, that the social is ultimately causal in the etiology of lung cancer—the vehicle of cancer-causation being the cigarettes pushed bu the social group.

Eysenck and Fisher, two pioneers of statistics and different methods in psychology, were blinded by self-interest. It is very clear with both Eysenck and Fisher, that their beliefs were driven by Big Tobacco and the money they acquired from them. Philosopher Donald Davidson famously stated that reasons are causes for actions (Davidson, 1963). Eysenck’s and Fisher’s “pro-belief” (in this case, the non-causation of smoking to lung cancer) would be their “pro-attitude” and their beliefs lead to their actions (taking money from Big Tobacco in an attempt to show that cigarettes do not cause cancer).

The etiology of lung cancer as brought on by smoking is multifactorial, multilevel, and complex. We do have ample research showing that the beginnings of smoking for a large majority of smokers are social in nature. They begin smoking in social groups, and their identity as a smoker is then refined by others in their social group who see them as “a smoker.” Since individuals both begin smoking in groups and quitting in groups, it then follows that the acquisition of lung cancer can be looked at as a social phenomenon as well, since most people start smoking in a peer group.

The lung cancer-smoking debate is one of the best examples of the dictum post hoc, ergo propter hoc—or, correlation does not equal causation (indeed, this is where the dictum first originated). While Fisher and Eysenck did hold to that view in regard to the etiology of lung cancer (they did not believe that since smokers were more likely to acquire lung cancer that smoking caused lung cancer), it does speak to the biases the two men had in their personal and professional lives. These beliefs were disproven by showing a dose-dependent relationship in regard to smoking and lung cancer: heavier smokers had more serious cancer incidences, which tapered down the less an individual smoked. Fisher’s belief, though, that differences in genotype caused both behavior that led to smoking and the lung cancer itself, while plausible at the time, was nothing more than a usual genetic determinist argument. We now know that genes are not causes on their own they do not cause traits irrespective of their uses for the physiological system (Noble, 2012).

Everyone is biased—everyone. Now, this does not mean that objective science cannot be done. But what it does show is that “… scientific ideas did not develop in a vacuum but rather reflected underlying political or economic trends” (Hilliard, 2012: 85). This, and many more examples, speak to the biases of scientists. For reasons like this, though, is why science is about the reproduction of evidence. And, for that, the ideas of Eysenck and Fisher will be left in the dustbin of history.

In that case, how do we establish causality using observational data?

There has been good amount of research done on this particular issue. The entire objective of these methodologies is to eliminate the effect of any unobserved variable. In this section, I will introduce you to some of these well known techniques :

1. Panel Model (Ordinary regression) : This method comes in very handy if the unobserved dimension is invariant along at least one dimension. For instance, if the unobserved dimension is invariant over time, we can try building a panel model which can segregate out the bias coming from unobserved dimension. For instance, in B Schools => High Salary example, we assumed that the non observable dimension is invariant over time.

Let’s try this methodology.

1. Following is the equation of regressing y (salary) against both Premium college (subscript T) and unobserved dimensions (subscript U)

2. But, because the unobserved dimension is invariant over time, we can simplify the equation as follows :

3. We can now eliminate the unobserved factor by differencing over time

Now, it becomes to find the actual coefficient of causality relationship between college and salary.

2. Simulated Control : Biggest concern with observation data is that we do not get treatment and non-treatment data for the same data point. For instance, referring to the smoking example above, a person can’t be a smoker and non-smoker at the same time.

But, what if, we can find out a look alike for all treated in the non-treated group. And then compare the response of this treatment among look alikes. This is the most common method implemented currently in the industry.

The look alike can be found using nearest neighbor algorithm, k-d tree or any other algorithm. Let’s take an instance, we have two people with same age group, gender, income etc. One of them starts smoking and another does not. Now the stress level can be compared over a period of time given no other condition changes among them.

While this might sound a very exciting approach theoretically, it is usually difficult to carve out pure simulated / virtual control and this can some time lead to conclusions, which may not be correct. This is actually a topic for a different article in future.

3. Instrumental Variable (IV) : This is probably the hardest one which I find to implement. Following are the steps to implement this technique :

  1. Find the cause – effect pair.
  2. Find an attribute which is related to cause but is independent of the error which we get by regressing cause-effect pair. This variable is known as Instrumental Variable.
  3. Now estimate the cause variables using IV.
  4. Try regressing estimated cause – effect to find the actual coefficient of causality.

In observational data, any regression technique applied between cause-effect pair gives a biased coefficient. Using this methodology, we come out with an unbiased estimation. For example (in cigarette – mental stress pair), we might think it could be influenced by reverse causality.

Now, if we can find any information which is connected to cigarette consumption but not mental stress, we might be able to find the actual relationship. Generally IV are regulatory based variables. For example, here we found a tax regulation which increased only cigarette price and lead to lesser consumption of cigarettes across board. We can now try finding out the mental stress using the 4-step method discussed above.

4. Regression discontinuity design : This is amongst one of my favourite choices. It this makes the observational data really close to experimental design.

In the graph shown below, we are finding a dimension which has a spike on treatment and non-treatment population ratio. Suppose, we want to test the effect of scholarship in college on the grades by the end of course for students. Note that, scholarship is offered to students scoring more than 80% in their pre-college examinations. Here’s the twist. Because these students are already bright, they might continue being on top in future as well. Hence, this is a very difficult cause-effect relation to crack!

But, what if, we compare students who scored just below 80% (say 79.9%) with those who scored just above 80% (say 80.1%) on grades by end of the college. The assumption being that 79.9% student won’t be much different from 80.1% student. And the only thing which can change is the effect of scholarship. This is known as Quasi Randomized Selection.

Hence, the results are very close to perfect conclusions on causality. The only challenge with this methodology is that getting such a dimension is very difficult which can give a pure break up between treated and non-treated population.

What is the Correlation?

Correlation is a statistical measure (expressed as a number) that describes the size and direction of a relationship between two or more variables Or correlation is simply a relationship between anything. The general and most prefer objective of the analysis is to identify the extent to which one variable relates to another variable, i.e., to see how to target variable is dependent on an independent variable.

A correlation between variables, however, does not automatically mean that the change in one variable is the cause of the difference in the values of the other variable.

What is the Difference Between Correlation and Causation?

Let’s start with the basics. What is the definition of causation versus correlation?

What is correlation?

The Australian Bureau of Statistics provides a great definition of correlation:

“[It is] a statistical measure (expressed as a number) that describes the size and direction of a relationship between two or more variables.”

In other words, a change in one variable will typically be mirrored by a positive or negative change in the other.

What are the different types of correlations?

  • Positive correlation: Variables A and B move in the same direction. For example, as Variable A increases, so does B.
  • Negative correlation: Variables A and B move in opposite directions. For example, as Variable A increases, B decreases.
  • No correlation: There is no apparent link between Variables A and B.

The strength of the linear relationship between two variables, also called the correlation coefficient, can range from -1 (negative correlation) to 1 (positive correlation). The closer the correlation coefficient is to either -1 or 1, the stronger the relationship. On the other hand, a correlation coefficient of 0 indicates that there is no correlation between these two variables.

However, a correlation does not necessarily mean the given independent and dependent variables are linked. Which brings us to causation…

What is causation?

Also known as ‘causality,’ the Australian Bureau of Statistics goes on to define causation the following way:

“…one event is the result of the occurrence of the other event i.e., there is a causal relationship between the two events. This is also referred to as cause and effect.”

In other words, does one variable actually impact the other?

What techniques can be used to differentiate between correlation and causation? - Psychology

A well-designed graph organizes its data in a way that allows the reader to see the main conclusion the graph maker has drawn from their data set. A particularly clever graph might present enough information for the reader to draw two, three, or even more, conclusions from the one graph.

When designing a graph it is important to keep in mind what conclusions you want the reader to draw from the graph—and then to design the graph in such a way as to draw the reader’s attention to the data that lead to that conclusion. When reading someone else’s graph it is important to look at the graph with a mind to determining what conclusions are being presented to you. For instance, if any control data sets are plotted, what alternative explanations do they rule out? What conclusions are left to explain the experimental group data?

One way to highlight a particular conclusion you want your reader to draw from your data plot is to draw lines through the data to illustrate a correlation between your data variables.

A correlation is a measure of how strongly one variable is related to another. If, for instance, weight were perfectly correlated with height (it isn’t), then if one person were twice as tall as another, they would also be exactly twice as heavy. And if one person were 50% as tall as another, that person would also be weigh exactly 50% as much.

One reason to determine if certain variables are correlated is to investigate whether one causes the other. For instance, if smoking causes lung cancer, than we expect to find that incidences of lung cancer should correlate well with smoking rates. Groups with more smokers should have higher rates of lung cancer. Groups with fewer smokers should have lower rates of lung cancer, etc. If you are trying to prove one thing “causes” another (known as having a “causal relationship”) then one line of evidence would be that those two things have a strong correlation with one another.

One important thing to remember about correlated data is that correlation does not prove causation. That is to say, if you have causation you will definitely have correlation, but just having correlation does not always mean that there is a causal relationship.

Sometimes two things are correlated because one causes another. For instance, we are now certain that lung cancer rates are correlated with smoking rates because smoking does cause cancer.

But sometimes two things are correlated because they are both influenced by a third variable that you might or might not be aware of. For instance, there might be a correlation between ice cream sales and physical assaults. When ice cream sales are higher, assault rates are higher, and when ice cream sales are lower, assault rates are lower. However, it is rather unlikely that eating ice cream causes people to get into fights. More likely there is a third variable we are missing. Perhaps it is heat. On hotter days more people buy ice cream, and also on hotter days more people are short-tempered and get into more fights.

The correlation between ice cream sales and assaults is evidence that ice cream might cause fights, but it isn’t definitive proof and more evidence of that hypothesis would be needed to convince anyone. (In the case of the connection between lung cancer and smoking, the evidence started out as correlations, but eventually came to include many other types of evidence, which is why we now accept the causal relationship.)

When you are plotting data, showing that two variables correlate well is interesting, and can be used as one piece of evidence of perhaps a causal relationship, but the correlation by itself will never be enough. Often a correlation is the first step in establishing the causal relationship.

Association, correlation and causation

Correlation implies association, but not causation. Conversely, causation implies association, but not correlation.

Most studies include multiple response variables, and the dependencies among them are often of great interest. For example, we may wish to know whether the levels of mRNA and the matching protein vary together in a tissue, or whether increasing levels of one metabolite are associated with changed levels of another. This month we begin a series of columns about relationships between variables (or features of a system), beginning with how pairwise dependencies can be characterized using correlation.

Two variables are independent when the value of one gives no information about the value of the other. For variables X and Y, we can express independence by saying that the chance of measuring any one of the possible values of X is unaffected by the value of Y, and vice versa, or by using conditional probability, P(X|Y) = P(X). For example, successive tosses of a coin are independent—for a fair coin, P(H) = 0.5 regardless of the outcome of the previous toss, because a toss does not alter the properties of the coin. In contrast, if a system is changed by observation, measurements may become associated or, equivalently, dependent. Cards drawn without replacement are not independent when a red card is drawn, the probability of drawing a black card increases, because now there are fewer red cards.

Association should not be confused with causality if X causes Y, then the two are associated (dependent). However, associations can arise between variables in the presence (i.e., X causes Y) and absence (i.e., they have a common cause) of a causal relationship, as we've seen in the context of Bayesian networks 1 . As an example, suppose we observe that people who daily drink more than 4 cups of coffee have a decreased chance of developing skin cancer. This does not necessarily mean that coffee confers resistance to cancer one alternative explanation would be that people who drink a lot of coffee work indoors for long hours and thus have little exposure to the sun, a known risk. If this is the case, then the number of hours spent outdoors is a confounding variable—a cause common to both observations. In such a situation, a direct causal link cannot be inferred the association merely suggests a hypothesis, such as a common cause, but does not offer proof. In addition, when many variables in complex systems are studied, spurious associations can arise. Thus, association does not imply causation.

In everyday language, dependence, association and correlation are used interchangeably. Technically, however, association is synonymous with dependence and is different from correlation (Fig. 1a). Association is a very general relationship: one variable provides information about another. Correlation is more specific: two variables are correlated when they display an increasing or decreasing trend. For example, in an increasing trend, observing that X > μX implies that it is more likely that Y > μY. Because not all associations are correlations, and because causality, as discussed above, can be connected only to association, we cannot equate correlation with causality in either direction.

(a) Scatter plots of associated (but not correlated), non-associated and correlated variables. In the lower association example, variance in y is increasing with x. (b) The Pearson correlation coefficient (r, black) measures linear trends, and the Spearman correlation coefficient (s, red) measures increasing or decreasing trends. (c) Very different data sets may have similar r values. Descriptors such as curvature or the presence of outliers can be more specific.

For quantitative and ordinal data, there are two primary measures of correlation: Pearson's correlation (r), which measures linear trends, and Spearman's (rank) correlation (s), which measures increasing and decreasing trends that are not necessarily linear (Fig. 1b). Like other statistics, these have population values, usually referred to as ρ. There are other measures of association that are also referred to as correlation coefficients, but which might not measure trends.

When “correlated” is used unmodified, it generally refers to Pearson's correlation, given by ρ(X, Y) = cov(X, Y)/σXσY, where cov(X, Y) = E((X – μX)(Y – μY)). The correlation computed from the sample is denoted by r. Both variables must be on an interval or ratio scale r cannot be interpreted if either variable is ordinal. For a linear trend, |r| = 1 in the absence of noise and decreases with noise, but it is also possible that |r| < 1 for perfectly associated nonlinear trends (Fig. 1b). In addition, data sets with very different associations may have the same correlation (Fig. 1c). Thus, a scatter plot should be used to interpret r. If either variable is shifted or scaled, r does not change and r(X, Y) = r(aX + b, Y). However, r is sensitive to nonlinear monotone (increasing or decreasing) transformation. For example, when applying log transformation, r(X, Y) ≠ r(X, log(Y)). It is also sensitive to the range of X or Y values and can decrease as values are sampled from a smaller range.

If an increasing or decreasing but nonlinear relationship is suspected, Spearman's correlation is more appropriate. It is a nonparametric method that converts the data to ranks and then applies the formula for the Pearson correlation. It can be used when X is ordinal and is more robust to outliers. It is also not sensitive to monotone increasing transformations because they preserve ranks—for example, s(X, Y) = s(X, log(Y)). For both coefficients, a smaller magnitude corresponds to increasing scatter or a non-monotonic relationship.

It is possible to see large correlation coefficients even for random data (Fig. 2a). Thus, r should be reported together with a P value, which measures the degree to which the data are consistent with the null hypothesis that there is no trend in the population. For Pearson's r, to calculate the P value we use the test statistic √[d.f. × r 2 /(1 − r 2 )], which is t-distributed with d.f. = n – 2 when (X, Y) has a bivariate normal distribution (P for s does not require normality) and the population correlation is 0. Even more informative is a 95% confidence interval, often calculated using the bootstrap method 2 . In Figure 2a we see that values up to |r| < 0.63 are not statistically significant—their confidence intervals span zero. More important, there are very large correlations that are statistically significant (Fig. 2a) even though they are drawn from a population in which the true correlation is ρ = 0. These spurious cases (Fig. 2b) should be expected any time a large number of correlations is calculated—for example, a study with only 140 genes yields 9,730 correlations. Conversely, modest correlations between a few variables, known to be noisy, could be biologically interesting.

(a) Distribution (left) and 95% confidence intervals (right) of correlation coefficients of 10,000 n = 10 samples of two independent normally distributed variables. Statistically significant coefficients (α = 0.05) and corresponding intervals that do not include r = 0 are highlighted in blue. (b) Samples with the three largest and smallest correlation coefficients (statistically significant) from a.

Because P depends on both r and the sample size, it should never be used as a measure of the strength of the association. It is possible for a smaller r, whose magnitude can be interpreted as the estimated effect size, to be associated with a smaller P merely because of a large sample size 3 . Statistical significance of a correlation coefficient does not imply substantive and biologically relevant significance.

The value of both coefficients will fluctuate with different samples, as seen in Figure 2, as well as with the amount of noise and/or the sample size. With enough noise, the correlation coefficient can cease to be informative about any underlying trend. Figure 3a shows a perfectly correlated relationship (X, X) where X is a set of n = 20 points uniformly distributed in the range [0, 1] in the presence of different amounts of normally distributed noise with a standard deviation σ. As σ increases from 0.1 to 0.3 to 0.6, r(X, X + σ) decreases from 0.95 to 0.69 to 0.42. At σ = 0.6 the noise is high enough that r = 0.42 (P = 0.063) is not statistically significant—its confidence interval includes ρ = 0.

(a) r of an n = 20 sample of (X, X + ɛ), where ɛ is the normally distributed noise scaled to standard deviation σ. The amount of scatter and value of r at three values of σ are shown. The shaded area is the 95% confidence interval. Intervals that do not include r = 0 are highlighted in blue (σ < 0.58), and those that do are highlighted in gray and correspond to nonsignificant r values (ns e.g., r = 0.42 with P = 0.063). (b) As sample size increases, r becomes less variable, and the estimate of the population correlation improves. Shown are samples with increasing size and noise: n = 20 (σ = 0.1), n = 100 (σ = 0.3) and n = 200 (σ = 0.6). Traces at the bottom show r calculated from a subsample, created from the first m values of each sample.

When the linear trend is masked by noise, larger samples are needed to confidently measure the correlation. Figure 3b shows how the correlation coefficient varies for subsamples of size m drawn from samples at different noise levels: m = 4–20 (σ = 0.1), m = 4–100 (σ = 0.3) and m = 4–200 (σ = 0.6). When σ = 0.1, the correlation coefficient converges to 0.96 once m > 12. However, when noise is high, not only is the value of r lower for the full sample (e.g., r = 0.59 for σ = 0.3), but larger subsamples are needed to robustly estimate ρ.

The Pearson correlation coefficient can also be used to quantify how much fluctuation in one variable can be explained by its correlation with another variable. A previous discussion about analysis of variance 4 showed that the effect of a factor on the response variable can be described as explaining the variation in the response the response varied, and once the factor was accounted for, the variation decreased. The squared Pearson correlation coefficient r 2 has a similar role: it is the proportion of variation in Y explained by X (and vice versa). For example, r = 0.05 means that only 0.25% of the variance of Y is explained by X (and vice versa), and r = 0.9 means that 81% of the variance of Y is explained by X. This interpretation is helpful in assessments of the biological importance of the magnitude of r when it is statistically significant.

Besides the correlation among features, we may also talk about the correlation among the items we are measuring. This is also expressed as the proportion of variance explained. In particular, if the units are clustered, then the intraclass correlation (which should be thought of as a squared correlation) is the percent variance explained by the clusters and given by σb 2 /(σb 2 + σw 2 ), where σb 2 is the between-cluster variation and σb 2 + σw 2 is the total between- and within-cluster variation. This formula was discussed previously in an examination of the percentage of total variance explained by biological variation 5 where the clusters are the technical replicates for the same biological replicate. As with the correlation between features, the higher the intraclass correlation, the less scatter in the data—this time measured not from the trend curve but from the cluster centers.

Association is the same as dependence and may be due to direct or indirect causation. Correlation implies specific types of association such as monotone trends or clustering, but not causation. For example, when the number of features is large compared with the sample size, large but spurious correlations frequently occur. Conversely, when there are a large number of observations, small and substantively unimportant correlations may be statistically significant.

Subtle issues

Although the above examples were obviously silly, correlation is very often mistaken for causation in ways that are not immediately obvious in the real world. When reading and interpreting statistics, one must take great care to understand exactly what the data and its statistics are implying – and more importantly, what they are not implying.

One recent example of the need for caution in interpreting data is the excitement earlier this year surrounding the apparent groundbreaking detection of gravitational waves – an announcement that appears to have been made prematurely, before all the variables that were affecting the data were accounted for.

Unfortunately, analysing statistics, probabilities and risks is not a skill set wired into our human intuition, and so is all too easy to be led astray. Entire books have been written on the subtle ways in which statistics can be misinterpreted (or used to mislead). To help keep your guard up, here are some common slippery statistical problems that you should be aware of:

1) The Healthy Worker Effect, where sometimes two groups cannot be directly compared on a level playing field.

Consider a hypothetical study comparing the health of a group of office-workers with the health of a group of astronauts. If the study shows no significant difference between the two – no correlation between healthiness and working environment – are we to conclude that living and working in space carries no long-term health risks for astronauts?

No! The groups are not on the same footing: the astronaut corps screen applicants to find healthy candidates, who then maintain a comprehensive fitness regime in order to proactively combat the effects of living in “microgravity”.

We would therefore expect them to be significant healthier than office workers, on average, and should rightly be concerned if they were not.

2) Categorisation and the Stage Migration Effect – shuffling people between groups can have dramatic effects on statistical outcomes.

This is also known as the Will Rogers effect, after the US comedian who reportedly quipped:

When the Okies left Oklahoma and moved to California, they raised the average intelligence level in both states.

To illustrate, imagine dividing a large group of friends into a “short” group and a “tall” group (perhaps in order to arrange them for a photo). Having done so, it’s surprisingly easy to raise the average height of both groups at once.

Simply ask the shortest person in the “tall” group to switch over to the “short” group. The “tall”‘ group lose their shortest member, thus bumping up their average height – but the “short” group gain their tallest member yet, and thus also gain in average height.

This has major implications in medical studies, where patients are often sorted into “healthy” or “unhealthy” groups in the course of testing a new treatment. If diagnostic methods improve, some very-slightly-unhealthy patients may be recategorised – leading to the health outcomes of both groups improving, regardless of how effective (or not) the treatment is.

Picking and choosing among the data can lead to the wrong conclusions. The skeptics see period of cooling (blue) when the data really shows long-term warming (green).

3) Data mining – when an abundance of data is present, bits and pieces can be cherry-picked to support any desired conclusion.

This is bad statistical practice, but if done deliberately can be hard to spot without knowledge of the original, complete data set.

Consider the above graph showing two interpretations of global warming data, for instance. Or fluoride – in small amounts it is one of the most effective preventative medicines in history, but the positive effect disappears entirely if one only ever considers toxic quantities of fluoride.

For similar reasons, it is important that the procedures for a given statistical experiment are fixed in place before the experiment begins and then remain unchanged until the experiment ends.

4) Clustering – which is to be expected even in completely random data.

Consider a medical study examining how a particular disease, such as cancer or Multiple sclerosis, is geographically distributed. If the disease strikes at random (and the environment has no effect) we would expect to see numerous clusters of patients as a matter of course. If patients are spread out perfectly evenly, the distribution would be most un-random indeed!

So the presence of a single cluster, or a number of small clusters of cases, is entirely normal. Sophisticated statistical methods are needed to determine just how much clustering is required to deduce that something in that area might be causing the illness.

Unfortunately, any cluster at all – even a non-significant one – makes for an easy (and at first glance, compelling) news headline.

One must always be wary when drawing conclusions from data! Randall Munroe, CC BY-NC

Statistical analysis, like any other powerful tool, must be used very carefully – and in particular, one must always be careful when drawing conclusions based on the fact that two quantities are correlated.

Instead, we must always insist on separate evidence to argue for cause-and-effect – and that evidence will not come in the form of a single statistical number.

Seemingly compelling correlations, say between given genes and schizophrenia or between a high fat diet and heart disease, may turn out to be based on very dubious methodology.

We are perhaps as a species cognitively ill prepared to deal with these issues. As Canadian educator Kieran Egan put it in his book Getting it Wrong from the Beginning:

The bad news is that our evolution equipped us to live in small, stable, hunter-gatherer societies. We are Pleistocene people, but our languaged brains have created massive, multicultural, technologically sophisticated and rapidly changing societies for us to live in.

In consequence, we must constantly resist the temptation to see meaning in chance and to confuse correlation and causation.

This article is part of a series on Understanding Research.


No matter what you want to believe about any medical quackery, no matter how hard you want to convince yourself they are real, and no matter how much you want everyone to believe your anecdotes, finding a potential correlation then causation is very difficult. And it requires a logical process not a claim that it must be so because of anecdotes or belief.

There is a logical process that is required to get from correlation to causality. Those who attempt to shortcut that process to reach a pre-ordained conclusion means that they have neither established correlation nor causality.

Causation and Correlation in Education

I have had an interesting fortnight in my role as a school leader and Research-lead. In this job, you get to share a lot of teacher training materials and the like, coupled, or most often, decoupled from the evidence. In just the last couple of weeks I have been repeatedly ‘exposed’ to popular zombie edu-theories that simply won’t go die. Discredited ideas keep bouncing back, recast and relabeled for the promise of a new generation of hard pressed teachers.

I’ve had the ubiquitous learning styles foisted into my inbox. The crumbling edifice that is the ‘learning pyramid’, or cone, or whatever it is branded as. I have seen a fist full of dubious GCSE programmes that proclaim that their evidence will secure the GCSEs of your students’ dreams. Sadly, I really could go on and on.

There is no easy antidote. Working with experts like Professor Rob Coe and Stuart Kime in the RISE project helps. Reading excellent blogs like this one from Nick Rose certainly helps too. Reading the newly created Edudatalab was a boon in this regard. Networks like ResearchEd and organizations like the Institute of Effective Education provide ballast to still the ship against the rising tide of bullcrap.

What we eventually need is a workforce of teachers who are critical consumers of research evidence and powerfully evidence-informed.

Now, I have a huge amount to learn about research evidence, but one of the turning points in my understanding was when I grasped the difference between correlation and causation (a threshold concept for research evidence):

“Correlation is a statistical measure (expressed as a number) that describes the size and direction of a relationship between two or more variables. A correlation between variables, however, does not automatically mean that the change in one variable is the cause of the change in the values of the other variable.

Causation indicates that one event is the result of the occurrence of the other event i.e. there is a causal relationship between the two events. This is also referred to as cause and effect.

Theoretically, the difference between the two types of relationships are easy to identify — an action or occurrence can cause another (e.g. smoking causes an increase in the risk of developing lung cancer), or it can correlate with another (e.g. smoking is correlated with alcoholism, but it does not cause alcoholism). In practice, however, it remains difficult to clearly establish cause and effect, compared with establishing correlation.” (Source: Australian Bureau of Statistics)

This is of course of crucial important in schools. We are constantly being sold silver bullets whose evidence is based on loose correlation (or worse) and nothing like causation. Fundamentally, we must move toward better evaluating what we do. We can ask the question: when do we attempt to put a control group and a treatment group in place for our latest innovations? To find evidence of causation, which is obviously very tricky, it requires decent controls being in place and a transparent statistical model that doesn’t fiddle the numbers to dredge up a positive result.

Most ‘evidence’ in schools, and education more widely, fails this test.

The debate about evidence and what has value is now part of the educational landscape. The evidence of a randomised controlled trial is matched up against political ideologies and personal prejudices at every step. We are forced to mediate a minefield of information. Teachers don’t know what to believe and therefore they stop listening.

Of course, schools ourselves are guilty of this basic failing when we analyze our evidence. In our punitive accountability model we are not encouraged to honestly evaluate our interventions and their impact. We work backwards: we spent money on X, results improved generally = X caused the improvement and is worth the money. The perils of this lazy correlating pattern is brilliantly exposed by the website by Tyler Vigen aptly entitled ‘Spurious Correlations‘ (thanks Stuart Kime for sharing this gem). Take a look at these two graphs – as they’re graphs, we of course give them credence:

And there is this irrefutable evidence too!

These examples are comic, but the isn’t a quantum leap to our estimations when we evaluate school spending and such like. We buy shiny new tablets, or we create a brilliant brain friendly programme, and – hey presto – students do better. Our new thing is the thing, of course! School leaders and teachers can sink their heart and their next promotion into such interventions – there are potent reasons not to evaluate well and properly seek out causation and not dubious correlation. There are issues with control groups, or the efficacy of trials, but we should approach these head on in the pursuit of better evidence.

When presented with evidence we should question the correlation and causation. When setting up evaluations of our own we need to be mindful of this too. Setting up a new time-consuming intervention, that costs teacher time and students’ curriculum time, must be evaluated better if we really want to go some way to having robust evidence. We all have a long way to go.