Skip to main content
HomeBlogData Literacy

Correlation vs. Causation: Understanding the Difference in Data Analysis

Learn the critical difference between correlation and causation in data analysis. Understand real-world examples and avoid common pitfalls in interpreting data.
Updated Sep 12, 2024  · 8 min read

As part of Data Literacy Month, this series will clarify key concepts from the world of data, answer the questions you may be too afraid to ask and have fun along the way. If you want to start at the beginning, read our first entry in the series: What is a Dataset?

Correlation vs. Causation

Ever heard the phrase ‘correlation does not imply causation’? It’s a common pitfall that trips up data enthusiasts and professionals alike. In this article, we’ll explore the key differences between correlation and causation, debunk myths, and share real-world examples to help you avoid one of the biggest errors in data interpretation.

What is Correlation?

Correlation is a measure of the relationship between two things. That is when one thing goes up, does the other thing go up or down? The scatter plot below shows data from the 2017 American Community Survey. On the x-axis, you can see the average yearly income for each of the fifty US states, and on the y-axis, you can see the average monthly rent payments for those states. As incomes increase, so do rent payments.

Average Monthly Rent

Annual Income compared to Average Monthly Rent (in USD)

If one thing goes up when the other goes up—as in this case—they are said to be positively correlated. They are said to be negatively correlated if one goes down when the other goes up.

In statistics and data science, correlation is more precise, referring to the strength of a linear relationship between two things. In the variation of the scatter plot below, a straight line has been fitted through the data. The line follows the points fairly closely, indicating a linear relationship between income and rent.

line between the two variables

A fitted line between the two variables, indicating a positive relationship between both of them

By contrast, in this example below, using the classic diamonds dataset, if we plot the price of a diamond against its weight in carats and draw a line of best fit through it, you can see that the line curves upwards. That is, the price increases faster than linearly.

positive relationship between the two things

Correlation will capture the positive relationship between the two things—as weight increases, so does the price—but not the non-linear aspect of that relationship.

What is Causation?

Causation is a stronger statement than correlation. It means that changes in one thing cause another thing to change. You see examples of causation a lot in medical advice, for example, "smoking causes cancer" or "taking ibuprofen reduces pain levels."

You can also see many examples of causation in day-to-day life. Eating healthy causes improved lifestyle outcomes, working out causes you to become in better shape, and studying on DataCamp causes you to know more about data.

The Correlation-Causation Fallacy

The correlation-causation fallacy is when people assume a cause-and-effect relationship simply from correlation. In the example with income and rent, the data showed that rent payments are positively correlated with income. However, economics is complicated, and the data is insufficient to make the bolder claim that higher income causes higher rent payments.

The correlation-causation fallacy is prevalent in most societies since everyone working in marketing would like you to believe that buying their product causes your life to be better without taking the time to run a rigorous scientific experiment to test that. This idea was taken to its logical conclusion in The Tamperer's 1990s pop tune If you Buy This Record (Your Life Will Be Better). A correlation can occur in three main cases, but causation does not.

Coincidental Relationship

Given enough data, two completely unrelated things can show a correlation. The example below is taken from Tyler Vigen's excellent Spurious Correlations site. The line plot shows that the divorce rate in the US state of Maine is highly correlated with per capita consumption of margarine.

divorce rate in the US state of Maine

It would be absurd to think that eating more or less margarine could influence a divorce rate on a state-wide level, so we cannot claim causation. The correlation shown here is purely coincidental.

Even investors aren't immune to this logical fallacy. The Super Bowl Indicator suggests that the stock market will rise if a team from the National Football Conference wins the Super Bowl tournament. If a team from the American Football Conference wins, it will drop.

Confounding Variable

Correlations can appear when a third thing (a "confounding variable") affects both the supposed cause and the supposed effect.

A classic example is that sunburns are correlated with ice cream sales. It would be silly to say that eating ice cream causes an increase in sunburns. A more reasonable explanation is that in hot, sunny weather, people are more likely to eat ice cream, and people are also more likely to go sunbathing. So the weather is a confounding variable.

In 2001 a scientific paper noted that people who eat lots of vegetables and olive oil have less wrinkly skin. That's a valid observation, but Ben Goldacre noted in a TED talk that nutritionists pounced on this result and began claiming that eating olive oil causes you to get fewer wrinkles. This ignores the effect of several confounding factors. Olive oil is expensive compared to other cooking oils, so if you can afford olive oil, you are more likely to have an indoor job with less sun exposure and less likely to smoke. Not to mention one of the hundreds of other lifestyle differences that aren't considered in the study.

Reverse Causation

Reverse causation is when people get confused and which thing causes what. A silly example is that when wind turbines spin faster, you also detect faster wind speed, so you conclude that wind is caused by wind turbines spinning. Taking a moment to think about this, you should realize that the causation should be the other way around: wind causes wind turbines to spin.

Understanding the direction of causality is a common problem with medical data analysis.  For example, there is a positive correlation between depression and cannabis usage. That is, people who are depressed are more likely to smoke cannabis. For example, this scientific paper found that in 2016, people with depression had 216% higher odds of smoking cannabis near-daily compared to people who weren't depressed. 

There has been a lot of debate about which thing causes which. Does smoking cannabis cause depression, or does having depression make you more likely to smoke cannabis? This scientific paper suggests that causation may happen in both directions. There is also some evidence that the use of cannabis may lead to the onset of depression; however, strong evidence points to the inverse association, i.e., that depression may lead to the onset or increase in cannabis use frequency.

What do You Need to Establish Causation?

Determining causation can be trickier than one might think. To establish causation, you need four things.

  1. Correlation: While having a correlation between two things doesn't guarantee causation, it is a necessary condition. If there is no correlation, then there is no causal relationship.
  2. A temporal relationship: Whatever you attribute to be the cause of what you’re analyzing needs to occur before the thing you think is an effect.
  3. A mechanism: There must be an explanation as to how the first thing would cause the second.
  4. Control of confounding variables: The effect of possible confounding variables needs to be controlled for using careful experimental design, data collection, and statistical methods.

Observational and Experimental Studies

When exploring relationships between things, there are two types of study.

  • Observational studies collect data in a way that doesn't affect how the data are created. The olive oil study mentioned previously is an example of this: it recorded what people's diets were, rather than proscribing that some people must cook with olive oil and others not.
  • Experiments randomly assign people to different "treatments." This is common in testing medicines, where some people get the real medicine and others get a placebo.

Observational studies usually make it very difficult or impossible to determine causation. However, experiments (designed and conducted correctly) allow causation to be determined. You can read more about experiments in our next entry of Data Demystified on A/B testing.

Want to Learn More?

In summary, the correlation-causation fallacy has been widely studied and is one of the biggest pitfalls you can fall into early in your data upskilling journey. Here are things to remember:

  • Correlation measures the relationship between two things.
  • You can't always infer a causal relationship between these things.
  • To determine causation, you need to perform an experiment.

To take your skills to the next level, take Introduction to Statistics and start your data literacy journey. The next entry in the data demystified series covers the experimentation more deeply, focusing on a practice called A/B testing. In the meantime, check out the following resources: 

Correlation vs. Causation FAQs

Can you have causation without correlation?

No, causation cannot exist without correlation. For one variable to cause another, there must be a relationship between them. Correlation is a necessary condition for causation but not sufficient on its own. If there is no correlation, it’s highly unlikely that one thing is causing the other.

How do I distinguish between correlation and causation in my data?

To distinguish between correlation and causation, you need to go beyond simple data analysis. Start by identifying whether there is a correlation between the variables. Then, check for three key elements: (1) A temporal relationship (the cause must happen before the effect), (2) a plausible mechanism explaining how the cause leads to the effect, and (3) control for any confounding variables through experimental design or advanced statistical methods.

What methods can I use to test for causality?

The best way to test for causality is through experiments, such as randomized controlled trials (RCTs), where subjects are randomly assigned to different groups and exposed to different treatments. A/B testing is another common method in business contexts. In addition, you can use statistical techniques like regression analysis, Granger causality tests, or structural equation modeling to infer causality from observational data, though these methods are less conclusive than well-designed experiments.

Is it possible for two things to be correlated but not have a direct relationship?

Yes, two things can be correlated without having a direct relationship due to the presence of confounding variables. A third factor can influence both variables, creating the illusion of a direct relationship when none exists. For example, ice cream sales and sunburns are correlated, but the true cause behind both is warm weather, not the consumption of ice cream.

What is a spurious correlation?

A spurious correlation occurs when two variables appear to be related but in reality, there is no logical connection between them. These correlations are often coincidental and can be misleading if not properly analyzed. For example, the divorce rate in Maine has been found to correlate with per capita margarine consumption, but these two factors obviously have no meaningful connection.

Topics

Data Literacy Courses

Course

Introduction to Statistics

4 hr
72.1K
Learn the fundamentals of statistics, including measures of center and spread, probability distributions, and hypothesis testing with no coding involved!
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related
Quantitative vs. Qualitative Data

blog

Data Demystified: Quantitative vs. Qualitative Data

In the second entry of data demystified, we’ll take a look at the two most common data types: Quantitative vs Qualitative Data. For more data demystified blogs, check out the first entry in the series.
Richie Cotton's photo

Richie Cotton

5 min

blog

What is Data Analysis? An Expert Guide With Examples

Explore the world of data analysis with our comprehensive guide. Learn about its importance, process, types, techniques, tools, and top careers in 2023
Matt Crabtree's photo

Matt Crabtree

15 min

blog

Data Analyst vs Business Analyst: What Are The Differences?

What are the main differences between a data analyst vs business analyst? Read all about them in this complete guide.
Austin Chia's photo

Austin Chia

8 min

Choosing a career path

blog

Data Analyst vs. Data Scientist: A Comparative Guide For 2024

Learn about the key differences between the two most popular data science roles, including which skill sets are required, key duties, project life cycles, and earning potential.
DataCamp Team's photo

DataCamp Team

18 min

tutorial

Correlation Matrix In Excel: A Complete Guide to Creating and Interpreting

Learn the statistical concept of correlation, and follow along in calculating and interpreting correlations for a sample dataset, in a step-by-step tutorial.
Arunn Thevapalan's photo

Arunn Thevapalan

9 min

tutorial

Python Details on Correlation Tutorial

A tutorial to understand what correlation is and why it is important for every aspiring data scientist to know it.
Javier Canales Luna's photo

Javier Canales Luna

13 min

See MoreSee More