Skip to main content

Simpson's Paradox: Avoid Being Misled by the Data

Break down misleading trends to uncover what’s really going on in your data. Learn to identify confounders, segment your analysis, and avoid false conclusions caused by Simpson’s Paradox.
Aug 7, 2025  · 7 min read

I thought the Simpson’s paradox was super confusing when I first learned about it in college. I almost didn’t know what I was looking at. There it was, a high-level trend, and the story seemed clear enough. But then, when I separated the underlying groups, the trend reversed.

My first thought, seeing this, was something like, ‘Well, I guess you can’t trust statistics.’ But in the intervening time, I’ve done some studying and I’m happy to say that I do trust statistics again. If you’re as confused as I was, keep reading, and I’ll help you understand what’s happening. 

What Is Simpson's Paradox?

An experienced data analyst will have learned that he or she needs to be skeptical of broad trends. This is because a simple average can hide something more complicated that is true in the data. With the Simpson’s paradox, this ‘something else’ is pretty remarkable: The aggregated data doesn't just obscure the facts, it points to the exact opposite conclusion.

In other words, Simpson’s paradox happens when a trend appears in separate groups of data but disappears or flips entirely when those groups are combined. It’s a sharp reminder that looking at the big picture without understanding its parts can lead to trouble.

A Simpson's Paradox Example

It’s best to show this with an example. I’ll start with something simple and then I’ll point to famous examples that you can study yourself. 

Imagine a study comparing the success rates of two soil types on tree growth, Soil A and Soil B. When we look at tree growth based on soil type, the results seem clear:

  • For trees in cool climates (Group 1), soil A has a better outcome.
  • For trees in warm climates (Group 2), soil A still has a better outcome.

Based on this, soil A looks like the obvious winner. But when we combine all the data, the paradox appears: We see that soil B is actually the more effective option overall.

In case you don’t believe me, I’ll put numbers to this:

Tree growth in cool climates

Soil Type

Number of Trees

Avg Growth Rate

Soil A

90

30 cm/year

Soil B

10

25 cm/year

In cool climates, Soil A supports faster growth.

Tree growth in warm climates

Soil Type

Number of Trees

Avg Growth Rate

Soil A

10

60 cm/year

Soil B

90

55 cm/year

In warm climates, Soil A still performs better, though the difference is smaller.

But when you combine all trees

Soil Type

Total Trees

Weighted Avg Growth Rate

Soil A

100

33 cm/year

Soil B

100

47 cm/year

Now we see that soil B is better overall, even though Soil A outperforms it in both climates. 

So how is this possible? The answer is a confounding variable—a hidden factor that influences both the groups being studied and the final outcome. In this case, the climate is the confounder. 

Specifically, we should say that: 

  • Soil A is used more often in cooler climates, where all trees grow more slowly, no matter the soil.
  • And soil B is used more often in warmer climates, where trees grow faster in general.

So, climate influences growth rate and it’s also unevenly distributed between the soil groups.

Simpson’s Paradox Classic Examples 

Simpson’s paradox is often studied with specific historical cases that really show what is happening.

A famous example comes from UC Berkeley admissions in the 1970s. At first, the data suggested women were accepted at lower rates than men. But when broken down by department, most admitted women at equal or higher rates. The confounder was department choice: women applied more to competitive departments with lower acceptance rates overall, while men applied to less competitive ones.

Another case is a 1986 study on kidney stone treatments. Overall, a less invasive method seemed more effective. But when split by stone size, the more invasive surgery had higher success rates for both small and large stones. The confounder here was case severity: the tougher cases went to surgery, making its overall numbers look worse.

In both cases, the combined data gave the wrong impression. Only after breaking things down did the truth come out.

What Causes Simpson’s Paradox?

In the Simpson's Paradox, the numbers are correct for both the combined and individual groups. So there’s no math error of any kind. The problem is one of interpretation. It tests our ability to hold all the facts straight. 

To help you understand - and I started to mention this a bit earlier - Simpson’s paradox happens when two conditions are met:

  1. A confounding variable is present: There is a third factor that is connected to both the independent variable and the outcome.
  2. The groups are imbalanced: In our trees example, soil A was used more often in cooler climates, where trees grow more slowly overall. Soil B was used more in warmer climates, where growth is faster. This imbalance skews the combined average and causes the reversal.

What to Do About Simpson’s Paradox

Now this might be the most important part: How do you defend against Simpson’s paradox in your own analysis, so it doesn’t accidentally show up and, if it does show up, what version of events are you supposed to report? 

What to do beforehand

Maybe, Simpson’s paradox is best handled before it has a chance to distort your conclusions. That means developing a few disciplined habits:

  • Segment your data: Don’t rely on top-level averages. Break the data into relevant subgroups  like age, region, product type, or severity, others,  and check if the trend holds within those slices.
  • Search for confounding variables: Always ask: What else could be influencing this outcome? Look for factors that might be unevenly distributed across your groups, especially those you know from your domain expertise.
  • Remember that correlation isn’t causation: Just because a trend appears in the aggregate doesn’t mean it reflects a true cause-and-effect relationship. Simpson’s Paradox often emerges when a superficial correlation masks some kind of deeper imbalance or imbalances.
  • Insist on context: Know where your data came from and what might be shaping it. The methods of collection, the nature of the subjects, and external influences all matter.

What to do after it appears

If the Simpson’s paradox appears, don’t panic. This is your signal to go look a little more closely:

  • Investigate the imbalance. What’s unevenly distributed across the groups? That’s likely your confounder.
  • Report both views, but prioritize clarity. It’s okay to show the aggregated result also, but make sure to explain why it’s misleading and highlight the disaggregated analysis that better reflects the true pattern.
  • Let your purpose guide your reporting. If you're making policy decisions or operational changes, you'll usually want to act based on subgroup-level insights, not rolled-up summaries.

If you’re asking yourself, is one version of the results “better” - the aggregated or disaggregated version? Know that there’s no one-size-fits-all answer. That said, I would think that the disaggregated analysis is typically more trustworthy when confounding is present. The disaggregated (grouped) findings are usually more informative because they reflect how a variable behaves under different conditions or contexts, and the aggregated findings can be misleading if there's a confounding variable that influences both the grouping and the outcome. I think what matters most is understanding why the reversal happens and communicating it clearly in your reporting.

Conclusion

Simpson's Paradox is a great lesson in the art of interpreting data. The ability to look past a misleading total and ask, "What am I missing?" is the mark of a mature analyst. It's the skill that separates someone who just reports numbers from someone who uncovers insights.

If you find yourself interested in the "why" behind these reversals (I certainly do), the paradox is a great gateway into the wider field of causal inference, our Machine Learning for Business course teaches causal models and other things. Also, enroll in our Foundations of Inference in Python course today as another very good learning option.


Josef Waples's photo
Author
Josef Waples

I'm a data science writer and editor with contributions to research articles in scientific journals. I'm especially interested in linear algebra, statistics, R, and the like. I also play a fair amount of chess! 

Topics

Learn with DataCamp

Course

Foundations of Probability in R

4 hr
40.9K
In this course, you'll learn about the concepts of random variables, distributions, and conditioning.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related
image2.png

blog

Data Demystified: Avoiding the AI Hype Trap

In part 11 of data demystified, we examine how AI is covered in the news, how it can lead to the AI hype trap, and how to not fall into it.
Richie Cotton's photo

Richie Cotton

8 min

blog

Seven Tricks for Better Data Storytelling: Part I

With proper design skills, data teams have the power to frame arguments and persuade with data. That was the key takeaway from a recent episode of DataFramed featuring Andy Cotgrave, the technical evangelist at Tableau.
Travis Tang 's photo

Travis Tang

3 min

image5.png

blog

Data Demystified: Data Visualizations that Capture Trends

In part eight of data demystified, we’ll dive deep into the world of data visualization, starting off with visualizations that capture trends.
Richie Cotton's photo

Richie Cotton

10 min

Correlation vs. Causation

blog

Correlation vs. Causation: Understanding the Difference in Data Analysis

Learn the critical difference between correlation and causation in data analysis. Understand real-world examples and avoid common pitfalls in interpreting data.
Richie Cotton's photo

Richie Cotton

8 min

podcast

The Credibility Crisis in Data Science

Skipper Seabold, a Director of Data Science at Civis Analytics, discusses the current and looming credibility crisis in data science.

Tutorial

Common Data Science Pitfalls & How to Avoid them!

In this tutorial, you'll learn about some pitfalls you might experience when working on data science projects "in the wild".
DataCamp Team's photo

DataCamp Team

See MoreSee More