course
Mean vs. Median: Knowing the Difference
When interpreting data, choosing the right measure of central tendency can make or break your analysis. Among the most common metrics are the mean and median, which are two seemingly straightforward concepts that carry profound implications in data interpretation. While the mean gives us the arithmetic average, the median is the central point in a sorted set of values, such that half the observations lie on either side. But which one is more reliable? The answer often depends on your data's distribution, the presence of outliers, and the story you're trying to tell.
In this article, I will break down the differences between mean and median, their strengths and weaknesses, and how to choose the right one for different scenarios. I will also explore how skewed distributions and outliers affect these measures, providing practical examples and visuals to help you understand these fundamental concepts. We'll also dip a toe into more advanced ideas.
Mean and Median Definitions
To fully understand the differences between the mean and the median, let us look at each of these measures and highlight their key properties.
What is the mean?
The mean can be viewed as the “balance point” (or center of mass) of the data. It considers all data points in a dataset and provides a single value that represents the average. More exactly, the mean is calculated by summing all the values in a dataset and then dividing by the number of values.
What is the median?
The median is the middle value when the data is sorted. Unlike the mean, it is more robust against outliers, providing a better measure of central tendency for skewed data.
What about the mode?
The mode is another measure of central tendency, representing the most frequently occuring value in a dataset. For example, in this series:
1, 3, 3, 6, 8, 9
the mode is 3 because it appears twice.
How to Calculate the Mean and Median
Reading a definition is one thing, but calculating is another. In this section, I will break down the steps for calculating each measure and highlight their computational differences.
How to find the mean
The mean is the arithmetic average of a dataset and is calculated as follows:
- Sum the Values: Add up all the numbers in your dataset.
- Divide by the Total Number of Values: Take the total sum and divide it by the count of values.
Here is the process represented as a general equation:
How to find the mean. Image by Author
For an example, consider a dataset of exam scores:
78, 85, 92, 88, 70
- Step 1 (Sum): 78 + 85 + 92 + 88 + 70 = 413
- Step 2 (Divide): 413 ÷ 5 = 82.6
The mean score is 82.6.
How to find the median
The median is the middle value of a dataset when arranged in ascending order. Here is how to find it:
- Sort the Data: Arrange the values from smallest to largest.
- Identify the Middle Value: If the dataset contains an odd number of values, the median is the value in the middle; if the dataset contains an even number of values, the median is the average of the two middle values.
And here are those steps represented as equations:
Median formula. Image by Author
I also created a visual to highlight the process.
How to find the median. Image by Author
Here’s an example dataset with an odd number of values:
70, 78, 85, 88, 92
- Step 1 (Sort): Already sorted.
- Step 2 (Middle Value): The third value is 85.
The median is 85.
Here’s another example but with an even number of values:
70, 78, 85, 88
- Step 1 (Sort): Already sorted.
- Step 2 (Average of middle values): (78 + 85) ÷ 2 = 81.5
The median is 81.5.
Why the Difference Matters: Outliers and Skew
While both the mean and median describe the center of a dataset, their behavior diverges significantly in the presence of outiers and skewed distributions. Understanding this difference is very important for accurately interpreting data and avoiding misleading conclusions.
Impact of outliers
Outliers are values that are significantly higher or lower than the rest of the data. They can heavily influence the mean but have little to no effect on the median.
Let’s consider a dataset of monthly incomes (in thousands):
3, 3.5, 4, 4.5, 5, 6, 50
The mean income here is 10.85k, which is heavily skewed by the extreme value of 50k.
On the other hand, the median value is 4.5k, which is, I would argue, a much more typical representation of income for this group.
Skewed distributions
The mean and median also differ in their representation of data in skewed distributions (datasets that are not symmetrical).
For example, in right-skewed distributions (e.g., income or housing prices), most values are clustered at the lower end, with a few extreme values pulling the tail to the right.
- Mean: Shifts toward the tail, resulting in a value higher than the median.
- Median: Remains closer to the cluster of typical values, better reflecting the “typical” case.
Consider incomes:
30k, 35k, 40k, 45k, 50k, 100k, 200k
- Mean: 71.4k (pulled upward by 100k and 200k).
- Median: 45k (closer to the majority of incomes).
Why this matters
- In skewed data: The median is often more representative of a “typical” data point because it is not pulled by extreme values.
- In symmetrical data: The mean and median will be nearly identical, so either can be used as a measure of central tendency.
One thing you should take-away from this is that it’s important to always examine your data’s distribution before deciding whether to use the mean or median. Tools like histograms and box plots can help visualize skewness and identify outliers. We’ll cover these later on. Also, I want to say that examining the difference between the mean and median is one way of assessing skewness.
Choosing Mean or Median in Different Scenarios
When analyzing data, deciding whether to use the mean or median depends on the characteristics of your dataset and the insights you are trying to extract. Below is a quick reference table to guide your choice:
Use the Mean When | Use the Median When |
---|---|
The data distribution is approximately normal (symmetrical). | The data is highly skewed (e.g., income, property values). |
Outliers are minimal or irrelevant to the analysis. | Outliers are present and could distort the results if included. |
You need a measure that is sensitive to every data point, such as in predictive modeling or when calculating totals. | You want to reflect the “typical” value rather than the “mathematical center” of the dataset. |
Here’s a practical tip that will really help you: Always start with a visual analysis of your data (e.g., a histogram or box plot) to check for symmetry, skewness, and the presence of outliers. This will help you decide whether the mean or median is a better fit for your scenario.
Visualizing Mean vs. Median
Visualizations are powerful tools for understanding the behavior of the mean and median in different datasets. They can clearly demonstrate how these measures respond to outliers and skewed distributions, helping to inform better data-driven decisions.
bar chart example
Imagine a small dataset of incomes in thousands:
30, 35, 40, 45, 50, 55, 1000
The following bar chart demonstrates how a single extreme value can drastically affect the mean, while leaving the median relatively stable. In this case, most data points cluster between 30 and 55, but the presence of an outlier (1000) pulls the mean upward.
Bar chart showing effect of an outlier on mean vs. median. Image by Author
histogram example
In a right-skewed distribution (such as incomes or housing prices), the mean is often pulled toward the long tail of high values, while the median remains closer to the “typical” data point. This makes the median a better measure of central tendency in such cases.
The histogram below shows a simulated income distribution where the mean (red dashed line) is significantly larger than the median (green dashed line) due to the skew.
Histogram showing a right-skewed distribution. Image by Author
You can notice how the right skew stretches the tail, creating a clear difference between the mean and the median.
box plot example
A box plot is an excellent way to visualize the impact of outliers on the median. Below, we compare two groups: one with outliers and one without. The median (vertical line inside the box) remains stable even with the presence of extreme values, but the overall range of the data is heavily impacted by the outlier.
Box plot showing effect of outliers on median. Image by Author
These visualizations highlight how the mean and median respond to different data characteristics, providing clarity on when to use each measure. Whether analyzing skewed data, outlier-prone datasets, or comparing groups, visual aids like these can make complex relationships much easier to grasp.
Some More Advanced Ideas
Let's now look at some more advanced ideas if you are curious to learn more.
Mean vs. median imputation
Now, if you are a data scientist and you need to fill in gaps in your data, you may have to choose an imputation method. You might now be wondering, what is the practical difference between mean vs. median imputation?
As you might guess, mean imputation replaces missing values with the average of the available data, which, as we have said, can be skewed by extreme values. Median imputation, on the other hand, replaces missing values with the middle value of the dataset.
A useful rule of thumb is to look at the distribution of your data. If your data distribution were skewed with many missing values, and you had used mean imputation, then you might have altered the distribution of your data!
Mean vs. median: parametric or non-parametric?
In many parametric methods, the mean (and variance) are central parameters. For example, a simple linear regression model assumes errors are normally distributed around a mean. When your data meet the normality assumption, the sample mean is a natural estimator and fits well within parametric frameworks.
Now, the median has a non-prametric orientation, and is actually probably I would say the quintessential non-parametric measure of central tendency. Many rank-based tests like the Mann–Whitney effectively compare medians (or distributions) rather than means. So, if your data show strong skew or contain outliers, focusing on the median aligns more naturally with non-parametric statistics.
All this is to say that understanding the distinction between the mean vs. median is not just about describing data correctly, it’s also important in hypothesis testing.
Mean vs. median stability testing
When deciding whether to use a mean or a median, one key question is how stable our statistics are for a given dataset. Bootstrapping is one option that would allow us to empirically estimate the sampling distribution of both the mean and the median by repeatedly resampling (with replacement) from the original data.
You could highlight the differences in mean and median stability empirically. You could introduce a few outliers into a dataset and then re-run a bootstrap procedure, thus letting you visually show how the mean’s distribution shifts more dramatically than that of the median. Also, bootstrapping can make it concrete by showing how large or small your confidence intervals might be in realistic scenarios. Read our tutorial on applying bootstrap methods to learn more.
Mean vs. median as optimization problems
Let me now provide an alternate but equally true definition: The mean is the value that minimizes the sum of squared deviations from the data, whereas the median is the value that minimizes the sum of absolute deviations.
Take a look at this equation:
If you take the derivative of this equation with respect to m, set it to zero, and solve, you will find that the minimizing value is simply the arithmetic mean. This matters because in many statistical methods, like ols regression, we minimize squared errors for mathematical convenience and to conform to assumptions of normally distributed errors.
Now consider a different idea: Instead of squaring each deviation, we measure the absolute error between m and each data point:
Here we want to find m that minimizes this total absolute deviation. It turns out (by analyzing the derivative of the absolute loss, or by a geometric argument) that the solution is the median of the dataset.
Intuitively, if m is to the left of the median, there are more data points on the right pulling it to move over. Only the median is where the pull from left and right balances out, minimizing total absolute distance.
Mean vs. median computational complexity
Finally, I'll say the mean is computationally simpler at scale. What this means is that you can compute it incrementally as data streams in, without needing to sort.
Median often requires sorting. Sorting a large dataset can be computationally expensive, especially with millions of values. For very large datasets, approximate algorithms (like streaming or quantile-based algorithms) can be used to estimate the median more efficiently. Our new Concepts in Computer Science course is a great resource for learning about these things.
Next Steps
As you have seen, the mean is the arithmetic average of a dataset, which makes it sensitive to extreme values, while the median represents the middle value in an ordered dataset. The right choie can make all the difference but, this said, in real-world analyses, it is often best to actually report both the mean and median alongside additional statistics like mode, standard deviation, and percentiles. This is the best way because it provides a comprehensive picture.
If you’re eager to explore deeper into statistical concepts, there are several areas worth focusing on. Start by reading up on more advanced variations of the mean, such as the trimmed mean, geometric mean, and weighted mean, which each have their purpose. I would also take our technology-agnostic Introduction to Statistics course.
Then, to really become more of an expert, you will want to choose and master a tool. Our Introduction to Statistics in R course, and Statistician in R career track are both very informative starting points if you want to use R, which is a popular language for data science and statistics. If you prefer working with spreadsheets and a programming language like Python, our Introduction to Statistics in Google Sheets course and Introduction to Statistics in Python course provides a hands-on approach to statistical analysis using formulas and powerful libraries.
Experienced data professional and writer who is passionate about empowering aspiring experts in the data space.
Mean vs. Median FAQs
What is the main difference between the mean and median?
The mean is the arithmetic average of all data points, while the median is the middle value when data is sorted.
When should I use the median instead of the mean?
Use the median when your data is skewed or contains outliers that could distort the mean.
Can the mean and median be the same?
Yes, they can be the same in a perfectly symmetrical distribution, such as a normal distribution.
Are there situations where neither mean nor median is sufficient?
Yes, for multimodal distributions or datasets with multiple peaks, neither may be representative. In such cases, additional measures like mode or percentiles might be more appropriate.
Why is the mean more affected by outliers than the median?
To answer this question, consider how the mean is calculated: The mean is the sum of all data values divided by the number of observations. An outlier (an extremely high or low value) heavily influences that sum, pulling the mean away from what might be considered a typical value.
Now consider how the median is calculated: The median is the middle value in a sorted dataset. It depends only on the ordering of the data—not on how large or small the individual points are. A single outlier doesn’t shift the position of the middle value in the sorted list and therefore barely affects the median.
How do you think about choosing between the mean and median?
Let’s look at some key considerations:
- When precision is critical: The mean considers all data points, making it ideal for calculations that require every value (e.g., average fuel consumption across all vehicles).
- When robustness is needed: The median offers more reliability in skewed datasets or when extreme values could distort the mean. For example, the median is often preferred in reporting household incomes to avoid misrepresentation due to a few ultra-high earners.
Learn with DataCamp
course
Trend Analysis in Power BI
course
Exploratory Data Analysis in R

blog
Correlation vs. Causation: Understanding the Difference in Data Analysis
tutorial
Arithmetic Mean: A Foundational Tool for Data Analysis

Vinod Chugani
7 min
tutorial
Normalization vs. Standardization: How to Know the Difference

Samuel Shaibu
9 min
tutorial
Harmonic Mean Explained: A Guide to Rates and Ratios

Vinod Chugani
8 min
tutorial
Sample Standard Deviation: The Key Ideas

Allan Ouko
6 min
tutorial
Geometric Mean: A Measure for Growth and Compounding

Vinod Chugani
8 min