Skip to main content

Data Demystified: An Overview of Descriptive Statistics

In the fifth entry of data demystified, we provide an overview of the basics of descriptive statistics, one of the fundamental areas of data science.
Sep 2022  · 6 min read

Welcome to part five of our month-long data demystified series. As part of Data Literacy Month, this series will clarify key concepts from the world of data, answer the questions you may be too afraid to ask and have fun along the way. If you want to start at the beginning, read our first entry in the series: What is a Dataset?

An Overview of Descriptive Statistics

In this entry, we’ll dive deep into descriptive statistics, one of the fundamental areas of data science. 

What is descriptive statistics?

Descriptive statistics is the primary tool used in descriptive analytics, one of the four types of analytics. To understand descriptive statistics, it is first helpful to understand the concept of a variable. 

A variable is a quantity that can be measured or counted. For example, given a group of people, you could measure each of their heights. This would be considered a variable. Likewise, you could count how many people had each hair color, and hair color would also be considered a variable.

Descriptive statistics are numbers that summarize variables. Calculating descriptive statistics is so common that several names for it have arisen. "Summary statistics" and "aggregations" mean the same as descriptive statistics.

In the next sections, we’ll outline the main techniques in descriptive statistics, all taken from our introduction to statistics course

Counts and proportions

When you have categorical variables—that is, data that consists of discrete groups like hair color—the most natural ways to summarize those variables are by counting them or talking about proportions. 

For example, four people had black hair, two had brown hair, two had blonde hair, one had ginger hair, and one had grey hair. Alternatively, you can say 40% of people had black hair, 20% had brown hair, 20% had blonde hair, 10% had ginger hair, and 10% had grey hair. 

Hair Color

Count 

Proportion

Black 

4

40%

Brown

2

20%

Blonde

2

20%

Ginger

1

10%

Grey 

1

10%

Measures of center

Measures of center, more popularly known as averages, summarize your data by capturing one value that describes the center of its distribution. Here are the most common measures of center. 

Mode

The mode is the most accessible type of average to calculate. It's just the most common value. In the example of hair colors, black is the most common color, so the mode is "black."

Arithmetic mean

Perhaps the most well-known average type is the arithmetic mean. If people speak casually about the average of something without specifying which type of average they are talking about, then they are talking about the arithmetic mean. 

The formula to calculate it is to add up all the values, then divide by how many values you have. For example, suppose you have a variable capturing the height of 10 people. To calculate the average, you need to sum up all their heights and divide them by 10. 

One of the strengths (and a weakness, as we'll see in a moment) of the mean is that every data point is used in the calculation. That is, it uses the most information possible in its calculation.

There are other types of mean, known as the geometric mean and harmonic mean, which have more niche use cases.

Median

The median is calculated by sorting the values from smallest to lowest, then taking the middle value. 

Using the height example, you need to sort the height from lowest to highest and choose the midway point in your height variable. Which, in this case, would be the fifth value. The median is known as a "robust" descriptive statistic since it won't change much if one of the values changes. 

To illustrate this, suppose we had received another data point of someone extraordinarily tall or short. The arithmetic mean of these heights would change quite a lot since outliers can influence the final value for the arithmetic mean. The median, however, would only vary slightly since the midway position hasn’t changed much.  

Other measures of location

Sometimes a middle value isn't a helpful summary of your variable. Perhaps you care about the minimum (smallest value) or the maximum (largest value). These other measures of the size of the values are called measures of location.

Another critical measure of location is called a percentile. This uses cutoff points that divide the data into 100 intervals with the same amount of data in each interval. The 100th percentile is the same as the maximum, and the 50th percentile is the same as the median. Percentiles aren't as relevant when you only have 10 data points as in the example of heights but are useful for larger datasets. 

A variation on percentiles is quartiles, where you split the data into four intervals rather than 100.

Measures of spread

Sometimes, rather than caring about the size of values, you care about how different they are. Descriptive statistics that calculate how different values are called measures of spread, or measures of variation.

Range

The simplest measure of spread is the range, which is the maximum value minus the minimum value. In the example of the height, the range would be the maximum height minus the minimum height. 

Variance

The variance is a more technical measure of spread often used in conjunction with the arithmetic mean. It is calculated as the sum of the squares of the differences between each value and the mean, all divided by one less than the number of data points.

To illustrate this, let’s look at an even simpler example of the height variable. Assume that we have captured four people's heights, as listed in the table below.

Name

Height

Isabella Leslie-Miller

158

Valeria Kogan-Higgs

172

Hadrien Lacroix

177

Sara Billen

183

The arithmetic mean of people’s height here amounts to 172.5 cm (calculated by summing up all their heights and dividing by 4). 

In this example, the variance would be 

((158 - 172.5) ^ 2 + (172 - 172.5) ^ 2 + (177 - 172.5) ^ 2 + (183 - 172.5) ^ 2) / (4 - 1) = 113 cm2.

As with the mean, the fact that it uses every data point in the calculation makes it very powerful and potentially subject to larger changes when the data changes.

Correlation

Correlation is a measure of the linear relationship between two variables. That is when one variable goes up, does the other thing go up or down? Several algorithms calculate correlation, but it is always a score between minus one and one. 

For two variables, X and Y, the correlation has the following interpretation.

Correlation score

Interpretation

-1

When X increases, Y decreases. The closer to -1 the correlation value is, the stronger the relationship. 

Between -1 and 0

When X increases, Y decreases. The closer to 0, the weaker the relationship is. 

0

There is no linear relationship between X and Y, so there is no correlation

between 0 and 1

When X increases, Y increases. The closer to 0, the weaker the relationship is. 

1

When X increases, Y increases. The closer to 1 the correlation value is, the stronger the relationship. 

Note that correlation does not account for non-linear effects, so if X and Y do not have a straight-line relationship, the correlation score may not be meaningful.

Want to Learn More?

We hope you enjoyed this short introduction to descriptive statistics. In the next series entry, we’ll explore correlation more deeply. Specifically, we’ll be looking at where the idiom correlation does not imply causation comes from and how you can avoid this pitfall.

Machine Learning for Business

Beginner
2 hours
18,493
Understand the fundamentals of Machine Learning and how it's applied in the business world.
See DetailsRight Arrow
Start Course

Understanding Data Science

Beginner
2 hours
388,497
An introduction to data science with no coding involved.

Understanding Machine Learning

Beginner
2 hours
140,526
An introduction to machine learning with no coding involved.
See all coursesRight Arrow
Related

The Importance of Data: 5 Top Reasons

Why is data important? Learn about the importance of data in the world today and discover some courses to help you improve your own data skills.
Kurtis Pykes 's photo

Kurtis Pykes

What Does a Data Analyst Do?

Discover what a data analyst is, what they do, and what you need to break into one of the most in-demand careers in data science.
Javier Canales Luna 's photo

Javier Canales Luna

[Infographic] Dashboard Design Checklist

Dashboards are one of the most useful tools when communicating data stories. Here is a handy checklist to keep in mind when designing your next dashboard.
DataCamp Team's photo

DataCamp Team

Best Practices for Building a Data Academy_final.png

[Infographic] 5 Best Practices for Building a Data Academy

With the rising need for data skills, organizations are building internal data academies to accelerate their data transformation. Here are 5 best practices learned from DataCamp for Business customers.
DataCamp Team's photo

DataCamp Team

10 Signs of Bad Data: How to Spot Poor Quality Data

Learn how to spot bad data, exploring why data quality matters, the cost of poor data, and the 10 signs of bad data.
Kurtis Pykes 's photo

Kurtis Pykes

What is Data Maturity and Why Does it Matter?

Discover what data maturity is and why it matters to businesses of all sizes. Plus, find out how to determine your company's data maturity.
Elena Kosourova 's photo

Elena Kosourova

10 min

See MoreSee More