Skip to main content

Data Demystified: Data Visualizations that Capture Distributions

In part 10 of data demystified, we’ll dive deep into the world of data visualization, continuing with visualizations that capture distributions.
Sep 2022  · 8 min read

Welcome to part ten of our month-long data demystified series. As part of Data Literacy Month, this series will clarify key concepts from the world of data, answer the questions you may be too afraid to ask and have fun along the way. If you want to start at the beginning, read our first entry in the series: What is a Dataset?

Data Demystified: Data Visualizations that Capture Distributions banner

This week, we’ll cover common data visualizations and how to interpret them. Data visualization is often called the gateway drug into data science; this blog post will look at data visualizations that capture distributions and how to interpret them.

Visualizations that Capture Distribution

A key use case of data visualization is capturing the distribution of a variable. Capturing distributions allows you to understand critical statistical properties of the data you’re visualizing and help audiences make educated data-driven decisions on key outcomes. Before diving in, here are some key pointers to keep in mind when visualizing distributions:

  1. What is the shape of the distribution? Is the distribution symmetrical or otherwise?
  2. What is the distribution’s spread? As in, the difference between the smallest and the largest value in the dataset.
  3. Is there any outlier in the distribution?
  4. Is there any pattern to the distribution? Is the distribution random, or is there an obvious shape?
  5. What is the average value (mean, mode, median)?

The four visualizations below help us capture these pointers. 

Histograms

A histogram is a graph showing a numerical variable's distribution with bars. It is a convenient way to illustrate the major features of the distribution, especially when the data set is large. Key examples where histograms shine are capturing the salary distribution of employees in a company or the blood sugar levels of a cohort of patients. 

A histogram depicting the age of death for Australian Males in 2022

A histogram depicting the age of death for Australian Males in 2022 (Source: Oosterbaan)

To build a histogram, the numerical data is first divided into several ranges or bins, and the frequency of occurrence of each range is counted. The horizontal axis shows the range, while the vertical axis represents the frequency or percentage of occurrences of a range. Histograms immediately showcase how a variable's distribution is skewed or where it peaks. Here are examples from our Data Visualization for Everyone course. 

symmetric, left-skewed, or right-skewed histograms

A histogram can be symmetric, left-skewed, or right-skewed. (Source: DataCamp)

multiple modes histograms

 A histogram can have multiple modes (Source: DataCamp)

While histograms and bar charts bear resemblances, they serve distinct functions and thus are not to be confused. Here are the key differences.

 

Histogram

Bar chart

Functional difference

To display the distribution of a numerical variable.

To compare values across categories.

Visual difference

There is no space between each bar.

There is usually a space between bars. Also,

Density plots

Just like a histogram, a density plot represents the distribution of a numerical variable. Unlike a histogram, a density plot uses a smooth line instead of bars. The horizontal axis of a density plot is the numerical variable, while the vertical axis is the probability density function. The probability that the variable lies between a range is the area under the graph. 

birth weights mice density plots

The probability that a mouse has a birth weight of between 1.0 to 1.2 grams is the area under the density plot (Source: SPSS)

A density plot can show the distribution shape more effectively than a histogram. A histogram with too small or large of a bin count might hide the actual shape of the underlying distribution. In contrast, a density plot does not require binning and displays smooth distribution curves.

medium bins histogram small bins histogram large bins histograms

The choice of bin count in a histogram is crucial. (Source: Laerd Statistics)

A density plot is also better at comparing multiple distributions than a histogram.

measuring distributions with histogram distributions with density plots

Comparing distributions with density plots vs histograms (Source: Koehrsen Will)

Box plots

Histograms are well-suited for displaying the overall distribution of the data, but box plots are excellent at summarizing a distribution. 

anatomy of a boxplot

The anatomy of a box plot (Source: Galarnyk)

Visualizing data with a box plot reveals the following:

  1. The median: The middle value of a dataset where 50% of the data is less than the median, and 50% of the data is higher than the median. 
  2. The upper quartile: The 75th percentile of a dataset where 75% of the data is less than the upper quartile, and 25% of the data is higher than the upper quartile. 
  3. The lower quartile: The 25th percentile of a dataset where 25% of the data is less than the lower quartile and 75% is higher than the lower quartile. 
  4. The interquartile range: The upper quartile minus the lower quartile
  5. The upper adjacent value: Or colloquially the “maximum”. It represents the upper quartile plus 1.5 times the interquartile range.
  6. The lower adjacent value: Or colloquially the “minimum". It represents the lower quartile minus 1.5 times the interquartile range.
  7. Outliers: Any values above the “maximum” or below the “minimum”.

Violin Plot

A violin plot is a hybrid between a box plot and a density plot. 

distribution of total bills using a violin plot

A violin plot showing the distribution of total bill (Source: DataCamp)

Like in a density plot, a violin plot displays a density distribution. Like in a box plot, a violin plot also shows summary statistics. Violin plots are an effective tool for simultaneously displaying and summarizing the distribution of a numerical variable. 

the anatomy of a violin plot

The anatomy of a violin plot (Source: Hintze and Nelson)

Get Started with Data Visualization Today

We hope you enjoyed this short introduction to data visualization. In the next series entry, we’ll look at how AI is covered in the news and how to grow a healthy skepticism around the latest advancements in the field. To start your data learning journey today, check out the following resources. 

Interactive Data Visualization with plotly in R

Beginner
4 hours
8,479
Learn to create interactive graphics entirely in R with plotly.
See DetailsRight Arrow
Start Course

Data Visualization in Spreadsheets

Beginner
4 hours
28,073
Learn the fundamentals of data visualization using spreadsheets.

Introduction to Data Visualization with Matplotlib

Beginner
4 hours
118,792
Learn how to create, customize, and share data visualizations using Matplotlib.
See MoreRight Arrow
← Back to Blogs