Skip to main content
HomeAbout PythonLearn Python

Histograms in Matplotlib

Learn about histograms and how you can use them to gain insights from data with the help of matplotlib.
Jun 2019  · 8 min read


According to Wikipedia, A histogram is an accurate graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable) and was first introduced by Karl Pearson. It is a type of bar graph. To construct a histogram, the first step is to “bin” the range of values — that is, divide the entire range of values into a series of intervals — and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent and are often (but are not required to be) of equal size.

Apart, from numerical data, Histograms can also be used for visualizing the distribution of images, since the images are nothing but a combination of picture elements (pixels) ranging from $0$ to $255$. The x-axis of the histogram denotes the number of bins while the y-axis represents the frequency of a particular bin. The number of bins is a parameter which can be varied based on how you want to visualize the distribution of your data.

For example: Let's say, you would like to plot a histogram of an RGB image having pixel values ranging from $0$ to $255$. The maximum of bins this image can have is $255$. If you keep x_axis (number of bins) as $255$, then the y_axis will denote the frequency of each pixel in that image.

Histograms are similar in spirit to bar graphs, let's take a look at one pictorial example of a histogram:

A histogram is an excellent tool for visualizing and understanding the probabilistic distribution of numerical data or image data that is intuitively understood by almost everyone. Python has a lot of different options for building and plotting histograms. Python has few in-built libraries for creating graphs, and one such library is matplotlib.

In today's tutorial, you will be mostly using matplotlib to create and visualize histograms on various kinds of data sets.

So without any further ado, let's get started.

Plotting Histogram using NumPy and Matplotlib

import numpy as np

For reproducibility, you will use the seed function of numPy, which will give the same output each time it is executed.

You will plot the histogram of gaussian (normal) distribution, which will have a mean of $0$ and a standard deviation of $1$.

np_hist = np.random.normal(loc=0, scale=1, size=1000)
array([-1.74976547,  0.3426804 ,  1.1530358 , -0.25243604,  0.98132079,
        0.51421884,  0.22117967, -1.07004333, -0.18949583,  0.25500144])

Let's verify the mean and standard deviation of the above distribution.

(-0.016772157343909157, 1.0458427194167)

Next, you will use numPy's histogram function, which will return hist and bin_edges.

hist,bin_edges = np.histogram(np_hist)
array([  7,  37, 111, 207, 275, 213, 105,  31,  10,   4])
array([-3.20995538, -2.50316588, -1.79637637, -1.08958687, -0.38279736,
        0.32399215,  1.03078165,  1.73757116,  2.44436066,  3.15115017,

Next, let's plot the histogram using matplotlib's function where your x-axis and y-axis will be bin_edges and hist, respectively.

import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=[10,8])[:-1], hist, width = 0.5, color='#0504aa',alpha=0.7)
plt.xlim(min(bin_edges), max(bin_edges))
plt.grid(axis='y', alpha=0.75)
plt.title('Normal Distribution Histogram',fontsize=15)

Let's round the bin_edges to an integer.

bin_edges = np.round(bin_edges,0)
plt.figure(figsize=[10,8])[:-1], hist, width = 0.5, color='#0504aa',alpha=0.7)
plt.xlim(min(bin_edges), max(bin_edges))
plt.grid(axis='y', alpha=0.75)
plt.title('Normal Distribution Histogram',fontsize=15)

Now the graph looks much better, doesn't it?

To understand hist and bin_edges, let's look at an example:

You will define an array having arbitrary values and define bins with a range of $5$.

hist, bin_edges = np.histogram([1, 1, 2, 2, 2, 2, 3],bins=range(5))
array([0, 2, 4, 1])

In the above output, hist indicates that there are $0$ items in bin $0$, $2$ in bin $1$, $4$ in bin $3$, $1$ in bin $4$.

array([0, 1, 2, 3, 4])

Similarly, the above bin_edges output means bin $0$ is in the interval $[0,1)$, bin $1$ is in the interval $[1,2)$, bin $2$ is in the interval $[2,3)$, bin $3$ is in the interval $[3,4)$.

Plotting Histogram using only Matplotlib

Plotting histogram using matplotlib is a piece of cake. All you have to do is use plt.hist() function of matplotlib and pass in the data along with the number of bins and a few optional parameters.

In plt.hist(), passing bins='auto' gives you the “ideal” number of bins. The idea is to select a bin width that generates the most faithful representation of your data.

That's all. So, let's plot the histogram of the normal_distribution you built using numpy.

n, bins, patches = plt.hist(x=np_hist, bins=8, color='#0504aa',alpha=0.7, rwidth=0.85)
plt.grid(axis='y', alpha=0.75)
plt.title('Normal Distribution Histogram',fontsize=15)

Plotting two histograms together

x = 0.3*np.random.randn(1000)
y = 0.3*np.random.randn(1000)
n, bins, patches = plt.hist([x, y])

Plotting Histogram of Iris Data using Pandas

You will use sklearn to load a dataset called iris. In sklearn, you have a library called datasets in which you have the Iris dataset that can be loaded on the fly. So, let's quickly load the iris dataset.

from sklearn.datasets import load_iris
import pandas as pd
data = load_iris().data
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width']

Next, you will create the dataframe of the iris dataset.

dataset = pd.DataFrame(data,columns=names)
  sepal-length sepal-width petal-length petal-width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

Finally, you will use pandas in-built function .hist() which will plot histograms for all the features that are present in the dataset.

Isn't that wonderful?

fig = plt.figure(figsize = (8,8))
ax = fig.gca()

The petal-length, petal-width, and sepal-length show a unimodal distribution, whereas sepal-width reflect a Gaussian curve. All these are useful analysis because then you can think of using an algorithm that works well with this kind of distribution.

Plot Histogram with Matplotlib

f,a = plt.subplots(2,2)
a = a.ravel()
for idx,ax in enumerate(a):
    ax.hist(dataset.iloc[:,idx], bins='auto', color='#0504aa',alpha=0.7, rwidth=0.85)
<Figure size 720x720 with 0 Axes>

Plotting Histogram of Binary and RGB Images

In the final segment of the tutorial, you will be visualizing two different domain images: binary (document) and natural images. Binary images are those images which have pixel values are mostly $0$ or $255$, whereas a color channel image can have a pixel value ranging anywhere between $0$ to $255$.

Analyzing the pixel distribution by plotting a histogram of intensity values of an image is the right way of measuring the occurrence of each pixel for a given image. For an 8-bit grayscale image, there are 256 possible intensity values. Hence, to study the difference between the document and the natural images in the pixel distribution domain, you will plot two histograms each for document and natural image respectively. Considering that readable documents have sparsity in them, the distribution of pixels in these document images should be mostly skewed.

You will use the opencv module to load the two images, convert them into grayscale by passing a parameter $0$ while reading it, and finally resize them to the same size.

import cv2
lena_rgb = cv2.imread('lena.png')
lena_gray = cv2.cvtColor(lena_rgb,cv2.COLOR_BGR2GRAY)
(512, 512)
binary = cv2.imread('binary.png')
(1394, 2190, 3)
binary = cv2.resize(binary,(512,512),cv2.INTER_CUBIC)
(512, 512, 3)
binary_gray = cv2.cvtColor(binary,cv2.COLOR_BGR2GRAY)
(512, 512)

Let's visualize both the natural and document image.


plt.title("Natural Image")

plt.title("Document Image")

Note: You will use bins equal to $255$, since there are a total of $255$ pixels in an image and you want to visualize the frequency of each pixel value of the image ranging from $0$ to $255$.

hist, edges = np.histogram(binary_gray,bins=range(255))
plt.figure(figsize=[10,8])[:-1], hist, width = 0.8, color='#0504aa')
plt.xlim(min(edges), max(edges))
plt.grid(axis='y', alpha=0.75)
plt.title('Document Image Histogram',fontsize=15)
hist, edges = np.histogram(lena_gray,bins=range(260))
plt.figure(figsize=[10,8])[:-1], hist, width = 0.5, color='#0504aa')
plt.xlim(min(edges), max(edges))
plt.grid(axis='y', alpha=0.75)
plt.ylabel('Frequency of Pixels',fontsize=15)
plt.title('Natural Image Histogram',fontsize=15)

From the above figures, you can observe that the natural image is spread across all 256 intensities having a random distribution, whereas, the document image shows a unimodal distribution and is mostly skewed between pixel values $0$ to $255$.


Congratulations on finishing the tutorial.

This tutorial was a good starting point to how you can create a histogram using matplotlib with the help of numpy and pandas.

You also learned how you could leverage the power of histogram's to differentiate between two different image domains, namely document and natural image.

If you would like to learn more about data visualization techniques, consider taking DataCamp's Data Visualization with Matplotlib course.

Please feel free to ask any questions related to this tutorial in the comments section below.

Learn more about Python

Certification available

Introduction to Data Visualization with Matplotlib

BeginnerSkill Level
4 hr
Learn how to create, customize, and share data visualizations using Matplotlib.
See DetailsRight Arrow
Start Course
See MoreRight Arrow

10 Essential Python Skills All Data Scientists Should Master

All data scientists need expertise in Python, but which skills are the most important for them to master? Find out the ten most vital Python skills in the latest rundown.

Thaylise Nakamoto

9 min

The 7 Best Python Certifications For All Levels

Find out whether a Python certification is right for you, what the best options are, and the alternatives on offer in this comprehensive guide.
Matt Crabtree's photo

Matt Crabtree

18 min

Textacy: An Introduction to Text Data Cleaning and Normalization in Python

Discover how Textacy, a Python library, simplifies text data preprocessing for machine learning. Learn about its unique features like character normalization and data masking, and see how it compares to other libraries like NLTK and spaCy.

Mustafa El-Dalil

5 min

Types of Data Plots and How to Create Them in Python

Explore various types of data plots—from the most common to advanced and unconventional ones—what they show, when to use them, when to avoid them, and how to create and customize them in Python.
Elena Kosourova's photo

Elena Kosourova

21 min

Coding Best Practices and Guidelines for Better Code

Learn coding best practices to improve your programming skills. Explore coding guidelines for collaboration, code structure, efficiency, and more.
Amberle McKee's photo

Amberle McKee

26 min

Pandas Profiling (ydata-profiling) in Python: A Guide for Beginners

Learn how to use the ydata-profiling library in Python to generate detailed reports for datasets with many features.
Satyam Tripathi's photo

Satyam Tripathi

9 min

See MoreSee More