Tutorials
python
+3

Histograms in Matplotlib

Learn about histograms and how you can use them to gain insights from data with the help of matplotlib.

Introduction

According to Wikipedia, A histogram is an accurate graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable) and was first introduced by Karl Pearson. It is a type of bar graph. To construct a histogram, the first step is to “bin” the range of values — that is, divide the entire range of values into a series of intervals — and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent and are often (but are not required to be) of equal size.

Apart, from numerical data, Histograms can also be used for visualizing the distribution of images, since the images are nothing but a combination of picture elements (pixels) ranging from $0$ to $255$. The x-axis of the histogram denotes the number of bins while the y-axis represents the frequency of a particular bin. The number of bins is a parameter which can be varied based on how you want to visualize the distribution of your data.

For example: Let's say, you would like to plot a histogram of an RGB image having pixel values ranging from $0$ to $255$. The maximum of bins this image can have is $255$. If you keep x_axis (number of bins) as $255$, then the y_axis will denote the frequency of each pixel in that image.

Histograms are similar in spirit to bar graphs, let's take a look at one pictorial example of a histogram:

A histogram is an excellent tool for visualizing and understanding the probabilistic distribution of numerical data or image data that is intuitively understood by almost everyone. Python has a lot of different options for building and plotting histograms. Python has few in-built libraries for creating graphs, and one such library is matplotlib.

In today's tutorial, you will be mostly using matplotlib to create and visualize histograms on various kinds of data sets.

So without any further ado, let's get started.

Plotting Histogram using Numpy and Matplotlib

import numpy as np

For reproducibility, you will use the seed function of numpy, which will give the same output each time it is executed.

You will plot the histogram of gaussian (normal) distribution, which will have a mean of $0$ and a standard deviation of $1$.

np.random.seed(100)
np_hist = np.random.normal(loc=0, scale=1, size=1000)
np_hist[:10]
array([-1.74976547,  0.3426804 ,  1.1530358 , -0.25243604,  0.98132079,
        0.51421884,  0.22117967, -1.07004333, -0.18949583,  0.25500144])

Let's verify the mean and standard deviation of the above distribution.

np_hist.mean(),np_hist.std()
(-0.016772157343909157, 1.0458427194167)

Next, you will use numpy's histogram function, which will return hist and bin_edges.

hist,bin_edges = np.histogram(np_hist)
hist
array([  7,  37, 111, 207, 275, 213, 105,  31,  10,   4])
bin_edges
array([-3.20995538, -2.50316588, -1.79637637, -1.08958687, -0.38279736,
        0.32399215,  1.03078165,  1.73757116,  2.44436066,  3.15115017,
        3.85793967])

Next, let's plot the histogram using matplotlib's plt.bar function where your x-axis and y-axis will be bin_edges and hist, respectively.

import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=[10,8])

plt.bar(bin_edges[:-1], hist, width = 0.5, color='#0504aa',alpha=0.7)
plt.xlim(min(bin_edges), max(bin_edges))
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value',fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.title('Normal Distribution Histogram',fontsize=15)
plt.show()

Let's round the bin_edges to an integer.

bin_edges = np.round(bin_edges,0)
plt.figure(figsize=[10,8])

plt.bar(bin_edges[:-1], hist, width = 0.5, color='#0504aa',alpha=0.7)
plt.xlim(min(bin_edges), max(bin_edges))
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value',fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.title('Normal Distribution Histogram',fontsize=15)
plt.show()

Now the graph looks much better, doesn't it?

To understand hist and bin_edges, let's look at an example:

You will define an array having arbitrary values and define bins with a range of $5$.

hist, bin_edges = np.histogram([1, 1, 2, 2, 2, 2, 3],bins=range(5))
hist
array([0, 2, 4, 1])

In the above output, hist indicates that there are $0$ items in bin $0$, $2$ in bin $1$, $4$ in bin $3$, $1$ in bin $4$.

bin_edges
array([0, 1, 2, 3, 4])

Similarly, the above bin_edges output means bin $0$ is in the interval $[0,1)$, bin $1$ is in the interval $[1,2)$, bin $2$ is in the interval $[2,3)$, bin $3$ is in the interval $[3,4)$.

Plotting Histogram using only Matplotlib

Plotting histogram using matplotlib is a piece of cake. All you have to do is use plt.hist() function of matplotlib and pass in the data along with the number of bins and a few optional parameters.

In plt.hist(), passing bins='auto' gives you the “ideal” number of bins. The idea is to select a bin width that generates the most faithful representation of your data.

That's all. So, let's plot the histogram of the normal_distribution you built using numpy.

plt.figure(figsize=[10,8])
n, bins, patches = plt.hist(x=np_hist, bins=8, color='#0504aa',alpha=0.7, rwidth=0.85)
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value',fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.title('Normal Distribution Histogram',fontsize=15)
plt.show()

Plotting two histograms together

plt.figure(figsize=[10,8])
x = 0.3*np.random.randn(1000)
y = 0.3*np.random.randn(1000)
n, bins, patches = plt.hist([x, y])

Plotting histogram of Iris data using Pandas

You will use sklearn to load a dataset called iris. In sklearn, you have a library called datasets in which you have the Iris dataset that can be loaded on the fly. So, let's quickly load the iris dataset.

from sklearn.datasets import load_iris
import pandas as pd
data = load_iris().data
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width']

Next, you will create the dataframe of the iris dataset.

dataset = pd.DataFrame(data,columns=names)
dataset.head()
sepal-length sepal-width petal-length petal-width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
dataset.columns[0]
'sepal-length'

Finally, you will use pandas in-built function .hist() which will plot histograms for all the features that are present in the dataset.

Isn't that wonderful?

fig = plt.figure(figsize = (8,8))
ax = fig.gca()
dataset.hist(ax=ax)
plt.show()

The petal-length, petal-width, and sepal-length show a unimodal distribution, whereas sepal-width reflect a Gaussian curve. All these are useful analysis because then you can think of using an algorithm that works well with this kind of distribution.

Plot Histogram with Matplotlib

plt.figure(figsize=[10,10])
f,a = plt.subplots(2,2)
a = a.ravel()
for idx,ax in enumerate(a):
    ax.hist(dataset.iloc[:,idx], bins='auto', color='#0504aa',alpha=0.7, rwidth=0.85)
    ax.set_title(dataset.columns[idx])
plt.tight_layout()
<Figure size 720x720 with 0 Axes>

Plotting Histogram of Binary and RGB Images

In the final segment of the tutorial, you will be visualizing two different domain images: binary (document) and natural images. Binary images are those images which have pixel values are mostly $0$ or $255$, whereas a color channel image can have a pixel value ranging anywhere between $0$ to $255$.

Analyzing the pixel distribution by plotting a histogram of intensity values of an image is the right way of measuring the occurrence of each pixel for a given image. For an 8-bit grayscale image, there are 256 possible intensity values. Hence, to study the difference between the document and the natural images in the pixel distribution domain, you will plot two histograms each for document and natural image respectively. Considering that readable documents have sparsity in them, the distribution of pixels in these document images should be mostly skewed.

You will use the opencv module to load the two images, convert them into grayscale by passing a parameter $0$ while reading it, and finally resize them to the same size.

import cv2
lena_rgb = cv2.imread('lena.png')
lena_gray = cv2.cvtColor(lena_rgb,cv2.COLOR_BGR2GRAY)
lena_gray.shape
(512, 512)
binary = cv2.imread('binary.png')
binary.shape
(1394, 2190, 3)
binary = cv2.resize(binary,(512,512),cv2.INTER_CUBIC)
binary.shape
(512, 512, 3)
binary_gray = cv2.cvtColor(binary,cv2.COLOR_BGR2GRAY)
binary_gray.shape
(512, 512)

Let's visualize both the natural and document image.

plt.figure(figsize=[10,10])

plt.subplot(121)
plt.imshow(lena_rgb[:,:,::-1])
plt.title("Natural Image")

plt.subplot(122)
plt.imshow(binary[:,:,::-1])
plt.title("Document Image")
plt.show()

Note: You will use bins equal to $255$, since there are a total of $255$ pixels in an image and you want to visualize the frequency of each pixel value of the image ranging from $0$ to $255$.

hist, edges = np.histogram(binary_gray,bins=range(255))
plt.figure(figsize=[10,8])

plt.bar(edges[:-1], hist, width = 0.8, color='#0504aa')
plt.xlim(min(edges), max(edges))
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value',fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.title('Document Image Histogram',fontsize=15)
plt.show()

hist, edges = np.histogram(lena_gray,bins=range(260))
plt.figure(figsize=[10,8])

plt.bar(edges[:-1], hist, width = 0.5, color='#0504aa')
plt.xlim(min(edges), max(edges))
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Pixels',fontsize=15)
plt.ylabel('Frequency of Pixels',fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.title('Natural Image Histogram',fontsize=15)
plt.show()

From the above figures, you can observe that the natural image is spread across all 256 intensities having a random distribution, whereas, the document image shows a unimodal distribution and is mostly skewed between pixel values $0$ to $255$.

Conclusion

Congratulations on finishing the tutorial.

This tutorial was a good starting point to how you can create a histogram using matplotlib with the help of numpy and pandas.

You also learned how you could leverage the power of histogram's to differentiate between two different image domains, namely document and natural image.

If you would like to learn more about data visualization techniques, consider taking DataCamp's Data Visualization with Python course.

Please feel free to ask any questions related to this tutorial in the comments section below.

Want to leave a comment?