Course
Introduction
According to Wikipedia, A histogram is an accurate graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable) and was first introduced by Karl Pearson. It is a type of bar graph. To construct a histogram, the first step is to “bin” the range of values — that is, divide the entire range of values into a series of intervals — and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent and are often (but are not required to be) of equal size.
Apart, from numerical data, Histograms can also be used for visualizing the distribution of images, since the images are nothing but a combination of picture elements (pixels) ranging from $0$ to $255$. The x-axis of the histogram denotes the number of bins while the y-axis represents the frequency of a particular bin. The number of bins is a parameter which can be varied based on how you want to visualize the distribution of your data.
For example: Let's say, you would like to plot a histogram of an RGB image having pixel values ranging from $0$ to $255$. The maximum of bins this image can have is $255$. If you keep x_axis (number of bins) as $255$, then the y_axis will denote the frequency of each pixel in that image.
Histograms are similar in spirit to bar
graphs, let's take a look at one pictorial example of a histogram:
A histogram is an excellent tool for visualizing and understanding the probabilistic distribution of numerical data or image data that is intuitively understood by almost everyone. Python has a lot of different options for building and plotting histograms. Python has few in-built libraries for creating graphs, and one such library is matplotlib
.
In today's tutorial, you will be mostly using matplotlib
to create and visualize histograms on various kinds of data sets.
So without any further ado, let's get started.
Plotting Histogram using NumPy and Matplotlib
import numpy as np
For reproducibility, you will use the seed
function of numPy, which will give the same output each time it is executed.
You will plot the histogram of gaussian (normal) distribution
, which will have a mean of $0$ and a standard deviation of $1$.
np.random.seed(100)
np_hist = np.random.normal(loc=0, scale=1, size=1000)
np_hist[:10]
array([-1.74976547, 0.3426804 , 1.1530358 , -0.25243604, 0.98132079,
0.51421884, 0.22117967, -1.07004333, -0.18949583, 0.25500144])
Let's verify the mean and standard deviation of the above distribution.
np_hist.mean(),np_hist.std()
(-0.016772157343909157, 1.0458427194167)
Next, you will use numPy's histogram
function, which will return hist
and bin_edges
.
hist,bin_edges = np.histogram(np_hist)
hist
array([ 7, 37, 111, 207, 275, 213, 105, 31, 10, 4])
bin_edges
array([-3.20995538, -2.50316588, -1.79637637, -1.08958687, -0.38279736,
0.32399215, 1.03078165, 1.73757116, 2.44436066, 3.15115017,
3.85793967])
Next, let's plot the histogram using matplotlib's plt.bar
function where your x-axis and y-axis will be bin_edges
and hist
, respectively.
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=[10,8])
plt.bar(bin_edges[:-1], hist, width = 0.5, color='#0504aa',alpha=0.7)
plt.xlim(min(bin_edges), max(bin_edges))
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value',fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.title('Normal Distribution Histogram',fontsize=15)
plt.show()
Let's round the bin_edges
to an integer.
bin_edges = np.round(bin_edges,0)
plt.figure(figsize=[10,8])
plt.bar(bin_edges[:-1], hist, width = 0.5, color='#0504aa',alpha=0.7)
plt.xlim(min(bin_edges), max(bin_edges))
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value',fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.title('Normal Distribution Histogram',fontsize=15)
plt.show()
Now the graph looks much better, doesn't it?
To understand hist
and bin_edges
, let's look at an example:
You will define an array having arbitrary values and define bins with a range of $5$.
hist, bin_edges = np.histogram([1, 1, 2, 2, 2, 2, 3],bins=range(5))
hist
array([0, 2, 4, 1])
In the above output, hist
indicates that there are $0$ items in bin $0$, $2$ in bin $1$, $4$ in bin $3$, $1$ in bin $4$.
bin_edges
array([0, 1, 2, 3, 4])
Similarly, the above bin_edges
output means bin $0$ is in the interval $[0,1)$, bin $1$ is in the interval $[1,2)$, bin $2$ is in the interval $[2,3)$, bin $3$ is in the interval $[3,4)$.
Plotting Histogram using only Matplotlib
Plotting histogram using matplotlib
is a piece of cake. All you have to do is use plt.hist()
function of matplotlib and pass in the data along with the number of bins and a few optional parameters.
In plt.hist()
, passing bins='auto' gives you the “ideal” number of bins. The idea is to select a bin width that generates the most faithful representation of your data.
That's all. So, let's plot the histogram of the normal_distribution
you built using numpy
.
plt.figure(figsize=[10,8])
n, bins, patches = plt.hist(x=np_hist, bins=8, color='#0504aa',alpha=0.7, rwidth=0.85)
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value',fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.title('Normal Distribution Histogram',fontsize=15)
plt.show()
Plotting two histograms together
plt.figure(figsize=[10,8])
x = 0.3*np.random.randn(1000)
y = 0.3*np.random.randn(1000)
n, bins, patches = plt.hist([x, y])
Plotting Histogram of Iris Data using Pandas
You will use sklearn
to load a dataset called iris
. In sklearn, you have a library called datasets in which you have the Iris dataset that can be loaded on the fly. So, let's quickly load the iris dataset.
from sklearn.datasets import load_iris
import pandas as pd
data = load_iris().data
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width']
Next, you will create the dataframe of the iris
dataset.
dataset = pd.DataFrame(data,columns=names)
dataset.head()
sepal-length | sepal-width | petal-length | petal-width | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
dataset.columns[0]
'sepal-length'
Finally, you will use pandas
in-built function .hist()
which will plot histograms for all the features that are present in the dataset.
Isn't that wonderful?
fig = plt.figure(figsize = (8,8))
ax = fig.gca()
dataset.hist(ax=ax)
plt.show()
The petal-length, petal-width, and sepal-length show a unimodal distribution, whereas sepal-width reflect a Gaussian curve. All these are useful analysis because then you can think of using an algorithm that works well with this kind of distribution.
Plot Histogram with Matplotlib
plt.figure(figsize=[10,10])
f,a = plt.subplots(2,2)
a = a.ravel()
for idx,ax in enumerate(a):
ax.hist(dataset.iloc[:,idx], bins='auto', color='#0504aa',alpha=0.7, rwidth=0.85)
ax.set_title(dataset.columns[idx])
plt.tight_layout()
<Figure size 720x720 with 0 Axes>
Plotting Histogram of Binary and RGB Images
In the final segment of the tutorial, you will be visualizing two different domain images: binary (document) and natural images. Binary images are those images which have pixel values are mostly $0$ or $255$, whereas a color channel image can have a pixel value ranging anywhere between $0$ to $255$.
Analyzing the pixel distribution by plotting a histogram of intensity values of an image is the right way of measuring the occurrence of each pixel for a given image. For an 8-bit grayscale image, there are 256 possible intensity values. Hence, to study the difference between the document and the natural images in the pixel distribution domain, you will plot two histograms each for document and natural image respectively. Considering that readable documents have sparsity in them, the distribution of pixels in these document images should be mostly skewed.
You will use the opencv
module to load the two images, convert them into grayscale by passing a parameter $0$ while reading it, and finally resize them to the same size.
import cv2
lena_rgb = cv2.imread('lena.png')
lena_gray = cv2.cvtColor(lena_rgb,cv2.COLOR_BGR2GRAY)
lena_gray.shape
(512, 512)
binary = cv2.imread('binary.png')
binary.shape
(1394, 2190, 3)
binary = cv2.resize(binary,(512,512),cv2.INTER_CUBIC)
binary.shape
(512, 512, 3)
binary_gray = cv2.cvtColor(binary,cv2.COLOR_BGR2GRAY)
binary_gray.shape
(512, 512)
Let's visualize both the natural and document image.
plt.figure(figsize=[10,10])
plt.subplot(121)
plt.imshow(lena_rgb[:,:,::-1])
plt.title("Natural Image")
plt.subplot(122)
plt.imshow(binary[:,:,::-1])
plt.title("Document Image")
plt.show()
Note: You will use bins
equal to $255$, since there are a total of $255$ pixels in an image and you want to visualize the frequency of each pixel value of the image ranging from $0$ to $255$.
hist, edges = np.histogram(binary_gray,bins=range(255))
plt.figure(figsize=[10,8])
plt.bar(edges[:-1], hist, width = 0.8, color='#0504aa')
plt.xlim(min(edges), max(edges))
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value',fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.title('Document Image Histogram',fontsize=15)
plt.show()
hist, edges = np.histogram(lena_gray,bins=range(260))
plt.figure(figsize=[10,8])
plt.bar(edges[:-1], hist, width = 0.5, color='#0504aa')
plt.xlim(min(edges), max(edges))
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Pixels',fontsize=15)
plt.ylabel('Frequency of Pixels',fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.title('Natural Image Histogram',fontsize=15)
plt.show()
From the above figures, you can observe that the natural image
is spread across all 256 intensities having a random
distribution, whereas, the document image shows a unimodal
distribution and is mostly skewed between pixel values $0$ to $255$.
Conclusion
Congratulations on finishing the tutorial.
This tutorial was a good starting point to how you can create a histogram using matplotlib with the help of numpy and pandas.
You also learned how you could leverage the power of histogram's to differentiate between two different image domains, namely document and natural image.
If you would like to learn more about data visualization techniques, consider taking DataCamp's Data Visualization with Matplotlib course.
Please feel free to ask any questions related to this tutorial in the comments section below.
Learn more about Python
Course
Biomedical Image Analysis in Python
Course
Exploratory Data Analysis in Python
blog
Intermediate Python for Data Science: Matplotlib
cheat-sheet
Matplotlib Cheat Sheet: Plotting in Python
tutorial
How to Create a Histogram with Plotly
tutorial
Introduction to Plotting with Matplotlib in Python
Kevin Babitz
25 min
tutorial
Python Boxplots: A Comprehensive Guide for Beginners
tutorial