# Histograms in Matplotlib

### Introduction

According to *Wikipedia*, A histogram is an accurate graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable) and was first introduced by Karl Pearson. It is a type of bar graph. To construct a histogram, the first step is to “bin” the range of values — that is, divide the entire range of values into a series of intervals — and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent and are often (but are not required to be) of equal size.

Apart, from numerical data, Histograms can also be used for visualizing the distribution of images, since the images are nothing but a combination of picture elements (pixels) ranging from $0$ to $255$. The x-axis of the histogram denotes the number of bins while the y-axis represents the frequency of a particular bin. The number of bins is a parameter which can be varied based on how you want to visualize the distribution of your data.

For example: Let's say, you would like to plot a histogram of an RGB image having pixel values ranging from $0$ to $255$. The maximum of bins this image can have is $255$. If you keep x_axis (number of bins) as $255$, then the y_axis will denote the frequency of each pixel in that image.

Histograms are similar in spirit to `bar`

graphs, let's take a look at one pictorial example of a histogram:

A histogram is an excellent tool for visualizing and understanding the probabilistic distribution of numerical data or image data that is intuitively understood by almost everyone. Python has a lot of different options for building and plotting histograms. Python has few in-built libraries for creating graphs, and one such library is `matplotlib`

.

In today's tutorial, you will be mostly using `matplotlib`

to create and visualize histograms on various kinds of data sets.

So without any further ado, let's get started.

### Plotting Histogram using Numpy and Matplotlib

```
import numpy as np
```

For reproducibility, you will use the `seed`

function of numpy, which will give the same output each time it is executed.

You will plot the histogram of `gaussian (normal) distribution`

, which will have a mean of $0$ and a standard deviation of $1$.

```
np.random.seed(100)
np_hist = np.random.normal(loc=0, scale=1, size=1000)
```

```
np_hist[:10]
```

```
array([-1.74976547, 0.3426804 , 1.1530358 , -0.25243604, 0.98132079,
0.51421884, 0.22117967, -1.07004333, -0.18949583, 0.25500144])
```

Let's verify the mean and standard deviation of the above distribution.

```
np_hist.mean(),np_hist.std()
```

```
(-0.016772157343909157, 1.0458427194167)
```

Next, you will use numpy's `histogram`

function, which will return `hist`

and `bin_edges`

.

```
hist,bin_edges = np.histogram(np_hist)
```

```
hist
```

```
array([ 7, 37, 111, 207, 275, 213, 105, 31, 10, 4])
```

```
bin_edges
```

```
array([-3.20995538, -2.50316588, -1.79637637, -1.08958687, -0.38279736,
0.32399215, 1.03078165, 1.73757116, 2.44436066, 3.15115017,
3.85793967])
```

Next, let's plot the histogram using matplotlib's `plt.bar`

function where your x-axis and y-axis will be `bin_edges`

and `hist`

, respectively.

```
import matplotlib.pyplot as plt
%matplotlib inline
```

```
plt.figure(figsize=[10,8])
plt.bar(bin_edges[:-1], hist, width = 0.5, color='#0504aa',alpha=0.7)
plt.xlim(min(bin_edges), max(bin_edges))
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value',fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.title('Normal Distribution Histogram',fontsize=15)
plt.show()
```

Let's round the `bin_edges`

to an integer.

```
bin_edges = np.round(bin_edges,0)
```

```
plt.figure(figsize=[10,8])
plt.bar(bin_edges[:-1], hist, width = 0.5, color='#0504aa',alpha=0.7)
plt.xlim(min(bin_edges), max(bin_edges))
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value',fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.title('Normal Distribution Histogram',fontsize=15)
plt.show()
```

Now the graph looks much better, doesn't it?

To understand `hist`

and `bin_edges`

, let's look at an example:

You will define an array having arbitrary values and define bins with a range of $5$.

```
hist, bin_edges = np.histogram([1, 1, 2, 2, 2, 2, 3],bins=range(5))
```

```
hist
```

```
array([0, 2, 4, 1])
```

In the above output, `hist`

indicates that there are $0$ items in bin $0$, $2$ in bin $1$, $4$ in bin $3$, $1$ in bin $4$.

```
bin_edges
```

```
array([0, 1, 2, 3, 4])
```

Similarly, the above `bin_edges`

output means bin $0$ is in the interval $[0,1)$, bin $1$ is in the interval $[1,2)$, bin $2$ is in the interval $[2,3)$, bin $3$ is in the interval $[3,4)$.

### Plotting Histogram using only Matplotlib

Plotting histogram using `matplotlib`

is a piece of cake. All you have to do is use `plt.hist()`

function of matplotlib and pass in the data along with the number of bins and a few optional parameters.

In `plt.hist()`

, passing bins='auto' gives you the “ideal” number of bins. The idea is to select a bin width that generates the most faithful representation of your data.

That's all. So, let's plot the histogram of the `normal_distribution`

you built using `numpy`

.

```
plt.figure(figsize=[10,8])
n, bins, patches = plt.hist(x=np_hist, bins=8, color='#0504aa',alpha=0.7, rwidth=0.85)
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value',fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.title('Normal Distribution Histogram',fontsize=15)
plt.show()
```

#### Plotting two histograms together

```
plt.figure(figsize=[10,8])
x = 0.3*np.random.randn(1000)
y = 0.3*np.random.randn(1000)
n, bins, patches = plt.hist([x, y])
```

### Plotting histogram of Iris data using Pandas

You will use `sklearn`

to load a dataset called `iris`

. In sklearn, you have a library called datasets in which you have the Iris dataset that can be loaded on the fly. So, let's quickly load the iris dataset.

```
from sklearn.datasets import load_iris
import pandas as pd
```

```
data = load_iris().data
```

```
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width']
```

Next, you will create the dataframe of the `iris`

dataset.

```
dataset = pd.DataFrame(data,columns=names)
```

```
dataset.head()
```

sepal-length | sepal-width | petal-length | petal-width | |
---|---|---|---|---|

0 | 5.1 | 3.5 | 1.4 | 0.2 |

1 | 4.9 | 3.0 | 1.4 | 0.2 |

2 | 4.7 | 3.2 | 1.3 | 0.2 |

3 | 4.6 | 3.1 | 1.5 | 0.2 |

4 | 5.0 | 3.6 | 1.4 | 0.2 |

```
dataset.columns[0]
```

```
'sepal-length'
```

Finally, you will use `pandas`

in-built function `.hist()`

which will plot histograms for all the features that are present in the dataset.

Isn't that wonderful?

```
fig = plt.figure(figsize = (8,8))
ax = fig.gca()
dataset.hist(ax=ax)
plt.show()
```

The petal-length, petal-width, and sepal-length show a unimodal distribution, whereas sepal-width reflect a Gaussian curve. All these are useful analysis because then you can think of using an algorithm that works well with this kind of distribution.

#### Plot Histogram with Matplotlib

```
plt.figure(figsize=[10,10])
f,a = plt.subplots(2,2)
a = a.ravel()
for idx,ax in enumerate(a):
ax.hist(dataset.iloc[:,idx], bins='auto', color='#0504aa',alpha=0.7, rwidth=0.85)
ax.set_title(dataset.columns[idx])
plt.tight_layout()
```

```
<Figure size 720x720 with 0 Axes>
```

### Plotting Histogram of Binary and RGB Images

In the final segment of the tutorial, you will be visualizing two different domain images: binary (document) and natural images. Binary images are those images which have pixel values are mostly $0$ or $255$, whereas a color channel image can have a pixel value ranging anywhere between $0$ to $255$.

Analyzing the pixel distribution by plotting a histogram of intensity values of an image is the right way of measuring the occurrence of each pixel for a given image. For an 8-bit grayscale image, there are 256 possible intensity values. Hence, to study the difference between the document and the natural images in the pixel distribution domain, you will plot two histograms each for document and natural image respectively. Considering that readable documents have sparsity in them, the distribution of pixels in these document images should be mostly skewed.

You will use the `opencv`

module to load the two images, convert them into grayscale by passing a parameter $0$ while reading it, and finally resize them to the same size.

```
import cv2
```

```
lena_rgb = cv2.imread('lena.png')
lena_gray = cv2.cvtColor(lena_rgb,cv2.COLOR_BGR2GRAY)
```

```
lena_gray.shape
```

```
(512, 512)
```

```
binary = cv2.imread('binary.png')
binary.shape
```

```
(1394, 2190, 3)
```

```
binary = cv2.resize(binary,(512,512),cv2.INTER_CUBIC)
```

```
binary.shape
```

```
(512, 512, 3)
```

```
binary_gray = cv2.cvtColor(binary,cv2.COLOR_BGR2GRAY)
```

```
binary_gray.shape
```

```
(512, 512)
```

Let's visualize both the natural and document image.

```
plt.figure(figsize=[10,10])
plt.subplot(121)
plt.imshow(lena_rgb[:,:,::-1])
plt.title("Natural Image")
plt.subplot(122)
plt.imshow(binary[:,:,::-1])
plt.title("Document Image")
plt.show()
```

**Note**: You will use `bins`

equal to $255$, since there are a total of $255$ pixels in an image and you want to visualize the frequency of each pixel value of the image ranging from $0$ to $255$.

```
hist, edges = np.histogram(binary_gray,bins=range(255))
```

```
plt.figure(figsize=[10,8])
plt.bar(edges[:-1], hist, width = 0.8, color='#0504aa')
plt.xlim(min(edges), max(edges))
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value',fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.title('Document Image Histogram',fontsize=15)
plt.show()
```

```
hist, edges = np.histogram(lena_gray,bins=range(260))
```

```
plt.figure(figsize=[10,8])
plt.bar(edges[:-1], hist, width = 0.5, color='#0504aa')
plt.xlim(min(edges), max(edges))
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Pixels',fontsize=15)
plt.ylabel('Frequency of Pixels',fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.title('Natural Image Histogram',fontsize=15)
plt.show()
```

From the above figures, you can observe that the `natural image`

is spread across all 256 intensities having a `random`

distribution, whereas, the document image shows a `unimodal`

distribution and is mostly skewed between pixel values $0$ to $255$.

## Conclusion

Congratulations on finishing the tutorial.

This tutorial was a good starting point to how you can create a histogram using matplotlib with the help of numpy and pandas.

You also learned how you could leverage the power of histogram's to differentiate between two different image domains, namely document and natural image.

If you would like to learn more about data visualization techniques, consider taking DataCamp's Data Visualization with Python course.

Please feel free to ask any questions related to this tutorial in the comments section below.