Naive Bayes Classification using Scikitlearn
Learn how to build and evaluate a Naive Bayes Classifier using Python's Scikitlearn package.
Suppose you are a product manager, you want to classify customer reviews in positive and negative classes. Or As a loan manager, you want to identify which loan applicants are safe or risky? As a healthcare analyst, you want to predict which patients can suffer from diabetes disease. All the examples have the same kind of problem to classify reviews, loan applicants, and patients.
Naive Bayes is the most straightforward and fast classification algorithm, which is suitable for a large chunk of data. Naive Bayes classifier is successfully used in various applications such as spam filtering, text classification, sentiment analysis, and recommender systems. It uses Bayes theorem of probability for prediction of unknown class.
In this tutorial, you are going to learn about all of the following:
 Classification Workflow
 What is Naive Bayes classifier?
 How Naive Bayes classifier works?
 Classifier building in Scikitlearn
 Zero Probability Problem
 It's advantages and disadvantages
Classification Workflow
Whenever you perform classification, the first step is to understand the problem and identify potential features and label. Features are those characteristics or attributes which affect the results of the label. For example, in the case of a loan distribution, bank manager's identify customer’s occupation, income, age, location, previous loan history, transaction history, and credit score. These characteristics are known as features which help the model classify customers.
The classification has two phases, a learning phase, and the evaluation phase. In the learning phase, classifier trains its model on a given dataset and in the evaluation phase, it tests the classifier performance. Performance is evaluated on the basis of various parameters such as accuracy, error, precision, and recall.
What is Naive Bayes Classifier?
Naive Bayes is a statistical classification technique based on Bayes Theorem. It is one of the simplest supervised learning algorithms. Naive Bayes classifier is a fast, accurate and reliable algorithm. Naive Bayes classifiers have high accuracy and speed on large datasets.
Naive Bayes classifier assumes that the effect of a particular feature in a class is independent of other features. For example, a loan applicant is desirable or not depending on his/her income, previous loan and transaction history, age, and location. Even if these features are interdependent, these features are still considered independently. This assumption simplifies computation, and that's why it is considered as naive. This assumption is called class conditional independence.
: the probability of hypothesis h being true (regardless of the data). This is known as the prior probability of h. : the probability of the data (regardless of the hypothesis). This is known as the prior probability. : the probability of hypothesis h given the data D. This is known as posterior probability. : the probability of data d given that the hypothesis h was true. This is known as posterior probability.
How Naive Bayes classifier works?
Let’s understand the working of Naive Bayes through an example. Given an example of weather conditions and Playing sports. You need to calculate the probability of Playing sports. Now, you need to classify whether Play ers will Play or not, based on the weather condition.
First Approach (In case of a single feature)
Naive Bayes classifier calculates the probability of an event in the following steps:
 Step 1: Calculate the prior probability for given class labels
 Step 2: Find Likelihood probability with each attribute for each class
 Step 3: Put these values in Bayes Formula and calculate the posterior probability.
 Step 4: See which class has a higher probability, given the input belongs to the higher probability class.
For simplifying prior and posterior probability calculation you can use the two tables frequency and likelihood tables. Both of these tables will help you to calculate the prior and posterior probability. The Frequency table contains the occurrence of labels for all features. There are two likelihood tables. Likelihood Table 1 is showing prior probabilities of labels and Likelihood Table 2 is showing the posterior probability.
Now suppose you want to calculate the probability of Playing when the weather is overcast.
Probability of Playing:

Calculate Prior Probabilities:

Calculate Posterior Probabilities:

Put Prior and Posterior probabilities in equation (1)
Similarly, you can calculate the probability of not Playing:
Probability of not Playing:

Calculate Prior Probabilities:

Calculate Posterior Probabilities:

Put Prior and Posterior probabilities in equation (2)
The probability of a 'Yes' class is higher. So you can determine here if the weather is overcast than Players will Play the sport.
Second Approach (In case of multiple features)
Now suppose you want to calculate the probability of Playing when the weather is overcast, and the temperature is mild.
Probability of Playing:

Calculate Prior Probabilities:

Calculate Posterior Probabilities:

Put Posterior probabilities in equation (2)

Put Prior and Posterior probabilities in equation (1)
Similarly, you can calculate the probability of not Playing:
Probability of not Playing:

Calculate Prior Probabilities:

Calculate Posterior Probabilities:

Put posterior probabilities in equation (4)

Put prior and posterior probabilities in equation (3)
The probability of a 'Yes' class is higher. So you can say here that if the weather is overcast than Play ers will Play the sport.
Classifier Building in Scikitlearn
Naive Bayes Classifier
Defining Dataset
In this example, you can use the dummy dataset with three columns: weather, temperature, and Play . The first two are features(weather, temperature) and the other is the label.
# Assigning features and label variables
Weather = ['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny',
'Rainy','Sunny','Overcast','Overcast','Rainy']
Temp = ['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','Hot','Mild']
Play =['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']
Encoding Features
First, you need to convert these string labels into numbers. for example: 'Overcast', 'Rainy', 'Sunny' as 0, 1, 2. This is known as label encoding. Scikitlearn provides LabelEncoder library for encoding labels with a value between 0 and one less than the number of discrete classes.
# Import LabelEncoder
from sklearn import preprocessing
#creating labelEncoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
weather_encoded = le.fit_transform(Weather)
print(weather_encoded)
Similarly, you can also encode temp and Play columns.
# Converting string labels into numbers
temp_encoded = le.fit_transform(Temp)
label = le.fit_transform(Play)
print("Temp:", temp_encoded)
print("Play :", label)
Now combine both the features (weather and temp) in a single variable (list of tuples).
# Combinig weather and temp into single listof tuples
features = [tup for tup in zip(weather_encoded, temp_encoded)]
print(features)
Generating Model
Generate a model using naive bayes classifier in the following steps:
 Create naive bayes classifier
 Fit the dataset on classifier
 Perform prediction
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
#Create a Gaussian Classifier
model = GaussianNB()
# Train the model using the training sets
model.fit(features,label)
#Predict Output
predicted= model.predict([[0,2]]) # 0:Overcast, 2:Mild
print("Predicted Value:", predicted)
Here, 1 indicates that Play ers can 'Play '.
Naive Bayes with Multiple Labels
Till now you have learned Naive Bayes classification with binary labels. Now you will learn about multiple class classification in Naive Bayes. Which is known as multinomial Naive Bayes classification. For example, if you want to classify a news article about technology, entertainment, politics, or sports.
In model building part, you can use wine dataset which is a very famous multiclass classification problem. "This dataset is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars." (UC Irvine)
Dataset comprises of 13 features (alcohol, malic_acid, ash, alcalinity_of_ash, magnesium, total_phenols, flavanoids, nonflavanoid_phenols, proanthocyanins, color_intensity, hue, od280/od315_of_diluted_wines, proline) and type of wine cultivar. This data has three type of wine Class_0, Class_1, and Class_3. Here you can build a model to classify the type of wine.
The dataset is available in the scikitlearn library.
Loading Data
Let's first load the required wine dataset from scikitlearn datasets.
#Import scikitlearn dataset library
from sklearn import datasets
#Load dataset
wine = datasets.load_wine()