Skip to content

Naive Bayes Classifier

Welcome to your next lab! You will build Naive Bayes Classifier.

You will classify spam/ham messages.

You will learn to:

  • Build the general architecture of a learning algorithm with OOP in mind:
    • Helper functions
      • Preprocessing data
      • Calculate prior probs for classes
      • Calculate likelihood probs of each word in a class
    • Main Model Class
      • Training
      • Prediction

0 - Download data

!pip install wget
import wget
wget.download('https://dru.fra1.digitaloceanspaces.com/DS_Fundamentals/datasets/04_supervised_learning/Naive_Bayes_Classifier/spam.csv')

1 - Packages

First, let's run the cell below to import all the packages that you will need during this assignment.

  • numpy is the fundamental package for scientific computing with Python.
  • pandas is a library providing a convenient work with data.
  • re is for regex
import pandas as pd
import numpy as np
import re
!head spam.csv

2 - Overview of the Problem set

Problem Statement: You are given a dataset containing:

  • a training set of m_train examples
  • a test set of m_test examples
  • each example is a message that belongs to a particular class: ham or spam.

Let's get more familiar with the dataset. Load the data by running the following code.

We won't divide our data to features(X) and target(Y) here, because we need to preprocess it in a special way.

# Loading the data 

def load_data():
    df = pd.read_csv('spam.csv', encoding='latin-1')
    df_for_tests = df.head()
    
    idx = np.arange(df.shape[0])
    np.random.shuffle(idx)

    train_set_size = int(df.shape[0] * 0.8)

    train_set = df.loc[idx[:train_set_size]]
    test_set = df.loc[idx[train_set_size:]]
    
    return train_set, test_set, df_for_tests
train_set, test_set, df_for_tests = load_data()

3 - Naive Bayes Classifier

Mathematical expression of the algorithm:

This algorithm is based on Bayes' theorem: \begin{equation} P(A_{j}\hspace{0.1cm}\rvert\hspace{0.1cm}x_{1},\dots,x_{n}) = \frac{P(x_{1},\dots,x_{n}\hspace{0.1cm}\rvert\hspace{0.1cm}A_{j})P(A_{j})}{P(x_{1},\dots,x_{n})} \end{equation}

Ignoring denominator (because it stays the same for all cases):

By making an assumption that the are conditionally independent of each other:

We can calculate the probability, if we know the prior probability:

Due to floating point underflow, the above is usually replaced with the numerically tractable expression:

For more consistent knowledge of the NBC algorithm, we highly recommend you to read this Stanford University article

Training the Naive Bayes Classifier:

How can we find the probabilities and ? We'll simply use the frequencies in the data. For the class prior we find what percentage of the messages in our training set are in each class :

where is the number of messages in our training data with class and be the total number of messages.

In we just compute as the fraction of times the word appears among all words in all messages of class :

where is number of times word appears in messages from class and - total count of all words in class .

Laplace smoothing

In statistics, additive smoothing, also called Laplace smoothing, or Lidstone smoothing, is a technique that is used to smooth categorical data. Given an observation , a "smoothed" version of the data gives the estimator:

where the pseudocount is the smoothing parameter ( corresponds to no smoothing) and is a vocabulary, which consists of the union of all the words in all classes.

3.1 - Preprocessing the data

Our data consists of different messages. Messages contain some excess symbols, which don't affect the content of the text, but add noise to the data. For example: "Does not \operate 66.7 after & lt;# & gt; or what".

Let's clean our data and leave only letters a-z, A-Z, numbers 0-9 and cast all letters to lowercase, replace double to n spaces with just one space, remove trailing spaces.

# Clean the data

def clean_data(message):
    
    """ 
    Returns string which consists of message words
    
    Argument:
    message -- message from dataset; 
        type(message) -> <class 'str'>
    
    Returns:
    result -- cleaned message, which contains only letters a-z, and numbers 0-9, with only one space between words;
        type(clean_data(message)) -> <class 'str'>
    
    """
    
    cleaned_message = re.sub(r'[^\w]+', ' ', message.lower())
    cleaned_message = re.sub(r'\s+', ' ', cleaned_message)
    return cleaned_message.strip()
sentence_1 = 'Doesn\'t get, how{to}% \\operate+66.7 :after[it]"" & lt;# & gt; won\'t `or(what)'
sentence_2 = 'O\]k,.lar7i$double{} check wif*& da! hair: [dresser;   ..already He SaID-77.88.5 wun cut v short question(std txt rate)T&C\'s'
print('cleaned: ',clean_data(sentence_1))
print('cleaned: ',clean_data(sentence_2))

Expected Output:

cleaned: doesn t get how to operate 66 7 after it lt gt won t or what
cleaned: o k lar7i double check wif da hair dresser already he said 77 88 5 wun cut v short question std txt rate t c s

Now let's clean each sentence and split data on features(X) and target(Y)