Naive Bayes Classifier
Welcome to your next lab! You will build Naive Bayes Classifier.
You will classify spam/ham messages.
You will learn to:
- Build the general architecture of a learning algorithm with OOP in mind:
- Helper functions
- Preprocessing data
- Calculate prior probs for classes
- Calculate likelihood probs of each word in a class
- Main Model Class
- Training
- Prediction
- Helper functions
0 - Download data
!pip install wget
import wget
wget.download('https://dru.fra1.digitaloceanspaces.com/DS_Fundamentals/datasets/04_supervised_learning/Naive_Bayes_Classifier/spam.csv')import pandas as pd
import numpy as np
import re!head spam.csv2 - Overview of the Problem set
Problem Statement: You are given a dataset containing:
- a training set of
m_trainexamples - a test set of
m_testexamples - each example is a message that belongs to a particular class: ham or spam.
Let's get more familiar with the dataset. Load the data by running the following code.
We won't divide our data to features(X) and target(Y) here, because we need to preprocess it in a special way.
# Loading the data
def load_data():
df = pd.read_csv('spam.csv', encoding='latin-1')
df_for_tests = df.head()
idx = np.arange(df.shape[0])
np.random.shuffle(idx)
train_set_size = int(df.shape[0] * 0.8)
train_set = df.loc[idx[:train_set_size]]
test_set = df.loc[idx[train_set_size:]]
return train_set, test_set, df_for_teststrain_set, test_set, df_for_tests = load_data()3 - Naive Bayes Classifier
Mathematical expression of the algorithm:
This algorithm is based on Bayes' theorem: \begin{equation} P(A_{j}\hspace{0.1cm}\rvert\hspace{0.1cm}x_{1},\dots,x_{n}) = \frac{P(x_{1},\dots,x_{n}\hspace{0.1cm}\rvert\hspace{0.1cm}A_{j})P(A_{j})}{P(x_{1},\dots,x_{n})} \end{equation}
Ignoring denominator (because it stays the same for all cases):
By making an assumption that the
We can calculate the probability, if we know the prior probability:
Due to floating point underflow, the above is usually replaced with the numerically tractable expression:
For more consistent knowledge of the NBC algorithm, we highly recommend you to read this Stanford University article
Training the Naive Bayes Classifier:
How can we find the probabilities
where
In
where
Laplace smoothing
In statistics, additive smoothing, also called Laplace smoothing, or Lidstone smoothing, is a technique that is used to smooth categorical data. Given an observation
where the pseudocount
3.1 - Preprocessing the data
Our data consists of different messages. Messages contain some excess symbols, which don't affect the content of the text, but add noise to the data. For example: "Does not \operate 66.7 after & lt;# & gt; or what".
Let's clean our data and leave only letters a-z, A-Z, numbers 0-9 and cast all letters to lowercase, replace double to n spaces with just one space, remove trailing spaces.
# Clean the data
def clean_data(message):
"""
Returns string which consists of message words
Argument:
message -- message from dataset;
type(message) -> <class 'str'>
Returns:
result -- cleaned message, which contains only letters a-z, and numbers 0-9, with only one space between words;
type(clean_data(message)) -> <class 'str'>
"""
cleaned_message = re.sub(r'[^\w]+', ' ', message.lower())
cleaned_message = re.sub(r'\s+', ' ', cleaned_message)
return cleaned_message.strip()sentence_1 = 'Doesn\'t get, how{to}% \\operate+66.7 :after[it]"" & lt;# & gt; won\'t `or(what)'
sentence_2 = 'O\]k,.lar7i$double{} check wif*& da! hair: [dresser; ..already He SaID-77.88.5 wun cut v short question(std txt rate)T&C\'s'
print('cleaned: ',clean_data(sentence_1))
print('cleaned: ',clean_data(sentence_2))Expected Output:
| cleaned: | doesn t get how to operate 66 7 after it lt gt won t or what |
| cleaned: | o k lar7i double check wif da hair dresser already he said 77 88 5 wun cut v short question std txt rate t c s |
Now let's clean each sentence and split data on features(X) and target(Y)