Skip to content

Support Vector Machines

Welcome to your next lab! You will build SVM algorithm and explore working of this model with different kernels.

You will learn to:

  • Build the general architecture of a learning algorithm with OOP in mind:
    • Helper functions:
      • Kernels
      • Kernels matrix
      • Computing lagrange multipliers
      • Extracting support features
    • Main Model Class:
      • Training
      • Prediction

Before the start

SVM is a rather complex method compared to the ones we learned before. For better understanding of the steps in lab we strongly recommend you to get through this topics (links to related guides):

  • Lagrangian
  • SVM algorithm and kernels intuition
  • Quadratic optimization problem
  • Using of cvxopt
  • cvxopt tutorial

0 - Download data

!pip install wget
import wget
wget.download('https://dru.fra1.digitaloceanspaces.com/DS_Fundamentals/datasets/04_supervised_learning/Support_Vector_Machines/mush_features.csv')
wget.download('https://dru.fra1.digitaloceanspaces.com/DS_Fundamentals/datasets/04_supervised_learning/Support_Vector_Machines/mush_labels.csv')

1 - Packages

First, let's run the cell below to import all the packages that you will need during this assignment.

  • numpy is the fundamental package for scientific computing with Python.
  • matplotlib is a famous library to plot graphs in Python.
  • cvxopt is a software package for convex optimization.
# !pip install cvxopt
import numpy as np
import cvxopt
import matplotlib.pyplot as plt

cvxopt.solvers.options['show_progress'] = False

2 - Overview of the Dataset

Problem Statement: You are given a dataset containing:

  • a training set of m_train examples

  • a test set of m_test examples

  • each example is of shape (number of features, 1)

    This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.

def load_data():
    from sklearn.model_selection import train_test_split
    
    X = np.genfromtxt('mush_features.csv')
    Y = np.genfromtxt('mush_labels.csv')
    
    train_set_x, test_set_x, train_set_y, test_set_y = train_test_split(X, Y, test_size=0.33, random_state=42)
    
    train_set_x = train_set_x[:300].astype(float)
    train_set_y = train_set_y[:300].astype(float)
    
    test_set_x = test_set_x[:100].astype(float)
    test_set_y = test_set_y[:100].astype(float)
    
    x_test = train_set_x[:5]
    y_test = train_set_y[:5]   
    
    train_set_x = train_set_x.reshape(train_set_x.shape[0], -1).T
    test_set_x = test_set_x.reshape(test_set_x.shape[0], -1).T
    
    train_set_y = train_set_y.reshape((1, train_set_y.shape[0]))
    test_set_y = test_set_y.reshape((1, test_set_y.shape[0]))
    
    x_test = x_test.reshape(x_test.shape[0], -1).T
    y_test = y_test.reshape((1, y_test.shape[0]))
    
    return train_set_x, test_set_x, train_set_y, test_set_y, x_test, y_test

Many software bugs in machine learning come from having matrix/vector dimensions that don't fit. If you can keep your matrix/vector dimensions straight you will go a long way toward eliminating many bugs.

So, let's check shapes:

train_set_x, test_set_x, train_set_y, test_set_y, x_test, y_test = load_data()
print('train_set_x.shape: ', train_set_x.shape)
print('test_set_x.shape: ', test_set_x.shape)
print('train_set_y.shape: ', train_set_y.shape)
print('test_set_y.shape: ', test_set_y.shape)

Expected Output for m_train, m_test:

train_set_x.shape: (22, 300)
test_set_x.shape: (22, 100)
train_set_y.shape: (1,300)
test_set_y.shape: (1,100)

Distribution of samples in train set:

plt.figure(figsize=(4, 3))
plt.hist(train_set_y.T)
plt.xlabel("Class")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

Distribution of samples in test set:

plt.figure(figsize=(4, 3))
plt.hist(test_set_y.T)
plt.xlabel("Class")
plt.ylabel("Count")
plt.tight_layout()
plt.show()