Skip to content

Support Vector Machine Classification

Support-vector machines (SVMs) are supervised learning models used for classification and regression, known for its kernel trick to handle nonlinear input spaces. This template builds, trains, and tunes an SVM for a classification problem. If you would like to learn more about SVMs, take a look at DataCamp's Linear Classifiers in Python course.

To swap in your dataset in this template, the following is required:

  • There must be at least one feature column and a column with a categorical target variable you would like to predict.
  • The features have been cleaned and preprocessed, including categorical encoding.
  • There are no NaN/NA values. You can use this template to impute missing values if needed.

The placeholder dataset in this template is consists of hotel booking data with details, such as length of stay. Each row represents a booking and whether the booking was canceled (the target variable). You can find more information on this dataset's source and dictionary here.

1. Loading packages and data

# Load packages
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV

# Load the data and replace with your CSV file path
df = pd.read_csv("data/hotel_bookings_clean.csv")
df.head()
# Check if there are any null values
print(df.isnull().sum())
# Check columns to make sure you have feature(s) and a target variable
df.info()

2. Splitting the data

To split the data, we'll use the train_test_split() function.

# Split the data into two DataFrames: X (features) and y (target variable)
X = df.iloc[:, 1:]  # Specify at least one column as feature(s)
y = df["is_canceled"]  # Specify one column as the target variable

# Split the data into train and test subsets
# You can adjust the test size and random state
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=123
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

3. Building a support vector machine classifier

The following code builds a scikit-learn support vector machine classifier (svm.SVC) using the most fundamental parameters. As a reminder, you can learn more about these parameters in DataCamp's Linear Classifiers in Python course or scikit-learn documentation.

# Define parameters: these will need to be tuned to prevent overfitting and underfitting
params = {
    "kernel": "linear",  # Kernel type: 'linear', 'poly', 'rbf', 'sigmoid', or 'precomputed'
    "C": 1,  # Regularization parameter, squared l2 penalty
    "gamma": 0.01,  # Kernel coefficient (a float, 'scale', or 'auto') for 'rbf', 'poly' and 'sigmoid'
    "degree": 3,  # Degree of ‘poly’ kernel function
    "random_state": 123,
}

# Create a svm.SVC with the parameters above
clf = svm.SVC(**params)

# Train the SVM classifer on the train set
clf = clf.fit(X_train, y_train)

# Predict the outcomes on the test set
y_pred = clf.predict(X_test)

To evaluate this classifier, we will use accuracy and implement it with sklearn's metrics.accuracy_score() function. Note accuracy may not be the best evaluation metric for your problem, especially if your dataset has class imbalance.

# Evaluate accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

4. Hyperparameter tuning with random search

Hyperparameter tuning is considered best practice to improve the efficiency and effectiveness of your machine learning model. In this section, we'll use random search where a fixed number of hyperparameter settings are sampled from specified probability distributions. To learn more about other hyperparameter tuning options, such as grid search, check out DataCamp's Hyperparameter Tuning in Python course.

Note: SVMs can take noticeably longer to train on larger datasets compared to other models. If that's the case, you can adjust the parameter space and reduce the number of folds and candidates in RandomizedSearchCV(). Otherwise, you may want to consider another classification model, such as decision trees.

# Define a parameter grid with distributions of possible parameters to use
rs_param_grid = {
    "kernel": ["linear", "poly", "rbf", "sigmoid"],
    "C": [0.1, 1, 10],
    "gamma": [0.00001, 0.0001, 0.001, 0.01, 0.1],
}

# Create a svm.SVC object
clf = svm.SVC(random_state=123)

# Instantiate RandomizedSearchCV() with clf and the parameter grid
clf_rs = RandomizedSearchCV(
    estimator=clf,
    param_distributions=rs_param_grid,
    cv=3,  # Number of folds
    n_iter=5,  # Number of parameter candidate settings to sample
    verbose=2,  # The higher this is, the more messages are outputed
    random_state=123,
)

# Train the model on the training set
clf_rs.fit(X_train, y_train)

# Print the best parameters and highest accuracy
print("Best parameters found: ", clf_rs.best_params_)
print("Best accuracy found: ", clf_rs.best_score_)