Skip to content

Supervised Learning with scikit-learn

Run the hidden code cell below to import the data used in this course.

# Importing pandas
import pandas as pd

# Importing the course datasets 
diabetes = pd.read_csv('datasets/diabetes_clean.csv')
music = pd.read_csv('datasets/music_clean.csv')
advertising = pd.read_csv('datasets/advertising_and_sales_clean.csv')
telecom = pd.read_csv("datasets/telecom_churn_clean.csv")

Two types of supervised learning — classification and regression. Recall that binary classification is used to predict a target variable that has only two labels, typically represented numerically with a zero or a one.

Building a classification model, or classifier, to predict the labels of unseen data

  • Build a model/Classifier
  • Model/Classifier learns from the labeled data we pass to it
  • Pass unlabeled data to the model/classifier as input
  • Model predicts labels of unseen data Labelled data=Training Data

Measuring Model Performance(Accuracy)

  • correct predictions/total observations
  • Use train test split i.e fit data on train data then test accuracy using test data
  • use 20-30% of data as test set test_size=0.3/0.2
  • stratify=y

interpret k using a model complexity curve

  • Interpreting model complexity is a great way to evaluate performance when utilizing supervised learning.
  • Your aim is to produce a model that can interpret the relationship between features and the target variable, as well as generalize well when exposed to new observations.
# Create neighbors
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

X = telecom[["total_day_charge", "total_eve_charge"]].values
y = telecom["churn"].values


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,                                                    random_state=21, stratify=y)
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test))


neighbors = np.arange(1, 13)
train_accuracies = {}
test_accuracies = {}

for neighbor in neighbors:
  
	# Set up a KNN Classifier
	knn = KNeighborsClassifier(n_neighbors=neighbor)
  
	# Fit the model
	knn.fit(X_train, y_train)
  
	# Compute accuracy
	train_accuracies[neighbor] = knn.score(X_train, y_train)
	test_accuracies[neighbor] = knn.score(X_test, y_test)
print(neighbors, '\n', train_accuracies, '\n', test_accuracies)

#Plot model accuracies
# Add a title
plt.title("KNN: Varying Number of Neighbors")

# Plot training accuracies
plt.plot(neighbors, train_accuracies.values(), label="Training Accuracy")

# Plot test accuracies
plt.plot(neighbors, test_accuracies.values(), label="Testing Accuracy")

plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")

# Display the plot
plt.show()

#Check matching peak of both train and test accuracies
#In this case its 7
Hidden output

Regression

  • Target variable is continuous,in classification its discrete

1 hidden cell
#Multiple linear regression
# Create X and y arrays
X = sales_df.drop("sales", axis=1).values
y = sales_df["sales"].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Instantiate the model
reg = LinearRegression()

# Fit the model to the data
reg.fit(X_train,y_train)

# Make predictions
y_pred = reg.predict(X_test)
print("Predictions: {}, Actual Values: {}".format(y_pred[:2], y_test[:2]))

Regression performance

Cross Validation

  • If we're computing R-squared on our test set, the R-squared returned is dependent on the way that we split up the data!
  • The data points in the test set may have some peculiarities that mean the R-squared computed on it is not representative of the model's ability to generalize to unseen data.
  • To combat this dependence on what is essentially a random split, we use a technique called cross-validation.
  • We begin by splitting the dataset into five groups or folds.
  • Then we set aside the first fold as a test set,fit our model on the remaining four folds, predict on our test set,and compute the metric of interest, such as R-squared.
  • Next, we set aside the second fold as our test set,fit on the remaining data, predict on the test set,and compute the metric of interest.We do the rest for the remaining folds

2 hidden cells

Interpreting validation results using measures of central tendencies _ An average score of 0.75 with a low standard deviation is pretty good for a model out of the box!


1 hidden cell