Skip to content
1 hidden cell
2 hidden cells
1 hidden cell
Supervised Learning with scikit-learn
Supervised Learning with scikit-learn
Run the hidden code cell below to import the data used in this course.
# Importing pandas
import pandas as pd
# Importing the course datasets
diabetes = pd.read_csv('datasets/diabetes_clean.csv')
music = pd.read_csv('datasets/music_clean.csv')
advertising = pd.read_csv('datasets/advertising_and_sales_clean.csv')
telecom = pd.read_csv("datasets/telecom_churn_clean.csv")Two types of supervised learning — classification and regression. Recall that binary classification is used to predict a target variable that has only two labels, typically represented numerically with a zero or a one.
Building a classification model, or classifier, to predict the labels of unseen data
- Build a model/Classifier
- Model/Classifier learns from the labeled data we pass to it
- Pass unlabeled data to the model/classifier as input
- Model predicts labels of unseen data Labelled data=Training Data
Measuring Model Performance(Accuracy)
- correct predictions/total observations
- Use train test split i.e fit data on train data then test accuracy using test data
- use 20-30% of data as test set test_size=0.3/0.2
- stratify=y
interpret k using a model complexity curve
- Interpreting model complexity is a great way to evaluate performance when utilizing supervised learning.
- Your aim is to produce a model that can interpret the relationship between features and the target variable, as well as generalize well when exposed to new observations.
# Create neighbors
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
X = telecom[["total_day_charge", "total_eve_charge"]].values
y = telecom["churn"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test))
neighbors = np.arange(1, 13)
train_accuracies = {}
test_accuracies = {}
for neighbor in neighbors:
# Set up a KNN Classifier
knn = KNeighborsClassifier(n_neighbors=neighbor)
# Fit the model
knn.fit(X_train, y_train)
# Compute accuracy
train_accuracies[neighbor] = knn.score(X_train, y_train)
test_accuracies[neighbor] = knn.score(X_test, y_test)
print(neighbors, '\n', train_accuracies, '\n', test_accuracies)
#Plot model accuracies
# Add a title
plt.title("KNN: Varying Number of Neighbors")
# Plot training accuracies
plt.plot(neighbors, train_accuracies.values(), label="Training Accuracy")
# Plot test accuracies
plt.plot(neighbors, test_accuracies.values(), label="Testing Accuracy")
plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")
# Display the plot
plt.show()
#Check matching peak of both train and test accuracies
#In this case its 7Hidden output
Regression
- Target variable is continuous,in classification its discrete
1 hidden cell
#Multiple linear regression
# Create X and y arrays
X = sales_df.drop("sales", axis=1).values
y = sales_df["sales"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Instantiate the model
reg = LinearRegression()
# Fit the model to the data
reg.fit(X_train,y_train)
# Make predictions
y_pred = reg.predict(X_test)
print("Predictions: {}, Actual Values: {}".format(y_pred[:2], y_test[:2]))Regression performance
Cross Validation
- If we're computing R-squared on our test set, the R-squared returned is dependent on the way that we split up the data!
- The data points in the test set may have some peculiarities that mean the R-squared computed on it is not representative of the model's ability to generalize to unseen data.
- To combat this dependence on what is essentially a random split, we use a technique called cross-validation.
- We begin by splitting the dataset into five groups or folds.
- Then we set aside the first fold as a test set,fit our model on the remaining four folds, predict on our test set,and compute the metric of interest, such as R-squared.
- Next, we set aside the second fold as our test set,fit on the remaining data, predict on the test set,and compute the metric of interest.We do the rest for the remaining folds
2 hidden cells
Interpreting validation results using measures of central tendencies _ An average score of 0.75 with a low standard deviation is pretty good for a model out of the box!
1 hidden cell