Skip to content

Supervised Learning with scikit-learn

Run the hidden code cell below to import the data used in this course.

# Importing pandas and numpy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scikitplot
# Importing the course datasets 
diabetes = pd.read_csv('datasets/diabetes_clean.csv')
music = pd.read_csv('datasets/music_clean.csv')
advertising = pd.read_csv('datasets/advertising_and_sales_clean.csv')
telecom = pd.read_csv("datasets/telecom_churn_clean.csv")

Chapter 1: Classification

What is machine learning?

Process whereby:

  • Computers learn to make decisions from data without being expilcitly programed

Supervised learning

  • The predicted values are known
  • Aim: Predict the target values of unseen data, given the features
  • Uses features to predict the value of a target variable

Types of supervised learning

  • Classification: target variable consists of categories (fraudulent vs non-fraudulent transaction is an example of binary classification)
  • Regression: Target variable is continuous

Naming conventions

  • Feature = predictor varible = independent variable (column in table)
  • Target variable = dependent variable = response variable

Requirements before using supervised learning:

  • No missing values
  • Data in numeric form
  • Data stored in pandas DF or NumPy array
  • Perform EDA first

scikit-learn syntax

`from sklearn.module_name import ModelName

model = ModelName()

model.fit(X, y)

predictions = model.predict(X_new)

print(predictions)`

Add your notes here

Classification challenge

Classifying labels of unseen data

  1. Build a model
  2. Model learns from the labeled dat we pass to it
  3. Pass unlabeled data to the model as input
  4. Model predicts the labels of the unseen data
  • Labeled data = training data

k-Nearest Neighbors

  • KNN predicts label of a data point by
  • Looking at the k closest labeled data points
  • Taking a majority vote
telecom.head()
from sklearn.neighbors import KNeighborsClassifier
X = telecom[['total_day_charge', 'total_eve_charge']].values
y = telecom['churn'].values
# .values converts X and y to numpy arrays
print(X.shape, y.shape)
print(f'There are {X.shape[0]} observations of {X.shape[1]} features, and {y.shape[0]} observations of the target feature.')
# instantiate the algorithim n_neighbors being 15
knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X, y)

Predicting on unlabeled data

# inputting new data to test the model
X_new = np.array([[56.8, 17.5], [24.4, 24.1], [50.1, 10.9]])

# predicting the new values
predictions = knn.predict(X_new)

print('Predictions: {}'.format(predictions))

Predicted that the first customer will churn, and the next 2 won't.

Measuring model performance

  • In classification, accuracy is a commonly used metric
  • Accuracy is NOT indicative of ability to generalize

Computing Accuracy

Split the data into a training set and a test set, fit/train classifier on the training set, then calculate accuracy using test set.