Skip to content
Supervised Learning with scikit-learn
Supervised Learning with scikit-learn
Run the hidden code cell below to import the data used in this course.
# Importing pandas and numpy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scikitplot
# Importing the course datasets
diabetes = pd.read_csv('datasets/diabetes_clean.csv')
music = pd.read_csv('datasets/music_clean.csv')
advertising = pd.read_csv('datasets/advertising_and_sales_clean.csv')
telecom = pd.read_csv("datasets/telecom_churn_clean.csv")Chapter 1: Classification
What is machine learning?
Process whereby:
- Computers learn to make decisions from data without being expilcitly programed
Supervised learning
- The predicted values are known
- Aim: Predict the target values of unseen data, given the features
- Uses features to predict the value of a target variable
Types of supervised learning
- Classification: target variable consists of categories (fraudulent vs non-fraudulent transaction is an example of binary classification)
- Regression: Target variable is continuous
Naming conventions
- Feature = predictor varible = independent variable (column in table)
- Target variable = dependent variable = response variable
Requirements before using supervised learning:
- No missing values
- Data in numeric form
- Data stored in pandas DF or NumPy array
- Perform EDA first
scikit-learn syntax
`from sklearn.module_name import ModelName
model = ModelName()
model.fit(X, y)
predictions = model.predict(X_new)
print(predictions)`
Add your notes here
Classification challenge
Classifying labels of unseen data
- Build a model
- Model learns from the labeled dat we pass to it
- Pass unlabeled data to the model as input
- Model predicts the labels of the unseen data
- Labeled data = training data
k-Nearest Neighbors
- KNN predicts label of a data point by
- Looking at the
kclosest labeled data points - Taking a majority vote
telecom.head()from sklearn.neighbors import KNeighborsClassifier
X = telecom[['total_day_charge', 'total_eve_charge']].values
y = telecom['churn'].values
# .values converts X and y to numpy arrays
print(X.shape, y.shape)
print(f'There are {X.shape[0]} observations of {X.shape[1]} features, and {y.shape[0]} observations of the target feature.')# instantiate the algorithim n_neighbors being 15
knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X, y)Predicting on unlabeled data
# inputting new data to test the model
X_new = np.array([[56.8, 17.5], [24.4, 24.1], [50.1, 10.9]])
# predicting the new values
predictions = knn.predict(X_new)
print('Predictions: {}'.format(predictions))Predicted that the first customer will churn, and the next 2 won't.
Measuring model performance
- In classification, accuracy is a commonly used metric
- Accuracy is NOT indicative of ability to generalize
Computing Accuracy
Split the data into a training set and a test set, fit/train classifier on the training set, then calculate accuracy using test set.