Skip to content

Supervised Learning with scikit-learn

Run the hidden code cell below to import the data used in this course.

# Importing pandas
import pandas as pd

# Importing the course datasets 
diabetes = pd.read_csv('datasets/diabetes_clean.csv')
music = pd.read_csv('datasets/music_clean.csv')
advertising = pd.read_csv('datasets/advertising_and_sales_clean.csv')
telecom = pd.read_csv("datasets/telecom_churn_clean.csv")

CROSS VALIDATION IN SCIKIT LEARN

  • Data is divided into k "folds."
  • first, the first fold is taken as test data and the metric of intrest is computed using the other (k-1) folds as training data
  • then, the second fold is taken as test data and the metric of intrest is computed using the other (k-1) folds as training data
  • this process is repeated for all k folds
  • the more the number of folds, the more computationally expensive it is

Add your notes here

from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits = 6, shuffle = True, random_state = 42) 
# Shuffle = True means that the dataset is shuffled before it is split into folds
# rando_set sets a seed so that the data is split in the same way, making the process repeatable in the future.
reg = LinearRegression()
cv_results = cross_val_score(reg, X, y, cv = kf)
# score reported is R^2 as this is the default
print(cv_results) # Prints the results of cross Validation
print(np.mean(cv_results), np.std(cv_results)) #Prints the mean and standard deviation 
print(np.quantile(cv_results, [0.025, 0.975])) #Prints the 95% confidence interval i.e. range of possible mean vaulues calculated from a sample

REGULARIZATION USING REGRESSION

  • Linear Regression uses loss function minimization
  • Large Coefficients in y=ax+b can lead to overfitting
  • regularization penalizes large coefficients
  1. RIDGE REGRESSION

  • Ridge penalizes large positive or negative coeffecients
  • alpha - hyperparameter chosen by us(similar to k in KNN)
  • alpha controls model complexity
  • if alpha is very large - can lead to overfitting
  • if alpha is very small - can lead to undetfitting
from sklearn.linear_model import Ridge
scores = []
for alpha in [0.1, 1.0, 10.0, 100.0, 1000.0]:
    ridge = Ridge(alpha = alpha)
    ridge.fit(X_train, y_train)
    y_pred = reg.predict(X_test)
    scores.append(ridge.score(X_test, y_test))
print(scores)
  1. LASSO REGRESSION

  • can select important features of a dataset
  • it shrinks the coefficients of less important features to zero
  • features not shrunk to zero are selected by lasso
from sklearn.linear_model import Ridge
scores = []
for alpha in [0.1, 1.0, 10.0, 100.0, 1000.0]:
    lasso = Lasso(alpha = alpha)
    lasso.fit(X_train, y_train)
    y_pred = reg.predict(X_test)
    scores.append(rodge.score(X_test, y_test))
print(scores)

# LASSO FOR FEATURE SELECTION

from sklearn.linear_model import Lasso
X = diabetes_df.drop("glucose", axis = 1).values
y = diabetes_df["glucose"].values
names = diabetes_df,drop("glucose", axis = 1).columns
lasso = Lasso(alpha = 0.1)
lasso_coeff = Lasso.fit(X, y).coef_

MODEL CLASSIFICATION METRICS

  1. ACCURACY
  • Accuracy = Correctly Classified Samples / Total number of Samples
  • not always useful(eg: for a model prediction fraudulent transactions, a 99% accurate model would be terrible at determining fraudulent transactions)
  • This is called Class Imbalance where the frequency of classes is uneven
  1. CONFUSSION MATRIX
  • Plots the Actual Values on the Y-axis and Predicted Values on X-Axis used to determine other parameters a) Accuracy Accuracy = tp+tn / (tp+tn+fp+fn)

    b) Precision:

  • Precision = true positive/(true positives + false positives)
  • high precision = low false positive rates(i.e. lower rate of legitimate transactions which are predicted as fraudulent)

    c) Recall:

  • Recall = true positives / (true positives + false negatives)
  • high recall = lower false negative rate(i.e. predicted most fraudulent transactions correctly)

    d) F1 Score:

  • HM of precision and recall
  • F1 Score = (2* precision * recall) / (precision + recall)
from sklearn.metrics import classification_report, confusion_matrix
knn = KNeighborsClassifier(n_neighbors = 7)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(confussion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))  # Returns all parameters along with confussion matrix

LOGISTIC REGRESSION

  • used for binary classification
  • logistic regression uses probabilities
  • if p>0.5 ; model returns 1 for target prediction (default probability threshold = 0.5)
  • if p<0.5 ; model returns 0 for target prediction
  • it produces a linear descision boundary

When the default probability threshold is changed,

RECIEVER OPERATING CHARACTERISITC (ROC) CURVE: used to visualize how different threshold values affect the true positive and false positive rates

ROC AUC (Area Under Curve) used to quantify the models performance based on the ROC Curve

from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

y_pred_probs = logreg.predit_proba(X_test)[:, 1]
print(y_pred_probs[0])

# Plotting the ROC Curve
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs) # fpr - false positive rates, tpr - true positive rates
plt.plot([0,1], [0,1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel("False Positive Rates")
plt.ylabel("True Positive Rates")
plt.title("Logistic Regression ROC Curve")
plt.show()

#Calculating AUC
from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_test, y_pred_probs))
Hidden output

HYPERPARAMETER TUNING

  • parameters we specify before making the model (like alpha and n_neighbors)
  • try a lot of different values, fit all the values seperately, see how they perform and choose the best preforming values
  • use cross validatinon to avoid overfitting
  1. GRID SEARCH CROSS PARAMETER
  2. RANDOMIZED SEARCH CV Picks random hyperparameter values