Supervised Learning with scikit-learn

Run the hidden code cell below to import the data used in this course.

# Importing pandas
import pandas as pd

# Importing the course datasets 
diabetes = pd.read_csv('datasets/diabetes_clean.csv')
music = pd.read_csv('datasets/music_clean.csv')
advertising = pd.read_csv('datasets/advertising_and_sales_clean.csv')
telecom = pd.read_csv("datasets/telecom_churn_clean.csv")

CROSS VALIDATION IN SCIKIT LEARN

Data is divided into k "folds."
first, the first fold is taken as test data and the metric of intrest is computed using the other (k-1) folds as training data
then, the second fold is taken as test data and the metric of intrest is computed using the other (k-1) folds as training data
this process is repeated for all k folds
the more the number of folds, the more computationally expensive it is

Add your notes here

from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits = 6, shuffle = True, random_state = 42) 
# Shuffle = True means that the dataset is shuffled before it is split into folds
# rando_set sets a seed so that the data is split in the same way, making the process repeatable in the future.
reg = LinearRegression()
cv_results = cross_val_score(reg, X, y, cv = kf)
# score reported is R^2 as this is the default
print(cv_results) # Prints the results of cross Validation
print(np.mean(cv_results), np.std(cv_results)) #Prints the mean and standard deviation 
print(np.quantile(cv_results, [0.025, 0.975])) #Prints the 95% confidence interval i.e. range of possible mean vaulues calculated from a sample

REGULARIZATION USING REGRESSION

Linear Regression uses loss function minimization
Large Coefficients in y=ax+b can lead to overfitting
regularization penalizes large coefficients

RIDGE REGRESSION

Ridge penalizes large positive or negative coeffecients
alpha - hyperparameter chosen by us(similar to k in KNN)
alpha controls model complexity
if alpha is very large - can lead to overfitting
if alpha is very small - can lead to undetfitting

from sklearn.linear_model import Ridge
scores = []
for alpha in [0.1, 1.0, 10.0, 100.0, 1000.0]:
    ridge = Ridge(alpha = alpha)
    ridge.fit(X_train, y_train)
    y_pred = reg.predict(X_test)
    scores.append(ridge.score(X_test, y_test))
print(scores)

LASSO REGRESSION

can select important features of a dataset
it shrinks the coefficients of less important features to zero
features not shrunk to zero are selected by lasso

from sklearn.linear_model import Ridge
scores = []
for alpha in [0.1, 1.0, 10.0, 100.0, 1000.0]:
    lasso = Lasso(alpha = alpha)
    lasso.fit(X_train, y_train)
    y_pred = reg.predict(X_test)
    scores.append(rodge.score(X_test, y_test))
print(scores)

# LASSO FOR FEATURE SELECTION

from sklearn.linear_model import Lasso
X = diabetes_df.drop("glucose", axis = 1).values
y = diabetes_df["glucose"].values
names = diabetes_df,drop("glucose", axis = 1).columns
lasso = Lasso(alpha = 0.1)
lasso_coeff = Lasso.fit(X, y).coef_

MODEL CLASSIFICATION METRICS

ACCURACY

Accuracy = Correctly Classified Samples / Total number of Samples
not always useful(eg: for a model prediction fraudulent transactions, a 99% accurate model would be terrible at determining fraudulent transactions)
This is called Class Imbalance where the frequency of classes is uneven

CONFUSSION MATRIX

Plots the Actual Values on the Y-axis and Predicted Values on X-Axis used to determine other parameters a) Accuracy Accuracy = tp+tn / (tp+tn+fp+fn)

b) Precision:

Precision = true positive/(true positives + false positives)

high precision = low false positive rates(i.e. lower rate of legitimate transactions which are predicted as fraudulent)

c) Recall:

Recall = true positives / (true positives + false negatives)

high recall = lower false negative rate(i.e. predicted most fraudulent transactions correctly)

d) F1 Score:

```
HM of precision and recall
```

F1 Score = (2* precision * recall) / (precision + recall)

from sklearn.metrics import classification_report, confusion_matrix
knn = KNeighborsClassifier(n_neighbors = 7)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(confussion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))  # Returns all parameters along with confussion matrix

LOGISTIC REGRESSION

used for binary classification
logistic regression uses probabilities
if p>0.5 ; model returns 1 for target prediction (default probability threshold = 0.5)
if p<0.5 ; model returns 0 for target prediction
it produces a linear descision boundary

When the default probability threshold is changed,

RECIEVER OPERATING CHARACTERISITC (ROC) CURVE: used to visualize how different threshold values affect the true positive and false positive rates

ROC AUC (Area Under Curve) used to quantify the models performance based on the ROC Curve

from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

y_pred_probs = logreg.predit_proba(X_test)[:, 1]
print(y_pred_probs[0])

# Plotting the ROC Curve
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs) # fpr - false positive rates, tpr - true positive rates
plt.plot([0,1], [0,1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel("False Positive Rates")
plt.ylabel("True Positive Rates")
plt.title("Logistic Regression ROC Curve")
plt.show()

#Calculating AUC
from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_test, y_pred_probs))

Hidden output

HYPERPARAMETER TUNING

parameters we specify before making the model (like alpha and n_neighbors)
try a lot of different values, fit all the values seperately, see how they perform and choose the best preforming values
use cross validatinon to avoid overfitting

GRID SEARCH CROSS PARAMETER
RANDOMIZED SEARCH CV Picks random hyperparameter values

‌
‌
‌

Supervised Learning with scikit-learn

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Supervised Learning with scikit-learn

Supervised Learning with scikit-learn