Skip to content

Problem: classify the turnover status of customers of a telecommunications company.

Here we will make use of a turnover dataset, where we seek to classify the turnover status of customers of a telecommunications company. For this, we use the KNN algorithm - (k-Nearest Neighbors).

k-Nearest Neighbors

The k-nearest neighbors algorithm, also known as KNN or k-NN, is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point. While it can be used for either regression or classification problems, it is typically used as a classification algorithm, working off the assumption that similar points can be found near one another.

More details see: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

import pandas as pd
import numpy as np
import warnings

# Importing the course datasets 
diabetes = pd.read_csv('datasets/diabetes_clean.csv')
music = pd.read_csv('datasets/music_clean.csv')
advertising = pd.read_csv('datasets/advertising_and_sales_clean.csv')
telecom = pd.read_csv("datasets/telecom_churn_clean.csv")

k-Nearest Neighbors: Fit

k-Nearest Neighbors: Fit (Exercise) In this exercise, you will build your first classification model using the churn_df dataset, which has been preloaded for the remainder of the chapter.

The features to use will be "account_length" and "customer_service_calls". The target, "churn", needs to be a single column with the same number of observations as the feature data.

You will convert the features and the target variable into NumPy arrays, create an instance of a KNN classifier, and then fit it to the data.

numpy has also been preloaded for you as np.

Instructions:

  • Import KNeighborsClassifier from sklearn.neighbors.
  • Create an array called X containing values from the "account_length" and "customer_service_calls" columns, and an array called y for the values of the "churn" column.
  • Instantiate a KNeighborsClassifier called knn with 6 neighbors.
  • Fit the classifier to the data using the .fit() method.
pd.set_option('display.expand_frame_repr', False)

warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

churn_df = pd.read_csv('./datasets/telecom_churn_clean.csv')
churn_df.head()
# Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier 

# Create arrays for the features and the target variable
X = churn_df[["account_length", "customer_service_calls"]].values
y = churn_df["churn"].values


# Create a KNN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the data
knn.fit(X, y)

k-Nearest Neighbors: Predict

k-Nearest Neighbors: Predict (Exercise) Now you have fit a KNN classifier, you can use it to predict the label of new data points. All available data was used for training, however, fortunately, there are new observations available. These have been preloaded for you as X_new.

The model knn, which you created and fit the data in the last exercise, has been preloaded for you. You will use your classifier to predict the labels of a set of new data points:

X_new = np.array([[30.0, 17.5], [107.0, 24.1], [213.0, 10.9]]) Instructions:

Create y_pred by predicting the target values of the unseen features X_new. Print the predicted labels for the set of predictions.

X_new = np.array([[30.0, 17.5],
                  [107.0, 24.1],
                  [213.0, 10.9]])

# Predict the labels for the X_new
y_pred = knn.predict(X_new)

# Print the predictions for X_new
print("Predictions: {}".format(y_pred))   

0 indicates no churn, while 1 indicates churn.

Measuring model performance

In classification, accuracy is a commonly used metric

Accuracy= (totalobservations)/ correctpredictions

In the image shown, as k increases, the decision limit less affected by individual observations, reflecting a model simpler. simpler models are less capable of detecting relationships in the dataset, which is known as underfitting.

on the other hand, complex models can be sensitive to noise in the training data rather than reflecting general trends. This is known as overfitting.

Model complexity and over/underfitting:

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,  
                                                    random_state=21, stratify=y)


knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)


print(knn.score(X_test, y_test))

We can also interpret K using a model complexity curve. As k increases beyond 11 or 12, we see overrfitting. The peak accuracy of the test occurs at 13 neighbors.