Churn Classification Model Shootout
This dataset contains account information for phone plan customers. There is a column attached that indicates whether the customer churned. Using the data compiled for each account, can we build a predicitive model that tells us whether the customer will churn or not? Below is some data cleaning, EDA, and model testing!
# Import Packages and Modules
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
#From Scikit Learn
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve, auc, confusion_matrix, classification_reportRead in Churn Calls dataset as a dataframe called Churn
#Import csv into Pandas Dataset called Auto
Churn = pd.read_csv('Churn_Calls.csv', sep = ",")Churn.head()Churn.dtypesSet target variable to Churn and move to first column.
# designate target variable name
targetName = 'churn'
#print(targetName)
targetSeries = Churn[targetName]
#remove target from current location and insert in column number 0
del Churn[targetName]
Churn.insert(0, targetName, targetSeries)
#reprint dataframe and see target is in position 0
Churn.head(10)#Check for NaN values
Churn.isna().any()Exploratory Data Analysis
#Create a bar chart of our target variable
groupby = Churn.groupby(targetName)
targetEDA=groupby[targetName].aggregate(len)
print(targetEDA)
plt.figure()
targetEDA.plot(kind='bar', grid=False)
plt.axhline(0, color='k')#Describe the database
Churn.describe()#Check out the variable correlation
#Create correlation matrix
corr_matrix = Churn.iloc[:,1:].corr()
corr_matrix
plt.figure(figsize=(15,15)) #need to adjust size as needed.
mask = np.zeros_like(corr_matrix, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr_matrix,
vmin=-1,
vmax=1,
cmap='coolwarm',
annot=True,
mask=mask)
plt.show()We can see that roughly 14% of the customers (707/5000) in our database have churned. Seeing as this is such a low number, our model is going to need to be very precise to catch which customers could be a churn risk. Since 84% of cour customers don't churn, we are going to need to see predictive performance greater than 84% for this model to be useful.
We can also see by the correlation matrix that there is not much corraltion between most of the variables, but a handful are heavily corrlated to each other. For example Total Daily Charge and Total Daily Min are highly correlated, which makes sense. Same with Total Night Charge and Total Night Min. These may act like duplicate variables.