The telecommunications (telecom) sector in India is rapidly changing, with more and more telecom businesses being created and many customers deciding to switch between providers. "Churn" refers to the process where customers or subscribers stop using a company's services or products. Understanding the factors that influence keeping a customer as a client in predicting churn is crucial for telecom companies to enhance their service quality and customer satisfaction. As the data scientist on this project, you aim to explore the intricate dynamics of customer behavior and demographics in the Indian telecom sector in predicting customer churn, utilizing two comprehensive datasets from four major telecom partners: Airtel, Reliance Jio, Vodafone, and BSNL:
telecom_demographics.csvcontains information related to Indian customer demographics:
| Variable | Description |
|---|---|
customer_id | Unique identifier for each customer. |
telecom_partner | The telecom partner associated with the customer. |
gender | The gender of the customer. |
age | The age of the customer. |
state | The Indian state in which the customer is located. |
city | The city in which the customer is located. |
pincode | The pincode of the customer's location. |
registration_event | When the customer registered with the telecom partner. |
num_dependents | The number of dependents (e.g., children) the customer has. |
estimated_salary | The customer's estimated salary. |
telecom_usagecontains information about the usage patterns of Indian customers:
| Variable | Description |
|---|---|
customer_id | Unique identifier for each customer. |
calls_made | The number of calls made by the customer. |
sms_sent | The number of SMS messages sent by the customer. |
data_used | The amount of data used by the customer. |
churn | Binary variable indicating whether the customer has churned or not (1 = churned, 0 = not churned). |
# Import libraries and methods/functions
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
#read the dataset
telecom_partner=pd.read_csv('telecom_demographics.csv')
display (telecom_partner)
print (telecom_partner.info())
#print and check objects
print(telecom_partner['telecom_partner'].unique())
print(telecom_partner.gender.unique())
print(telecom_partner.registration_event.unique())
print(telecom_partner.state.unique())
print(telecom_partner.city.unique())
#convert objects to categories
for col in telecom_partner.columns:
if telecom_partner[col].dtype == object:
telecom_partner[col] = telecom_partner[col].astype('category')
#keep registration event as category
#print final cleaned dataset info
print(telecom_partner.info())
#calculate summary stats to see outliers
print(telecom_partner.describe())load and clean clean telecom_usage
#load telecom usage dataset and check datatypes and missing values
telecom_usage=pd.read_csv('telecom_usage.csv')
print(telecom_usage.info())merge the two files, turn categories to dummies, identify features and target, scale down features
#merge the two files into churn_df
churn_df = telecom_partner.merge(telecom_usage, on='customer_id')
#calculate churn rate and print
churn_rate = churn_df['churn'].value_counts() / len(churn_df)
print(churn_rate)
# Turn categoriess into dummies
churn_df = pd.get_dummies(churn_df, columns=['telecom_partner', 'gender', 'state', 'city', 'registration_event'], drop_first=True)
# Define feature set drop customer id and churt
X = churn_df.drop(['customer_id', 'churn'], axis=1)
cols=X.columns
print(cols)
# Define target: churn
y = churn_df.churn
# Scale the features and print ratio of churned
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)logistic regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Fit logistic regression
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
# Print confusion matrix and classification report
conf_mat = confusion_matrix(y_test, y_pred)
print(conf_mat)
rep = classification_report(y_test, y_pred)
print(rep)random forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Fit random forest classifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
# Get feature importances
feature_imp = rf.feature_importances_
print(type(feature_imp))
sns.barplot(feature_imp[0:10])
plt.show()
# Print confusion matrix and report
conf_mat = confusion_matrix(y_test, rf_pred)
print(conf_mat)
rep = classification_report(y_test, rf_pred)
print(rep)