Skip to content

The telecommunications (telecom) sector in India is rapidly changing, with more and more telecom businesses being created and many customers deciding to switch between providers. "Churn" refers to the process where customers or subscribers stop using a company's services or products. Understanding the factors that influence keeping a customer as a client in predicting churn is crucial for telecom companies to enhance their service quality and customer satisfaction. As the data scientist on this project, you aim to explore the intricate dynamics of customer behavior and demographics in the Indian telecom sector in predicting customer churn, utilizing two comprehensive datasets from four major telecom partners: Airtel, Reliance Jio, Vodafone, and BSNL:

  • telecom_demographics.csv contains information related to Indian customer demographics:
VariableDescription
customer_id Unique identifier for each customer.
telecom_partner The telecom partner associated with the customer.
gender The gender of the customer.
age The age of the customer.
stateThe Indian state in which the customer is located.
cityThe city in which the customer is located.
pincodeThe pincode of the customer's location.
registration_eventWhen the customer registered with the telecom partner.
num_dependentsThe number of dependents (e.g., children) the customer has.
estimated_salaryThe customer's estimated salary.
  • telecom_usage contains information about the usage patterns of Indian customers:
VariableDescription
customer_idUnique identifier for each customer.
calls_madeThe number of calls made by the customer.
sms_sentThe number of SMS messages sent by the customer.
data_usedThe amount of data used by the customer.
churnBinary variable indicating whether the customer has churned or not (1 = churned, 0 = not churned).
# Import libraries and methods/functions
import pandas as pd

Task overview

Does Logistic Regression or Random Forest produce a higher accuracy score in predicting telecom churn in India?

Task 1

Load the two CSV files into separate DataFrames. Merge them into a DataFrame named churn_df. Calculate the proportion of customers who have churned, and identify the categorical variables in churn_df.

Task 2

Convert categorical features in churn_df into features_scaled. Perform feature scaling separating the appropriate features and scale them. Define your scaled features and target variable for the churn prediction model.

Task 3

Split the processed data into training and testing sets giving names of X_train, X_test, y_train, and y_test using an 80-20 split, setting a random state of 42 for reproducibility.

Task 4

Train Logistic Regression and Random Forest Classifier models, setting a random seed of 42. Store model predictions in logreg_pred and rf_pred. Assess the models on test data. Assign the model's name with higher accuracy ("LogisticRegression" or "RandomForest") to higher_accuracy.

import pandas as pd
import numpy as np


df_customers = pd.read_csv('telecom_demographics.csv')
df_usage = pd.read_csv('telecom_usage.csv')


churn_df = pd.merge(df_customers, df_usage, on='customer_id')


churn_proportion = churn_df['churn'].value_counts(normalize=True)
print("Churn proportion:\n", churn_proportion)


X = churn_df.drop(['churn', 'customer_id'], axis=1)
y = churn_df['churn']


categorical_vars = X.select_dtypes(include=['object', 'category']).columns.tolist()


X_encoded = pd.get_dummies(X, columns=categorical_vars, drop_first=True)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.2, random_state=42, stratify=y
)


from sklearn.preprocessing import StandardScaler



numeric_cols = X_train.select_dtypes(include=['int64', 'float64']).columns

scaler = StandardScaler()
X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_test[numeric_cols] = scaler.transform(X_test[numeric_cols])



from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


logreg = LogisticRegression(max_iter=1000, random_state=42)
logreg.fit(X_train, y_train)
logreg_pred = logreg.predict(X_test)


rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)



logreg_acc = accuracy_score(y_test, logreg_pred)
rf_acc = accuracy_score(y_test, rf_pred)

higher_accuracy = "Logistic Regression" if logreg_acc > rf_acc else "Random Forest"

print(f"Logistic Regression Accuracy: {logreg_acc:.4f}")
print(f"Random Forest Accuracy: {rf_acc:.4f}")
print("Higher Accuracy Model:", higher_accuracy)