Telecom Customer Churn Prediction
Business Problem: Telecom companies lose revenue when customers churn. We need a model that identifies high-risk customers early for targeted retention actions.
Tools Used: Python, Pandas, Matplotlib, Seaborn, Scikit-learn
Success Metrics:
- Reduce churn rate
- Improve retention campaign ROI
- Prioritize high-value customers
This dataset is derived from a telecom company operating in Iran. It contains real customer-level information including demographics, call usage, SMS frequency, and plan details.
Each row represents one customer.
🔧 Project Workflow
This notebook is structured as follows:
- Data Preview & Understanding
- Exploratory Data Analysis
- Statistical Testing
- Churn Prediction using Machine Learning
- Model Evaluation & Insights
The goal is to translate behavioral data into actionable churn predictions.
1️⃣ Load Dataset
We begin by loading the customer churn dataset into memory and performing an initial structural inspection to understand the shape and completeness of the data.
import pandas as pd
churn = pd.read_csv("data/customer_churn.csv")
print(churn.shape)
churn.head(100)Data Dictionary
| Column | Explanation |
|---|---|
| Call Failure | number of call failures |
| Complaints | binary (0: No complaint, 1: complaint) |
| Subscription Length | total months of subscription |
| Charge Amount | ordinal attribute (0: lowest amount, 9: highest amount) |
| Seconds of Use | total seconds of calls |
| Frequency of use | total number of calls |
| Frequency of SMS | total number of text messages |
| Distinct Called Numbers | total number of distinct phone calls |
| Age Group | ordinal attribute (1: younger age, 5: older age) |
| Tariff Plan | binary (1: Pay as you go, 2: contractual) |
| Status | binary (1: active, 2: non-active) |
| Age | age of customer |
| Customer Value | the calculated value of customer |
| Churn | class label (1: churn, 0: non-churn) |
churn.info()churn.describe()churn.isna().sum()churn['Churn'].value_counts()churn['Churn'].value_counts(normalize=True) * 100sns.countplot(x='Churn', data=churn)
plt.title("Churn Class Distribution")
plt.show()
Before modelling, I always check data completeness, data types, and class balance. Here, we have no missing values, but churn is only ~16%, so it’s slightly imbalanced. That’s why I focus on precision, recall, and F1 rather than just accuracy.
2️⃣ Exploratory Data Analysis (EDA)
EDA is conducted to find trends or behaviors linked to churn.
We start by analyzing whether customer preferences for SMS versus calls vary by age group — as different communication styles may correlate with churn.
usage_by_age=churn.groupby("Age Group")[["Frequency of SMS","Frequency of use"]].mean()
print(usage_by_age)To avoid bias from different group sizes, I look at average usage per customer rather than totals. This gives a fairer comparison of behaviour across age groups