Telco Customer Churn Prediction
Project Overview
Customer churn, the loss of customers to competitors, is one of the biggest challenges in the telecommunications industry. Acquiring new customers is often five times more expensive than retaining existing ones.
This project aims to analyze historical customer data and build a predictive model that identifies customers most likely to churn. By recognizing these at-risk customers early, the company can take proactive actions such as offering promotions, improving service quality, or providing personalized engagement ultimately reducing churn and increasing customer lifetime value.
Problem Statement
The objective of this project is to predict customer churn based on demographics, account information, and service usage patterns.
Specifically, we aim to:
-
Understand the factors contributing to customer churn through exploratory data analysis (EDA).
-
Build and compare machine learning models to classify whether a customer is likely to churn.
-
Evaluate the models using key performance metrics.
-
Provide actionable business insights to guide retention strategies.
Business Value
Predicting churn enables the company to:
-
Reduce revenue loss by identifying customers at risk of leaving.
-
Improve retention rates through targeted campaigns and service improvements.
-
Optimize marketing spend by focusing on customers with the highest churn probability.
For example, if each retained customer represents an average revenue of $1000 per year, reducing churn by even 5% can lead to significant savings.
Dataset Information
Dataset name: WA_Fn-UseC_-Telco-Customer-Churn.csv
Source: Kaggle โ Telco Customer Churn Dataset
Rows: 7,043
Columns: 21
Tools and Technologies
-
Python Libraries: pandas, numpy, matplotlib, seaborn, scikit-learn
-
Data Handling: Python (for preprocessing and analysis)
-
Visualization: Matplotlib, Seaborn
-
Modeling: Logistic Regression, Random Forest
-
Evaluation: Classification metrics and ROC curves
-
IDE: Jupyter Notebook
Project Workflow
- Data Loading and Inspection
The dataset used in this analysis is the Telco Customer Churn Dataset from Kaggle. It contains customer-level information such as tenure, payment method, contract type, and churn status.
-
Total Records: 7,043
-
Features: 21
-
Target Variable: Churn (Yes / No)
- Data Cleaning and Preprocessing
-
Handled missing values in TotalCharges by imputing with median values.
-
Converted categorical variables into numerical form using Label Encoding and One-Hot Encoding.
-
Normalized numerical features using StandardScaler to ensure model stability.
- Exploratory Data Analysis (EDA)
Exploratory analysis was conducted to uncover trends and relationships in the data.
Key insights:
-
Tenure: Customers with shorter tenures have a significantly higher churn rate.
-
Contract Type: Month-to-month customers are more likely to churn compared to those on annual or two-year contracts.
-
Payment Method: Customers using electronic checks show higher churn tendencies.
-
Monthly Charges: Higher monthly charges correlate with higher churn probability.
EDA Visualizations:
- Churn Distribution: Bar chart showing class imbalance between churned and retained customers.
- Contract Type vs Churn: Stacked bar showing churn rates by contract duration.
Model Development
Two classification models were developed and compared:
- Logistic Regression (LR)
-
Chosen for interpretability and as a strong baseline model.
-
Provides insights into how individual features affect churn likelihood through coefficient values.
- Random Forest Classifier (RF)
-
Chosen for its robustness and ability to capture complex, non-linear relationships.
-
Provides feature importance rankings for interpretability.
Model Evaluation
The dataset was split into 80% training and 20% testing subsets. Model performance was evaluated using Accuracy, Precision, Recall, F1-score, and ROC-AUC to ensure both predictive accuracy and class balance handling.
| Metric | Logistic Regression | Random Forest |
|---|---|---|
| Accuracy | 0.8045 | 0.7875 |
| Precision | 0.6495 | 0.6221 |
| Recall | 0.5749 | 0.5107 |
| F1-Score | 0.6099 | 0.5609 |
| ROC-AUC | 0.8360 | 0.8178 |
Findings:
-
Logistic Regression performed slightly better than Random Forest across most metrics, particularly Recall and F1-score.
-
Both models achieved strong AUC scores (>0.81), indicating a good ability to distinguish between churned and non-churned customers.
-
Logistic Regressionโs interpretability makes it a strong choice for business decision-making, while Random Forest offers robustness and non-linearity handling.
Model Optimization
Although the baseline models achieved strong results, their recall and precision values indicated that some churners were still being missed, and a moderate number of loyal customers were being incorrectly flagged as churners.
In churn prediction, this trade-off is critical:
-
Low recall โ missed churners โ potential revenue loss.
-
Low precision โ loyal customers targeted unnecessarily โ increased marketing costs.
To address this, model optimization was performed through:
-
Class Weight Balancing โ to make both models more sensitive to the minority churn class.
-
Threshold Optimization โ to find the best decision threshold that maximizes the F1-score (balance between precision and recall).
Optimized Model Results
After optimization, both models improved their ability to capture churners, achieving better recall and balanced F1-scores.
| Metric | Logistic Regression (Optimized) | Random Forest (Optimized) |
|---|---|---|
| Accuracy | 0.787 | 0.760 |
| Precision | 0.588 | 0.538 |
| Recall | 0.671 | 0.701 |
| F1-Score | 0.627 | 0.609 |
| ROC-AUC | 0.835 | 0.819 |
Interpretation:
-
The optimized models achieved higher recall, improving their ability to detect churners.
-
Though accuracy slightly dropped, this trade-off is acceptable in business settings where missing churners is costlier than flagging loyal customers.
-
The optimized Logistic Regression model offered the best trade-off between interpretability, recall, and overall performance.
Business Impact:
-
The optimized models now identify a larger proportion of true churners, helping reduce customer loss through targeted retention campaigns.
-
Logistic Regression is preferred for deployment due to its simplicity, interpretability, and balanced performance.
-
Random Forest can serve as a secondary model for experimental use where higher recall is prioritized.
Top 10 Most Influential Features
Logistic Regression (Top 10 Features)
| Feature | Coefficient |
|---|---|
| Contract_Two year | -1.364197 |
| tenure | -1.348550 |
| InternetService_Fiber optic | 1.108470 |
| Contract_One year | -0.748888 |
| TotalCharges | 0.634001 |
| PhoneService_Yes | -0.517122 |
| MonthlyCharges | -0.430947 |
| PaymentMethod_Electronic check | 0.380924 |
| OnlineSecurity_Yes | -0.373453 |
| StreamingTV_Yes | 0.371831 |
Visualization: Top 10 Features (LR)
Interpretation:
-
Negative coefficients (โ) reduce churn likelihood. โ Customers with longer tenure or multi-year contracts tend to stay.
-
Positive coefficients (+) increase churn likelihood. โ Customers with fiber optic service, electronic check payments, or streaming TV are more likely to churn.
Random Forest (Top 10 Features)
| Feature | Importance |
|---|---|
| TotalCharges | 0.194313 |
| tenure | 0.168529 |
| MonthlyCharges | 0.167972 |
| InternetService_Fiber optic | 0.038913 |
| PaymentMethod_Electronic check | 0.037898 |
| Contract_Two year | 0.031862 |
| gender_Male | 0.028939 |
| OnlineSecurity_Yes | 0.027288 |
| PaperlessBilling_Yes | 0.025595 |
| Partner_Yes | 0.023280 |
Visualization: Top 10 Features (RF)
Interpretation:
-
Tenure, TotalCharges, and MonthlyCharges are the strongest churn predictors.
-
Fiber optic and electronic check users show higher churn tendencies.
-
Two-year contracts and online security lower churn risk.
Comparative Insight
Both models highlight contract type, tenure, monthly charges, and payment method as key churn drivers. While Logistic Regression reveals the direction of each relationship (positive or negative), Random Forest confirms their importance ranking. Together, they emphasize that:
Customers with short contracts, higher bills, or electronic payments are more prone to churn, while long-term, loyal customers with secure services are more likely to stay.
Business Insights and Recommendation
-
Contract Type: Customers on month-to-month contracts are more likely to churn. โ Encourage long-term contracts through loyalty discounts.
-
Monthly Charges: High-charging customers tend to leave. โ Consider tiered pricing or loyalty rewards for premium users.
-
Tenure: New customers have a higher risk of churn. โ Introduce onboarding programs to improve early retention.
-
Payment Method: Electronic check users churn more often. โ Promote credit card or auto-pay options for convenience.
-
Internet Service: Fiber optic customers churn more frequently. โ Investigate service quality and satisfaction issues in this segment.
-
Model Deployment: The final model can be integrated into a CRM system to flag high-risk customers for proactive engagement.
Conclusion
This project demonstrates how machine learning can effectively predict customer churn and support data-driven retention strategies. While Logistic Regression provided interpretability and strong recall, Random Forest added robust feature insights.
With recall-optimized models, telecom companies can proactively identify customers at risk, reduce churn, and enhance customer lifetime value.
# 1. Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, confusion_matrix, classification_report,
roc_curve, precision_recall_curve
)
import warnings
warnings.filterwarnings("ignore")# 2. Load dataset
df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
# 3. Display shape and first few rows
print("Dataset shape:", df.shape)
df.head()# 4. DATA OVERVIEW & VALIDATION
# Check info and missing values
df.info()
df.isna().sum()
# Check for unique values in categorical columns
for col in df.select_dtypes('object').columns:
print(f"{col}: {df[col].nunique()} unique values")
# Quick statistics for numeric columns
df.describe()
# 5. DATA CLEANING & PREPARATION
# Convert TotalCharges to numeric
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
# Check how many are NaN
print("Missing TotalCharges:", df['TotalCharges'].isna().sum())
# Drop rows with missing TotalCharges
df = df.dropna(subset=['TotalCharges'])
# Drop customerID (unique identifier)
df.drop('customerID', axis=1, inplace=True)
# Convert target column
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})
# Strip whitespace from object columns
for col in df.select_dtypes('object').columns:
df[col] = df[col].str.strip()
print("Cleaned data shape:", df.shape)
df.head()
# 6. EXPLORATORY DATA ANALYSIS (EDA)
# Churn distribution
sns.countplot(x='Churn', data=df)
plt.title("Customer Churn Distribution")
plt.show()
# Tenure distribution
sns.histplot(df['tenure'], bins=30)
plt.title("Tenure Distribution")
plt.show()
# Monthly Charges distribution
sns.histplot(df['MonthlyCharges'], bins=30)
plt.title("Monthly Charges Distribution")
plt.show()
# Contract type vs Churn
pd.crosstab(df['Contract'], df['Churn'], normalize='index').plot(kind='bar', stacked=True)
plt.title("Contract Type vs Churn")
plt.ylabel("Proportion")
plt.show()
# Correlation heatmap
num_cols = df.select_dtypes('number').columns
plt.figure(figsize=(10,6))
sns.heatmap(df[num_cols].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap (Numerical Features)")
plt.show()
# 7. FEATURE ENGINEERING
# Separate features and target
X = df.drop('Churn', axis=1)
y = df['Churn']
# One-hot encode categorical variables
X = pd.get_dummies(X, drop_first=True)
# Standardize numeric columns
scaler = StandardScaler()
num_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
X[num_cols] = scaler.fit_transform(X[num_cols])
print("Feature matrix shape:", X.shape)
X.head()
# 8. MODEL DEVELOPMENT
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Baseline: Logistic Regression
lr = LogisticRegression(max_iter=2000)
lr.fit(X_train, y_train)
# Compare with Random Forest
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
# 8. MODEL EVALUATION
def evaluate(model, X_test, y_test, name="Model"):
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:,1]
print(f"===== {name} =====")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_prob))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
fpr, tpr, _ = roc_curve(y_test, y_prob)
plt.plot(fpr, tpr, label=f"{name} (AUC={roc_auc_score(y_test, y_prob):.2f})")
plt.plot([0,1], [0,1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.title("ROC Curve")
plt.show()
# Evaluate both models
evaluate(lr, X_test, y_test, "Logistic Regression")
evaluate(rf, X_test, y_test, "Random Forest")
# Model Optimization for Churn Prediction
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_curve, classification_report, confusion_matrix, roc_auc_score, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# 1. Retrain Models with Balanced Class Weights
print("Training Balanced Models...")
lr_balanced = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)
rf_balanced = RandomForestClassifier(class_weight='balanced', n_estimators=200, random_state=42)
lr_balanced.fit(X_train, y_train)
rf_balanced.fit(X_train, y_train)
print("Models trained successfully.\n")
# 2. Find Optimal Threshold for Each Model
def optimal_threshold(model, X_test, y_test, model_name):
"""Find the optimal threshold that maximizes F1-score."""
y_pred_proba = model.predict_proba(X_test)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_test, y_pred_proba)
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-10)
best_index = np.argmax(f1_scores)
best_threshold = thresholds[best_index]
best_precision = precisions[best_index]
best_recall = recalls[best_index]
print(f"===== {model_name} =====")
print(f"Best Threshold: {best_threshold:.3f}")
print(f"Precision: {best_precision:.3f}")
print(f"Recall: {best_recall:.3f}")
print(f"F1 Score: {f1_scores[best_index]:.3f}\n")
# Plot Precision-Recall Curve
plt.figure(figsize=(6,5))
plt.plot(thresholds, precisions[:-1], label='Precision', color='blue')
plt.plot(thresholds, recalls[:-1], label='Recall', color='red')
plt.title(f'Precision-Recall vs Threshold ({model_name})')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.legend()
plt.grid()
plt.show()
return best_threshold
lr_threshold = optimal_threshold(lr_balanced, X_test, y_test, "Logistic Regression (Balanced)")
rf_threshold = optimal_threshold(rf_balanced, X_test, y_test, "Random Forest (Balanced)")
# 3. Evaluate Models Using the Optimal Thresholds
def evaluate_model_threshold(model, X_test, y_test, threshold, model_name):
y_pred_proba = model.predict_proba(X_test)[:, 1]
y_pred = (y_pred_proba >= threshold).astype(int)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred_proba)
print(f"===== {model_name} (Optimized Threshold) =====")
print(f"Accuracy: {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1 Score: {f1:.3f}")
print(f"ROC-AUC: {auc:.3f}\n")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Plot Confusion Matrix
plt.figure(figsize=(5,4))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title(f'Confusion Matrix - {model_name}')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
evaluate_model_threshold(lr_balanced, X_test, y_test, lr_threshold, "Logistic Regression")
evaluate_model_threshold(rf_balanced, X_test, y_test, rf_threshold, "Random Forest")
# =====================================================
# 4๏ธโฃ (Optional) Hyperparameter Tuning for Further Improvement
# =====================================================
# Uncomment to run
"""
# Logistic Regression Tuning
params_lr = {
'C': [0.01, 0.1, 1, 10],
'penalty': ['l2'],
'solver': ['lbfgs'],
'class_weight': ['balanced']
}
grid_lr = GridSearchCV(LogisticRegression(max_iter=1000, random_state=42), params_lr, cv=5, scoring='f1')
grid_lr.fit(X_train, y_train)
print("Best Logistic Regression Params:", grid_lr.best_params_)
# Random Forest Tuning
params_rf = {
'n_estimators': [100, 200],
'max_depth': [5, 10, 20],
'min_samples_split': [2, 5, 10],
'class_weight': ['balanced']
}
grid_rf = GridSearchCV(RandomForestClassifier(random_state=42), params_rf, cv=3, scoring='f1')
grid_rf.fit(X_train, y_train)
print("Best Random Forest Params:", grid_rf.best_params_)
"""
# ๐ 9. Feature Importance โ Logistic Regression (Top 10 Clean Output)
# Create dataframe for coefficients
coefficients = pd.DataFrame({
'Feature': X.columns,
'Coefficient': lr.coef_[0]
})
# Add absolute coefficient values for sorting
coefficients['Abs_Coefficient'] = np.abs(coefficients['Coefficient'])
# Sort by absolute value and select top 10
top10_lr = coefficients.sort_values('Abs_Coefficient', ascending=False).head(10).reset_index(drop=True)
# Display clean output
print("Top 10 Features - Logistic Regression:\n")
print(top10_lr[['Feature', 'Coefficient']].to_string(index=False))
# Plot the top 10 coefficients
plt.figure(figsize=(8,6))
plt.barh(top10_lr['Feature'], top10_lr['Coefficient'])
plt.title('Top 10 Features - Logistic Regression')
plt.xlabel('Coefficient Value')
plt.ylabel('Feature')
plt.gca().invert_yaxis()
plt.show()
# 10. Feature Importance โ Random Forest (Top 10)
# Calculate feature importance
importances = rf.feature_importances_
importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': importances})
# Sort and select top 10
top10_rf = importance_df.sort_values('Importance', ascending=False).head(10)
# Reset index to remove numbering
top10_rf = top10_rf.reset_index(drop=True)
# Display clean table
print("Top 10 Features - Random Forest:\n")
print(top10_rf.to_string(index=False))
# Plot the top 10 features
plt.figure(figsize=(8,6))
plt.barh(top10_rf['Feature'], top10_rf['Importance'])
plt.title('Top 10 Features - Random Forest')
plt.xlabel('Importance Score')
plt.ylabel('Feature')
plt.gca().invert_yaxis()
plt.show()