Telco Customer Churn

Telco Customer Churn Prediction

Project Overview

Customer churn, the loss of customers to competitors, is one of the biggest challenges in the telecommunications industry. Acquiring new customers is often five times more expensive than retaining existing ones.

This project aims to analyze historical customer data and build a predictive model that identifies customers most likely to churn. By recognizing these at-risk customers early, the company can take proactive actions such as offering promotions, improving service quality, or providing personalized engagement ultimately reducing churn and increasing customer lifetime value.

Problem Statement

The objective of this project is to predict customer churn based on demographics, account information, and service usage patterns.

Specifically, we aim to:

Understand the factors contributing to customer churn through exploratory data analysis (EDA).
Build and compare machine learning models to classify whether a customer is likely to churn.
Evaluate the models using key performance metrics.
Provide actionable business insights to guide retention strategies.

Business Value

Predicting churn enables the company to:

Reduce revenue loss by identifying customers at risk of leaving.
Improve retention rates through targeted campaigns and service improvements.
Optimize marketing spend by focusing on customers with the highest churn probability.

For example, if each retained customer represents an average revenue of $1000 per year, reducing churn by even 5% can lead to significant savings.

Dataset Information

Dataset name: WA_Fn-UseC_-Telco-Customer-Churn.csv

Source: Kaggle – Telco Customer Churn Dataset

Rows: 7,043

Columns: 21

Tools and Technologies

Python Libraries: pandas, numpy, matplotlib, seaborn, scikit-learn
Data Handling: Python (for preprocessing and analysis)
Visualization: Matplotlib, Seaborn
Modeling: Logistic Regression, Random Forest
Evaluation: Classification metrics and ROC curves
IDE: Jupyter Notebook

Project Workflow

Data Loading and Inspection

The dataset used in this analysis is the Telco Customer Churn Dataset from Kaggle. It contains customer-level information such as tenure, payment method, contract type, and churn status.

Total Records: 7,043
Features: 21
Target Variable: Churn (Yes / No)

Data Cleaning and Preprocessing

Handled missing values in TotalCharges by imputing with median values.
Converted categorical variables into numerical form using Label Encoding and One-Hot Encoding.
Normalized numerical features using StandardScaler to ensure model stability.

Exploratory Data Analysis (EDA)

Exploratory analysis was conducted to uncover trends and relationships in the data.

Key insights:

Tenure: Customers with shorter tenures have a significantly higher churn rate.
Contract Type: Month-to-month customers are more likely to churn compared to those on annual or two-year contracts.
Payment Method: Customers using electronic checks show higher churn tendencies.
Monthly Charges: Higher monthly charges correlate with higher churn probability.

EDA Visualizations:

Churn Distribution: Bar chart showing class imbalance between churned and retained customers.

Contract Type vs Churn: Stacked bar showing churn rates by contract duration.

Model Development

Two classification models were developed and compared:

Logistic Regression (LR)

Chosen for interpretability and as a strong baseline model.
Provides insights into how individual features affect churn likelihood through coefficient values.

Random Forest Classifier (RF)

Chosen for its robustness and ability to capture complex, non-linear relationships.
Provides feature importance rankings for interpretability.

Model Evaluation

The dataset was split into 80% training and 20% testing subsets. Model performance was evaluated using Accuracy, Precision, Recall, F1-score, and ROC-AUC to ensure both predictive accuracy and class balance handling.

Metric	Logistic Regression	Random Forest
Accuracy	0.8045	0.7875
Precision	0.6495	0.6221
Recall	0.5749	0.5107
F1-Score	0.6099	0.5609
ROC-AUC	0.8360	0.8178

Findings:

Logistic Regression performed slightly better than Random Forest across most metrics, particularly Recall and F1-score.
Both models achieved strong AUC scores (>0.81), indicating a good ability to distinguish between churned and non-churned customers.
Logistic Regression’s interpretability makes it a strong choice for business decision-making, while Random Forest offers robustness and non-linearity handling.

Model Optimization

Although the baseline models achieved strong results, their recall and precision values indicated that some churners were still being missed, and a moderate number of loyal customers were being incorrectly flagged as churners.

In churn prediction, this trade-off is critical:

Low recall → missed churners → potential revenue loss.
Low precision → loyal customers targeted unnecessarily → increased marketing costs.

To address this, model optimization was performed through:

Class Weight Balancing — to make both models more sensitive to the minority churn class.
Threshold Optimization — to find the best decision threshold that maximizes the F1-score (balance between precision and recall).

Optimized Model Results

After optimization, both models improved their ability to capture churners, achieving better recall and balanced F1-scores.

Metric	Logistic Regression (Optimized)	Random Forest (Optimized)
Accuracy	0.787	0.760
Precision	0.588	0.538
Recall	0.671	0.701
F1-Score	0.627	0.609
ROC-AUC	0.835	0.819

Interpretation:

The optimized models achieved higher recall, improving their ability to detect churners.
Though accuracy slightly dropped, this trade-off is acceptable in business settings where missing churners is costlier than flagging loyal customers.
The optimized Logistic Regression model offered the best trade-off between interpretability, recall, and overall performance.

Business Impact:

The optimized models now identify a larger proportion of true churners, helping reduce customer loss through targeted retention campaigns.
Logistic Regression is preferred for deployment due to its simplicity, interpretability, and balanced performance.
Random Forest can serve as a secondary model for experimental use where higher recall is prioritized.

Top 10 Most Influential Features

Logistic Regression (Top 10 Features)

Feature	Coefficient
Contract_Two year	-1.364197
tenure	-1.348550
InternetService_Fiber optic	1.108470
Contract_One year	-0.748888
TotalCharges	0.634001
PhoneService_Yes	-0.517122
MonthlyCharges	-0.430947
PaymentMethod_Electronic check	0.380924
OnlineSecurity_Yes	-0.373453
StreamingTV_Yes	0.371831

Visualization: Top 10 Features (LR)

Interpretation:

Negative coefficients (−) reduce churn likelihood. → Customers with longer tenure or multi-year contracts tend to stay.
Positive coefficients (+) increase churn likelihood. → Customers with fiber optic service, electronic check payments, or streaming TV are more likely to churn.

Random Forest (Top 10 Features)

Feature	Importance
TotalCharges	0.194313
tenure	0.168529
MonthlyCharges	0.167972
InternetService_Fiber optic	0.038913
PaymentMethod_Electronic check	0.037898
Contract_Two year	0.031862
gender_Male	0.028939
OnlineSecurity_Yes	0.027288
PaperlessBilling_Yes	0.025595
Partner_Yes	0.023280

Visualization: Top 10 Features (RF)

Interpretation:

Tenure, TotalCharges, and MonthlyCharges are the strongest churn predictors.
Fiber optic and electronic check users show higher churn tendencies.
Two-year contracts and online security lower churn risk.

Comparative Insight

Both models highlight contract type, tenure, monthly charges, and payment method as key churn drivers. While Logistic Regression reveals the direction of each relationship (positive or negative), Random Forest confirms their importance ranking. Together, they emphasize that:

Customers with short contracts, higher bills, or electronic payments are more prone to churn, while long-term, loyal customers with secure services are more likely to stay.

Business Insights and Recommendation

Contract Type: Customers on month-to-month contracts are more likely to churn. → Encourage long-term contracts through loyalty discounts.
Monthly Charges: High-charging customers tend to leave. → Consider tiered pricing or loyalty rewards for premium users.
Tenure: New customers have a higher risk of churn. → Introduce onboarding programs to improve early retention.
Payment Method: Electronic check users churn more often. → Promote credit card or auto-pay options for convenience.
Internet Service: Fiber optic customers churn more frequently. → Investigate service quality and satisfaction issues in this segment.
Model Deployment: The final model can be integrated into a CRM system to flag high-risk customers for proactive engagement.

Conclusion

This project demonstrates how machine learning can effectively predict customer churn and support data-driven retention strategies. While Logistic Regression provided interpretability and strong recall, Random Forest added robust feature insights.

With recall-optimized models, telecom companies can proactively identify customers at risk, reduce churn, and enhance customer lifetime value.

# 1. Import necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, confusion_matrix, classification_report,
    roc_curve, precision_recall_curve
)
import warnings
warnings.filterwarnings("ignore")

# 2. Load dataset 
df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")

# 3. Display shape and first few rows
print("Dataset shape:", df.shape)
df.head()

# 4. DATA OVERVIEW & VALIDATION

# Check info and missing values
df.info()
df.isna().sum()

# Check for unique values in categorical columns
for col in df.select_dtypes('object').columns:
    print(f"{col}: {df[col].nunique()} unique values")

# Quick statistics for numeric columns
df.describe()

# 5. DATA CLEANING & PREPARATION

# Convert TotalCharges to numeric
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Check how many are NaN
print("Missing TotalCharges:", df['TotalCharges'].isna().sum())

# Drop rows with missing TotalCharges
df = df.dropna(subset=['TotalCharges'])

# Drop customerID (unique identifier)
df.drop('customerID', axis=1, inplace=True)

# Convert target column
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

# Strip whitespace from object columns
for col in df.select_dtypes('object').columns:
    df[col] = df[col].str.strip()

print("Cleaned data shape:", df.shape)
df.head()

# 6. EXPLORATORY DATA ANALYSIS (EDA)

# Churn distribution
sns.countplot(x='Churn', data=df)
plt.title("Customer Churn Distribution")
plt.show()

# Tenure distribution
sns.histplot(df['tenure'], bins=30)
plt.title("Tenure Distribution")
plt.show()

# Monthly Charges distribution
sns.histplot(df['MonthlyCharges'], bins=30)
plt.title("Monthly Charges Distribution")
plt.show()

# Contract type vs Churn
pd.crosstab(df['Contract'], df['Churn'], normalize='index').plot(kind='bar', stacked=True)
plt.title("Contract Type vs Churn")
plt.ylabel("Proportion")
plt.show()

# Correlation heatmap
num_cols = df.select_dtypes('number').columns
plt.figure(figsize=(10,6))
sns.heatmap(df[num_cols].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap (Numerical Features)")
plt.show()

# 7. FEATURE ENGINEERING

# Separate features and target
X = df.drop('Churn', axis=1)
y = df['Churn']

# One-hot encode categorical variables
X = pd.get_dummies(X, drop_first=True)

# Standardize numeric columns
scaler = StandardScaler()
num_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
X[num_cols] = scaler.fit_transform(X[num_cols])

print("Feature matrix shape:", X.shape)
X.head()

# 8. MODEL DEVELOPMENT

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Baseline: Logistic Regression
lr = LogisticRegression(max_iter=2000)
lr.fit(X_train, y_train)

# Compare with Random Forest
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)

# 8. MODEL EVALUATION

def evaluate(model, X_test, y_test, name="Model"):
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:,1]

    print(f"===== {name} =====")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred))
    print("Recall:", recall_score(y_test, y_pred))
    print("F1 Score:", f1_score(y_test, y_pred))
    print("ROC-AUC:", roc_auc_score(y_test, y_prob))
    print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("\nClassification Report:\n", classification_report(y_test, y_pred))

    fpr, tpr, _ = roc_curve(y_test, y_prob)
    plt.plot(fpr, tpr, label=f"{name} (AUC={roc_auc_score(y_test, y_prob):.2f})")
    plt.plot([0,1], [0,1], 'k--')
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.legend()
    plt.title("ROC Curve")
    plt.show()

# Evaluate both models
evaluate(lr, X_test, y_test, "Logistic Regression")
evaluate(rf, X_test, y_test, "Random Forest")


#  Model Optimization for Churn Prediction

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_curve, classification_report, confusion_matrix, roc_auc_score, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd


# 1. Retrain Models with Balanced Class Weights

print("Training Balanced Models...")

lr_balanced = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)
rf_balanced = RandomForestClassifier(class_weight='balanced', n_estimators=200, random_state=42)

lr_balanced.fit(X_train, y_train)
rf_balanced.fit(X_train, y_train)

print("Models trained successfully.\n")



# 2. Find Optimal Threshold for Each Model

def optimal_threshold(model, X_test, y_test, model_name):
    """Find the optimal threshold that maximizes F1-score."""
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    precisions, recalls, thresholds = precision_recall_curve(y_test, y_pred_proba)
    f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-10)
    best_index = np.argmax(f1_scores)
    best_threshold = thresholds[best_index]
    best_precision = precisions[best_index]
    best_recall = recalls[best_index]
    
    print(f"===== {model_name} =====")
    print(f"Best Threshold: {best_threshold:.3f}")
    print(f"Precision: {best_precision:.3f}")
    print(f"Recall: {best_recall:.3f}")
    print(f"F1 Score: {f1_scores[best_index]:.3f}\n")

    # Plot Precision-Recall Curve
    plt.figure(figsize=(6,5))
    plt.plot(thresholds, precisions[:-1], label='Precision', color='blue')
    plt.plot(thresholds, recalls[:-1], label='Recall', color='red')
    plt.title(f'Precision-Recall vs Threshold ({model_name})')
    plt.xlabel('Threshold')
    plt.ylabel('Score')
    plt.legend()
    plt.grid()
    plt.show()

    return best_threshold

lr_threshold = optimal_threshold(lr_balanced, X_test, y_test, "Logistic Regression (Balanced)")
rf_threshold = optimal_threshold(rf_balanced, X_test, y_test, "Random Forest (Balanced)")



# 3. Evaluate Models Using the Optimal Thresholds

def evaluate_model_threshold(model, X_test, y_test, threshold, model_name):
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    y_pred = (y_pred_proba >= threshold).astype(int)
    
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_pred_proba)
    
    print(f"===== {model_name} (Optimized Threshold) =====")
    print(f"Accuracy: {accuracy:.3f}")
    print(f"Precision: {precision:.3f}")
    print(f"Recall: {recall:.3f}")
    print(f"F1 Score: {f1:.3f}")
    print(f"ROC-AUC: {auc:.3f}\n")
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    
    # Plot Confusion Matrix
    plt.figure(figsize=(5,4))
    sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Blues', cbar=False)
    plt.title(f'Confusion Matrix - {model_name}')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()


evaluate_model_threshold(lr_balanced, X_test, y_test, lr_threshold, "Logistic Regression")
evaluate_model_threshold(rf_balanced, X_test, y_test, rf_threshold, "Random Forest")


# =====================================================
# 4️⃣ (Optional) Hyperparameter Tuning for Further Improvement
# =====================================================
# Uncomment to run

"""
# Logistic Regression Tuning
params_lr = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l2'],
    'solver': ['lbfgs'],
    'class_weight': ['balanced']
}

grid_lr = GridSearchCV(LogisticRegression(max_iter=1000, random_state=42), params_lr, cv=5, scoring='f1')
grid_lr.fit(X_train, y_train)
print("Best Logistic Regression Params:", grid_lr.best_params_)

# Random Forest Tuning
params_rf = {
    'n_estimators': [100, 200],
    'max_depth': [5, 10, 20],
    'min_samples_split': [2, 5, 10],
    'class_weight': ['balanced']
}

grid_rf = GridSearchCV(RandomForestClassifier(random_state=42), params_rf, cv=3, scoring='f1')
grid_rf.fit(X_train, y_train)
print("Best Random Forest Params:", grid_rf.best_params_)
"""

# 🔍 9. Feature Importance — Logistic Regression (Top 10 Clean Output)

# Create dataframe for coefficients
coefficients = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': lr.coef_[0]
})

# Add absolute coefficient values for sorting
coefficients['Abs_Coefficient'] = np.abs(coefficients['Coefficient'])

# Sort by absolute value and select top 10
top10_lr = coefficients.sort_values('Abs_Coefficient', ascending=False).head(10).reset_index(drop=True)

# Display clean output
print("Top 10 Features - Logistic Regression:\n")
print(top10_lr[['Feature', 'Coefficient']].to_string(index=False))

# Plot the top 10 coefficients
plt.figure(figsize=(8,6))
plt.barh(top10_lr['Feature'], top10_lr['Coefficient'])
plt.title('Top 10 Features - Logistic Regression')
plt.xlabel('Coefficient Value')
plt.ylabel('Feature')
plt.gca().invert_yaxis()
plt.show()

# 10. Feature Importance — Random Forest (Top 10)

# Calculate feature importance
importances = rf.feature_importances_
importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': importances})

# Sort and select top 10
top10_rf = importance_df.sort_values('Importance', ascending=False).head(10)

# Reset index to remove numbering
top10_rf = top10_rf.reset_index(drop=True)

# Display clean table
print("Top 10 Features - Random Forest:\n")
print(top10_rf.to_string(index=False))

# Plot the top 10 features
plt.figure(figsize=(8,6))
plt.barh(top10_rf['Feature'], top10_rf['Importance'])
plt.title('Top 10 Features - Random Forest')
plt.xlabel('Importance Score')
plt.ylabel('Feature')
plt.gca().invert_yaxis()
plt.show()

Telco Customer Churn

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Telco Customer Churn Prediction

Project Overview

Problem Statement

Business Value

Dataset Information

Tools and Technologies

Project Workflow

EDA Visualizations:

Model Development

Model Evaluation

Model Optimization

Optimized Model Results

Interpretation:

Business Impact:

Top 10 Most Influential Features

Logistic Regression (Top 10 Features)

Visualization: Top 10 Features (LR)

Random Forest (Top 10 Features)

Visualization: Top 10 Features (RF)

Comparative Insight

Business Insights and Recommendation

Conclusion

Telco Customer Churn Prediction