Telecom Customer Churn

This dataset comes from an Iranian telecom company, with each row representing a customer over a year period. Along with a churn label, there is information on the customers' activity, such as call failures and subscription length.

Not sure where to begin? Scroll to the bottom to find challenges!

import pandas as pd
churn = pd.read_csv("data/customer_churn.csv")
print(churn.shape)
churn.head(100)

Data Dictionary

Column	Explanation
Call Failure	number of call failures
Complaints	binary (0: No complaint, 1: complaint)
Subscription Length	total months of subscription
Charge Amount	ordinal attribute (0: lowest amount, 9: highest amount)
Seconds of Use	total seconds of calls
Frequency of use	total number of calls
Frequency of SMS	total number of text messages
Distinct Called Numbers	total number of distinct phone calls
Age Group	ordinal attribute (1: younger age, 5: older age)
Tariff Plan	binary (1: Pay as you go, 2: contractual)
Status	binary (1: active, 2: non-active)
Age	age of customer
Customer Value	the calculated value of customer
Churn	class label (1: churn, 0: non-churn)

Source of dataset and source of dataset description.

Citation: Jafari-Marandi, R., Denton, J., Idris, A., Smith, B. K., & Keramati, A. (2020). Optimum Profit-Driven Churn Decision Making: Innovative Artificial Neural Networks in Telecom Industry. Neural Computing and Applications.

Don't know where to start?

Challenges are brief tasks designed to help you practice specific skills:

🗺️ Explore: Which age groups send more SMS messages than make phone calls?
📊 Visualize: Create a plot visualizing the number of distinct phone calls by age group. Within the chart, differentiate between short, medium, and long calls (by the number of seconds).
🔎 Analyze: Are there significant differences between the length of phone calls between different tariff plans?

Scenarios are broader questions to help you develop an end-to-end project for your portfolio:

You have just been hired by a telecom company. A competitor has recently entered the market and is offering an attractive plan to new customers. The telecom company is worried that this competitor may start attracting its customers.

You have access to a dataset of the company's customers, including whether customers churned. The telecom company wants to know whether you can use this data to predict whether a customer will churn. They also want to know what factors increase the probability that a customer churns.

You will need to prepare a report that is accessible to a broad audience. It should outline your motivation, steps, findings, and conclusions.

list(churn.columns)

import pandas as pd

# Group by Age Group and calculate mean frequency of SMS and calls
age_group_stats = churn.groupby('Age Group').agg({
    'Frequency of SMS': 'mean',
    'Frequency of use': 'mean'
}).reset_index()

# Find age groups where average SMS frequency > call frequency
sms_heavy_groups = age_group_stats[age_group_stats['Frequency of SMS'] > age_group_stats['Frequency of use']]

print(sms_heavy_groups[['Age Group', 'Frequency of SMS', 'Frequency of use']])

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

bins = [0, 60, 300, float('inf')]  # Short: <1min, Medium: 1-5min, Long: >5min
labels = ['Short (<1min)', 'Medium (1-5min)', 'Long (>5min)']

# Categorize calls by duration
churn['Call Duration'] = pd.cut(churn['Seconds of Use'], bins=bins, labels=labels)

# Count distinct calls per Age Group and Call Duration
call_counts = churn.groupby(['Age Group', 'Call Duration']).size().unstack().fillna(0)

# Plot (Stacked Bar Chart)
call_counts.plot(kind='bar', stacked=True, figsize=(10, 6))
plt.title('Number of Distinct Calls by Age Group and Call Duration')
plt.xlabel('Age Group')
plt.ylabel('Number of Calls')
plt.legend(title='Call Duration')
plt.show()

# Alternative: Grouped Bar Chart (using Seaborn)
plt.figure(figsize=(10, 6))
sns.barplot(
    data=churn,
    x='Age Group',
    y='Distinct Called Numbers',  
    hue='Call Duration',
    estimator='sum'  
)
plt.title('Number of Distinct Calls by Age Group and Call Duration')
plt.ylabel('Total Distinct Calls')
plt.legend(title='Call Duration')
plt.show()

import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Option 1: One-Way ANOVA (if assumptions hold)
model = ols('Q("Seconds of Use") ~ C(Q("Tariff Plan"))', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

# Check homogeneity of variances (Levene's test)
p_value_levene = stats.levene(*[group['Seconds of Use'] for name, group in df.groupby('Tariff Plan')])[1]
print(f"Levene's Test p-value: {p_value_levene}")

# Option 2: Kruskal-Wallis (if ANOVA assumptions fail)
groups = [group['Seconds of Use'] for name, group in df.groupby('Tariff Plan')]
stat, p_value_kw = stats.kruskal(*groups)
print(f"Kruskal-Wallis p-value: {p_value_kw}")

Telecom Customer Churn

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Telecom Customer Churn

Data Dictionary

Don't know where to start?

Telecom Customer Churn