Telecom Customer Churn
This dataset comes from an Iranian telecom company, with each row representing a customer over a year period. Along with a churn label, there is information on the customers' activity, such as call failures and subscription length.
Not sure where to begin? Scroll to the bottom to find challenges!
import pandas as pd
churn = pd.read_csv("data/customer_churn.csv")
print(churn.shape)
churn.head(100)
Data Dictionary
Column | Explanation |
---|---|
Call Failure | number of call failures |
Complaints | binary (0: No complaint, 1: complaint) |
Subscription Length | total months of subscription |
Charge Amount | ordinal attribute (0: lowest amount, 9: highest amount) |
Seconds of Use | total seconds of calls |
Frequency of use | total number of calls |
Frequency of SMS | total number of text messages |
Distinct Called Numbers | total number of distinct phone calls |
Age Group | ordinal attribute (1: younger age, 5: older age) |
Tariff Plan | binary (1: Pay as you go, 2: contractual) |
Status | binary (1: active, 2: non-active) |
Age | age of customer |
Customer Value | the calculated value of customer |
Churn | class label (1: churn, 0: non-churn) |
Don't know where to start?
Challenges are brief tasks designed to help you practice specific skills:
- 🗺️ Explore: Which age groups send more SMS messages than make phone calls?
- 📊 Visualize: Create a plot visualizing the number of distinct phone calls by age group. Within the chart, differentiate between short, medium, and long calls (by the number of seconds).
- 🔎 Analyze: Are there significant differences between the length of phone calls between different tariff plans?
Scenarios are broader questions to help you develop an end-to-end project for your portfolio:
You have just been hired by a telecom company. A competitor has recently entered the market and is offering an attractive plan to new customers. The telecom company is worried that this competitor may start attracting its customers.
You have access to a dataset of the company's customers, including whether customers churned. The telecom company wants to know whether you can use this data to predict whether a customer will churn. They also want to know what factors increase the probability that a customer churns.
You will need to prepare a report that is accessible to a broad audience. It should outline your motivation, steps, findings, and conclusions.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
churn.info()
print(churn['Age'].value_counts())
#Age groups of texters (more frequent to send messages than make phone calls)
texters = churn.loc[churn['Frequency of SMS'] > churn['Frequency of use']]
print(texters['Age Group'].value_counts(ascending = False))
# Visualize
#Categorize the length of phone calls
call_length = ['short', 'medium', 'long']
call_brackets = [0, churn['Seconds of Use'].quantile(0.33), churn['Seconds of Use'].quantile(0.66), churn['Seconds of Use'].max()]
churn['Length of Call'] = pd.cut(churn['Seconds of Use'], labels = call_length, bins = call_brackets)
churn.dropna(subset = 'Length of Call', inplace = True)
print(churn.info())
churn
# Number of distinct phone calls by age group
sns.catplot(data = churn, kind = 'bar', x = 'Age Group', y = 'Frequency of use', hue = 'Length of Call')
plt.show()
churn.isna().sum()
# Analyze
print(churn['Tariff Plan'].value_counts())
sns.catplot(data = churn, kind = 'count', x = 'Length of Call', col = 'Tariff Plan')
plt.show()
# In the plots, Tariff Plan 2 users usually have long phone calls, while the majority of Tariff Plan 1 users make medium-length phone calls.
sns.histplot(data = churn, x = 'Seconds of Use', hue = 'Tariff Plan', multiple = 'stack')
plt.show()
#Further Analysis
sum_stats = churn.groupby('Tariff Plan').agg(call_length_mean = ('Seconds of Use', 'mean'), call_length_std = ('Seconds of Use', 'std'), \
call_length_median = ('Seconds of Use', 'median')).round(2)
sum_stats
# By the summary statistics, Tariff Plan 2 has longer calls than Tariff Plan 1.
# This analysis might not be enough to conclude that the difference is significant. This can be further extended by using T-test.
# T-test
from scipy.stats import ttest_ind
tariff_1 = churn.loc[churn['Tariff Plan'] == 1, 'Seconds of Use']
tariff_2 = churn.loc[churn['Tariff Plan'] == 2, 'Seconds of Use']
t_stat, p_value = ttest_ind(tariff_1, tariff_2, equal_var=False) # Welch's t-test
print("t-stat:", t_stat, "p-value:", p_value)
# Significant difference