Telecom Customer Churn
This dataset comes from an Iranian telecom company, with each row representing a customer over a year period. Along with a churn label, there is information on the customers' activity, such as call failures and subscription length.
Not sure where to begin? Scroll to the bottom to find challenges!
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier, AdaBoostClassifier
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
df = pd.read_csv("data/customer_churn.csv")
print(df.shape)
df.head(100)
Data Dictionary
| Column | Explanation | 
|---|---|
| Call Failure | number of call failures | 
| Complaints | binary (0: No complaint, 1: complaint) | 
| Subscription Length | total months of subscription | 
| Charge Amount | ordinal attribute (0: lowest amount, 9: highest amount) | 
| Seconds of Use | total seconds of calls | 
| Frequency of use | total number of calls | 
| Frequency of SMS | total number of text messages | 
| Distinct Called Numbers | total number of distinct phone calls | 
| Age Group | ordinal attribute (1: younger age, 5: older age) | 
| Tariff Plan | binary (1: Pay as you go, 2: contractual) | 
| Status | binary (1: active, 2: non-active) | 
| Age | age of customer | 
| Customer Value | the calculated value of customer | 
| Churn | class label (1: churn, 0: non-churn) | 
Don't know where to start?
Challenges are brief tasks designed to help you practice specific skills:
- ๐บ๏ธ Explore: Which age groups send more SMS messages than make phone calls?
- ๐ Visualize: Create a plot visualizing the number of distinct phone calls by age group. Within the chart, differentiate between short, medium, and long calls (by the number of seconds).
- ๐ Analyze: Are there significant differences between the length of phone calls between different tariff plans?
Scenarios are broader questions to help you develop an end-to-end project for your portfolio:
You have just been hired by a telecom company. A competitor has recently entered the market and is offering an attractive plan to new customers. The telecom company is worried that this competitor may start attracting its customers.
You have access to a dataset of the company's customers, including whether customers churned. The telecom company wants to know whether you can use this data to predict whether a customer will churn. They also want to know what factors increase the probability that a customer churns.
You will need to prepare a report that is accessible to a broad audience. It should outline your motivation, steps, findings, and conclusions.
df.isna().sum()# Which age groups send more SMS messages than make phone calls?
age_groups = df.groupby('Age Group')
avg_sms = age_groups['Frequency of SMS'].mean()
avg_call = age_groups['Frequency of use'].mean()
more_sms = avg_sms > avg_call
print(more_sms)
# Group the data by age group and calculate the mean number of distinct called numbers
distinct_calls_by_age = df.groupby('Age Group')['Distinct Called Numbers'].mean()
# Create a new column to categorize the calls as short, medium, or long based on the number of seconds
df['Call Duration Category'] = pd.cut(df['Seconds of Use'], bins=[0, 60, 300, float('inf')], labels=['Short', 'Medium', 'Long'])
# Group the data by age group and call duration category, and calculate the count of distinct calls
distinct_calls_by_age_duration = df.groupby(['Age Group', 'Call Duration Category'])['Distinct Called Numbers'].median()
# Reshape the data to have age groups as rows and call duration categories as columns
distinct_calls_by_age_duration = distinct_calls_by_age_duration.unstack()
# Plot the data
distinct_calls_by_age_duration.plot(kind='bar', stacked=True)
# Set the labels and title
plt.xlabel('Age Group')
plt.ylabel('Number of Distinct Calls')
plt.title('Number of Distinct Phone Calls by Age Group and Call Duration')
# Show the plot
plt.show()
# Group the data by tariff plan and calculate the mean call duration
mean_call_duration_by_tariff = df.groupby('Tariff Plan')['Seconds of Use'].mean()
# Plot the mean call duration for each tariff plan
sns.barplot(x=mean_call_duration_by_tariff.index, y=mean_call_duration_by_tariff.values)
# Set the labels and title
plt.xlabel('Tariff Plan')
plt.ylabel('Mean Call Duration (seconds)')
plt.title('Mean Call Duration by Tariff Plan')
# Show the plot
plt.show()
def check_df(dataframe, head=5):
    print("##################### Shape #####################")
    print(dataframe.shape)
    print("##################### Types #####################")
    print(dataframe.dtypes)
    print("##################### Head #####################")
    print(dataframe.head(head))
    print("##################### Tail #####################")
    print(dataframe.tail(head))
    print("##################### NA #####################")
    print(dataframe.isnull().sum())
    print("##################### Quantiles #####################")
    print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)
    
check_df(df)1 hidden cell
cat_cols, num_cols, cat_but_car = grab_col_names(df, cat_th=5, car_th=20)
# checking the categorical cols
for col in cat_cols:
    cat_summary(df, col)# checking the numerical cols
df[num_cols].describe().T# correlation in numerical variables
correlation_matrix(df, num_cols)
โ
โ