Skip to content
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(class.output = "code-background")
.code-background {
background-color: lightgreen;
border: 3px solid brown;
font-weight: bold;
}
library(tidyverse)
packages <- c("janitor", "skimr", "tidymodels", "ranger",
"xgboost", "tictoc", "vip")
packages %>%
walk(~ install.packages(.x))
library(janitor)
library(skimr)
library(tidymodels)
library(ranger)
library(xgboost)
library(tictoc)
library(vip)
data <- readr::read_csv('data/customer_churn.csv')
Telecom Customer Churn
This dataset comes from an Iranian telecom company, with each row representing a customer over a year period. Along with a churn label, there is information on the customers' activity, such as call failures and subscription length.
Data Dictionary
Column | Explanation |
---|---|
Call Failure | number of call failures |
Complains | binary (0: No complaint, 1: complaint) |
Subscription Length | total months of subscription |
Charge Amount | ordinal attribute (0: lowest amount, 9: highest amount) |
Seconds of Use | total seconds of calls |
Frequency of use | total number of calls |
Frequency of SMS | total number of text messages |
Distinct Called Numbers | total number of distinct phone calls |
Age Group | ordinal attribute (1: younger age, 5: older age) |
Tariff Plan | binary (1: Pay as you go, 2: contractual) |
Status | binary (1: active, 2: non-active) |
Age | age of customer |
Customer Value | the calculated value of customer |
Churn | class label (1: churn, 0: non-churn) |
Source of dataset and source of dataset description.
Citation: Jafari-Marandi, R., Denton, J., Idris, A., Smith, B. K., & Keramati, A. (2020). Optimum Profit-Driven Churn Decision Making: Innovative Artificial Neural Networks in Telecom Industry. Neural Computing and Applications.
Research question
-
Can we predict the probability of a customer churning based on attributes such as age, charge amount, subscription length and so on?
-
So this is a classical supervised machine learning binary classification problem
The plan
- We are going to use machine learning models such as logistic regression, decision trees, and random forest
- Tune their hyperparameters using grid search
- And determine the best model based on the ROC-AUC value
- Tune their hyperparameters using grid search
- All of this will be done using the tidymodels package
Data preparation
- We read the data in and turn all variable names into snake_case for easier coding
data <- data %>%
clean_names()
skim(data)
- Luckily, there are now missing values, so no imputation or deletion of rows is necessary
glimpse(data)
- Some of the categorical variables are coded as numerical values. We should transform their data types
data <- data %>%
mutate(complains = as.factor(complains),
charge_amount = factor(charge_amount, ordered = T),
age_group = factor(age_group, ordered = T),
tariff_plan = as.factor(tariff_plan),
status = as.factor(status),
churn = ifelse(churn == 1, "yes", "no"),
churn = as.factor(churn))
data %>%
group_by(churn) %>%
summarize(n_obs = n())
495 / (495 + 2655)
- The dataset is pretty unbalanced, as only 15 % of the rows are churns
- We can deal with this imbalance by using stratified sampling when we create the train/test splits
Workflow
Train-test split and cross validation
- 75 % of the data get assigned to the training set, the other 25 % to the testing set
set.seed(1)
data_split <- initial_split(data, prop = 0.75,
strata = churn)
data_train <- training(data_split)
data_test <- testing(data_split)
data_folds <- vfold_cv(data_train, k = 10)