Skip to content
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(class.output = "code-background")
.code-background {
  background-color: lightgreen;
  border: 3px solid brown;
  font-weight: bold;
}
library(tidyverse)
packages <- c("janitor", "skimr", "tidymodels", "ranger",
  "xgboost", "tictoc", "vip")

packages %>%
  walk(~ install.packages(.x))

library(janitor)
library(skimr)
library(tidymodels)
library(ranger)
library(xgboost)
library(tictoc)
library(vip)

data <- readr::read_csv('data/customer_churn.csv')

Telecom Customer Churn

This dataset comes from an Iranian telecom company, with each row representing a customer over a year period. Along with a churn label, there is information on the customers' activity, such as call failures and subscription length.

Data Dictionary

ColumnExplanation
Call Failurenumber of call failures
Complainsbinary (0: No complaint, 1: complaint)
Subscription Lengthtotal months of subscription
Charge Amountordinal attribute (0: lowest amount, 9: highest amount)
Seconds of Usetotal seconds of calls
Frequency of usetotal number of calls
Frequency of SMStotal number of text messages
Distinct Called Numberstotal number of distinct phone calls
Age Groupordinal attribute (1: younger age, 5: older age)
Tariff Planbinary (1: Pay as you go, 2: contractual)
Statusbinary (1: active, 2: non-active)
Ageage of customer
Customer Valuethe calculated value of customer
Churnclass label (1: churn, 0: non-churn)

Source of dataset and source of dataset description.

Citation: Jafari-Marandi, R., Denton, J., Idris, A., Smith, B. K., & Keramati, A. (2020). Optimum Profit-Driven Churn Decision Making: Innovative Artificial Neural Networks in Telecom Industry. Neural Computing and Applications.

Research question

  • Can we predict the probability of a customer churning based on attributes such as age, charge amount, subscription length and so on?

  • So this is a classical supervised machine learning binary classification problem

The plan

  • We are going to use machine learning models such as logistic regression, decision trees, and random forest
    • Tune their hyperparameters using grid search
      • And determine the best model based on the ROC-AUC value
  • All of this will be done using the tidymodels package

Data preparation

  • We read the data in and turn all variable names into snake_case for easier coding
data <- data  %>% 
  clean_names()
skim(data)
  • Luckily, there are now missing values, so no imputation or deletion of rows is necessary
glimpse(data)
  • Some of the categorical variables are coded as numerical values. We should transform their data types
data <- data %>% 
  mutate(complains = as.factor(complains),
         charge_amount = factor(charge_amount, ordered = T),
         age_group = factor(age_group, ordered = T),
         tariff_plan = as.factor(tariff_plan),
         status = as.factor(status),
         churn = ifelse(churn == 1, "yes", "no"),
         churn = as.factor(churn))
data %>% 
  group_by(churn) %>% 
  summarize(n_obs = n())

495 / (495 + 2655)
  • The dataset is pretty unbalanced, as only 15 % of the rows are churns
  • We can deal with this imbalance by using stratified sampling when we create the train/test splits

Workflow

Train-test split and cross validation

  • 75 % of the data get assigned to the training set, the other 25 % to the testing set
set.seed(1)
data_split <- initial_split(data, prop = 0.75,
                            strata = churn)

data_train <- training(data_split)
data_test <- testing(data_split)


data_folds <- vfold_cv(data_train, k = 10)