[R] Supervised ML: Predicting Telecom Customer Churn

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(class.output = "code-background")

.code-background {
  background-color: lightgreen;
  border: 3px solid brown;
  font-weight: bold;
}

library(tidyverse)
packages <- c("janitor", "skimr", "tidymodels", "ranger",
  "xgboost", "tictoc", "vip")

packages %>%
  walk(~ install.packages(.x))

library(janitor)
library(skimr)
library(tidymodels)
library(ranger)
library(xgboost)
library(tictoc)
library(vip)

data <- readr::read_csv('data/customer_churn.csv')

Telecom Customer Churn

This dataset comes from an Iranian telecom company, with each row representing a customer over a year period. Along with a churn label, there is information on the customers' activity, such as call failures and subscription length.

Data Dictionary

Column	Explanation
Call Failure	number of call failures
Complains	binary (0: No complaint, 1: complaint)
Subscription Length	total months of subscription
Charge Amount	ordinal attribute (0: lowest amount, 9: highest amount)
Seconds of Use	total seconds of calls
Frequency of use	total number of calls
Frequency of SMS	total number of text messages
Distinct Called Numbers	total number of distinct phone calls
Age Group	ordinal attribute (1: younger age, 5: older age)
Tariff Plan	binary (1: Pay as you go, 2: contractual)
Status	binary (1: active, 2: non-active)
Age	age of customer
Customer Value	the calculated value of customer
Churn	class label (1: churn, 0: non-churn)

Source of dataset and source of dataset description.

Citation: Jafari-Marandi, R., Denton, J., Idris, A., Smith, B. K., & Keramati, A. (2020). Optimum Profit-Driven Churn Decision Making: Innovative Artificial Neural Networks in Telecom Industry. Neural Computing and Applications.

Research question

Can we predict the probability of a customer churning based on attributes such as age, charge amount, subscription length and so on?
So this is a classical supervised machine learning binary classification problem

The plan

We are going to use machine learning models such as logistic regression, decision trees, and random forest
- Tune their hyperparameters using grid search
  - And determine the best model based on the ROC-AUC value
All of this will be done using the tidymodels package

Data preparation

We read the data in and turn all variable names into snake_case for easier coding

data <- data  %>% 
  clean_names()

skim(data)

Luckily, there are now missing values, so no imputation or deletion of rows is necessary

glimpse(data)

Some of the categorical variables are coded as numerical values. We should transform their data types

data <- data %>% 
  mutate(complains = as.factor(complains),
         charge_amount = factor(charge_amount, ordered = T),
         age_group = factor(age_group, ordered = T),
         tariff_plan = as.factor(tariff_plan),
         status = as.factor(status),
         churn = ifelse(churn == 1, "yes", "no"),
         churn = as.factor(churn))

data %>% 
  group_by(churn) %>% 
  summarize(n_obs = n())

495 / (495 + 2655)

The dataset is pretty unbalanced, as only 15 % of the rows are churns
We can deal with this imbalance by using stratified sampling when we create the train/test splits

Workflow

Train-test split and cross validation

75 % of the data get assigned to the training set, the other 25 % to the testing set

set.seed(1)
data_split <- initial_split(data, prop = 0.75,
                            strata = churn)

data_train <- training(data_split)
data_test <- testing(data_split)


data_folds <- vfold_cv(data_train, k = 10)

‌
‌
‌