[R] Supervised ML: Predicting Telecom Customer Churn
  • knitr::opts_chunk$set(echo = TRUE)
    knitr::opts_chunk$set(class.output = "code-background")
    packages <- c("janitor", "skimr", "tidymodels", "ranger",
      "xgboost", "tictoc", "vip")
    packages %>%
      walk(~ install.packages(.x))
    data <- readr::read_csv('data/customer_churn.csv')

    Telecom Customer Churn

    This dataset comes from an Iranian telecom company, with each row representing a customer over a year period. Along with a churn label, there is information on the customers' activity, such as call failures and subscription length.

    Data Dictionary

    Call Failurenumber of call failures
    Complainsbinary (0: No complaint, 1: complaint)
    Subscription Lengthtotal months of subscription
    Charge Amountordinal attribute (0: lowest amount, 9: highest amount)
    Seconds of Usetotal seconds of calls
    Frequency of usetotal number of calls
    Frequency of SMStotal number of text messages
    Distinct Called Numberstotal number of distinct phone calls
    Age Groupordinal attribute (1: younger age, 5: older age)
    Tariff Planbinary (1: Pay as you go, 2: contractual)
    Statusbinary (1: active, 2: non-active)
    Ageage of customer
    Customer Valuethe calculated value of customer
    Churnclass label (1: churn, 0: non-churn)

    Source of dataset and source of dataset description.

    Citation: Jafari-Marandi, R., Denton, J., Idris, A., Smith, B. K., & Keramati, A. (2020). Optimum Profit-Driven Churn Decision Making: Innovative Artificial Neural Networks in Telecom Industry. Neural Computing and Applications.

    Research question

    • Can we predict the probability of a customer churning based on attributes such as age, charge amount, subscription length and so on?

    • So this is a classical supervised machine learning binary classification problem

    The plan

    • We are going to use machine learning models such as logistic regression, decision trees, and random forest
      • Tune their hyperparameters using grid search
        • And determine the best model based on the ROC-AUC value
    • All of this will be done using the tidymodels package

    Data preparation

    • We read the data in and turn all variable names into snake_case for easier coding
    data <- data  %>% 
    • Luckily, there are now missing values, so no imputation or deletion of rows is necessary
    • Some of the categorical variables are coded as numerical values. We should transform their data types
    data <- data %>% 
      mutate(complains = as.factor(complains),
             charge_amount = factor(charge_amount, ordered = T),
             age_group = factor(age_group, ordered = T),
             tariff_plan = as.factor(tariff_plan),
             status = as.factor(status),
             churn = ifelse(churn == 1, "yes", "no"),
             churn = as.factor(churn))
    data %>% 
      group_by(churn) %>% 
      summarize(n_obs = n())
    495 / (495 + 2655)
    • The dataset is pretty unbalanced, as only 15 % of the rows are churns
    • We can deal with this imbalance by using stratified sampling when we create the train/test splits


    Train-test split and cross validation

    • 75 % of the data get assigned to the training set, the other 25 % to the testing set
    data_split <- initial_split(data, prop = 0.75,
                                strata = churn)
    data_train <- training(data_split)
    data_test <- testing(data_split)
    data_folds <- vfold_cv(data_train, k = 10)