Skip to content
0

๐Ÿ“– Background and Overview

You work for the human capital department of a large corporation. The Board is worried about the relatively high turnover, and your team must look into ways to reduce the number of employees leaving the company.

The team needs to understand better the situation, which employees are more likely to leave, and why. Once it is clear what variables impact employee churn, you can present your findings along with your ideas on how to attack the problem.

The questions that I will focus on are;

  1. Which department has the highest employee turnover? Which one has the lowest?
  2. Investigate which variables seem to be better predictors of employee departure.
  3. What recommendations would you make regarding ways to reduce employee turnover?
# Set renv activate the current project renv::install(c("corrr", "glmnet", "ranger", "rmdformats", "vip"), prompt = FALSE) # install packages that are not in cache renv::hydrate(update = FALSE) # install any packages used in the Rnotebook but not provided, do not update renv::snapshot(prompt = FALSE) library(readr) library(tidymodels) library(rmdformats) library(psych) library(glmnet) library(ranger) library(vip) library(corrr) employee_churn_data <- read_csv("./data/employee_churn_data.csv", show_col_types = FALSE) employee_churn_data <- employee_churn_data %>% mutate(salary = factor(salary, levels = c("low", "medium", "high"), ordered = TRUE), left = factor(left, levels = c("yes", "no"), labels = c("Yes", "No")), promoted = factor(promoted, levels = c("1", "0"), labels = c("Yes", "No")), bonus = factor(bonus, levels = c("1", "0"), labels = c("Yes", "No"))) %>% mutate_if(is.character, as.factor)

1. Which department has the highest employee turnover? Which has the lowest?

In preparation for answering the first question, it is a good idea to take a look at each of the variables to see if there are any issues that might cause issues in the analysis.

๐Ÿฉบ First step - healthcheck and EDA {.tabset}

Structure of data

The HR team has assembled data on almost 10,000 employees, they used information from exit interviews, performance reviews, and employee records.

There is a mix of factors, ordinal and numeric input variables and a single output variable.

  • "department" - the department the employee belongs to.
  • "promoted" - 1 if the employee was promoted in the previous 24 months, 0 otherwise.
  • "review" - the composite score the employee received in their last evaluation.
  • "projects" - how many projects the employee is involved in.
  • "salary" - for confidentiality reasons, salary comes in three tiers: low, medium, high.
  • "tenure" - how many years the employee has been at the company.
  • "satisfaction" - a measure of employee satisfaction from surveys.
  • "avg_hrs_month" - the average hours the employee worked in a month.
  • "left" - "yes" if the employee ended up leaving, "no" otherwise.

NA's

There are no occurences of NA in the data, this is highly unusual but a testament to the HR team's diligence in recording keeping and attention to detail.

employee_churn_data[rowSums(is.na(employee_churn_data))!=0,]

Overall outcome

Let's first see how many cases of each of the outcomes there are. The dataset does seem to be unbalanced with approx 70/30 split of remain compared to eixts.


employee_churn_data %>%
  group_by(left) %>%
  summarise(employees = n()) %>%
    ggplot(aes(left, employees)) +
    geom_col(aes(fill = left)) +
    geom_text(aes(label = employees), position = position_stack(vjust = 0.5)) +
    labs(x = "Did employee leave?") +
    scale_y_continuous(labels = NULL, breaks = NULL) + labs(y = "") +
    guides(fill = "none")

Data values

Considering the numeric variables, it looks like there will be a need for scaling when it comes to modelling as some of the values are much larger than others.

describe(employee_churn_data, omit = TRUE)

๐Ÿ” Getting an overview {.tabset}

The following tabs give a sense of the structure and make-up of the organisation.

Depts

The organisation is relatively conventional with regards to it's structure. There are significant engineering, sales, retail and operations departments which are supported by a number of smaller back-office functions.

Run cancelled

employee_churn_data %>%
  group_by(department, left) %>%
  summarise(employees = n()) %>%
    ggplot(aes(reorder(department, employees, sum), employees)) +
    geom_col() +
    labs(y = "Employees", x = "Department") +
    coord_flip()

Promos

Very few employee promotions - need see how these were distributed through the employee population and in relation to other variables.

Run cancelled

employee_churn_data %>%
  group_by(promoted) %>%
  summarise(employees = n()) %>%
    ggplot(aes(as.factor(promoted), employees)) +
    geom_col(aes(fill = promoted)) +
    geom_text(aes(label = employees), position = position_stack(vjust = 0.5)) +
    labs(x = "Was employee promoted?") +
    scale_y_continuous(labels = NULL, breaks = NULL) + labs(y = "") +
    guides (fill = "none")

Review

The review scores of employees seems to be approximately normally distributed with values between 0 and 1.

Run cancelled

employee_churn_data %>%
  ggplot(aes(review)) +
    geom_histogram() +
    labs(x = "Review", y = "Employees") +
    guides (fill = "none")

Projects

The majority of employees are members of 3 projects, of the others a large proportion are part of one additonal project with a small number part of 2 or 5.

Run cancelled

employee_churn_data %>%
  group_by(projects) %>%
  summarise(employees = n()) %>%
    ggplot(aes(x = as.factor(projects), y = employees)) +
    geom_col() +
    labs(x = "Number of projects", y = "Employees") +
    guides (fill = "none")

Salary

The majority of employees sit in the medium range with a smaller number at low and high ranges.

Run cancelled

employee_churn_data %>%
  group_by(salary) %>%
  summarise(employees = n()) %>%
    ggplot(aes(x = salary, y = employees)) +
    geom_col() +
    labs(x = "Salary range", y = "Employees") +
    guides (fill = "none")

Tenure

The tenure of employees seems to be approximately normally distributed (although there is more of a bunching around mean / median due to rounding to whole years?) Strangely the organisation had a very small number of new employees - this could be another indicator issues with culture / leadership.

Run cancelled

employee_churn_data %>%
  group_by(tenure) %>%
  summarise(employees = n()) %>%
    ggplot(aes(x = as.factor(tenure), y = employees)) +
    geom_col() +
    labs(x = "Tenure", y = "Employees") +
    guides (fill = "none")
โ€Œ
โ€Œ
โ€Œ