๐ Background and Overview
You work for the human capital department of a large corporation. The Board is worried about the relatively high turnover, and your team must look into ways to reduce the number of employees leaving the company.
The team needs to understand better the situation, which employees are more likely to leave, and why. Once it is clear what variables impact employee churn, you can present your findings along with your ideas on how to attack the problem.
The questions that I will focus on are;
- Which department has the highest employee turnover? Which one has the lowest?
- Investigate which variables seem to be better predictors of employee departure.
- What recommendations would you make regarding ways to reduce employee turnover?
# Set renv activate the current project renv::install(c("corrr", "glmnet", "ranger", "rmdformats", "vip"), prompt = FALSE) # install packages that are not in cache renv::hydrate(update = FALSE) # install any packages used in the Rnotebook but not provided, do not update renv::snapshot(prompt = FALSE) library(readr) library(tidymodels) library(rmdformats) library(psych) library(glmnet) library(ranger) library(vip) library(corrr) employee_churn_data <- read_csv("./data/employee_churn_data.csv", show_col_types = FALSE) employee_churn_data <- employee_churn_data %>% mutate(salary = factor(salary, levels = c("low", "medium", "high"), ordered = TRUE), left = factor(left, levels = c("yes", "no"), labels = c("Yes", "No")), promoted = factor(promoted, levels = c("1", "0"), labels = c("Yes", "No")), bonus = factor(bonus, levels = c("1", "0"), labels = c("Yes", "No"))) %>% mutate_if(is.character, as.factor)
1. Which department has the highest employee turnover? Which has the lowest?
In preparation for answering the first question, it is a good idea to take a look at each of the variables to see if there are any issues that might cause issues in the analysis.
๐ฉบ First step - healthcheck and EDA {.tabset}
Structure of data
The HR team has assembled data on almost 10,000 employees, they used information from exit interviews, performance reviews, and employee records.
There is a mix of factors, ordinal and numeric input variables and a single output variable.
- "department" - the department the employee belongs to.
- "promoted" - 1 if the employee was promoted in the previous 24 months, 0 otherwise.
- "review" - the composite score the employee received in their last evaluation.
- "projects" - how many projects the employee is involved in.
- "salary" - for confidentiality reasons, salary comes in three tiers: low, medium, high.
- "tenure" - how many years the employee has been at the company.
- "satisfaction" - a measure of employee satisfaction from surveys.
- "avg_hrs_month" - the average hours the employee worked in a month.
- "left" - "yes" if the employee ended up leaving, "no" otherwise.
NA's
There are no occurences of NA in the data, this is highly unusual but a testament to the HR team's diligence in recording keeping and attention to detail.
employee_churn_data[rowSums(is.na(employee_churn_data))!=0,]
Overall outcome
Let's first see how many cases of each of the outcomes there are. The dataset does seem to be unbalanced with approx 70/30 split of remain compared to eixts.
employee_churn_data %>%
group_by(left) %>%
summarise(employees = n()) %>%
ggplot(aes(left, employees)) +
geom_col(aes(fill = left)) +
geom_text(aes(label = employees), position = position_stack(vjust = 0.5)) +
labs(x = "Did employee leave?") +
scale_y_continuous(labels = NULL, breaks = NULL) + labs(y = "") +
guides(fill = "none")
Data values
Considering the numeric variables, it looks like there will be a need for scaling when it comes to modelling as some of the values are much larger than others.
describe(employee_churn_data, omit = TRUE)
๐ Getting an overview {.tabset}
The following tabs give a sense of the structure and make-up of the organisation.
Depts
The organisation is relatively conventional with regards to it's structure. There are significant engineering, sales, retail and operations departments which are supported by a number of smaller back-office functions.
employee_churn_data %>%
group_by(department, left) %>%
summarise(employees = n()) %>%
ggplot(aes(reorder(department, employees, sum), employees)) +
geom_col() +
labs(y = "Employees", x = "Department") +
coord_flip()
Promos
Very few employee promotions - need see how these were distributed through the employee population and in relation to other variables.
employee_churn_data %>%
group_by(promoted) %>%
summarise(employees = n()) %>%
ggplot(aes(as.factor(promoted), employees)) +
geom_col(aes(fill = promoted)) +
geom_text(aes(label = employees), position = position_stack(vjust = 0.5)) +
labs(x = "Was employee promoted?") +
scale_y_continuous(labels = NULL, breaks = NULL) + labs(y = "") +
guides (fill = "none")
Review
The review scores of employees seems to be approximately normally distributed with values between 0 and 1.
employee_churn_data %>%
ggplot(aes(review)) +
geom_histogram() +
labs(x = "Review", y = "Employees") +
guides (fill = "none")
Projects
The majority of employees are members of 3 projects, of the others a large proportion are part of one additonal project with a small number part of 2 or 5.
employee_churn_data %>%
group_by(projects) %>%
summarise(employees = n()) %>%
ggplot(aes(x = as.factor(projects), y = employees)) +
geom_col() +
labs(x = "Number of projects", y = "Employees") +
guides (fill = "none")
Salary
The majority of employees sit in the medium range with a smaller number at low and high ranges.
employee_churn_data %>%
group_by(salary) %>%
summarise(employees = n()) %>%
ggplot(aes(x = salary, y = employees)) +
geom_col() +
labs(x = "Salary range", y = "Employees") +
guides (fill = "none")
Tenure
The tenure of employees seems to be approximately normally distributed (although there is more of a bunching around mean / median due to rounding to whole years?) Strangely the organisation had a very small number of new employees - this could be another indicator issues with culture / leadership.
employee_churn_data %>%
group_by(tenure) %>%
summarise(employees = n()) %>%
ggplot(aes(x = as.factor(tenure), y = employees)) +
geom_col() +
labs(x = "Tenure", y = "Employees") +
guides (fill = "none")
โ
โ