๐ Background
You work for the human capital department of a large corporation. The Board is worried about the relatively high turnover, and your team must look into ways to reduce the number of employees leaving the company.
The team needs to understand better the situation, which employees are more likely to leave, and why. Once it is clear what variables impact employee churn, you can present your findings along with your ideas on how to attack the problem.
๐พ The data
The department has assembled data on almost 10,000 employees. The team used information from exit interviews, performance reviews, and employee records.
- "department" - the department the employee belongs to.
- "promoted" - 1 if the employee was promoted in the previous 24 months, 0 otherwise.
- "review" - the composite score the employee received in their last evaluation.
- "projects" - how many projects the employee is involved in.
- "salary" - for confidentiality reasons, salary comes in three tiers: low, medium, high.
- "tenure" - how many years the employee has been at the company.
- "satisfaction" - a measure of employee satisfaction from surveys.
- "avg_hrs_month" - the average hours the employee worked in a month.
- "left" - "yes" if the employee ended up leaving, "no" otherwise.
0) A technical note about the packages
The packages used are listed below :
library(tidyverse)
install.packages(c("Factoshiny","missMDA","FactoInvestigate"))
library(FactoMineR)
install.packages("factoextra")
library(factoextra)
library(corrplot)
library(knitr)0) A technical note about the dataset
df <- readr::read_csv('./data/employee_churn_data.csv')
head(df)
sum(is.na(df))
str(df)A rapid glimpse to the dataset:
kable(head(df))More about the dataset:
The dataset does not contain missing values. The dataset cointains ten columns. The column bonus is not described. The column bonus contains 0 and 1. The following analyses assume the variable bonus to be similar to the variable promoted : 1 if the employee received a bonus in the previous 24 months, 0 otherwise.
1) The employee turnover changes across different departments and is due to a common problem
Let's start our analyses looking at the number of people leaving their job and the correspondent fractions across different departments.
left_people_dep <- df %>%
filter(left == "yes") %>%
group_by(department) %>%
count(left)
ggplot() +
geom_col(left_people_dep,mapping=aes(x=reorder(department,-n),y=n,fill=n)) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
text = element_text(size = 18, color="black"),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
scale_fill_gradient(low="darkgreen",high = "darkred") + xlab("Departments") + ylab("Count") +
labs(title="Number of people that left the department") +
guides(fill="none")
In terms of human resources, the departments can be divided into 3 main areas:
-
The sales department which counts the largest number of people leaving their position;
-
A "gray zone" constituted by departments in which the problem involves a similar number of workers, but less compared to the sales department.
- 2.a Retail, Engineering, and Operations: The turnover does involves more than 400 employees per department.
- 2.b Marketing and Support: The turnover does involves around 200 employees per department.
- A third area includes the administration, logistic, IT and finance departments. In this case the turnover involves around 100 employees.
We look now at the percentage of people who left each department.
df_perc <- df %>%
group_by(department) %>%
count(left) %>%
mutate(fract = round((n / sum(n))*100,digits = 1) )
ggplot(df_perc,mapping = aes(x=factor(department),y=fract,fill=left)) +
geom_bar(position = "stack",stat = "identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
text = element_text(size = 18, color="black"),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
scale_fill_manual(values = c("#E69F00", "#56B4E9")) +
ylab("Percentage of employee") +
xlab("Departments")
Looking at the percentages we can see similar values of people leaving the job across different departments. This observation suggests that the problem does not lie within a single department and that, the underlying problem, affects the whole company.
2) Good performing employees leave early
The identification of the reasons contributing to the employee turnover are scouted across multiple variables. A PCA on quantitative data reveals what are the variables that better explains a separation between employees who left and remained. The PCA reduces the problem to a smaller set of variables.
# Quantitative Data
res.pca.quat <- PCA(df[,-c(1,2,5,8,10)], ncp = 10, graph = FALSE)
eig.val <- get_eigenvalue(res.pca.quat)
noprint <- fviz_screeplot(res.pca.quat)
fviz_pca_ind(res.pca.quat,axes = c(1, 2),
label = "none", # hide individual labels
habillage = as.factor(df$left), # color by groups
palette = c("#E69F00", "#56B4E9")
)
From the plot above we see how the employees are distributed across the first two main components (Dim1 and Dim2).
Dim1 explains up to 41.2% of the variance and also contributes to the better separation between the group of people who left and remained. A good separation is also obtained along Dim2.
The correlation plot can show us what are the variables that mainly contribute to explain the separation that we observe along the axes.
corrplot(res.pca.quat$va$contrib, is.corr=FALSE) โ
โ