Understanding the problem is half the solution: Using classification model to predict the recently high employee turnover (copy)

Understanding the problem is half the solution: Using classification model to predict the recently high employee turnover

📖 Background

You work for the human capital department of a large corporation. The Board is worried about the relatively high turnover, and your team must look into ways to reduce the number of employees leaving the company. The team needs to understand better the situation, which employees are more likely to leave, and why. Once it is clear what variables impact employee churn, you can present your findings along with your ideas on how to attack the problem.

# Installing extra packages
!pip install shap sklearn_pandas catboost==0.26

# Preliminaries 
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import shap
df = pd.read_csv('./data/employee_churn_data.csv')
%matplotlib inline 
plt.rcParams['figure.figsize'] = (15, 6)

Introduction

In order to reduce the recently high employee turnover happening in the company, it is important to make sense of the situation first so that specific solutions can be designed to tackle the problem. Therefore, this analysis will investigate the following questions:

Which department has the highest employee turnover? Which one has the lowest?
Which variables seem to be better predictors of employee departure?
What recommendations should be made regarding ways to reduce employee turnover?

💾 Some information of the data

The human capital department has assembled data on almost 10,000 employees. The team used information from exit interviews, performance reviews, and employee records.

"department" - the department the employee belongs to.
"promoted" - 1 if the employee was promoted in the previous 24 months, 0 otherwise.
"review" - the composite score the employee received in their last evaluation.
"projects" - how many projects the employee is involved in.
"salary" - for confidentiality reasons, salary comes in three tiers: low, medium, high.
"tenure" - how many years the employee has been at the company.
"satisfaction" - a measure of employee satisfaction from surveys.
"avg_hrs_month" - the average hours the employee worked in a month.
"left" - "yes" if the employee ended up leaving, "no" otherwise.

Glimpsing at the data

It will be useful to check if the dataset contains any missing values or misclassified data types before beginning the analysis. Fortunately, there is no missing values for all of the features. As for data types, it should be kept in mind that promoted and bonus are binary features indicated with integers 0 and 1, and the left column will need to be recoded in integers for modelling later. Moreover, since values in the salary variable indicate some kind of hierarchy, it will be better to recode it as an ordinal feature via the help of pandas. I will also recode department as a nominal feature.

# Head of the DataFrame
display(df.head())

# Checking data type
print("\nNumber of entries and data types of the features in the dataset:\n")
df.info()

# recoding salary into an ordinal feature
df['salary'] = pd.Categorical(df.salary, categories = ["low", "medium", "high"], ordered=True)
df['department'] = pd.Categorical(df.department)

While exploratory data analysis will be done later, looking at the summary statistics of the features will also be helpful to preliminarily make sense of the dataset, such as checking whether the left column has severe class imbalance to be used as the target variable in the classification model. Judging from the two tables created with the .describe() method, it seems that the values of the features are within reasonable ranges, meaning that even if outliers may exist for each feature, it will be unlikely due to data entry or collection errors. Here are also some worth-noting observations drawn from the .describe() tables:

The overall turnover rate of the company is 29.18%. Even though this implies the class imbalance problem of the left column is mild, the overall figure of turnover rate is concerning because it signifies 3 out of 10 employees left the company after being hired. If retention strategies are not designed and implemented as soon as possible, this may pose considerable burdens for the hiring team to constantly look for new employees.
Only around 3% of the employees were promoted during the last 2 years. At the same time, at least 75% of the employees have already worked at the company for at least 5 years by the time when the data were collected.
If we use 0.5 as a threshold to categorise if an employee is satisfied with his/her work in the company, then there are around half of the employees in the company whose satisfaction levels are below this threshold.

# Summary stats for numerical features
print("Summary stats for numerical features: ")
display(df.describe())

print("\nSummary stats for categorical features: ")
df.describe(exclude="number")

Which department suffers the most from the turnover problem?

After briefly exploring at the dataset, I will now start my analysis by first looking at the turnover rate by department. The first thing to do before answering this question is to see how many employees there are in each department. If the numbers of employees differ considerably between departments, then normalising the turnover figures will be needed to properly compare the by-department turnover rate. As the below table shows, there exists a huge difference of the number of employees in each department, from the sales department with 1883 employees to the IT department which only has 356 employees. Therefore, it will be more useful to compare the turnover rates (calculated as the number of left employees divided by the total number of employees in each department) between departments to see if a particular department suffers more from employee turnover.

# Counting the number of employees by department
employee_by_dept = df.department.value_counts(ascending=False)
print(f"Number of employees by department:\n{employee_by_dept}")

The bar chart below presents both the number of employees who left the company and the turnover rate (shown by hovering on the bars) of each department in the company. The departments on the x-axis are sorted from the one with the highest turnover rate on the left to the one with the lowest turnover rate on the right. Based on the bar chart, there are some worth-mentioning observations about the turnover rates:

The department with the highest turnover rate is the IT department (30.9%), whereas as the one with the lowest turnover rate is the finance department (26.87%).
There are four departments (namely, IT, logistics, retail and marketing) whose turnover rates are higher than the overall turnover rate of the company (29.18%). Likewise, the turnover rates of the remaining departments except for finance are very close to the company's overall turnover rate as well.

# Finding out the turnover rate per department
turnover_by_dept = pd.crosstab(df.department, 
                               df.left, 
                               margins=1, 
                               margins_name="total_employees")\
                               .drop(index="total_employees")\
                               .reset_index()
turnover_by_dept["turnover_rate"] = round(turnover_by_dept.yes / turnover_by_dept.total_employees * 100, 2)
turnover_by_dept.sort_values(by="turnover_rate", 
                             ascending=False, 
                             inplace=True)

# Plotting 
fig = go.Figure()
for turnover in ["yes", "no"]:
    fig.add_trace(go.Bar(x=turnover_by_dept["department"], 
                         y=turnover_by_dept[turnover], 
                         name=turnover, 
                         customdata=turnover_by_dept["turnover_rate"], 
                         hovertemplate="<br>".join(["Number of employees: %{y}", 
                                                    "Turnover rate of this department: %{customdata}%"])))
        
# Formatting the layout
fig.update_layout({"xaxis": {"title": {"text": "Department"}}, 
                   "yaxis": {"title": {"text": "Count"}},
                   "legend": {"title": {"text": "Left company"}},
                   "title": "Turnover statistics by department"})
fig.show()

In order to alleviate the high turnover rate of the company, it is important to know what factors may be more related to employee departure so that specific retention strategies can be devised by the human capital department. I will now move to exploratory data analysis to see if there are any particular patterns in the dataset that should be taken account into during the classification model building process.

Exploratory data analysis (EDA)

Since we are mostly concerned about predicting whether an employee is likely to leave the company or not based on features provided by the dataset, we can firstly look at the summary statistics of the features grouped by the left column to see if these two groups of employee have any remarkable difference. Doing so can provide a starting point of finding which features may be more associated with the likelihood of employees leaving the company.

Let's start with the numerical features of the dataset. To truncate the output, I excluded the 25th and 75th percentiles of the numerical features. Here are several worth-noting observations to be mentioned:

The proportion of remaining employees who got promoted (3.43%) is 50% higher than that of employees who left the company (2.05%).
Surprisingly, the mean and median of review scores of left employees is higher than those of remaining employees.
In terms of tenure, the median value of employees who left the company is higher than that of remaining ones.
Lastly, the percentage of employees receiving any sorts of bonuses is slightly higher for those remaining at the company (21.5%) than those who left (20.5%).

‌
‌
‌