Competition - Employee turnover reduction analysis

Can you help reduce employee turnover?

📖 Background

You work for the human capital department of a large corporation. The Board is worried about the relatively high turnover, and your team must look into ways to reduce the number of employees leaving the company.

The team needs to understand better the situation, which employees are more likely to leave, and why. Once it is clear what variables impact employee churn, you can present your findings along with your ideas on how to attack the problem.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import RandomizedSearchCV
#Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

💾 The data

The department has assembled data on almost 10,000 employees. The team used information from exit interviews, performance reviews, and employee records.

"department" - the department the employee belongs to.
"promoted" - 1 if the employee was promoted in the previous 24 months, 0 otherwise.
"review" - the composite score the employee received in their last evaluation.
"projects" - how many projects the employee is involved in.
"salary" - for confidentiality reasons, salary comes in three tiers: low, medium, high.
"tenure" - how many years the employee has been at the company.
"satisfaction" - a measure of employee satisfaction from surveys.
"avg_hrs_month" - the average hours the employee worked in a month.
"left" - "yes" if the employee ended up leaving, "no" otherwise.

df = pd.read_csv('./data/employee_churn_data.csv')
df = pd.concat([df,pd.get_dummies(df.department),pd.get_dummies(df.salary)], axis = 1)
df['performer'] = list(map(lambda x: x > df.review.mean(), df['review']))
df.head()

df.info()

Department turnover analysis

As we can observe in the graph below, the department which had the highest relative employee turnover was the IT department, whereas the one that had the lowest was the finance department. However, it seems that relative employee turnover is quite similar between departments. The main reason to use relative employee turnover instead of the absolute one is that, in the second case, we could observe a huge employee turnover for some departments just because these departments are the ones that have the most employees.

%matplotlib inline
# Which department has the highest employee turnover? Which one has the lowest?
def has_left(x):
    if x == "yes":
        return 1
    else:
        return 0

df["has_left"] = list(map(lambda x: has_left(x), df.left))
departments = df.groupby("department").agg({"has_left":"sum","promoted":"count"})
departments["prop"] = departments.has_left/departments.promoted
departments.sort_values("prop", ascending = False).prop.plot(kind = "bar")
_ = plt.xlabel("Department")
_ = plt.ylabel("Relative employee turnover")
_ = plt.title("Proportion of employees that left by department")
plt.show()

Variable predictive capacity analysis

In order to study which of the numerous variables in the database have better predictive capacity, we should elaborate a classification model. In this case, I am using a Logistic Regression model plus some additional steps to fine tune it, like scaling the variables or using hyperparametrization tunning

X = df.loc[:,[i for i in df.columns if i in ['promoted', 'review', 'tenure', 'satisfaction', 'avg_hrs_month']]]
y = df["has_left"]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3, random_state = 42)
pipeline = Pipeline([("standard",StandardScaler()),("clf",LogisticRegression(C = 0.9591836734693877))])
pipeline.fit(X_train,y_train)
cv_log = cross_val_score(pipeline[1],X,y,cv = 10, scoring = "accuracy")
print("Model score: {}".format(np.median(cv_log)))

params = {'C': np.linspace(0,1)}
hiper = RandomizedSearchCV(pipeline[1],params,cv = 10, scoring = 'accuracy')
hiper.fit(X_train,y_train)
hiper.best_params_

Once we have our model, it is time to select the most important variables in terms of predictive capacity. For that purpose, I am selecting based on the weights the model assings to each variable. In addition, I am also using a l1 regularized model, so that the least important variables coeficients go to zero. The results can be seen in the graphic below.

It seems that there are three important variables in our database:

The composite score the employee received in their last evaluation, stored in the "review" variable.
The measured employee satisfaction, stored in the "satisfaction" variable.
The average hours the employee worked in a month, stored in the "avg_hrs_month" variable.

X = df.loc[:,[i for i in df.columns if i not in ["left","has_left","department","salary","high","engineering"]]]
y = df["has_left"]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3, random_state = 42)
for j in range(3):
    i = j+1
    pipeline = Pipeline([("standard",StandardScaler()),("clf",SelectFromModel(estimator = LogisticRegression(C = 0.9591836734693877), max_features = i))])
    pipeline.fit(X_train,y_train)
    result = pd.DataFrame({'features':X.columns, 'selected':pipeline[1].get_support()})
    variables = result[result.selected == True].features.values.tolist()
    print("The selected {} variables are: ".format(str(i)), variables)

X = df.loc[:,[i for i in df.columns if i not in ["left","has_left","department","salary","high","engineering"]]]
y = df["has_left"]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3, random_state = 42)
pipeline = Pipeline([("standard",StandardScaler()),("clf",LogisticRegression(solver = "liblinear", penalty = "l1"))])
pipeline.fit(X_train,y_train)
df_v = pd.DataFrame({'features':X.columns,'coeffs':list(pipeline[1].coef_[0])}).set_index("features").sort_values("coeffs",ascending = False)
df_v.plot(kind = "bar")
_ = plt.xlabel("Variable name")
_ = plt.ylabel("Model coefficient")
_ = plt.title("L1 regularized coefficients for each variable")
plt.show()

Variable analysis and recommendations

Finally, it is time to analyze the relation between the variables we found in the last part and the employee turnover, so that we can understand what we can do to reduce it. First of all, let´s take a look to the review variable, as it is the most important one in our database.

# review variable analysis
%matplotlib inline
plt.rcParams["figure.figsize"]=10,5
left = df.groupby(["department","left"]).agg({"review":"mean","has_left":"sum"}).reset_index()
sns.barplot(x = left.review, y = left.department, hue = left.left, orient = "h", order = departments.sort_values("prop").index)
_ = plt.xlabel("Employee group by status (Has he/she left the company?)")
_ = plt.ylabel("Mean review of the employee status group")
_ = plt.title("Mean review by employee status")
_ = plt.xticks(rotation = 45)
plt.show()

‌
‌
‌