Competition - employee turnover

Can you help reduce employee turnover?

📖 Background

You work for the human capital department of a large corporation. The Board is worried about the relatively high turnover, and your team must look into ways to reduce the number of employees leaving the company.

The team needs to understand better the situation, which employees are more likely to leave, and why. Once it is clear what variables impact employee churn, you can present your findings along with your ideas on how to attack the problem.

💾 The data

The department has assembled data on almost 10,000 employees. The team used information from exit interviews, performance reviews, and employee records.

"department" - the department the employee belongs to.
"promoted" - 1 if the employee was promoted in the previous 24 months, 0 otherwise.
"review" - the composite score the employee received in their last evaluation.
"projects" - how many projects the employee is involved in.
"salary" - for confidentiality reasons, salary comes in three tiers: low, medium, high.
"tenure" - how many years the employee has been at the company.
"satisfaction" - a measure of employee satisfaction from surveys.
"avg_hrs_month" - the average hours the employee worked in a month.
"left" - "yes" if the employee ended up leaving, "no" otherwise.

import pandas as pd
import numpy as np
df = pd.read_csv('./data/employee_churn_data.csv')
df.head()

Exploratory Data Analysis

this exploration is guided by the objective to understand the variables in this dataset and how the independet variables relate to the dependent variable 'left' representing employee churn. Looking a the variable types, we have numerical and categorical vars and catch a first look.

categorical_var = ['department', 'promoted', 'salary', 'bonus', 'left']
numerical_var = ['review', 'projects', 'tenure', 'satisfaction', 'avg_hrs_month']

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
sns.set()
sns.set(rc={"figure.figsize":(14, 9)})

df[numerical_var].hist(layout=(2,3), figsize=(8,6))
plt.suptitle('Numerical Variables')


fig, ax = plt.subplots(2,3, figsize=(8,6))
for catvar, sub in zip(categorical_var, ax.flatten()):
    df[catvar].value_counts().plot(kind='bar', title=catvar, layout=(2,3), ax=sub)
plt.suptitle('Categorical Variables')
fig.tight_layout()
plt.show()

Okay let's turn to the first question we have been tasked.

1. Which department has the highest employee turnover? Which one has the lowest?

turnover = pd.get_dummies(df)
print(turnover.columns)
col_order= ['promoted', 'review', 'projects', 'tenure', 'satisfaction', 'bonus',
       'avg_hrs_month', 'department_IT', 'department_admin',
       'department_engineering', 'department_finance', 'department_logistics',
       'department_marketing', 'department_operations', 'department_retail',
       'department_sales', 'department_support', 'salary_high', 'salary_low',
       'salary_medium', 'left' ]
turnover = turnover.reindex(columns=col_order)
turnover.head()

grouped_turnover = turnover.groupby('left')
#print(grouped_turnover['department_IT'])
grouped_turnover.first()

turnover.groupby('department_IT').first()

#df.left.value_counts(normalize=True)
df['left'] = df['left'].map({'yes':True, 'no':False})

#df.pivot(columns = 'department', values =['left'], aggfunc=np.mean)
piv = pd.pivot_table(df,values=['left'], index='tenure', columns = 'department', aggfunc=np.mean, margins=True, margins_name='total')
piv.columns = piv.columns.droplevel(0)
piv.loc['total_diff'] = (piv.loc['total'] / piv.loc['total','total'] -1 )
piv

piv.loc['total_diff'].index

def bar_values(x,y):
    for i in range(len(x)):
        if y[i]>0:
            plt.text(i-0.2,y[i]+0.001,round(y[i],3))
        else:
            plt.text(i-0.2,y[i]-0.004,round(y[i],3))
            
diff = piv.loc['total_diff'].sort_values()
print(diff)
rank = np.arange(0,len(diff))

palette = sns.color_palette(sns.diverging_palette(250, 15, s=75, l=40,
                                 n=11))

ax = sns.barplot(x = diff.index, y = diff.values, palette=np.array(palette)[rank])
plt.xticks(rotation=0)
plt.ylabel('rate difference')
bar_values(diff.index, diff.values)
plt.title('Turnover rate by department in relation to the total turnover rate')
plt.show()

Formal answer:

With some separation the departments marketing, retail, logistics and especially IT stand out with above average churn. Better than average are support, engineering, operations, sales, admin and particularly the finance department.