Skip to content
0

Cracking the Code of Employee Attrition: Unraveling Insights and Strategies for Retention and Success in the Workplace

Photo Credit: wallpaperflare

📷 HR Metrics: A Data-driven Journey

🌐 Link to Tableau Public dashboard

HR Metrics: A Data Driven Journey

🧾 Executive summary

Below is a summary and break-down based on my analysis of employee, performance rating and education level data.

Employee Data:

  • The employee data consists of 1,470 rows and 23 columns, including employee information such as demographics, employment details, and performance metrics.
  • There are no missing values in the employee data, indicating a complete dataset.

Performance Rating Data:

  • The performance rating data consists of 6,709 rows and 1 column.
  • However, the data seems to have not been properly parsed into individual columns, requiring further processing to enable analysis.

Education Level Data:

  • The education level data consists of 5 rows and 2 columns.
  • No missing values are present in the education level data.

Salary Analysis:

  • The average salary is 85,029.39, with a minimum of 20,387 and a maximum of 150,000.
  • There is a significant difference in salaries between different departments, as indicated by the T-tests and P-values.

Attrition Analysis:

  • The attrition rate is approximately 37%.
  • There is a significant association between attrition and various factors such as gender, education, and ethnicity, as indicated by the Chi-square tests.

Age and Years at Company Analysis:

  • Age and years at the company significantly impact attrition, as shown by the T-tests and ANOVA tests.
  • Employees in the age range of 30-39 have the highest attrition rate.

Department-wise Attrition:

  • The human resources department has the highest attrition rate, followed by sales and technology.
  • The attrition rates differ significantly across departments.

Job Satisfaction Analysis:

  • Job satisfaction varies across departments, with different distributions of ratings.
  • Employees working overtime tend to have higher job satisfaction levels.

Retention Rate:

  • The retention rate for employees working overtime is 52.26%, while for non-overtime employees, it is 21.96%.
  • This indicates that overtime employees are more likely to stay with the company.

Based on these findings, it is recommended to focus on strategies to improve employee retention, especially in the human resources department. Addressing factors such as job satisfaction, work-life balance, and career development opportunities can contribute to reducing attrition rates and improving overall employee engagement.

Below is my Python code and analysis that I used to create my executive summary and Tableau Public dashboard:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind
from scipy.stats import chi2_contingency
import scipy.stats as stats
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load the employee data from the CSV file
employee_data = pd.read_csv('data/employee.csv')

# Load the performance rating data
performance_data = pd.read_csv('data/performance_rating.csv')

# Load the education level data
education_data = pd.read_csv('data/education_level.csv')

# Display the first few rows of the employee data
print("First few rows of employee data:")
print(employee_data.head())

# Check the structure of the employee data (number of rows, columns, data types)
print("\nData structure of employee data:")
print(employee_data.info())

# Check for missing values in the employee data
print("\nMissing values in employee data:")
print(employee_data.isnull().sum())

# Get summary statistics for numerical columns in the employee data
print("\nSummary statistics of employee data:")
print(employee_data.describe())

# Explore unique values in categorical columns in the employee data
print("\nUnique values in categorical columns of employee data:")
categorical_columns = ['Gender', 'BusinessTravel', 'Department', 'State', 'Ethnicity', 'EducationField',
                       'JobRole', 'MaritalStatus', 'OverTime', 'Attrition']
for column in categorical_columns:
    unique_values = employee_data[column].unique()
    print(f"{column}: {unique_values}")

# Explore unique values and frequency counts for Education Level in the employee data
print("\nEducation Level in employee data:")
education_level_counts = employee_data['Education'].value_counts()
print(education_level_counts)

# Display the first few rows of the performance rating data
print("\nFirst few rows of performance rating data:")
print(performance_data.head())

# Check the structure of the performance rating data (number of rows, columns, data types)
print("\nData structure of performance rating data:")
print(performance_data.info())

# Check for missing values in the performance rating data
print("\nMissing values in performance rating data:")
print(performance_data.isnull().sum())

# Display the first few rows of the education level data
print("\nFirst few rows of education level data:")
print(education_data.head())

# Check the structure of the education level data (number of rows, columns, data types)
print("\nData structure of education level data:")
print(education_data.info())

# Check for missing values in the education level data
print("\nMissing values in education level data:")
print(education_data.isnull().sum())

Summary of Data

Employee Data:

  • The employee data consists of 1,470 rows and 23 columns.
  • The columns include EmployeeID, FirstName, LastName, Gender, Age, BusinessTravel, Department, DistanceFromHome (KM), State, Ethnicity, Education, EducationField, JobRole, MaritalStatus, Salary, StockOptionLevel, OverTime, HireDate, Attrition, YearsAtCompany, YearsInMostRecentRole, YearsSinceLastPromotion, and YearsWithCurrManager.
  • There are no missing values in the employee data.

Performance Rating Data:

  • The performance rating data consists of 6,709 rows and 1 column.
  • The column contains information separated by semicolons. It seems that the data has not been properly parsed into individual columns.
  • There are no missing values in the performance rating data.

Education Level Data:

  • The education level data consists of 5 rows and 2 columns.
  • The columns include EducationLevelID and EducationLevel.
  • There are no missing values in the education level data.

It appears that there is an issue with the format of the performance rating data. The data needs to be properly parsed into separate columns before further analysis can be performed on it.

# Drop unnecessary columns
columns_to_drop = ['FirstName', 'LastName', 'HireDate']
employee_data = employee_data.drop(columns=columns_to_drop)

# Convert 'Attrition' column to boolean values (No: False, Yes: True)
employee_data['Attrition'] = employee_data['Attrition'].map({'No': False, 'Yes': True})

# Convert 'OverTime' column to boolean values (No: False, Yes: True)
employee_data['OverTime'] = employee_data['OverTime'].map({'No': False, 'Yes': True})

# Convert 'DistanceFromHome (KM)' column to float
employee_data['DistanceFromHome (KM)'] = employee_data['DistanceFromHome (KM)'].astype(float)

# Convert 'Salary' column to float
employee_data['Salary'] = employee_data['Salary'].astype(float)

# Handle outliers in 'Salary' column (e.g., cap values above a certain threshold)
salary_threshold = 150000
employee_data.loc[employee_data['Salary'] > salary_threshold, 'Salary'] = salary_threshold

# Handle outliers in 'Age' column (e.g., remove values above a certain threshold)
age_threshold = 65
employee_data = employee_data[employee_data['Age'] <= age_threshold]

# Handle outliers in 'YearsAtCompany' column (e.g., remove values above a certain threshold)
years_at_company_threshold = 30
employee_data = employee_data[employee_data['YearsAtCompany'] <= years_at_company_threshold]

# Handle missing values (e.g., drop rows with missing values)
employee_data = employee_data.dropna()

# Parse performance rating data into separate columns
performance_data = performance_data['PerformanceID;EmployeeID;ReviewDate;EnvironmentSatisfaction;JobSatisfaction;RelationshipSatisfaction;TrainingOpportunitiesWithinYear;TrainingOpportunitiesTaken;WorkLifeBalance;SelfRating;ManagerRating'].str.split(';', expand=True)
performance_data.columns = ['PerformanceID', 'EmployeeID', 'ReviewDate', 'EnvironmentSatisfaction', 'JobSatisfaction',
                            'RelationshipSatisfaction', 'TrainingOpportunitiesWithinYear', 'TrainingOpportunitiesTaken',
                            'WorkLifeBalance', 'SelfRating', 'ManagerRating']

# Convert 'EmployeeID' column to the same data type in both DataFrames
employee_data['EmployeeID'] = employee_data['EmployeeID'].str.replace(r'\D+', '').astype(int)
performance_data['EmployeeID'] = performance_data['EmployeeID'].str.replace(r'\D+', '').astype(int)

# Merge performance rating data with employee data
employee_data = pd.merge(employee_data, performance_data, left_on='EmployeeID', right_on='EmployeeID', how='left')

# Define education level data
education_level_data = pd.DataFrame({
    'EducationLevelID': [1, 2, 3, 4, 5],
    'EducationLevel': ['Below College', 'College', 'Bachelor', 'Master', 'Doctor']
})

# Merge education level data with employee data
employee_data = pd.merge(employee_data, education_level_data, left_on='Education', right_on='EducationLevelID', how='left')

# Verify the cleaned data
print(employee_data.head())
# Visualize Salary Distribution: Histogram
plt.figure(figsize=(8, 6))
plt.hist(employee_data['Salary'], bins=20)
plt.xlabel('Salary')
plt.ylabel('Frequency')
plt.title('Salary Distribution')
plt.show()

# Calculate Descriptive Statistics
salary_stats = employee_data['Salary'].describe()
print(salary_stats)

# Identify Outliers: Box Plot
plt.figure(figsize=(8, 6))
plt.boxplot(employee_data['Salary'])
plt.xlabel('Salary')
plt.title('Salary Outliers')
plt.show()

# Explore Factors Influencing Salary: Scatter Plot
plt.figure(figsize=(8, 6))
plt.scatter(employee_data['YearsAtCompany'], employee_data['Salary'])
plt.xlabel('Years at Company')
plt.ylabel('Salary')
plt.title('Relationship between Years at Company and Salary')
plt.show()

# Perform Statistical Tests: T-test

# Assuming you want to compare salary distributions between three groups: Sales, Technology, and Human Resources
group1 = employee_data[employee_data['Department'] == 'Sales']['Salary']
group2 = employee_data[employee_data['Department'] == 'Technology']['Salary']
group3 = employee_data[employee_data['Department'] == 'Human Resources']['Salary']

# Perform t-tests for all pairwise comparisons
t_stat1, p_value1 = ttest_ind(group1, group2)
t_stat2, p_value2 = ttest_ind(group1, group3)
t_stat3, p_value3 = ttest_ind(group2, group3)

print('T-statistic (Sales vs. Technology):', t_stat1)
print('P-value (Sales vs. Technology):', p_value1)
print('T-statistic (Sales vs. Human Resources):', t_stat2)
print('P-value (Sales vs. Human Resources):', p_value2)
print('T-statistic (Technology vs. Human Resources):', t_stat3)
print('P-value (Technology vs. Human Resources):', p_value3)

# Create Visualizations: Bar Chart
avg_salary_by_department = employee_data.groupby('Department')['Salary'].mean()

plt.figure(figsize=(8, 6))
avg_salary_by_department.plot(kind='bar')
plt.xlabel('Department')
plt.ylabel('Average Salary')
plt.title('Average Salary by Department')
plt.show()

The descriptive statistics for the Salary column in the employee data are as follows:

  • Count: 7,384
  • Mean: 85,029.39
  • Standard Deviation: 45,178.38
  • Minimum: 20,387.00
  • 25th Percentile: 44,779.00
  • Median (50th Percentile): 73,918.00
  • 75th Percentile: 133,865.00.
  • Maximum: 150,000.00

The t-test results for the pairwise comparisons between the Sales, Technology, and Human Resources departments suggest that there are significant variations in salary distributions across the different departments, highlighting potential differences in compensation between the departments.

# Attrition factors to consider
attrition_factors = ['BusinessTravel', 'Department', 'EducationField', 'Ethnicity','Gender', 'JobRole', 'MaritalStatus']

# Calculate attrition rates for each factor
attrition_rates = {}
for factor in attrition_factors:
    factor_attrition_rates = employee_data.groupby(factor)['Attrition'].mean()
    attrition_rates[factor] = factor_attrition_rates

# Plot attrition rates for each factor
for factor in attrition_factors:
    plt.figure(figsize=(8, 6))
    if factor in ['EducationField', 'JobRole']:
        sns.barplot(x=attrition_rates[factor].index, y=attrition_rates[factor].values)
        plt.xlabel(factor)
        plt.ylabel('Attrition Rate')
        plt.title('Attrition Rate by {}'.format(factor))
        plt.xticks(rotation=45, ha='right')  # Rotate x-labels by 45 degrees
    elif factor == 'Ethnicity':
        plt.xticks(rotation=60, ha='right')  # Adjust rotation for 'Ethnicity' factor
        sns.barplot(x=attrition_rates[factor].index, y=attrition_rates[factor].values)
        plt.xlabel(factor)
        plt.ylabel('Attrition Rate')
        plt.title('Attrition Rate by {}'.format(factor))
    else:
        sns.barplot(x=attrition_rates[factor].index, y=attrition_rates[factor].values)
        plt.xlabel(factor)
        plt.ylabel('Attrition Rate')
        plt.title('Attrition Rate by {}'.format(factor))
    plt.show()

attrition_counts = employee_data['Attrition'].value_counts()
print(attrition_counts)

# Perform statistical tests (e.g., t-tests or chi-square tests) to determine the significance of these relationships

# Categorical variables: Chi-square test
categorical_variables = ['Department', 'Gender', 'Education', 'Ethnicity']

for variable in categorical_variables:
    contingency_table = pd.crosstab(employee_data['Attrition'], employee_data[variable])

    # Chi-square test
    chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table)
    print(f"Chi-square test results for {variable}:")
    print(f"Chi-square statistic: {chi2_stat}")
    print(f"P-value: {p_value}")
    print('\n')

# Continuous variables: t-test and ANOVA
continuous_variables = ['Age', 'YearsAtCompany']

for variable in continuous_variables:
    attrition_yes = employee_data[employee_data['Attrition']][variable]
    attrition_no = employee_data[~employee_data['Attrition']][variable]

    # T-test
    t_stat, p_value_ttest = stats.ttest_ind(attrition_yes, attrition_no, nan_policy='omit')
    print(f"T-test results for {variable}:")
    print(f"T-statistic: {t_stat}")
    print(f"P-value: {p_value_ttest}")

    # ANOVA test
    f_stat, p_value_anova = stats.f_oneway(attrition_yes, attrition_no)
    print(f"ANOVA test results for {variable}:")
    print(f"F-statistic: {f_stat}")
    print(f"P-value: {p_value_anova}")
    print('\n')

For the categorical variables: It appears there is a statistically significant association between the variables that were tested (Department, Gender, Education, and Ethnicity) and the observed frequencies.

In all four tests, the p-values are extremely small (close to zero), indicating strong evidence to reject the null hypothesis of independence. Therefore, we can conclude that there is a significant association between the variables being tested (Department, Gender, Education, and Ethnicity) and the observed frequencies.

ethnicity_values = employee_data['Ethnicity'].unique()
print(ethnicity_values)