Skip to content
Competition - HR Analytics in Tableau
Data Cleaning
import pandas as pd
el_df = pd.read_csv("data/education_level.csv")
em_df = pd.read_csv("data/employee.csv", parse_dates=['HireDate'])
pr_df = pd.read_csv("data/performance_rating.csv", delimiter=';', parse_dates=['ReviewDate'])em_df.info()em_df.nunique()Number of unique EmployeeID is lesser than the number of EmployeeID. This means that there are EmployeeID duplicates. The EmployeeID duplicates turned out to be same value assigned to three different employees. I will assign a unique EmployeeID to these three employees after doing the table merge with education_level table and performance_rating table.
em_duplicate = em_df[em_df.duplicated("EmployeeID", keep=False)]
em_duplicatedf1 = em_df.merge(el_df, left_on="Education", right_on="EducationLevelID")
df1.head()df1.info()pr_df.info()pr_df.nunique()Performance_rating table has about 4x more rows than employee table because every employee has to do yearly review from the day it joined the company. For example, an employee who joined in year 2015 would have 7 reviews up till year 2022.
pr_df.sort_values(by='ReviewDate',ascending=False)
pr_dfdf2 = df1.merge(pr_df, on="EmployeeID")
df2.info()df2.sort_values(by='ReviewDate',ascending=False)
df2[12]
df2['HireDate'] = pd.to_datetime(df2['HireDate'])
df2['ReviewDate'] = pd.to_datetime(df2['ReviewDate'])
df3 = df2[df2['HireDate'] <= df2['ReviewDate']]
df3