Skip to content
0

Data Cleaning

import pandas as pd

el_df = pd.read_csv("data/education_level.csv")
em_df = pd.read_csv("data/employee.csv", parse_dates=['HireDate'])
pr_df = pd.read_csv("data/performance_rating.csv", delimiter=';', parse_dates=['ReviewDate'])
em_df.info()
em_df.nunique()

Number of unique EmployeeID is lesser than the number of EmployeeID. This means that there are EmployeeID duplicates. The EmployeeID duplicates turned out to be same value assigned to three different employees. I will assign a unique EmployeeID to these three employees after doing the table merge with education_level table and performance_rating table.

em_duplicate = em_df[em_df.duplicated("EmployeeID", keep=False)]
em_duplicate
df1 = em_df.merge(el_df, left_on="Education", right_on="EducationLevelID")
df1.head()
df1.info()
pr_df.info()
pr_df.nunique()

Performance_rating table has about 4x more rows than employee table because every employee has to do yearly review from the day it joined the company. For example, an employee who joined in year 2015 would have 7 reviews up till year 2022.

pr_df.sort_values(by='ReviewDate',ascending=False)
pr_df
df2 = df1.merge(pr_df, on="EmployeeID")
df2.info()
df2.sort_values(by='ReviewDate',ascending=False)
df2
[12]
df2['HireDate'] = pd.to_datetime(df2['HireDate'])
df2['ReviewDate'] = pd.to_datetime(df2['ReviewDate'])
df3 = df2[df2['HireDate'] <= df2['ReviewDate']]
df3