Skip to content
Project: Consolidating Employee Data
Joining data from disparate sources
You just got hired as the first and only data practitioner at a small business experiencing exponential growth. The company needs more structured processes, guidelines, and standards. Your first mission is to structure the human resources data. The data is currently scattered across teams and files and comes in various formats: Excel files, CSVs, JSON files...
You'll work with the following data in the datasets
folder:
- Office addresses are currently saved in
office_addresses.csv
. If the value for office isNaN
, then the employee is remote. - Employee addresses are saved on the first tab of
employee_information.xlsx
. - Employee emergency contacts are saved on the second tab of
employee_information.xlsx
; this tab is calledemergency_contacts
. However, this sheet was edited at some point, and the headers were removed! The HR manager let you know that they should be:employee_id
,last_name
,first_name
,emergency_contact
,emergency_contact_number
, andrelationship
. - Employee roles, teams, and salaries have been exported from the company's human resources management system into a JSON file titled
employee_roles.json
. Here are the first few lines of that file:
{"A2R5H9": { "title": "CEO", "monthly_salary": "$4500", "team": "Leadership" }, ... }
import pandas as pd
# Read the CSV
office_df = pd.read_csv('datasets/office_addresses.csv')
# print(office_df.columns)
# Read the first Excel sheet (default)
emp_addr_df=pd.read_excel('datasets/employee_information.xlsx')
print(emp_addr_df.columns )
# Create the header as the second sheet doesn't have one
header = ['employee_id', 'employee_last_name', 'employee_first_name', 'contact_name', 'contact_no', 'contact_relationship']
# Read the second sheet (sheet_name=1), has no header, set the header
emp_contact_df=pd.read_excel('datasets/employee_information.xlsx', sheet_name=1, header=None, names=header)
print(emp_contact_df )
emp_addr_contact_df = emp_contact_df.merge(emp_addr_df, on='employee_id', suffixes=('_con','_add'))
print(emp_addr_contact_df.columns)
emp_office_df = emp_addr_contact_df.merge(office_df, how='left', left_on='employee_country', right_on='office_country', suffixes=('_emp', '_ofc'))
print(emp_office_df.columns)
# Read the json data, specifying the format with orient para
emp_roles = pd.read_json('datasets/employee_roles.json', orient='index')
emp_roles['employee_id']= emp_roles.index
emp_roles.reset_index()
# emp_roles.columns=['employee_id', 'title', 'monthly_salary', 'team']
print(emp_roles)
employees_final = emp_office_df.merge(emp_roles, on='employee_id')
# Set emp_id as index
employees_final.set_index('employee_id')
# drop redundant columns
employees_final.drop(['employee_last_name_add', 'employee_first_name_add'], axis=1, inplace=True)
# rename the columns to specified
employees_final.columns = ['employee_id','last_name', 'first_name', 'emergency_contact', 'emergency_contact_number', 'relationship','employee_country', 'employee_city', 'employee_street', 'employee_street_number', 'office', 'office_country', 'office_city','office_street', 'office_street_number', 'title', 'monthly_salary', 'team']
# Re-arrange the columns
employees_final = employees_final[['employee_id','first_name','last_name','employee_country', 'employee_city', 'employee_street', 'employee_street_number','emergency_contact', 'emergency_contact_number', 'relationship','monthly_salary', 'team','title', 'office', 'office_country', 'office_city','office_street', 'office_street_number']]
employees_final.index = employees_final['employee_id']
employees_final.drop('employee_id', axis=1, inplace=True)
# Replace NaN in office with 'Remote'
employees_final.fillna('Remote', inplace=True)
print(employees_final.isnull())
print(employees_final.columns)
print(employees_final.loc['M1Z7U9'])