Skip to content
Project: Consolidating Employee Data
I just got hired as the first and only data practitioner at an imaginary small business experiencing exponential growth. The company needs more structured processes, guidelines, and standards. Your first mission is to structure the human resources data. The data is currently scattered across teams and files and comes in various formats: Excel files, CSVs, JSON files...
I will work with the following data in the datasets folder:
- Office addresses
- Saved in
office_addresses.csv. - If the value for office is
NaN, then the employee is remote.
- Saved in
- Employee addresses
- Saved on the first tab of
employee_information.xlsx.
- Saved on the first tab of
- Employee emergency contacts
- Saved on the second tab of
employee_information.xlsx; this tab is calledemergency_contacts. - However, this sheet was edited at some point, and the headers were removed! The HR manager let you know that they should be:
employee_id,last_name,first_name,emergency_contact,emergency_contact_number, andrelationship.
- Saved on the second tab of
- Employee roles, teams, and salaries
- This information has been exported from the company's human resources management system into a JSON file titled
employee_roles.json.
- This information has been exported from the company's human resources management system into a JSON file titled
#Import required libraries
import pandas as pd
import json
#Read in csv and excel files as Dataframes
office = pd.read_csv('office_addresses.csv')
xls = pd.read_excel('employee_information.xlsx')
#Read in json file
with open('employee_roles.json') as f:
data = json.load(f)
#Convert to Dataframe
info = pd.DataFrame(data)xls.head()#Count total sheets in excel file
excel = pd.ExcelFile('employee_information.xlsx')
num_sheets = len(excel.sheet_names)
print(num_sheets)#Preview excel file format
preview = pd.read_excel('employee_information.xlsx', sheet_name=1)preview.head()#Create header for dataframe
emergency_contacts_header = ['employee_id', 'employee_last_name', 'employee_first_name', 'emergency_contact', 'emergency_contact_number', 'relationship']#Read in 2nd page of excel file with the header fixed
contacts = pd.read_excel('employee_information.xlsx', sheet_name=1, header=None, names=emergency_contacts_header)contacts.head()#Merge both excel Dataframes together
df = xls.merge(contacts, how='left', on=['employee_id', 'employee_last_name', 'employee_first_name'])df.head()info.head()#Transpose json Dataframe
roles = info.transpose()#Create employee id column for merge
roles.index.rename('employee_id', inplace=True)#Reset index
roles.reset_index()