You just got hired as the first and only data practitioner at a small business experiencing exponential growth. The company needs more structured processes, guidelines, and standards. Your first mission is to structure the human resources data. The data is currently scattered across teams and files and comes in various formats: Excel files, CSVs, JSON files...
You'll work with the following data in the datasets
folder:
- Office addresses are currently saved in
office_addresses.csv
. If the value for office isNaN
, then the employee is remote. - Employee addresses are saved on the first tab of
employee_information.xlsx
. - Employee emergency contacts are saved on the second tab of
employee_information.xlsx
; this tab is calledemergency_contacts
. However, this sheet was edited at some point, and the headers were removed! The HR manager let you know that they should be:employee_id
,last_name
,first_name
,emergency_contact
,emergency_contact_number
, andrelationship
. - Employee roles, teams, and salaries have been exported from the company's human resources management system into a JSON file titled
employee_roles.json
. Here are the first few lines of that file:
{"A2R5H9": { "title": "CEO", "monthly_salary": "$4500", "team": "Leadership" }, ... }
Task Instructions
- Create a single DataFrame called
employees_final
containing: Index:employee_id
,
Columns: first_name
, last_name
, employee_country
, employee_city
, employee_street
,
employee_street_number
, emergency_contact
, emergency_contact_number
,
relationship
, monthly_salary
, team
, title
, office
, office_country
,
office_city
, office_street
, office_street_number
.
- Change any missing values in column names starting with
office
to the word"Remote"
.
import pandas as pd
# Start coding here...
#Import office addresses file
office_addresses = pd.read_csv('datasets/office_addresses.csv')
#Check the contents of the dataframe
office_addresses.head()
Read Excel
For merge purposes, i assumed that an employee lives in the City where the office city is located as well, it only makes sense right?
I noticed that the value of office city for US is 'New York City' in the office_addresses table but it's 'New-York' in the employee address table, so i had to fix the value in this table since it will be merged on
#Import employee addresses from the first worksheet of employee_information.xlsx
employee_addresses=pd.read_excel('datasets/employee_information.xlsx', sheet_name=0)
#Set the name of employee city to match the one in office city
employee_addresses.loc[employee_addresses['employee_city'] == 'New-York', 'employee_city'] = 'New York City'
employee_addresses.head()
# Import employees emergency contacts from the second worksheet of employee_information.xlsx
column_names = ["employee_id", "last_name", "first_name", "emergency_contact", "emergency_contact_number", "relationship"]
sheet= 'emergency_contacts'
employee_emergency_contacts= pd.read_excel('datasets/employee_information.xlsx', sheet_name=sheet, header=None, names=column_names)
employee_emergency_contacts.head()
Read JSON
#Read the json file
employee_roles = pd.read_json('datasets/employee_roles.json', orient='index')
#Make a column(employee_id) out of the index
employee_roles.reset_index(inplace=True)
employee_roles.rename(columns={'index': 'employee_id'}, inplace=True)
# Check the contents of the file
employee_roles.head()
Merge Dataframes
# merge the tables
#Merge employee adresses with employee emergency contacts
addresses_emergency_contacts = employee_addresses.merge(employee_emergency_contacts, how='left', on='employee_id')
#Merge the result of the merge above with employee roles table
addresses_emergency_role= addresses_emergency_contacts.merge(employee_roles, how='left', on='employee_id')
#Merge the resulting df from above with office addresses
merged_dfs = addresses_emergency_role.merge(office_addresses, how='left', left_on='employee_country', right_on='office_country')
#Check the contents of the merged datasets
merged_dfs
Filter required Columns
The last action after merging is to get the required dataframe with selected columns.
#create a list of the columns i want the final dataframe to have
columns = ['employee_id',
'first_name', 'last_name', 'employee_country', 'employee_city', 'employee_street',
'employee_street_number', 'emergency_contact', 'emergency_contact_number',
'relationship', 'monthly_salary', 'team', 'title', 'office', 'office_country',
'office_city', 'office_street', 'office_street_number'
]
#Filter the merged dataframe on the columns
employees_final= merged_dfs[columns]
#Set the employee id column as the index for rows
employees_final.set_index('employee_id', inplace=True)
employees_final
#Check for any missing values in the column names starting with 'office'
office_columns = ['office', 'office_country', 'office_city', 'office_street', 'office_street_number']
office_addresses_na=employees_final[office_columns].isna().any(axis=1)
#Returns 1 row with missing values
#Fill missing values in columns starting with 'office' with 'Remote'
employees_final[office_columns] = employees_final[office_columns].fillna('Remote')
#Check the final dataframe
employees_final