Skip to content
Project: Consolidating Employee Data
  • AI Chat
  • Code
  • Report
  • Spinner

    You just got hired as the first and only data practitioner at a small business experiencing exponential growth. The company needs more structured processes, guidelines, and standards. Your first mission is to structure the human resources data. The data is currently scattered across teams and files and comes in various formats: Excel files, CSVs, JSON files...

    You'll work with the following data in the datasets folder:

    • Office addresses are currently saved in office_addresses.csv. If the value for office is NaN, then the employee is remote.
    • Employee addresses are saved on the first tab of employee_information.xlsx.
    • Employee emergency contacts are saved on the second tab of employee_information.xlsx; this tab is called emergency_contacts. However, this sheet was edited at some point, and the headers were removed! The HR manager let you know that they should be: employee_id, last_name, first_name, emergency_contact, emergency_contact_number, and relationship.
    • Employee roles, teams, and salaries have been exported from the company's human resources management system into a JSON file titled employee_roles.json. Here are the first few lines of that file:
    {"A2R5H9": { "title": "CEO", "monthly_salary": "$4500", "team": "Leadership" }, ... }
    import pandas as pd
    
    # load csv file
    df_office = pd.read_csv("./datasets/office_addresses.csv")
    df_office.head()
    #load first worksheet
    employee_excel = pd.read_excel("./datasets/employee_information.xlsx", sheet_name=0, index_col=0)
    
    #rename some columns to format provided in solution
    employee_excel.rename(columns={"employee_last_name":"last_name", "employee_first_name":"first_name"}, inplace=True)
    employee_excel.head()
    #Names of columns
    columns_contact = ["employee_id", "last_name", "first_name","emergency_contact","emergency_contact_number","relationship" ]
    
    # load 2nd excel worksheet with specified names as columns
    employee_contacts = pd.read_excel("./datasets/employee_information.xlsx", sheet_name=1, names=columns_contact, header=None, index_col=0)
    
    #drop names column since it contains employee_id
    employee_contacts.drop(columns=["first_name","last_name"], inplace=True)
    employee_contacts.head()
    # import json file
    df_roles = pd.read_json("./datasets/employee_roles.json", orient="index")
    df_roles = df_roles.loc[:,["monthly_salary", "team", "title"]]
    
    df_roles
    # concat the 3 datasets with same id as employee_id and reset its index
    employee_details = pd.concat([employee_excel, employee_contacts,df_roles], axis=1, sort=False)
    employee_details.reset_index(inplace=True)
    employee_details.rename(columns={'index':"employee_id"}, inplace=True)
    
    
    employee_details
    employees_final = employee_details.merge(df_office,how='outer',right_on='office_country', left_on='employee_country')
    
    
    employees_final = employees_final.set_index("employee_id")
    employees_final
    #fill null values
    employees_final.fillna("Remote", inplace=True)
    
    #switching column 1 with 2 and vice versa
    col_list = employees_final.columns.tolist()
    col_list[0], col_list[1] = col_list[1], col_list[0]
    employees_final = employees_final[col_list]
    
    #validate final result
    employees_final