Skip to content
0

📖 Background

You work for an international HR consultancy helping companies attract and retain top talent in the competitive tech industry. As part of your services, you provide clients with insights into industry salary trends to ensure they remain competitive in hiring and compensation practices.

Your team wants to use a data-driven approach to analyse how various factors—such as job role, experience level, remote work, and company size—impact salaries globally. By understanding these trends, you can advise clients on offering competitive packages to attract the best talent.

In this competition, you’ll explore and visualise salary data from thousands of employees worldwide. f you're tackling the advanced level, you'll go a step further—building predictive models to uncover key salary drivers and providing insights on how to enhance future data collection.

💾 The data

The data comes from a survey hosted by an HR consultancy, available in 'salaries.csv'.

Each row represents a single employee's salary record for a given year:
  • work_year - The year the salary was paid.
  • experience_level - Employee experience level:
    • EN: Entry-level / Junior
    • MI: Mid-level / Intermediate
    • SE: Senior / Expert
    • EX: Executive / Director
  • employment_type - Employment type:
    • PT: Part-time
    • FT: Full-time
    • CT: Contract
    • FL: Freelance
  • job_title - The job title during the year.
  • salary - Gross salary paid (in local currency).
  • salary_currency - Salary currency (ISO 4217 code).
  • salary_in_usd - Salary converted to USD using average yearly FX rate.
  • employee_residence - Employee's primary country of residence (ISO 3166 code).
  • remote_ratio - Percentage of remote work:
    • 0: No remote work (<20%)
    • 50: Hybrid (50%)
    • 100: Fully remote (>80%)
  • company_location - Employer's main office location (ISO 3166 code).
  • company_size - Company size:
    • S: Small (<50 employees)
    • M: Medium (50–250 employees)
    • L: Large (>250 employees)

EXECUTIVE SUMMARY

  1. Employees resident in countries such as US, GB, AU earn considerably higher, on average than those working in RU, DZ or NG.
  2. Experience level at Senior/Expert and Executive/Director earn higher salaries on average. Other experience levels earn cosiderably lower, on average.
  3. Employees working on-site or remotely have a higher average salary than those working hybrid.
  4. A model was built and used to predict the annual income of a mid-level employee working fully remote in the US to be earning $33,400.38 per annum
  5. The strongest predictors of salary are Employee residence in the US and Senior Executive experience level.
  6. Incorporating company size and employment type improved the salary-prediction model by reducing the mean absolute error.
  7. To improve future salary predictions, we can introduce additional features that capture more salary-influencing factors. These features can be derived from job characteristics, market conditions, and employee background.
#import data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

salaries_df=pd.read_csv('salaries.csv')
#creating a new DataFrame by filtering out the job roles of interest

roles = ["Data Analyst", "Data Scientist", "Machine Learning Engineer"]
roles_df = salaries_df[salaries_df["job_title"].isin(roles)]
roles_df.head(10)
#salary distribution by country
roles_df.groupby(["employee_residence", "job_title"])["salary_in_usd"].mean().round(2).sort_values(ascending=False).head(10)
#salary distribution by country (visualization)
plt.figure(figsize=(12, 6))
sns.boxplot(data=roles_df, x="employee_residence", y="salary_in_usd", hue="job_title")
plt.xticks(rotation=90)
plt.title("Salary by Country")
Hidden output

Employees resident in countries such as US, GB, AU earn considerably higher, on average than those working in RU, DZ or NG

#salary distribution by experience level
roles_df.groupby(["experience_level", "job_title"])["salary_in_usd"].mean().round(2).unstack()
#salary distribution by experience level (visualization)

plt.figure(figsize=(8, 5))
sns.boxplot(data=roles_df, x="experience_level", y="salary_in_usd", hue="job_title")
plt.title("Salary Distribution by Experience Level")
Hidden output

Experience level at Senior/Expert and Executive/Director earn higher salaries on average. Other experience levels earn cosiderably lower, on average.

#salary distribution by remote ratio

#define remote ratio categories
def categorize_remote(ratio):
    if ratio == 100:
        return "Remote"
    elif ratio == 0:
        return "On-site"
    else:
        return "Hybrid"
#create a new column 'remote category'
roles_df["remote_category"] = roles_df["remote_ratio"].apply(categorize_remote)
#group the job titles by remote category
roles_df.groupby(["job_title","remote_category"])["salary_in_usd"].mean().round(2).unstack()
#salary distribution by remote ratio (visualization)

plt.figure(figsize=(8, 5))
sns.boxplot(data=roles_df, x="remote_category", y="salary_in_usd", hue="job_title")
plt.title("Salary vs Remote Work")
Hidden output

Employees working on-site or remotely have a higher average salary than those working hybrid

#Develop a predictive model to estimate an employee’s salary (in USD) using experience level, company location, and remote ratio.

import statsmodels.api as sm
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Select relevant features
df = salaries_df[["salary_in_usd", "experience_level", "employee_residence", "remote_ratio"]]

# Drop missing values
df.dropna(inplace=True)

# One-Hot Encode categorical variables
df = pd.get_dummies(df, columns=["experience_level", "employee_residence", "remote_ratio"], drop_first=True)

# Separate features and target variable
X = df.drop("salary_in_usd", axis=1)
y = df["salary_in_usd"]