Skip to content
0

📖 Background

You work for an international HR consultancy helping companies attract and retain top talent in the competitive tech industry. As part of your services, you provide clients with insights into industry salary trends to ensure they remain competitive in hiring and compensation practices.

Your team wants to use a data-driven approach to analyse how various factors—such as job role, experience level, remote work, and company size—impact salaries globally. By understanding these trends, you can advise clients on offering competitive packages to attract the best talent.

In this competition, you’ll explore and visualise salary data from thousands of employees worldwide. f you're tackling the advanced level, you'll go a step further—building predictive models to uncover key salary drivers and providing insights on how to enhance future data collection.

💾 The data

The data comes from a survey hosted by an HR consultancy, available in 'salaries.csv'.

Each row represents a single employee's salary record for a given year:
  • work_year - The year the salary was paid.
  • experience_level - Employee experience level:
    • EN: Entry-level / Junior
    • MI: Mid-level / Intermediate
    • SE: Senior / Expert
    • EX: Executive / Director
  • employment_type - Employment type:
    • PT: Part-time
    • FT: Full-time
    • CT: Contract
    • FL: Freelance
  • job_title - The job title during the year.
  • salary - Gross salary paid (in local currency).
  • salary_currency - Salary currency (ISO 4217 code).
  • salary_in_usd - Salary converted to USD using average yearly FX rate.
  • employee_residence - Employee's primary country of residence (ISO 3166 code).
  • remote_ratio - Percentage of remote work:
    • 0: No remote work (<20%)
    • 50: Hybrid (50%)
    • 100: Fully remote (>80%)
  • company_location - Employer's main office location (ISO 3166 code).
  • company_size - Company size:
    • S: Small (<50 employees)
    • M: Medium (50–250 employees)
    • L: Large (>250 employees)

1 hidden cell

💪 Competition challenge

In this first level, you’ll explore and summarise the dataset to understand its structure and key statistics. If you want to push yourself further, check out level two! Create a report that answers the following:

  • How many records are in the dataset, and what is the range of years covered?
  • What is the average salary (in USD) for Data Scientists and Data Engineers? Which role earns more on average?
  • How many full-time employees based in the US work 100% remotely?
import pandas as pd
salaries_df = pd.read_csv('salaries.csv')
salaries_df.head(5)

1A. How many records are in the dataset?

salaries_df.shape

There are 57194 rows in the and 11 columns. There are 57194 records with 11 fields of additional information

1B. What is the range of years covered?

# I need to check for unique year  dates in the  work_year column
unique_years = salaries_df['work_year'].unique()
unique_years

The years covered are between 2020 and 2024

2A. What is the average salary (in USD) for Data Scientists and Data Engineers?

'''For Data Scientists'''
# I need to obtain a filter of a dataframe containing Data Scientist only
ds = salaries_df['job_title'] == 'Data Scientist'
ds_df = salaries_df[ds]

# Due to outliers, I need to use the median for the average. I need to sort the df by the salary_in_usd
ds_df_sorted = ds_df.sort_values(by = 'salary_in_usd')
ds_df_sorted['salary_in_usd'].median()
'''For Data Engineers'''
# I need to obtain a filter of a dataframe containing Data Engineers only
de = salaries_df['job_title'] == 'Data Engineer'
de_df = salaries_df[de]

# Due to outliers, I need to use the median for the average. I need to sort the df by the salary_in_usd
de_df_sorted = de_df.sort_values(by = 'salary_in_usd')
de_df_sorted['salary_in_usd'].median()

Data Scientists earn 154660.0 while Data Engineers earn 140000.0 on average