Data salaries 💰 - Analysis & Insights

💪 Competition challenge

First level: explore and summarise the dataset to understand its structure and key statistics.

How many records are in the dataset, and what is the range of years covered?
What is the average salary (in USD) for Data Scientists and Data Engineers? Which role earns more on average?
How many full-time employees based in the US work 100% remotely?

📖 Background

You work for an international HR consultancy helping companies attract and retain top talent in the competitive tech industry. As part of your services, you provide clients with insights into industry salary trends to ensure they remain competitive in hiring and compensation practices.

Your team wants to use a data-driven approach to analyse how various factors—such as job role, experience level, remote work, and company size—impact salaries globally. By understanding these trends, you can advise clients on offering competitive packages to attract the best talent.

In this competition, you’ll explore and visualise salary data from thousands of employees worldwide. f you're tackling the advanced level, you'll go a step further—building predictive models to uncover key salary drivers and providing insights on how to enhance future data collection.

📖 Executive Summary

This report provides an analysis of salary trends in the global tech industry using data from a survey of employees worldwide. Our goal is to understand how job roles, experience levels, remote work arrangements, and company size impact salaries.

Target Audience

The target audience consists of talent acquisition professionals looking to attract top data science talent, especially in competitive markets, and individuals aiming to pursue a career in data.

🔮 Key Findings:

The dataset includes more than 57,000 records from 2020 to 2024, with a significant concentration in 2024, making up 81% of the total records. While this data benefits current analyses, it falls short for examining historical trends.
Data Engineers earn an average salary of 149K per year, while data scientists earn an average of 159K per year. This 10K difference might reflect the current demand and market value of data science expertise, but it is important to understand what factors contribute to the $ 10,000 difference, such as experience level or location.
The majority of salary records for employee (68%) represent full-time, non-remote positions. However, a significant 19.4% of records indicate full-time remote work. This demonstrates a substantial presence of remote work arrangements within the data-related job market.

💾 The data

The data comes from a survey hosted by an HR consultancy, available in 'salaries.csv'.

Each row represents a single employee's salary record for a given year:

work_year - The year the salary was paid.
experience_level - Employee experience level:
- EN: Entry-level / Junior
- MI: Mid-level / Intermediate
- SE: Senior / Expert
- EX: Executive / Director
employment_type - Employment type:
- PT: Part-time
- FT: Full-time
- CT: Contract
- FL: Freelance
job_title - The job title during the year.
salary - Gross salary paid (in local currency).
salary_currency - Salary currency (ISO 4217 code).
salary_in_usd - Salary converted to USD using average yearly FX rate.
employee_residence - Employee's primary country of residence (ISO 3166 code).
remote_ratio - Percentage of remote work:
- 0: No remote work (<20%)
- 50: Hybrid (50%)
- 100: Fully remote (>80%)
company_location - Employer's main office location (ISO 3166 code).
company_size - Company size:
- S: Small (<50 employees)
- M: Medium (50–250 employees)
- L: Large (>250 employees)

🔎 Dataset Overview

import pandas as pd
import matplotlib.ticker as mticker
salaries_df = pd.read_csv('salaries.csv')

To understand the dataset structure, first, the general information was analysed.

salaries_df.info()

📊 Data Cleaning and Preprocessing

For better readability, categorical variables were mapped:

mapping_employment_type = {
    'FT': 'Full-Time',
    'PT': 'Part-Time',
    'CT': 'Contract',
    'FL': 'Freelance'
}
salaries_df['employment_type']= salaries_df['employment_type'].map(lambda x: mapping_employment_type.get(x, x))

mapping_experience_level = {
    'EN': 'Entry-level / Junior',  
    'MI': 'Mid-level / Intermediate', 
    'SE': 'Senior / Expert',  
    'EX': 'Executive / Director' 
}
salaries_df['experience_level']= salaries_df['experience_level'].map(lambda x: mapping_experience_level.get(x, x))

‌
‌
‌