Skip to content
0

Executive Summary

Objective

The objective of this project was to analyze compensation trends in data-related roles and uncover key insights on salary distribution, job titles, and regional differences using a structured dataset.

Scope

The project focused on salary data for tech professionals across various roles and geographies. Data was analyzed using Python (Pandas), covering ~60,000 records with attributes such as job title, salary, location, and remote work ratio.

Key Findings

The average salary for Data Scientists between 2020 and 2024 is $164,397.43.

While the average salary for Data Engineers between 2020 and 2024 is $154,315.04.

There are 11,125 full-time employees in the US that are 100% remote out of 57,194 employees.

Tools & Technologies Used

Languages/Frameworks: Python, Pandas, Numpy, Plotly

Data Source: Public salary dataset (~60k records)

📖 Background

You work for an international HR consultancy helping companies attract and retain top talent in the competitive tech industry. As part of your services, you provide clients with insights into industry salary trends to ensure they remain competitive in hiring and compensation practices.

Your team wants to use a data-driven approach to analyse how various factors—such as job role, experience level, remote work, and company size—impact salaries globally. By understanding these trends, you can advise clients on offering competitive packages to attract the best talent.

In this competition, you’ll explore and visualise salary data from thousands of employees worldwide. f you're tackling the advanced level, you'll go a step further—building predictive models to uncover key salary drivers and providing insights on how to enhance future data collection.


1 hidden cell
import pandas as pd
salaries_df = pd.read_csv('salaries.csv')
salaries_df
# Get total number of records in the dataset
total_records = len(salaries_df)
print(f"There are a total of {total_records:,.0f} records.")
# Get the minimum and maximum years for the darange
year_range = salaries_df['work_year'].min(), salaries_df['work_year'].max()
print(f"Year range: {year_range[0]} to {year_range[1]}")
# Filter dataset for only Data Scientist and Data Engineers where currency is USD
filtered = salaries_df[(salaries_df['job_title'].isin(['Data Scientist', 'Data Engineer'])) &(salaries_df['salary_currency'] == 'USD')]

# Group by job title and calculate average salary
avg_salaries = filtered.groupby('job_title')['salary'].mean().reset_index()

# Sort from highest to lowest
avg_salaries = avg_salaries.sort_values(by='salary', ascending=False)

# Format salary to include $ and two decimal places
avg_salaries['salary'] = avg_salaries['salary'].apply(lambda x: f"${x:,.2f}")


print(avg_salaries.rename(columns={'job_title': 'Role', 'salary': 'Avg. Salary (USD)'}).to_string(index=False, col_space=25))
# Filter dataset for only Data Scientist and Data Engineers where currency is USD
filtered = salaries_df[
    (salaries_df['job_title'].isin(['Data Scientist', 'Data Engineer'])) &
    (salaries_df['salary_currency'] == 'USD')
]

# Group by job title and work year, calculate average salary
avg_salaries = (
    filtered.groupby(['job_title', 'work_year'])['salary']
    .mean()
    .reset_index()
    .sort_values(by=['work_year', 'salary'], ascending=[True, False])
)


import pandas as pd
import plotly.graph_objects as go

# Group and average
avg_summary = (
    filtered.groupby(['work_year', 'job_title'])['salary']
    .mean()
    .reset_index()
    .rename(columns={'salary': 'Avg Salary'})
)

# Pivot for plotting
pivot_avg = avg_summary.pivot(index='work_year', columns='job_title', values='Avg Salary').fillna(0)

# Convert index to string for plotting
years = pivot_avg.index.astype(str)

# Create interactive bar chart
fig = go.Figure()

# Add Data Scientist bars
fig.add_trace(go.Bar(
    x=years,
    y=pivot_avg['Data Scientist'],
    name='Data Scientist',
    marker_color='#1f77b4',
))

# Add Data Engineer bars
fig.add_trace(go.Bar(
    x=years,
    y=pivot_avg['Data Engineer'],
    name='Data Engineer',
    marker_color='#ff7f0e',
))

# Update layout
fig.update_layout(
    barmode='group',
    title='Interactive Avg Salary Comparison: Data Scientist vs Data Engineer (5 Years)',
    xaxis_title='Work Year',
    yaxis_title='Average Salary (USD)',
    yaxis_tickprefix='$',
    yaxis_tickformat=',.2f',
    template='plotly_white',
    height=500,
    width=900
)
fig.show()

On average data scientist receive a high pay than data engineers. For 4 of the 5 years data scientist received a higher pay.

# Get the fuil-time employees in the US that are fully remote
ft_employee = salaries_df[(salaries_df['employment_type'] == 'FT') & (salaries_df['employee_residence'] == 'US') & (salaries_df['remote_ratio'] == 100)]
print(f"There are {len(ft_employee):,.0f} full-time employees in the US that are 100% remote.")