How Are Tech Salaries Shaping Up in 2020-2024? Part 3
by John Mike Asuncion
Executive Summary
This Level 3 analysis examines tech salary trends from 2020 to 2024 using a dataset of 57,194 salary records. The report addresses three key objectives to provide actionable insights for an international HR consultancy:
- Impact of Country, Experience Level, and Remote Ratio on Salaries: The US offers the highest average salaries, with Machine Learning Engineers at $201,774, followed by Data Scientists at $165,127. Senior and Executive-level professionals earn the most, with Executive Machine Learning Engineers averaging around $210,000. On-site roles generally pay more, with Data Scientists earning $160,000 on-site compared to $80,000 in hybrid setups. The highest salaries occur in Mexico (mx), with Data Analysts earning $429,950 and Data Scientists $352,500, both on-site.
- Predictive Model for Salary Estimation: A Linear Regression model using experience level, company location, and remote ratio achieves an R-squared of 0.1932 and an RMSE of $62,095. The strongest predictors are company_location_pr (Puerto Rico, coefficient: $101,803.64), experience_level_ex (Executive, $96,428.34), and company_location_mx (Mexico, $81,031.21), highlighting the significant influence of specific locations and executive roles on salaries.
- Expanded Model with Additional Features: Adding company size and employment type slightly improves the model, increasing the R-squared to 0.1942 and reducing the RMSE to $62,055. Top predictors remain company_location_ir (Iran, $100,359.60), company_location_pr ($98,330.67), and experience_level_ex ($96,058.18). The marginal improvement suggests that outliers, such as high salaries in Mexico, continue to challenge accuracy. New features like industry sector and cost of living index are proposed to enhance future predictions.
Brief Recommendations: Offer competitive salaries above $350,000 for top roles in high-paying regions like Mexico, especially for on-site positions, to attract talent. Focus on Senior and Executive-level professionals in the US, where average salaries are highest. Enhance data collection with variables like industry sector and cost of living index to improve model accuracy. Explore advanced models like Random Forest to better handle outliers and non-linear relationships.
I. Background
Tech companies face fierce competition for talent, making salary insights critical for attracting and retaining skilled professionals. This analysis leverages a global salary dataset to uncover trends in job roles, experience levels, and remote work, helping an international HR consultancy stay competitive. With remote work surging and tech roles diversifying, understanding these drivers is more vital than ever.
II. Objectives
This report aims to provide actionable insights into tech salary trends by addressing the following goals:
- Examining the impact of country, experience level, and remote ratio on salaries for Data Analysts, Data Scientists, and Machine Learning Engineers. Identifying conditions for the highest compensation.
- Constructing a predictive model for salary estimation using experience level, company location, and remote ratio, determining the strongest predictors.
- Improving the salary prediction model with company size and employment type. Performance evaluation and proposals for future enhancements.
III. Data Description
The dataset, sourced from a survey hosted by an HR consultancy, is stored in salaries.csv
. Each row represents an employee’s salary record for a given year. The columns are as follows:
Column Name | Description | Expected Data Type |
---|---|---|
work_year | Year of work | int |
experience_level | Level of experience (e.g., EN, MI, SE, EX) | str |
employment_type | Type of employment (e.g., FT, PT, CT, FL) | str |
job_title | Job title of the employee | str |
salary | Salary amount in original currency | int |
salary_currency | Currency of the salary | str |
salary_in_usd | Salary amount converted to USD | int |
employee_residence | Country code of employee residence | str |
remote_ratio | Remote work ratio (0, 50, 100) | int |
company_location | Country code of company location | str |
company_size | Size of the company (S, M, L) | str |
Let's load the data to begin our analysis.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
salaries_df = pd.read_csv('salaries.csv')
salaries_df.head()
salaries_df.info()
IV. Data Preparation
Data Quality Check and Data Cleaning
Data preparation ensures reliable analysis by addressing potential issues. Each check builds a foundation for trustworthy insights.
- Checking Column Headers
Headers guide our analysis. Missing or misnamed columns, like salary_in_usd
, could derail salary comparisons.
salaries_df.columns.tolist()
- Checking for Missing Values
Missing data, such as blank salaries, skews results. Checking ensures completeness for accurate averages.
salaries_df.isnull().sum()
- Checking Data Types
Incorrect types, like text for salary_in_usd
, prevent calculations. Verification ensures compatibility.