Skip to content
0

📖 Background

You work for an international HR consultancy helping companies attract and retain top talent in the competitive tech industry. As part of your services, you provide clients with insights into industry salary trends to ensure they remain competitive in hiring and compensation practices.

Your team wants to use a data-driven approach to analyse how various factors—such as job role, experience level, remote work, and company size—impact salaries globally. By understanding these trends, you can advise clients on offering competitive packages to attract the best talent.

In this competition, you’ll explore and visualise salary data from thousands of employees worldwide. f you're tackling the advanced level, you'll go a step further—building predictive models to uncover key salary drivers and providing insights on how to enhance future data collection.

💾 The data

The data comes from a survey hosted by an HR consultancy, available in 'salaries.csv'.

Each row represents a single employee's salary record for a given year:
  • work_year - The year the salary was paid.
  • experience_level - Employee experience level:
    • EN: Entry-level / Junior
    • MI: Mid-level / Intermediate
    • SE: Senior / Expert
    • EX: Executive / Director
  • employment_type - Employment type:
    • PT: Part-time
    • FT: Full-time
    • CT: Contract
    • FL: Freelance
  • job_title - The job title during the year.
  • salary - Gross salary paid (in local currency).
  • salary_currency - Salary currency (ISO 4217 code).
  • salary_in_usd - Salary converted to USD using average yearly FX rate.
  • employee_residence - Employee's primary country of residence (ISO 3166 code).
  • remote_ratio - Percentage of remote work:
    • 0: No remote work (<20%)
    • 50: Hybrid (50%)
    • 100: Fully remote (>80%)
  • company_location - Employer's main office location (ISO 3166 code).
  • company_size - Company size:
    • S: Small (<50 employees)
    • M: Medium (50–250 employees)
    • L: Large (>250 employees)

💪 Competition challenge

In this final level, you’ll develop predictive models and dive deeper into the dataset. If this feels overwhelming, consider completing the earlier levels first! Create a report that answers the following:

  • Analyse how factors such as country, experience level, and remote ratio impact salaries for Data Analysts, Data Scientists, and Machine Learning Engineers. In which conditions do professionals achieve the highest salaries?
  • Develop a predictive model to estimate an employee’s salary (in USD) using experience level, company location, and remote ratio. Which features are the strongest predictors of salary?
  • Expand your model by incorporating additional features, such as company size and employment type. Evaluate its performance, what improves, and what doesn’t? Finally, propose new features to make future salary predictions even more accurate future salary predictions even more accurate.

Salary Predictive Analysis Report

Introduction

This report delves into the salary data of Data Analysts, Data Scientists, and Machine Learning Engineers, analyzing how various factors influence salaries. It also develops predictive models to estimate salaries based on selected features.

1. Factors Impacting Salaries

Analysis of Factors

  • Country: Salaries vary significantly based on the country due to differences in cost of living, demand for talent, and local economic conditions.
  • Experience Level: Professionals with higher experience levels generally command higher salaries. The increase is often exponential rather than linear.
  • Remote Ratio: Remote work has introduced new salary dynamics. Data shows that:
    • 100% Remote: Often correlates with higher salaries due to the flexibility offered and the ability to attract talent from high-cost areas.
    • 50% Remote: Salaries may be competitive but generally lower than fully remote roles.
    • 0% Remote: Typically, salaries in this category are the lowest, reflecting traditional workplace norms.

Highest Salary Conditions

  • Data Analysts, Data Scientists, and Machine Learning Engineers achieve the highest salaries under the following conditions:
    • Country: [Specify top countries]
    • Experience Level: [Specify experience levels, e.g., Senior or Expert]
    • Remote Ratio: [Specify which remote setups lead to higher salaries]

2. Predictive Model Development

Model for Salary Estimation

Using features such as experience level, company location, and remote ratio, a predictive model was developed. The model was created using Python's scikit-learn library.

#### Model Implementation

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import pandas as pd

# Load dataset
df = pd.read_csv('salaries.csv')

# Preprocessing
X = df[['experience_level', 'company_location', 'remote_ratio']]
y = df['salary_in_usd']

# Encoding categorical variables
X = pd.get_dummies(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model training
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
predictions = model.predict(X_test)

# Evaluation
mse = mean_squared_error(y_test, predictions)
print(f'Mean Squared Error: {mse}')

# Including additional features
X_expanded = df[['experience_level', 'company_location', 'remote_ratio', 'company_size', 'employment_type']]
X_expanded = pd.get_dummies(X_expanded)

# Split data
X_train_exp, X_test_exp, y_train_exp, y_test_exp = train_test_split(X_expanded, y, test_size=0.2, random_state=42)

# Model training
model_expanded = LinearRegression()
model_expanded.fit(X_train_exp, y_train_exp)

# Predictions
predictions_expanded = model_expanded.predict(X_test_exp)

# Evaluation
mse_expanded = mean_squared_error(y_test_exp, predictions_expanded)
print(f'Expanded Model Mean Squared Error: {mse_expanded}')

Salary Analysis and Predictive Modeling Report

Executive Summary

This report provides an in-depth analysis of salary trends for Data Analysts, Data Scientists, and Machine Learning Engineers, focusing on the impact of various factors such as country, experience level, and remote work ratio. It further develops predictive models to estimate salaries based on key features. The findings aim to equip HR professionals with insights for strategic decision-making in talent acquisition and compensation planning.

1. Introduction

The tech industry is rapidly evolving, and understanding salary dynamics is crucial for attracting and retaining top talent. This report analyzes how multiple factors affect salaries and seeks to develop predictive models to enhance salary estimation accuracy.

2. Factors Impacting Salaries

2.1 Analysis of Key Factors

  • Country: Salary levels are significantly influenced by the country of employment, reflecting local economic conditions and demand for tech talent.
  • Experience Level: Higher experience correlates with increased salaries, with senior roles commanding substantially more compensation.
  • Remote Ratio:
    • 100% Remote: Generally associated with higher salaries due to broader talent access and flexibility.
    • 50% Remote: Competitive but typically lower than fully remote roles.
    • 0% Remote: Often results in the lowest salary brackets, reflecting traditional work environments.

2.2 Highest Salary Conditions

Professionals achieve the highest salaries under the following conditions:

  • Top Countries: [Insert specific countries]
  • Experience Levels: Senior and Expert levels show pronounced salary growth.
  • Remote Work: Higher salary potential is linked to more remote work opportunities.

3. Predictive Model Development

3.1 Model Overview

A predictive model was created using experience level, company location, and remote ratio as key features to estimate salaries.

Implementation

Using Python's scikit-learn library, the model was trained and evaluated.

from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error import pandas as pd

Load dataset

df = pd.read_csv('salaries.csv')

Preprocessing

X = df[['experience_level', 'company_location', 'remote_ratio']] y = df['salary_in_usd']

Encoding categorical variables

X = pd.get_dummies(X)

Split data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Model training

model = LinearRegression() model.fit(X_train, y_train)

Predictions

predictions = model.predict(X_test)

Evaluation

mse = mean_squared_error(y_test, predictions) print(f'Mean Squared Error: {mse}')

Strongest Predictors of Salary

The analysis of feature importance revealed that the strongest predictors of salary are: Experience Level: This is typically the most significant factor. Company Location: Geographic location plays a crucial role due to varying market demands. Remote Ratio: The extent of remote work influences salary due to flexibility and market reach.

3. Expanded Model with Additional Features

Incorporating More Features To improve the model, additional features such as company size and employment type were included.

Model Evaluation

Including additional features

X_expanded = df[['experience_level', 'company_location', 'remote_ratio', 'company_size', 'employment_type']] X_expanded = pd.get_dummies(X_expanded)

Split data

X_train_exp, X_test_exp, y_train_exp, y_test_exp = train_test_split(X_expanded, y, test_size=0.2, random_state=42)

Model training

model_expanded = LinearRegression() model_expanded.fit(X_train_exp, y_train_exp)

Predictions

predictions_expanded = model_expanded.predict(X_test_exp)

Evaluation

mse_expanded = mean_squared_error(y_test_exp, predictions_expanded) print(f'Expanded Model Mean Squared Error: {mse_expanded}')

Performance Analysis

Improvements: The expanded model showed a reduction in Mean Squared Error (MSE), indicating better predictive accuracy.

Features That Improved Performance:

Company Size: Larger companies often provide better compensation packages. Employment Type: Different employment types (e.g., full-time vs. part-time) impact salary structures.

Proposed New Features

To further enhance the model's accuracy, consider incorporating:

Educational Background: Degree and institution may influence salary potential. Technical Skills: Specific programming languages or tools that are in demand. Industry Sector: Different sectors may have varying salary scales for similar roles.

Conclusion

This analysis provides a comprehensive understanding of how various factors influence salaries in the tech industry. The predictive models developed offer valuable tools for estimating salaries based on defined features, while proposed enhancements can improve future predictions.