Competition - Demystifying data salaries 💰 - Level 1

Introduction

Recently, I discovered DataCamp, and it has been an incredible journey of learning and growth. With my basic knowledge, I am participating in this competition as a way to challenge myself and expand my skills.

In this competition, I'll be analyzing salary data from thousands of employees globally. Using a data-driven approach, I'll explore how factors like job role, experience level, remote work, and company size impact salaries.

I am eager to dig deeper into the field of data science and analysis, exploring new techniques and methodologies to enhance my expertise. Support me if you gain same insights from my analysis.

Given dataset descriptiom

💾 The data

Here’s a breakdown of the data in points:

Data Source: The data comes from a survey hosted by an HR consultancy, available in the 'salaries.csv' file.
Each Row: Represents a single employee's salary record for a given year.

Columns:

work_year: The year the salary was paid.
experience_level: Employee experience level:
- EN: Entry-level / Junior
- MI: Mid-level / Intermediate
- SE: Senior / Expert
- EX: Executive / Director
employment_type: Employment type:
- PT: Part-time
- FT: Full-time
- CT: Contract
- FL: Freelance
job_title: The job title during the year.
salary: Gross salary paid (in local currency).
salary_currency: Salary currency (ISO 4217 code).
salary_in_usd: Salary converted to USD using the average yearly FX rate.
employee_residence: Employee's primary country of residence (ISO 3166 code).
remote_ratio: Percentage of remote work:
- 0: No remote work (<20%)
- 50: Hybrid (50%)
- 100: Fully remote (>80%)
company_location: Employer's main office location (ISO 3166 code).
company_size: Company size:
- S: Small (<50 employees)
- M: Medium (50–250 employees)
- L: Large (>250 employees)

Loading and Analyzing dataset

Importing required libraries

# Importing required library

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
salaries_df = pd.read_csv('salaries.csv')
salaries_df

Q1. How many records are in the dataset, and what is the range of years covered?

num_records = salaries_df.shape[0]
year_range = (salaries_df["work_year"].min()
              ,salaries_df["work_year"].max())

#Plotting the records

plt.figure(figsize=(1, 4))
plt.bar(["Total Records"], [num_records], color='skyblue')
plt.ylabel("Count")
plt.title("Total Number of Records")
plt.show()

# Plotting No.of records vs Year

yearly_counts = salaries_df["work_year"].value_counts().sort_index()  # Count records per year

plt.figure(figsize=(7, 4))
plt.bar(yearly_counts.index, yearly_counts.values, color=('lightgreen','lightblue','orange'))
plt.xlabel("work_year")
plt.ylabel("Count of Records")
plt.title("Number of Records per Year")
plt.xticks(yearly_counts.index)  # Ensure all years are labeled
plt.show()

Q2. What is the average salary (in USD) for Data Scientists and Data Engineers? Which role earns more on average?

roles = ["Data Scientist", "Data Engineer"]
avg_salaries = salaries_df[salaries_df["job_title"].isin(roles)].groupby("job_title")["salary_in_usd"].mean()
higher_paid_role = avg_salaries.idxmax()

# Plotting Average salary for Data Scientists and Data Engineers

plt.figure(figsize=(3, 4))
sns.barplot(x=avg_salaries.index, y=avg_salaries.values, palette="coolwarm")
plt.ylabel("Average Salary (USD)")
plt.title("Average Salary Comparison: Data Scientist vs. Data Engineer")
plt.show()

Q3. How many full-time employees based in the US work 100% remotely?

full_time_remote_us = salaries_df[(salaries_df["employment_type"] == "FT") & 
                         (salaries_df["company_location"] == "US") & 
                         (salaries_df["employment_type"] == "100%")].shape[0]

# Plot: Full-Time Remote Employees in the US

import matplotlib.pyplot as plt

remote_counts = [full_time_remote_us, num_records - full_time_remote_us]
labels = ["Full-Time Remote US", "Others"]

plt.figure(figsize=(4, 5))  # Adjust the figure size
plt.bar(labels, remote_counts, color=['pink', 'royalblue'])
plt.title("Number of Full-Time Remote Employees in the US")
plt.ylabel("Count")
plt.show()

Displaying result numbers

# Displaying results

print(f"Number of records: {num_records}")
print(f"Year range: {year_range[0]} - {year_range[1]}")
print("Average salaries (USD):")
print(avg_salaries)
print(f"Higher paid role on average: {higher_paid_role}")
print(f"Number of full-time remote employees in the US: {full_time_remote_us}")

‌
‌
‌