Skip to content

20 hidden cells

Winning the Job Hunt: Data-Driven Strategies

Introduction

LinkedIn has become a powerful online platform for posting jobs and seeking talent. On the job seeker side, this social media gives an unlimited opportunity to explore the job market and build professional connections.

Although there are countless other channels to recruit people, every decent company, large or small, uses LinkedIn to fill positions. Being a pro-active LinkedIn user, you might be curious about the insights, including where to look and how to look for a job or hire the best talents.

In this data analysis project, I will answer the business questions, such as:

  1. What are the most in-demand job roles across various industries?
  2. Where are the highest concentrations of job postings geographically located?
  3. Who are the top competitors regarding applicant volume for high-salary positions?
  4. Do remote or on-site positions attract more applications, and why?
  5. How can recent graduates and career changers improve their chances of securing employment?

In this project, every data exploration and cleaning stage was done in Python, using Pandas, and Numpy libraries for data manipulation, and Matplotlib and Seaborn libraries for creating visualizations. The writing was done using Markdown. The DataLab Workspace is used for building and formatting.

All the visualizations and conclusions are based on the publicly available Kaggle dataset of 2023 LinkedIn job postings, under this license. The dataset can be downloaded here in CSV format.

Data Preperation

Before getting my hands dirty in database cleaning, I needed to assemble all the job posting data from different raw data files.

As a first step, I loaded every CSV file using the read_csv method. Next, the CSV files were appended one by one to the core table, which was loaded from the 'job_postings.csv' file, based on the key field.

The code below shows the appending operation in detail:

#Appending data altogether based on key fields to the main job openings table database_plus_job_id = database_raw.merge(job_skills, on='job_id', how='left').merge(salaries, on='job_id', how='left').merge(job_industries, on='job_id', how='left') database_plus_industries = database_plus_job_id.merge(industries, on='industry_id', how='left') database_plus_companies = database_plus_industries.merge(companies, on='company_id', how='left') database_aggregated = database_plus_companies.merge(benefits_pivoted, on='job_id',how='left')

All the data is now stored in one single table called 'database_aggregated'. It contains information about the job posting description, salary range, location, number of applications, industry, etc.

The combined table before cleaning contained 77,766 rows and 47 columns.

Data Exploration and Cleaning

Each row in the database represents one job opening on LinkedIn. To see the whole picture, first I found out the datatype of each field and the sum of missing values in each column.

Since almost every column required some work of validation and cleaning, I checked all of them one by one. Here is how:

  • job_id: The unique identifier of a job opening. Duplicates should not be present here, so I removed such records leaving no duplicates and null values.

  • company_id: There were 654 job openings with no company information. These records were deleted as well.

  • title: No null values found, job titles might be the same.

  • description_x: 1 null value was changed into 'No information'.

  • max_salary_x / med_salary_x / min_salary_x: This was a bit tricky. There were either maximum and minimum wages specified but no median, or only median wage mentioned without minimum and maximum, or all three values were null.
    To fix this, I started by computing the median salary based on the industry average a job listing belongs to (using the rest of the table's records). It affected rows with yearly salary and decreased the number of null values. Similarly, I repeated this step twice for jobs where salary is specified monthly and hourly.
    Having the median salary, it was easier to calculate the min and max salary, assuming they're 10% less and more, respectively. In reverse, med_salary_x was computed as an average of max_salary_x and min_salary_x.
    There were still 1323 rows with no salary information at all. Since it accounts for less than 2% of data, the best solution was to remove these records.

  • pay_period_x: Null values were changed into 'YEARLY', 'MONTHLY', or 'HOURLY', depending on the amount (assuming less than 100 is an hourly wage, less than 13500 is monthly, and the rest has to be marked as yearly).

  • formatted_work_type: Specifies whether it's a full-time, part-time, or contract job. No null values, no changes applied.

  • location: The location of a job posting. No null values, no changes made.

  • applies: The number of applications the posting received through LinkedIn. It doesn't include any other sources people may apply from. I replaced null values with the median of applications based on the core skill associated with a job posting. After this operation, the remaining 318 null values were removed.

  • original_listed_time: Time and date a posting was published on LinkedIn. It was listed as an integer but converted to the timestamp, which is easier to read and comprehend.

  • remote_allowed: There were two values: 1 and null. Suppose 1 means 'Yes', then null values were changed into 0.

  • views: The number of times a posting has been viewed. The missing values were replaced with 0. Additionally, I checked the values being greater than or equal to 'applies', changing them where needed.

  • job_posting_url: Self-explanatory. I didn't modify anything.

  • application_url: Self-explanatory. Null values were replaced with 'Not available'.

  • application_type: There are 3 possible values: 'SimpleOnsiteApply', 'ComplexOnsiteApply', and 'OffsiteApply'. No null values, thus no changes were made.

  • expiry: Shows the deadline for applying. I converted it to the timestamp, similar to the original_listed_time.

  • closed_time: The time a posting was closed. After converting to the timestamp, all null values were replaced with '2023-12-31 23:59:59'.

  • formatted_experience_level: The experience level required (eg. Junior). Null values were replaced with 'Not specified'.

  • skills_desc: Since the description was too long in each cell, I trimmed it to the first 50 characters. Null values were changed into 'No information'.

  • listed_time: The integer data format converted into a timestamp.

  • posting_domain: A website where the original posting was published. I substituted missing values with 'Not specified'.

  • sponsored: I did some replacement, assuming 0 is 'No', and 1 is 'Yes'.

  • work_type: Work type, eg. full-time or part-time. This column is left unchanged.

  • currency_x: The currency of salary. Almost all job postings in the database are located in the US or remotely in the US. The null values were changed into USD, to make them comparable.

  • compensation_type_x: The way compensation is paid. I ended up with two values: BASE_SALARY and OTHER. The latter replaced the null values.

  • scraped: The field of no value, was removed.

  • skill_abr: The skill abbreviation a posting is associated with. I put 'None' instead of null values.

  • max_salary_y, med_salary_y, min_salary_y, pay_period_y, currency_y, compensation_type_y were deleted. They were either duplicates or null and provided no additional value for the analysis.

  • industry_id: The unique identifier of an industry the posting is associated with. After Internet research, I replaced 9 null values with the corresponding industry_id for each posting.

  • industry_name: The full name of an industry. Like the above, I cut missing values, replacing them with full names.

  • name: For the records where a company 'name' was missing, the values were set to 'Not listed'.

  • description_y: The job description. I set 'No information' for all null values.

  • company_size: Represents the size of a company the listing belongs to, with 1 being the smallest and 7 being the largest. For the null values, it was changed to 0.

  • state: The state/province/federal land where the business is located. This column required additional validation. There were various ways the same geographical area was mentioned. For example: 'CA', 'ca', 'Ca', 'california', 'CALIFORNIA', there are just some of the ways the US state California was mentioned. To make it all standardized, I set 'California' to all values representing the state of California but spelled in different ways. The same procedure was applied to all states and administrative units overseas where such validation was required. If it was impossible to identify the area where the company was located, the record was deleted.

  • country: The country where the business is located. The complex validation wasn't necessary here. I only made sure there were no null values in the column.

  • city: The city where the business is located. The missing values were removed.

  • zip_code / address: Self-explanatory. If the value is null, I replace it with 0.

  • url: The company's website URL. Null values were changed to 'Not listed'.

  • type changed into no_of_benefits: The field shows the number of benefits the posting provides. The missing values were marked as 0.

In the end, I obtained the cleaned data about 30,883 individual job postings on LinkedIn in 2023, stored in 40 columns. Get ready for insights!


1 hidden cell

Analysis and Visualization

Thank you for reading this far! The boring part is over, things get interesting!

General data characteristics

Within the analyzed data, a total of 10,585 companies published job postings (an average of 2.92 jobs per employer).

The most popular title is Sales Director (Owner/Operator), having 120 postings. The other popular titles within the top five are Sales Manager (95 jobs), Project Manager (94 jobs), Retail Sales Associate (82 openings), and Sales Associate (79 postings). For comparison, there are 26 roles named 'Data Analyst' and 24 openings named 'Data Scientist'.

Hidden code

Wondering what was the highest compensated listing? The answer is - Quantitative Developer at Goliath Partners, based in Chicago, IL. The salary listed claimed up to 1,300,000 USD per year! Listings from the top 5 paid list are all tech jobs, starting from $150,000 a year as a minimum.

The full-time position is listed in more than 4/5 of all postings. One in ten jobs presumes contract-based, one in fifteen is part-time, and only a fraction of listings represent temporary jobs, internships, and voluntary positions.

Hidden code

About 2,100 jobs from our database are located in the entire United States. Technically, this should be the most popular location. However, it's not uncommon for remote jobs to have a broad geographical area where employers are hiring people. Narrowing down to cities, this is our top 10 job locations:

Hidden code

Not surprisingly, the biggest city in the US has the biggest number of unfulfilled jobs, followed by Chicago and Houston. Moreover, 264 jobs are located in New York City Metropolitan Area (including suburbs), giving 1,042 jobs in total.

The number of applications each posting received via LinkedIn is distributed unequally. Ranged from 0 to 1,615, the 25th percentile is only 2 applications per job posting. The median value is 3, and the 75th percentile is 8 applications.

Hidden code

How the number of applications is associated with salaries, industries, company size, etc. is covered more in detail in the chapter What contributes to more applicants?.

The average number of times a job posting on LinkedIn has been viewed is about 47, whereas the median value (50th percentile) is only 9. Some listings have been viewed a staggering 5 thousand times!

Analyzing the process of applying to a job, almost 1/3 required a complex application directly on a company's website (usually involves answering many questions, submitting a CV, cover letter, etc.). Only 6.45% of jobs required a simple application procedure on a company's website, and the rest 61.82% didn't involve any website, at all.

Hidden code