Project: User Search and Employer Interaction Analysis on a Job Platform

1. Executive Summary

This analysis examines user search behavior and employer interactions on Prosple’s job search platform. By parsing URL structures, processing taxonomy IDs, and evaluating user-employer interactions, I identified key trends in job-seeker intent, popular opportunity categories, and competitor overlap for a target employer (Company 1).

Key insights include:

The National Capital Region (NCR) is the most popular internship location in the Philippines, receiving the highest number of internship-related searches.
IT & Computer Science and Business & Management are the top study fields among graduate job seekers.
Technology is the most frequently searched industry sector in the Philippines.
Across the top industry sectors, users predominantly search for Graduate Jobs and Internships.
User activity peaked during Weeks 40–43, indicating a seasonal trend in job-search behavior.
Among users who viewed and applied to Company 1, many viewed multiple employers, but only a small subset applied to competitors, suggesting high application conversion for Company 1 relative to competing employers.

2. Introduction

The objective of this analysis is to explore a snapshot of Prosple’s site-activity data, with emphasis on understanding how early-career job seekers interact with search filters, employer profiles, and application workflows. The task is divided into two components:

User Search Metrics — analyzing URL parameters and taxonomy mappings to understand the behavior of internship and graduate-job seekers.
Employer Competitor Analysis — identifying how users who viewed and applied to Company 1 also interacted with other employers.

Two datasets were used in the analysis:

associate_data_analyst_task_dataset.csv — contains timestamped user activity records, including URLs, employers, and event types.
taxonomy_data.csv — maps taxonomy IDs to readable labels used in URL filters.

3. Methodology

I began by exploring taxonomy_data.csv using SQL to identify the TAXONOMY_ID values associated with Internship and Graduate Job categories. These IDs were needed to correctly interpret URL parameters in the main dataset.

After confirming the taxonomy mappings, all subsequent processing and visualization were performed in Python. The methodology includes the following steps:

3.1. Data Cleaning and Standardization

Standardized URL strings and removed trailing spaces.
Parsed URL query parameters using urllib.parse to extract filter values embedded in URLs (e.g., industry IDs, location IDs).
Converted extracted taxonomy IDs to numeric form and merged them with taxonomy_data.csv for readability.
Normalized employer names and user identifiers for cross-employer analysis.

3.2. Data Preparation

After cleaning the raw dataset, I prepared the data for analysis through the following steps:

Isolated job-search activity by retaining only URLs beginning with /search-jobs, which ensured that subsequent analysis focused strictly on search behavior rather than page views unrelated to job discovery (e.g., articles, resources, login pages).
Identified internship and graduate-job searches using taxonomy-based filters embedded in the URL query parameters (e.g., opportunity_types=2 for internships).
Extracted and normalized taxonomy IDs for location, industry sector, study field, and job type, converting them into readable labels via a merge with taxonomy_data.csv.
Standardized location values by mapping cities and provinces to their corresponding Philippine regions.
Constructed user-level employer activity by grouping events according to anonymous IDs, enabling cross-employer behavior analysis in the later sections.

DataFrameas

df1

variable

SELECT *
FROM 'taxonomy_data.csv'
WHERE LABEL ILIKE '%internship%'
	OR LABEL ILIKE '%graduate%';

4. User Search Metrics

4.1. Most Popular Region for Internships in the Philippines

Internship-related searches were identified using URLs with opportunity_types=2. After extracting and standardizing location taxonomy IDs, I mapped all cities and provinces to their administrative regions. The top-level geographic filter for Philippines was excluded to focus on subnational search intent.

import pandas as pd
import urllib.parse

# Load datasets
df = pd.read_csv("associate_data_analyst_task_dataset.csv")
tax = pd.read_csv("taxonomy_data.csv")

# Clean URL column
df["URL"] = df["URL"].astype(str).str.strip()

# Keep only job-search URLs
df_search = df[df["URL"].str.startswith("/search-jobs")]

# Filter: internship searches only
df_intern = df_search[df_search["URL"].str.contains("opportunity_types=2", na=False)]

# Extract location taxonomy_ids
def extract_locations(url):
    parsed = urllib.parse.urlparse(url)
    params = urllib.parse.parse_qs(parsed.query)
    locs = params.get("locations", [])
    if not locs:
        return []

    result = []
    for loc in locs:
        for part in loc.replace('|', ',').split(','):
            part = part.strip()
            if part:
                result.append(part)
    return result

df_intern["location_ids"] = df_intern["URL"].apply(extract_locations)

# Only keep rows with location filters
df_intern = df_intern[df_intern["location_ids"].map(len) > 0]

# Explode multiple locations
df_locs = df_intern.explode("location_ids").copy()

# Convert IDs to numeric
df_locs["location_ids"] = pd.to_numeric(df_locs["location_ids"], errors="coerce")
df_locs = df_locs.dropna(subset=["location_ids"])
df_locs["location_ids"] = df_locs["location_ids"].astype(int)

# Join with taxonomy to get readable location labels
locations_tax = tax[tax["TAXONOMY_TYPE"] == "locations"][["TAXONOMY_ID", "LABEL"]]

df_locs = df_locs.merge(locations_tax, 
                        left_on="location_ids", 
                        right_on="TAXONOMY_ID", 
                        how="left")

# Remove top-level country filter "Philippines"
df_locs = df_locs[df_locs["LABEL"] != "Philippines"]

# City -> Region mapping
city_to_region = {
    # NCR
    "Manila": "National Capital Region",
    "Makati": "National Capital Region",
    "Quezon City": "National Capital Region",
    "Taguig City": "National Capital Region",
    "Pasig City": "National Capital Region",
    "Parañaque City": "National Capital Region",
    "Caloocan City": "National Capital Region",
    "Las Piñas": "National Capital Region",
    "Malabon": "National Capital Region",
    "Mandaluyong": "National Capital Region",
    "Marikina": "National Capital Region",
    "Muntinlupa": "National Capital Region",
    "Navotas": "National Capital Region",
    "Pasay": "National Capital Region",
    "San Juan City": "National Capital Region",
    "Valenzuela City": "National Capital Region",

    # Luzon
    "Baguio": "Cordillera Administrative Region",
    "San Fernando (La Union)": "Ilocos Region",
    "Laoag": "Ilocos Region",
    "Dagupan": "Ilocos Region",
    "Tuguegarao": "Cagayan Valley",
    "Cauayan": "Cagayan Valley",
    "San Fernando (Pampanga)": "Central Luzon",
    "Clark Freeport Zone": "Central Luzon",
    "Angeles City": "Central Luzon",
    "Tarlac City": "Central Luzon",
    "Concepcion": "Central Luzon",
    "Limay": "Central Luzon",
    "Cabanatuan": "Central Luzon",
    "Bulacan": "Central Luzon",
    "Olongapo City": "Central Luzon",

    "Calamba": "Calabarzon",
    "Cavite": "Calabarzon",
    "Dasmariñas": "Calabarzon",
    "Bacoor": "Calabarzon",
    "Laguna": "Calabarzon",
    "Santa Rosa": "Calabarzon",
    "San Pablo City": "Calabarzon",
    "Santa Pedro": "Calabarzon",
    "Batangas City": "Calabarzon",
    "Lipa City": "Calabarzon",
    "Antipolo": "Calabarzon",
    "Cainta": "Calabarzon",
    "Lucena": "Calabarzon",

    "Calapan": "Mimaropa",
    "Puerto Galera": "Mimaropa",
    "Puerto Princesa": "Mimaropa",
    "Palawan": "Mimaropa",

    "Legazpi": "Bicol Region",
    "Sorsogon": "Bicol Region",
    "Camarines Sur": "Bicol Region",

    # Visayas
    "Iloilo City": "Western Visayas",
    "Bacolod City": "Western Visayas",
    "Roxas City": "Western Visayas",
    "Aklan": "Western Visayas",

    "Cebu City": "Central Visayas",
    "Mandaue": "Central Visayas",
    "Dumaguete": "Central Visayas",
    "Negros Occidental": "Central Visayas",

    "Tacloban": "Eastern Visayas",
    "Samar": "Eastern Visayas",

    # Mindanao
    "Zamboanga City": "Zamboanga Peninsula",
    "Dipolog City": "Zamboanga Peninsula",
    "Pagadian": "Zamboanga Peninsula",

    "Davao City": "Davao Region",
    "Digos City": "Davao Region",

    "General Santos City": "Soccsksargen",
    "Koronadal": "Soccsksargen",

    "Cagayan de Oro": "Northern Mindanao",
    "Iligan City": "Northern Mindanao",
    "Malaybalay": "Northern Mindanao",
    "Valencia City": "Northern Mindanao",

    "Butuan": "Caraga",
    "Surigao": "Caraga",

    "Cotabato City": "Bangsamoro",
}

ph_regions = [
    "Central Luzon", "Central Visayas", "Northern Mindanao", "National Capital Region",
    "Cordillera Administrative Region", "Ilocos Region", "Cagayan Valley", "Calabarzon",
    "Mimaropa", "Bicol Region", "Western Visayas", "Eastern Visayas", "Zamboanga Peninsula",
    "Davao Region", "Soccsksargen", "Caraga", "Bangsamoro"
]

def map_region(label):
    if label in ph_regions:
        return label
    if label in city_to_region:
        return city_to_region[label]
    return None

df_locs["REGION"] = df_locs["LABEL"].apply(map_region)

# Remove anything that is not a region
df_locs = df_locs.dropna(subset=["REGION"])

# Count and rank regions
region_rank = (
    df_locs.groupby("REGION")
           .size()
           .reset_index(name="search_count")
           .sort_values("search_count", ascending=False)
           .reset_index(drop=True)
)

region_rank

Result:

The National Capital Region (NCR) is the most selected region among users searching for internships in the Philippines.

This indicates that internship demand is highly concentrated in Metro Manila, which aligns with the region’s status as the country’s primary business and commercial hub.

4.2. Top 10 Study Fields for Graduate Job Seekers in the Philippines

To identify which study fields are most popular among graduate job seekers, I first filtered the dataset for search URLs containing the parameter opportunity_types=1, which corresponds to graduate job searches. Among these, I retained only the URLs where users applied study field filters.

Each URL may contain one or multiple study field taxonomy IDs, so these values were extracted, separated, and normalized. After converting them into their corresponding labels using taxonomy_data.csv, I aggregated and ranked the study fields based on the frequency of their appearance in user searches.

import pandas as pd
import urllib.parse

# Load datasets
df = pd.read_csv("associate_data_analyst_task_dataset.csv")
tax = pd.read_csv("taxonomy_data.csv")

# Clean URL column
df["URL"] = df["URL"].astype(str).str.strip()

# Keep only job search URLs
df_search = df[df["URL"].str.startswith("/search-jobs")]

# Keep only graduate job searches
df_grad = df_search[
    df_search["URL"].str.contains("opportunity_types=1", na=False)
].copy()

# Extract study field IDs
def extract_study_fields(url):
    parsed = urllib.parse.urlparse(url)
    params = urllib.parse.parse_qs(parsed.query)

    # The parameter is definitely "study_fields" in Prosple URLs
    fields = params.get("study_fields", [])

    if not fields:
        return []

    extracted = []
    for f in fields:
        # Handle comma-separated, pipe-separated, or encoded values
        f = f.replace("%2C", ",").replace("|", ",") 
        for part in f.split(","):
            part = part.strip()
            if part:
                extracted.append(part)

    return extracted

df_grad["study_field_ids"] = df_grad["URL"].apply(extract_study_fields)

# Keep only rows with study fields
df_grad = df_grad[df_grad["study_field_ids"].map(len) > 0]

# Explode the list
df_fields = df_grad.explode("study_field_ids").copy()

# Convert to numeric; drop invalid rows
df_fields["study_field_ids"] = pd.to_numeric(df_fields["study_field_ids"], errors="coerce")
df_fields = df_fields.dropna(subset=["study_field_ids"])
df_fields["study_field_ids"] = df_fields["study_field_ids"].astype(int)

# Join to taxonomy
study_tax = tax[tax["TAXONOMY_TYPE"] == "study_field"][["TAXONOMY_ID", "LABEL"]]

df_fields = df_fields.merge(
    study_tax,
    left_on="study_field_ids",
    right_on="TAXONOMY_ID",
    how="left"
)

# Count and rank
top_study_fields = (
    df_fields.groupby("LABEL")
             .size()
             .reset_index(name="search_count")
             .sort_values("search_count", ascending=False)
             .reset_index(drop=True)
)

top_study_fields["Rank"] = (
    top_study_fields["search_count"].rank(method="dense", ascending=False).astype(int)
)

# Show top 10
top10 = top_study_fields.sort_values("Rank").head(10)

top10[["Rank", "LABEL", "search_count"]]

FIndings:

IT & Computer Science is the most frequently selected study field among graduate job seekers.
Business & Management ranks second, reflecting its broad relevance across sectors.
The top fields viewed on the platform are heavily concentrated in Business and Commerce-related programs, indicating large portion of users coming from business-related academic backgrounds.

These results indicate that demand is highest in technology- and business-related fields, aligning with current industry hiring trends in the Philippines.

4.3. Top Industry Sectors for Job Searches in the Philippines

To understand which industries attract the most job-seeking activity, I analyzed how often users applied industry sector filters while browsing jobs on the platform. Unlike study fields, which relate to academic background, industry filters reflect broader career interests regardless of degree, job type, or location. This provides a high-level view of which sectors are most in demand among early-career talent.

I extracted the industry sector taxonomy IDs embedded in the job-search URLs and mapped them to readable labels using the taxonomy file. After aggregating the frequency of each selected industry sector and ranking them, I identified the top 10 most commonly used filters.

import pandas as pd
import urllib.parse
import matplotlib.pyplot as plt

# Load datasets
df = pd.read_csv("associate_data_analyst_task_dataset.csv")
tax = pd.read_csv("taxonomy_data.csv")

# Keep only search job URLs
df["URL"] = df["URL"].astype(str).str.strip()
df_search = df[df["URL"].str.startswith("/search-jobs")].copy()

# Extract industry sector ID from URL
def extract_industry_id(url):
    parsed = urllib.parse.urlparse(url)
    params = urllib.parse.parse_qs(parsed.query)
    sectors = params.get("industry_sectors", [])
    return sectors[0] if sectors else None

df_search["industry_sector_id"] = df_search["URL"].apply(extract_industry_id)
df_search = df_search[df_search["industry_sector_id"].notna()].copy()
df_search["industry_sector_id"] = pd.to_numeric(df_search["industry_sector_id"], errors="coerce")

# Load industry sector labels from taxonomy
industry_tax = tax[tax["TAXONOMY_TYPE"] == "industry_sectors"][["TAXONOMY_ID", "LABEL"]]

# Merge labels with search data
df_search = df_search.merge(
    industry_tax,
    left_on="industry_sector_id",
    right_on="TAXONOMY_ID",
    how="left"
)
df_search = df_search[df_search["LABEL"].notna()].copy()

# Count occurrences of each industry sector and select top 10
top_industries = df_search["LABEL"].value_counts().nlargest(10).reset_index()
top_industries.columns = ["Industry Sector", "Count"]

# Plot horizontal bar chart
plt.figure(figsize=(10, 6))
plt.barh(
    top_industries["Industry Sector"],
    top_industries["Count"],
    color='deeppink'
)
plt.xlabel("Number of Searches", fontweight='bold')
plt.ylabel("Industry Sector", fontweight='bold')
plt.title("Top 10 Industry Sectors Used as Filters in Job Searches", fontsize=16, fontweight='bold')
plt.gca().invert_yaxis()  # largest bar on top
plt.tight_layout()
plt.show()

Key observations:

Technology far surpasses all other sectors, suggesting strong demand for software, IT, and digital-focused roles.
Finance & Banking and Accounting & Advisory follow, indicating continued interest in corporate and financial services.
HR & Recruitment, Events & Hospitality, and Media & Entertainment also attract consistent search volume.

This blend of sectors suggests a diverse set of interests among early-career job seekers.

4.4. Popular Job Types Among Top Industry Sectors

While the previous section identifies which industries attract the most search activity, it is also important to understand what types of opportunities users are seeking within those industries. Opportunity types—such as Graduate Jobs, Internships, or Part-Time roles—reflect users’ career intent and help clarify whether demand is driven by students, fresh graduates, or early-career professionals.

To analyze this, I extracted both the opportunity type IDs and industry sector IDs from job-search URLs and mapped them to their taxonomy labels. I then created a pivot table summarizing how often each opportunity type appeared within each of the top 10 industry sectors. Expressing the counts as percentages allowed for comparison across sectors of varying popularity.

import pandas as pd
import urllib.parse
import matplotlib.pyplot as plt

# Load datasets
df = pd.read_csv("associate_data_analyst_task_dataset.csv")
tax = pd.read_csv("taxonomy_data.csv")

# Keep only search job URLs
df["URL"] = df["URL"].astype(str).str.strip()
df_search = df[df["URL"].str.startswith("/search-jobs")].copy()

# Extract opportunity_type_id and merge taxonomy labels
def extract_opp_type(url):
    parsed = urllib.parse.urlparse(url)
    params = urllib.parse.parse_qs(parsed.query)
    opp = params.get("opportunity_types", [])
    return opp[0] if opp else None

df_search["opportunity_type_id"] = df_search["URL"].apply(extract_opp_type)
df_search = df_search[df_search["opportunity_type_id"].notna()]
df_search["opportunity_type_id"] = pd.to_numeric(df_search["opportunity_type_id"], errors="coerce")

opp_tax = tax[tax["TAXONOMY_TYPE"] == "opportunity_type"][["TAXONOMY_ID", "LABEL"]]
df_search = df_search.merge(opp_tax, left_on="opportunity_type_id", right_on="TAXONOMY_ID", how="left")
df_search = df_search[df_search["LABEL"].notna()]

# Extract industry_sector_id and merge taxonomy labels
def extract_industry(url):
    parsed = urllib.parse.urlparse(url)
    params = urllib.parse.parse_qs(parsed.query)
    sectors = params.get("industry_sectors", [])
    return sectors[0] if sectors else None

df_search["industry_sector_id"] = df_search["URL"].apply(extract_industry)
df_search = df_search[df_search["industry_sector_id"].notna()]
df_search["industry_sector_id"] = pd.to_numeric(df_search["industry_sector_id"], errors="coerce")

industry_tax = tax[tax["TAXONOMY_TYPE"] == "industry_sectors"][["TAXONOMY_ID", "LABEL"]]
df_search = df_search.merge(industry_tax, left_on="industry_sector_id", right_on="TAXONOMY_ID", how="left")
df_search = df_search[df_search["LABEL_y"].notna()]

# Keep only top 10 sectors
top_n = 10
top_sectors = df_search["LABEL_y"].value_counts().head(top_n).index
df_search_top = df_search[df_search["LABEL_y"].isin(top_sectors)].copy()

# Prepare pivot table: counts
pivot_counts = df_search_top.pivot_table(
    index="LABEL_y",    # Industry sector
    columns="LABEL_x",  # Opportunity type
    values="URL",
    aggfunc="count",
    fill_value=0
)

# Convert counts to percentages per sector
pivot_percent = pivot_counts.div(pivot_counts.sum(axis=1), axis=0) * 100

# Plot stacked bar chart
pivot_percent.plot(
    kind='bar',
    stacked=True,
    figsize=(12, 7),
    colormap="Accent"
)

plt.ylabel("Percentage of Searches (%)", fontweight='bold')
plt.xlabel("Industry Sector", fontweight='bold')
plt.title("Opportunity Type Distribution Across Top Industry Sectors", fontsize=16, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.legend(title="Opportunity Type", bbox_to_anchor=(1.05, 1))
plt.tight_layout()
plt.show()

Notable patterns:

Graduate Jobs account for the majority of searches across most industry sectors, suggesting that a large proportion of users are final-year students or recent graduates preparing for full-time roles.
Internships represent a significant share of searches in sectors such as Media & Entertainment, Business Process Outsourcing (BPO), and Retail & Consumer Goods, indicating industries where students may be looking for entry points to gain practical experience.
Virtual Experiences and Part-Time roles appear less frequently overall, but show small spikes in Education & Training and Consulting, where flexible or skill-based learning pathways may be relevant.

These patterns help illustrate how job-seeking intentions vary by industry, providing insight into user behavior that may support content prioritization, partner engagement strategy, and sector-level product improvements.

4.5. Weekly Activity Patterns of Job Searchers in Prosple

Understanding when users actively search for opportunities helps identify seasonal patterns in job-seeking behavior and may inform marketing timing, platform engagement strategies, and employer outreach cycles. Although the dataset covers only a short period (October to late November), it still offers useful indications of weekly search behavior.

To prepare the data, I first cleaned the timestamp field by dropping invalid entries and converting all valid values into datetime format. I then extracted the opportunity type IDs from the job search URLs and mapped them to their taxonomy labels. Rare opportunity types were grouped under “Others” to focus on the three major categories: Graduate Jobs or Programs, Internships/Clerkships/Placements, and Part-Time Jobs. Using the cleaned timestamps, I retrieved the ISO week number and calculated the weekly counts for each category, along with the total number of searches per week.

Run cancelled

import pandas as pd
import urllib.parse
import matplotlib.pyplot as plt

# Load datasets
df = pd.read_csv("associate_data_analyst_task_dataset.csv")
tax = pd.read_csv("taxonomy_data.csv")

# Keep only search job URLs
df["URL"] = df["URL"].astype(str).str.strip()
df_search = df[df["URL"].str.startswith("/search-jobs")].copy()

# Timestamp column
timestamp_cols = [col for col in df_search.columns if 'timestamp' in col.lower()]
timestamp_col = timestamp_cols[0]
df_search[timestamp_col] = pd.to_datetime(df_search[timestamp_col], errors="coerce")
df_search = df_search[df_search[timestamp_col].notna()].copy()  # drop invalid timestamps

# Extract opportunity_type_id from URL
def extract_opp_type(url):
    parsed = urllib.parse.urlparse(url)
    params = urllib.parse.parse_qs(parsed.query)
    opp = params.get("opportunity_types", [])
    return opp[0] if opp else None

df_search["opportunity_type_id"] = df_search["URL"].apply(extract_opp_type)
df_search = df_search[df_search["opportunity_type_id"].notna()].copy()
df_search["opportunity_type_id"] = pd.to_numeric(df_search["opportunity_type_id"], errors="coerce")

# Merge taxonomy labels
opp_tax = tax[tax["TAXONOMY_TYPE"] == "opportunity_type"][["TAXONOMY_ID", "LABEL"]]
df_search = df_search.merge(opp_tax, left_on="opportunity_type_id", right_on="TAXONOMY_ID", how="left")
df_search = df_search[df_search["LABEL"].notna()].copy()

# Group into major categories
major_categories = [
    "Internship, Clerkship or Placement",
    "Graduate Job or Program",
    "Part-Time Student Job"
]

df_search["Grouped"] = df_search["LABEL"].apply(
    lambda x: x if x in major_categories else "Others"
)

# Extract week number
df_search["Week"] = df_search[timestamp_col].dt.isocalendar().week

# Count searches by week and category
weekly_counts = df_search.groupby(["Grouped", "Week"]).size().reset_index(name="Count")

# Compute total searches per week
total_counts = df_search.groupby("Week").size().reset_index(name="Count")

# Plot line chart
plt.figure(figsize=(12, 6))

# Lines for each major category
for category in weekly_counts["Grouped"].unique():
    data = weekly_counts[weekly_counts["Grouped"] == category]
    plt.plot(
        data["Week"],
        data["Count"],
        label=category
    )

# Line for total searches
plt.plot(
    total_counts["Week"],
    total_counts["Count"],
    linewidth=3,
    label="Total Searches"
)

plt.xlabel("Week Number", fontweight='bold')
plt.ylabel("Number of Searches", fontweight='bold')
plt.title("Job Search Trend Over Time by Opportunity Category", fontsize=16, fontweight='bold')
plt.xticks(sorted(df_search["Week"].unique()))
plt.grid(True, linestyle='--', alpha=0.5)
plt.legend(title="Category")
plt.tight_layout()
plt.show()

‌
‌
‌