Project Introduction: Strategic HR Analytics on Employee Data
This project delivers an in-depth diagnostic analysis of historical employee data to provide actionable intelligence on workforce stability, compensation equity, and diversity. The primary objective is to move beyond descriptive reporting and isolate the core drivers of workforce dynamics, directly informing leadership decisions on retention strategies and fairness initiatives.
Analytical Focus and Core Methodology
The analysis is structured around three high-value business questions that require advanced feature engineering and statistical rigor. Each question is paired with its specific methodological approach:
1. Retention & Turnover Drivers (Churn Analysis)
Question: What are the primary factors influencing employee turnover, and how can we mitigate their impact?
Methodology: A Turnover Factor Analysis is executed by comparing Leavers against Actives. The primary challenge is addressed by feature engineering to retrieve the employee's exact state (manager, final salary, department) at the moment of exit.
- Key Factors: Compensation (Final Salary & Growth Rate), Role/Team (Job Title & Department), Leadership (Specific Manager), and Tenure (Employment Duration).
- Churn Definition: To ensure accuracy against administrative lag, a conservative definition is used: an employee is classified as a Leaver if their latest recorded employment end-date occurred at least one month prior to the dataset's cut-off date.
2. Compensation Equity and Variance
Question: Is there a significant variance in compensation across departments or job roles?
Methodology: We assess pay dispersion across Departments and Job Roles using rigorous statistical measures.
- Core Metrics: Calculation of Salary Variance (absolute spread) and the Coefficient of Variation (CV).
- CV Rationale: The CV is essential for comparative analysis, as it standardizes variance relative to the mean, helping determine if observed differences are structural or simply due to higher average salaries in one group.
- Goal: Determine if variances are within a reasonable organizational range or indicative of structural pay inequities.
3. Gender Diversity and Inclusion
Question: What is the gender diversity distribution across different levels of the organization?
Methodology: We analyze gender distribution across the organizational structure to identify potential gaps and measure progress.
- Organizational Dimensions: Distribution is analyzed across Departments and Job Roles (Titles).
- Assessment: Calculate gender ratios and percentages within each group, with a specific focus on leadership tiers (Manager roles) to assess representation at the highest levels.
- Goal: Provide insight into how diversity is distributed and identify areas that would benefit from targeted inclusion efforts.
Data Structure and Complexity
The underlying data is a standard relational model with six interconnected tables. A core focus of this project is managing the temporal dimension of the data, as three key tables track historical changes over time: Salaries, Titles, and Dept_emp. This necessitates advanced SQL feature engineering to consolidate historical records into precise, single analytical snapshots for accurate analysis (e.g., finding the manager and salary an employee had on their last day).
The dataset is composed of six tables providing historical employee data:
- Employees: Contains static biographical and hiring data, including employee number (
emp_no, int), birth date, gender (enum), and hire date. - Departments: Stores static departmental information, consisting of the department ID (
dept_no, char) and name. - Salaries: A time-varying table tracking salary changes and payment intervals. Key columns are employee number, salary (int), and the interval start (
from_date) and end (to_date). - Titles: A time-varying table recording the history of employee job titles. Key columns include employee number, the title (varchar), and the interval start (
from_date) and end (to_date). - Dept_emp: A time-varying table that records employee-to-department assignment history. Key columns are employee number, department ID, and the assignment interval start (
from_date) and end (to_date). - Dept_manager: A time-varying table that tracks the manager assigned to each department. Key columns are employee number (of the manager), department ID, and the assignment interval start (
from_date) and end (to_date).
Methodology: Data Cleaning and Exploration
The analysis follows a rigorous process to ensure data integrity and reliable results:
- Data Quality Check: Initial inspection of data types and verification of column formats.
- Identification of Issues: Detection and handling of data quality anomalies, including sentinel values (e.g., date placeholders like 9999-01-01), typos, and outliers.
- Data Cleaning: Addressing anomalies by correcting entries, standardizing variables, and removing problematic records.
- Post-Cleaning Exploration: Performing a final exploratory analysis, including summary statistics, distribution analysis, and frequency counts, before proceeding to the core analytical questions.
-- Looking at the data type of each column
SELECT
table_name,
column_name,
data_type
FROM
information_schema.columns
WHERE
table_schema NOT IN ('information_schema','pg_catalog')
-- Looking for missing values
WITH mv_dep_table AS (
SELECT
COUNT(*) AS mv_dep
FROM
departments
WHERE
dept_no IS NULL
OR
dept_name IS NULL
),
mv_dep_emp_table AS (
SELECT
COUNT(*) AS mv_dep_emp
FROM
dept_emp
WHERE
emp_no IS NULL
OR
dept_no IS NULL
OR
from_date IS NULL
OR
to_date IS NULL
),
mv_dep_man_table AS (
SELECT
COUNT(*) AS mv_dep_man
FROM
dept_manager
WHERE
emp_no IS NULL
OR
dept_no IS NULL
OR
from_date IS NULL
OR
to_date IS NULL
),
mv_emp_table AS (
SELECT
COUNT(*) AS mv_emp
FROM
employees
WHERE
emp_no IS NULL
OR
birth_date IS NULL
OR
first_name IS NULL
OR
last_name IS NULL
OR
gender IS NULL
OR
hire_date IS NULL
),
mv_sal_table AS (
SELECT
COUNT(*) AS mv_sal
FROM
salaries
WHERE
emp_no IS NULL
OR
salary IS NULL
OR
from_date IS NULL
OR
to_date IS NULL
),
mv_ti_table AS (
SELECT
COUNT(*) AS mv_ti
FROM
titles
WHERE
emp_no IS NULL
OR
title IS NULL
OR
from_date IS NULL
OR
to_date IS NULL
)
SELECT
mv_dep,
mv_dep_emp,
mv_dep_man,
mv_emp,
mv_sal,
mv_ti
FROM
mv_dep_table,
mv_dep_emp_table,
mv_dep_man_table,
mv_emp_table,
mv_sal_table,
mv_ti_table
-- Looking for non-expected values at the department table
SELECT
DISTINCT dept_name
FROM
departments-- Looking for non-expected values at the dept_emp table
SELECT
MIN(emp_no) AS min_emp_no,
MAX(emp_no) AS max_emp_no,
MIN(from_date) AS min_from_date,
MAX(from_date) AS max_from_date,
MIN(to_date) AS min_to_date,
MAX(to_date) AS max_to_date
FROM
dept_emp-- As the non-expected value in the to_date column means the last date of the data, we will see what was the last date resgistered in the data and we will add 1 day to that date, due to the fact that the interval between the other dates is 1 day
-- The last date registered in all tables was 2002-08-01 so for the non-expected values we will use the date 2002-08-02
SELECT
DISTINCT to_date
FROM
dept_emp
ORDER BY
to_date DESC-- As we found some non-expected values in the to_date columns we need to handle those
-- As we think that those values would represent the employees that worked until the last registered date we will use for those values the date that we calculate previouslly
SELECT
emp_no,
dept_no,
DATE_FORMAT(from_date, '%Y-%m-%d') AS from_date,
CASE
WHEN to_date = (
SELECT
MAX(to_date)
FROM
dept_emp
) THEN DATE_FORMAT('2002-08-02', '%Y-%m-%d')
ELSE DATE_FORMAT(to_date, '%Y-%m-%d')
END AS to_date
FROM
dept_emp-- Looking for non-expected values in the dept_no column in the dept_emp table
SELECT
DISTINCT dept_no
FROM
dept_emp-- Looking for non_expected values in the dept_manager table
SELECT
MIN(emp_no) AS min_emp_no,
MAX(emp_no) AS max_emp_no,
MIN(from_date) AS min_from_date,
MAX(from_date) AS max_from_date,
MIN(to_date) AS min_to_date,
MAX(to_date) AS max_to_date
FROM
dept_manager-- As we found some non-expected values in the to_date column we need to handle those
-- As the previous table we think that those values represent that the employee was a manager until the last date registered we will replace those values with the previouslly calculated date
SELECT
emp_no,
dept_no,
DATE_FORMAT(from_date, '%Y-%m-%d') AS from_date,
CASE
WHEN to_date = (
SELECT
MAX(to_date)
FROM
dept_manager
) THEN DATE_FORMAT('2002-08-02', '%Y-%m-%d')
ELSE DATE_FORMAT(to_date, '%Y-%m-%d')
END AS to_date
FROM
dept_manager-- Looking for non-expected values in the dept_no column in the dept_manager table
SELECT
DISTINCT dept_no
FROM
dept_manager-- Looking for non-expected values in the employees table
SELECT
MIN(emp_no) AS min_emp_no,
MAX(emp_no) AS max_emp_no,
MIN(birth_date) AS min_birth_date,
MAX(birth_date) AS max_birth_date,
MIN(hire_date) AS min_hire_date,
MAX(hire_date) AS max_hire_date
FROM
employees-- Looking for non-expected values in the first_name and last_name columns in the employees table
SELECT
COUNT(*)
FROM
employees
WHERE
first_name NOT REGEXP '^[A-Za-z]+$'
OR
last_name NOT REGEXP '^[A-Za-z]+$'-- Looking for non-expected values in the gender column in the employees table
SELECT
DISTINCT gender
FROM
employees-- Looking for non-expected values in the salaries table
SELECT
MIN(emp_no) AS min_emp_no,
MAX(emp_no) AS max_emp_no,
MIN(salary) AS min_salary,
MAX(salary) AS max_salary,
MIN(from_date) AS min_from_date,
MAX(from_date) AS max_from_date,
MIN(to_date) AS min_to_date,
MAX(to_date) AS max_to_date
FROM
salaries