Skip to content

Project Introduction: Strategic HR Analytics on Employee Data

This project delivers an in-depth diagnostic analysis of historical employee data to provide actionable intelligence on workforce stability, compensation equity, and diversity. The primary objective is to move beyond descriptive reporting and isolate the core drivers of workforce dynamics, directly informing leadership decisions on retention strategies and fairness initiatives.

Analytical Focus and Core Methodology

The analysis is structured around three high-value business questions that require advanced feature engineering and statistical rigor. Each question is paired with its specific methodological approach:

1. Retention & Turnover Drivers (Churn Analysis)

Question: What are the primary factors influencing employee turnover, and how can we mitigate their impact?

Methodology: A Turnover Factor Analysis is executed by comparing Leavers against Actives. The primary challenge is addressed by feature engineering to retrieve the employee's exact state (manager, final salary, department) at the moment of exit.

  • Key Factors: Compensation (Final Salary & Growth Rate), Role/Team (Job Title & Department), Leadership (Specific Manager), and Tenure (Employment Duration).
  • Churn Definition: To ensure accuracy against administrative lag, a conservative definition is used: an employee is classified as a Leaver if their latest recorded employment end-date occurred at least one month prior to the dataset's cut-off date.

2. Compensation Equity and Variance

Question: Is there a significant variance in compensation across departments or job roles?

Methodology: We assess pay dispersion across Departments and Job Roles using rigorous statistical measures.

  • Core Metrics: Calculation of Salary Variance (absolute spread) and the Coefficient of Variation (CV).
  • CV Rationale: The CV is essential for comparative analysis, as it standardizes variance relative to the mean, helping determine if observed differences are structural or simply due to higher average salaries in one group.
  • Goal: Determine if variances are within a reasonable organizational range or indicative of structural pay inequities.

3. Gender Diversity and Inclusion

Question: What is the gender diversity distribution across different levels of the organization?

Methodology: We analyze gender distribution across the organizational structure to identify potential gaps and measure progress.

  • Organizational Dimensions: Distribution is analyzed across Departments and Job Roles (Titles).
  • Assessment: Calculate gender ratios and percentages within each group, with a specific focus on leadership tiers (Manager roles) to assess representation at the highest levels.
  • Goal: Provide insight into how diversity is distributed and identify areas that would benefit from targeted inclusion efforts.

Data Structure and Complexity

The underlying data is a standard relational model with six interconnected tables. A core focus of this project is managing the temporal dimension of the data, as three key tables track historical changes over time: Salaries, Titles, and Dept_emp. This necessitates advanced SQL feature engineering to consolidate historical records into precise, single analytical snapshots for accurate analysis (e.g., finding the manager and salary an employee had on their last day).

The dataset is composed of six tables providing historical employee data:

  • Employees: Contains static biographical and hiring data, including employee number (emp_no, int), birth date, gender (enum), and hire date.
  • Departments: Stores static departmental information, consisting of the department ID (dept_no, char) and name.
  • Salaries: A time-varying table tracking salary changes and payment intervals. Key columns are employee number, salary (int), and the interval start (from_date) and end (to_date).
  • Titles: A time-varying table recording the history of employee job titles. Key columns include employee number, the title (varchar), and the interval start (from_date) and end (to_date).
  • Dept_emp: A time-varying table that records employee-to-department assignment history. Key columns are employee number, department ID, and the assignment interval start (from_date) and end (to_date).
  • Dept_manager: A time-varying table that tracks the manager assigned to each department. Key columns are employee number (of the manager), department ID, and the assignment interval start (from_date) and end (to_date).

Methodology: Data Cleaning and Exploration

The analysis follows a rigorous process to ensure data integrity and reliable results:

  1. Data Quality Check: Initial inspection of data types and verification of column formats.
  2. Identification of Issues: Detection and handling of data quality anomalies, including sentinel values (e.g., date placeholders like 9999-01-01), typos, and outliers.
  3. Data Cleaning: Addressing anomalies by correcting entries, standardizing variables, and removing problematic records.
  4. Post-Cleaning Exploration: Performing a final exploratory analysis, including summary statistics, distribution analysis, and frequency counts, before proceeding to the core analytical questions.
Spinner
Queryas
query
variable
-- Looking at the data type of each column
SELECT
	table_name,
	column_name,
	data_type
FROM
	information_schema.columns
WHERE
	table_schema NOT IN ('information_schema','pg_catalog')
Spinner
Queryas
query1
variable
-- Looking for missing values

WITH mv_dep_table AS (
	SELECT 
		COUNT(*) AS mv_dep
	FROM
		departments
	WHERE
		dept_no IS NULL
		OR
		dept_name IS NULL
),
mv_dep_emp_table AS (
	SELECT
		COUNT(*) AS mv_dep_emp
	FROM
		dept_emp
	WHERE
		emp_no IS NULL
		OR
		dept_no IS NULL
		OR
		from_date IS NULL
		OR
		to_date IS NULL
),
mv_dep_man_table AS (
	SELECT
		COUNT(*) AS mv_dep_man
	FROM
		dept_manager
	WHERE
		emp_no IS NULL
		OR
		dept_no IS NULL
		OR
		from_date IS NULL
		OR
		to_date IS NULL
),
mv_emp_table AS (
	SELECT
		COUNT(*) AS mv_emp
	FROM
		employees
	WHERE
		emp_no IS NULL
		OR
		birth_date IS NULL
		OR
		first_name IS NULL
		OR
		last_name IS NULL
		OR
		gender IS NULL
		OR
		hire_date IS NULL
),
mv_sal_table AS (
	SELECT
		COUNT(*) AS mv_sal
	FROM
		salaries
	WHERE
		emp_no IS NULL
		OR
		salary IS NULL
		OR
		from_date IS NULL
		OR
		to_date IS NULL
),
mv_ti_table AS (
	SELECT
		COUNT(*) AS mv_ti
	FROM
		titles
	WHERE
		emp_no IS NULL
		OR
		title IS NULL
		OR
		from_date IS NULL
		OR
		to_date IS NULL
)

SELECT
	mv_dep,
	mv_dep_emp,
	mv_dep_man,
	mv_emp,
	mv_sal,
	mv_ti
FROM
	mv_dep_table,
	mv_dep_emp_table,
	mv_dep_man_table,
	mv_emp_table,
	mv_sal_table,
	mv_ti_table
	
Spinner
Queryas
query2
variable
-- Looking for non-expected values at the department table

SELECT
	DISTINCT dept_name
FROM
	departments
Spinner
Queryas
query3
variable
-- Looking for non-expected values at the dept_emp table

SELECT
	MIN(emp_no) AS min_emp_no,
	MAX(emp_no) AS max_emp_no,
	MIN(from_date) AS min_from_date,
	MAX(from_date) AS max_from_date,
	MIN(to_date) AS min_to_date,
	MAX(to_date) AS max_to_date
FROM
	dept_emp
Spinner
Queryas
query4
variable
-- As the non-expected value in the to_date column means the last date of the data, we will see what was the last date resgistered in the data and we will add 1 day to that date, due to the fact that the interval between the other dates is 1 day
-- The last date registered in all tables was 2002-08-01 so for the non-expected values we will use the date 2002-08-02

SELECT
	DISTINCT to_date
FROM
	dept_emp
ORDER BY
	to_date DESC
Spinner
Queryas
clean_dept_emp
variable
-- As we found some non-expected values in the to_date columns we need to handle those
-- As we think that those values would represent the employees that worked until the last registered date we will use for those values the date that we calculate previouslly

SELECT
	emp_no,
	dept_no,
	DATE_FORMAT(from_date, '%Y-%m-%d') AS from_date,
	CASE
		WHEN to_date = (
			SELECT
				MAX(to_date)
			FROM
				dept_emp
		) THEN DATE_FORMAT('2002-08-02', '%Y-%m-%d')
		ELSE DATE_FORMAT(to_date, '%Y-%m-%d')
	END AS to_date
FROM
	dept_emp
Spinner
Queryas
query5
variable
-- Looking for non-expected values in the dept_no column in the dept_emp table

SELECT
	DISTINCT dept_no
FROM
	dept_emp
Spinner
Queryas
query6
variable
-- Looking for non_expected values in the dept_manager table

SELECT
	MIN(emp_no) AS min_emp_no,
	MAX(emp_no) AS max_emp_no,
	MIN(from_date) AS min_from_date,
	MAX(from_date) AS max_from_date,
	MIN(to_date) AS min_to_date,
	MAX(to_date) AS max_to_date
FROM
	dept_manager
Spinner
Queryas
clean_dept_manager
variable
-- As we found some non-expected values in the to_date column we need to handle those
-- As the previous table we think that those values represent that the employee was a manager until the last date registered we will replace those values with the previouslly calculated date

SELECT
	emp_no,
	dept_no,
	DATE_FORMAT(from_date, '%Y-%m-%d') AS from_date,
	CASE
		WHEN to_date = (
			SELECT
				MAX(to_date)
			FROM
				dept_manager
		) THEN DATE_FORMAT('2002-08-02', '%Y-%m-%d')
		ELSE DATE_FORMAT(to_date, '%Y-%m-%d')
	END AS to_date
FROM
	dept_manager
Spinner
Queryas
query7
variable
-- Looking for non-expected values in the dept_no column in the dept_manager table

SELECT
	DISTINCT dept_no
FROM
	dept_manager
Spinner
Queryas
query8
variable
-- Looking for non-expected values in the employees table

SELECT
	MIN(emp_no) AS min_emp_no,
	MAX(emp_no) AS max_emp_no,
	MIN(birth_date) AS min_birth_date,
	MAX(birth_date) AS max_birth_date,
	MIN(hire_date) AS min_hire_date,
	MAX(hire_date) AS max_hire_date
FROM
	employees
Spinner
Queryas
query9
variable
-- Looking for non-expected values in the first_name and last_name columns in the employees table

SELECT
	COUNT(*)
FROM
	employees
WHERE
	first_name NOT REGEXP '^[A-Za-z]+$'
	OR
	last_name NOT REGEXP '^[A-Za-z]+$'
Spinner
Queryas
query10
variable
-- Looking for non-expected values in the gender column in the employees table

SELECT
	DISTINCT gender
FROM
	employees
Spinner
Queryas
query11
variable
-- Looking for non-expected values in the salaries table

SELECT
	MIN(emp_no) AS min_emp_no,
	MAX(emp_no) AS max_emp_no,
	MIN(salary) AS min_salary,
	MAX(salary) AS max_salary,
	MIN(from_date) AS min_from_date,
	MAX(from_date) AS max_from_date,
	MIN(to_date) AS min_to_date,
	MAX(to_date) AS max_to_date
FROM
	salaries