PostgreSQL Data Cleaning and Analysis for Super Store
Introduction
The aim of this project was to apply advanced SQL skills in a practical scenario using a dataset from a hypothetical Super Store. This project focused on two key objectives: identifying top-performing products across various categories based on total sales, and addressing missing values in the dataset, particularly in the quantity column. Through this process, the project helped enhance SQL proficiency in data cleaning, aggregation, and data transformation. By utilizing SQL functions such as aggregation, ranking, and calculation of missing values, the project provided valuable insights into the store's performance and data integrity.
Data Overview
The dataset consits of the following tables:
orders:
orders:| Column | Definition | Data type | Comments |
|---|---|---|---|
row_id | Unique Record ID | INTEGER | |
order_id | Identifier for each order in table | TEXT | Connects to order_id in returned_orders table |
order_date | Date when order was placed | TEXT | |
market | Market order_id belongs to | TEXT | |
region | Region Customer belongs to | TEXT | Connects to region in people table |
product_id | Identifier of Product bought | TEXT | Connects to product_id in products table |
sales | Total Sales Amount for the Line Item | DOUBLE PRECISION | |
quantity | Total Quantity for the Line Item | DOUBLE PRECISION | |
discount | Discount applied for the Line Item | DOUBLE PRECISION | |
profit | Total Profit earned on the Line Item | DOUBLE PRECISION |
returned_orders:
returned_orders:| Column | Definition | Data type |
|---|---|---|
returned | Yes values for Order / Line Item Returned | TEXT |
order_id | Identifier for each order in table | TEXT |
market | Market order_id belongs to | TEXT |
people:
people:| Column | Definition | Data type |
|---|---|---|
person | Name of Salesperson credited with Order | TEXT |
region | Region Salesperson in operating in | TEXT |
products:
products:| Column | Definition | Data type |
|---|---|---|
product_id | Unique Identifier for the Product | TEXT |
category | Category Product belongs to | TEXT |
sub_category | Sub Category Product belongs to | TEXT |
product_name | Detailed Name of the Product | TEXT |
Data Cleaning and Transformation Process
1. Data Type Adjustments
The dataset presented some initial challenges, particularly with incorrect data types:
Numeric Fields: The columns sales, profit, and others were stored as Double Precision. These were handled appropriately during calculations to ensure accurate numeric processing.
2. Handling Missing Values
The quantity column contained missing values, which required attention. SQL techniques were employed to identify the rows with missing values and determine appropriate strategies for filling them. By using existing data on sales, unit prices, and applicable discounts, I was able to estimate the missing quantities.
SQL Queries and Analysis
The SQL queries used to achieve the goal of the analysis are as follows:
-- Finding the top 5 products from each category based on highest total sales, sorted by category in ascending order and by sales in descending order within each category, i.e. within each category product with highest margin should sit on the top.
SELECT *
FROM (SELECT p.product_name,
p.category,
SUM(CAST(sales AS numeric)) AS product_total_sales,
SUM(CAST(profit AS numeric)) AS product_total_profit,
ROW_NUMBER()
OVER (PARTITION BY p.category
ORDER BY SUM(CAST(sales AS numeric)) DESC) AS product_rank
FROM orders AS o
INNER JOIN products AS p
ON o.product_id = p.product_id
GROUP BY p.category, p.product_name) AS ranks
WHERE product_rank <= 5-- Calculating the quantity for orders with missing values in the quantity column by determining the unit price for each product_id using available order data, considering relevant pricing factors such as discount, market, or region. Then, using this unit price to estimate the missing quantity values.
-- impute_missing_values
WITH item_prices AS (SELECT product_id,
AVG(CAST(sales AS numeric)/CAST(quantity AS numeric)) AS unit_price
FROM orders
GROUP BY product_id)
SELECT o.product_id,
discount,
market,
region,
sales,
quantity,
ROUND(CAST(o.sales AS numeric)/CAST(ip.unit_price AS numeric)) AS calculated_quantity
FROM orders AS o
INNER JOIN item_prices AS ip
ON o.product_id = ip.product_id
WHERE o.quantity IS NULL