Skip to content

PostgreSQL Data Cleaning and Analysis for Super Store

Introduction

The aim of this project was to apply advanced SQL skills in a practical scenario using a dataset from a hypothetical Super Store. This project focused on two key objectives: identifying top-performing products across various categories based on total sales, and addressing missing values in the dataset, particularly in the quantity column. Through this process, the project helped enhance SQL proficiency in data cleaning, aggregation, and data transformation. By utilizing SQL functions such as aggregation, ranking, and calculation of missing values, the project provided valuable insights into the store's performance and data integrity.

Data Overview

The dataset consits of the following tables:

orders:

ColumnDefinitionData typeComments
row_idUnique Record IDINTEGER
order_idIdentifier for each order in tableTEXTConnects to order_id in returned_orders table
order_dateDate when order was placedTEXT
marketMarket order_id belongs toTEXT
regionRegion Customer belongs toTEXTConnects to region in people table
product_idIdentifier of Product boughtTEXTConnects to product_id in products table
salesTotal Sales Amount for the Line ItemDOUBLE PRECISION
quantityTotal Quantity for the Line ItemDOUBLE PRECISION
discountDiscount applied for the Line ItemDOUBLE PRECISION
profitTotal Profit earned on the Line ItemDOUBLE PRECISION

returned_orders:

ColumnDefinitionData type
returnedYes values for Order / Line Item ReturnedTEXT
order_idIdentifier for each order in tableTEXT
marketMarket order_id belongs toTEXT

people:

ColumnDefinitionData type
personName of Salesperson credited with OrderTEXT
regionRegion Salesperson in operating inTEXT

products:

ColumnDefinitionData type
product_idUnique Identifier for the ProductTEXT
categoryCategory Product belongs toTEXT
sub_categorySub Category Product belongs toTEXT
product_nameDetailed Name of the ProductTEXT

Data Cleaning and Transformation Process

1. Data Type Adjustments

The dataset presented some initial challenges, particularly with incorrect data types:

Numeric Fields: The columns sales, profit, and others were stored as Double Precision. These were handled appropriately during calculations to ensure accurate numeric processing.

2. Handling Missing Values

The quantity column contained missing values, which required attention. SQL techniques were employed to identify the rows with missing values and determine appropriate strategies for filling them. By using existing data on sales, unit prices, and applicable discounts, I was able to estimate the missing quantities.

SQL Queries and Analysis

The SQL queries used to achieve the goal of the analysis are as follows:

Spinner
DataFrameas
top_five_products_each_category
variable
-- Finding the top 5 products from each category based on highest total sales,  sorted by category in ascending order and by sales in descending order within each category, i.e. within each category product with highest margin should sit on the top. 


SELECT *
FROM (SELECT p.product_name,
		p.category,
		SUM(CAST(sales AS numeric)) AS product_total_sales,
		SUM(CAST(profit AS numeric)) AS product_total_profit,
		ROW_NUMBER() 
			OVER (PARTITION BY p.category
			  	ORDER BY SUM(CAST(sales AS numeric)) DESC) AS product_rank
	FROM orders AS o 
	INNER JOIN products AS p 
	ON o.product_id  = p.product_id
	GROUP BY p.category, p.product_name) AS ranks
WHERE product_rank <= 5
Spinner
DataFrameas
impute_missing_values
variable
-- Calculating the quantity for orders with missing values in the quantity column by determining the unit price for each product_id using available order data, considering relevant pricing factors such as discount, market, or region. Then, using this unit price to estimate the missing quantity values.


-- impute_missing_values
WITH item_prices AS (SELECT product_id, 
			AVG(CAST(sales AS numeric)/CAST(quantity AS numeric)) AS unit_price
		  FROM orders
		   GROUP BY product_id) 


SELECT o.product_id,
	discount,
	market,
	region,
	sales,
	quantity,
	ROUND(CAST(o.sales AS numeric)/CAST(ip.unit_price AS numeric)) AS calculated_quantity
FROM orders AS o
INNER JOIN item_prices AS ip
ON o.product_id = ip.product_id
WHERE o.quantity IS NULL