Project: PostgreSQL Data Cleaning and Analysis for Super Store

PostgreSQL Data Cleaning and Analysis for Super Store

Introduction

The aim of this project was to apply advanced SQL skills in a practical scenario using a dataset from a hypothetical Super Store. This project focused on two key objectives: identifying top-performing products across various categories based on total sales, and addressing missing values in the dataset, particularly in the quantity column. Through this process, the project helped enhance SQL proficiency in data cleaning, aggregation, and data transformation. By utilizing SQL functions such as aggregation, ranking, and calculation of missing values, the project provided valuable insights into the store's performance and data integrity.

Data Overview

The dataset consits of the following tables:

`orders`:

Column	Definition	Data type	Comments
`row_id`	Unique Record ID	`INTEGER`
`order_id`	Identifier for each order in table	`TEXT`	Connects to `order_id` in `returned_orders` table
`order_date`	Date when order was placed	`TEXT`
`market`	Market order_id belongs to	`TEXT`
`region`	Region Customer belongs to	`TEXT`	Connects to `region` in `people` table
`product_id`	Identifier of Product bought	`TEXT`	Connects to `product_id` in `products` table
`sales`	Total Sales Amount for the Line Item	`DOUBLE PRECISION`
`quantity`	Total Quantity for the Line Item	`DOUBLE PRECISION`
`discount`	Discount applied for the Line Item	`DOUBLE PRECISION`
`profit`	Total Profit earned on the Line Item	`DOUBLE PRECISION`

`returned_orders`:

Column	Definition	Data type
`returned`	Yes values for Order / Line Item Returned	`TEXT`
`order_id`	Identifier for each order in table	`TEXT`
`market`	Market order_id belongs to	`TEXT`

`people`:

Column	Definition	Data type
`person`	Name of Salesperson credited with Order	`TEXT`
`region`	Region Salesperson in operating in	`TEXT`

`products`:

Column	Definition	Data type
`product_id`	Unique Identifier for the Product	`TEXT`
`category`	Category Product belongs to	`TEXT`
`sub_category`	Sub Category Product belongs to	`TEXT`
`product_name`	Detailed Name of the Product	`TEXT`

Data Cleaning and Transformation Process

1. Data Type Adjustments

The dataset presented some initial challenges, particularly with incorrect data types:

Numeric Fields: The columns sales, profit, and others were stored as Double Precision. These were handled appropriately during calculations to ensure accurate numeric processing.

2. Handling Missing Values

The quantity column contained missing values, which required attention. SQL techniques were employed to identify the rows with missing values and determine appropriate strategies for filling them. By using existing data on sales, unit prices, and applicable discounts, I was able to estimate the missing quantities.

SQL Queries and Analysis

The SQL queries used to achieve the goal of the analysis are as follows:

DataFrameas

top_five_products_each_category

variable

-- Finding the top 5 products from each category based on highest total sales,  sorted by category in ascending order and by sales in descending order within each category, i.e. within each category product with highest margin should sit on the top. 


SELECT *
FROM (SELECT p.product_name,
		p.category,
		SUM(CAST(sales AS numeric)) AS product_total_sales,
		SUM(CAST(profit AS numeric)) AS product_total_profit,
		ROW_NUMBER() 
			OVER (PARTITION BY p.category
			  	ORDER BY SUM(CAST(sales AS numeric)) DESC) AS product_rank
	FROM orders AS o 
	INNER JOIN products AS p 
	ON o.product_id  = p.product_id
	GROUP BY p.category, p.product_name) AS ranks
WHERE product_rank <= 5