Skip to content
Superstore Sales Data Exploration & Cleaning
Superstore Sales Data Exploration & Cleaning
Due to the nature of sales and company privacy, I wasn't able to pull any real sales data from online, so I had to look to Kaggle for a dataset I could use to clean and build a dashboard.
I landed on the Superstore Dataset.
To start, I separated the single file into Sales Info, Customer Info, and Product Info .csv files in order to practice a little SQL before building the dashboard.
We'll start by simply joining all three tables to get a look at the full dataset:
DataFrameas
df
variable
SELECT * FROM 'superstore_sales_info.csv'
LEFT JOIN 'superstore_customer_info.csv'
ON "Customer ID" = "customer_id"
LEFT JOIN join 'superstore_product_info.csv'
ON "Product ID" = "product_id";Next, we'll find the top-5 products by quantity sold:
DataFrameas
df1
variable
SELECT product_name, sum(Quantity) as SumQuant
FROM 'superstore_sales_info.csv'
LEFT JOIN 'superstore_product_info.csv'
ON "Product ID" = "product_id"
GROUP BY product_name
ORDER BY SumQuant DESC
LIMIT 5;Let's also find the top-5 products by profit:
DataFrameas
df2
variable
SELECT product_name, sum(Profit) as SumProfit
FROM 'superstore_sales_info.csv'
LEFT JOIN 'superstore_product_info.csv'
ON "Product ID" = "product_id"
GROUP BY product_name
ORDER BY SumProfit DESC
LIMIT 5;Now let's combine all of the data sources into a single dataset that includes columns for the:
- Year of the order_date
- Month of the order_date
- Season of the order_date
DataFrameas
df3
variable
select *, date_part('year', "Order Date") as Year, date_part('month', "Order Date") as Month,
case when date_part('month', "Order Date") in (12,1,2) then 'Winter'
when date_part('month', "Order Date") in (3,4,5) then 'Spring'
when date_part('month', "Order Date") in (6,7,8) then 'Summer'
when date_part('month', "Order Date") in (9,10,11) then 'Fall'
else 'Unknown' end as Season
from "superstore_sales_info.csv"
left join "superstore_customer_info.csv"
on "Customer ID" = customer_id
left join "superstore_product_info.csv"
on "Product ID" = product_id;