Task 1
From taking a quick look at the data, you are pretty certain it isn't quite as it should be. You need to make sure all of the data is clean before you start your analysis. The table below shows what the data should look like.
Write a query to return a table that matches the description provided.
Do not update the original table.
They have provided you with access to their database, which contains the following table called sales:
Sales
| Column | Data type | Description |
|---|---|---|
order_number | VARCHAR | Unique order number. |
date | DATE | Date of the order, from June to August 2021. |
warehouse | VARCHAR | The warehouse that the order was made from— North, Central, or West. |
client_type | VARCHAR | Whether the order was Retail or Wholesale. |
product_line | VARCHAR | Type of product ordered. |
quantity | INT | Number of products ordered. |
unit_price | FLOAT | Price per product (dollars). |
total | FLOAT | Total price of the order (dollars). |
payment | VARCHAR | Payment method—Credit card, Transfer, or Cash. |
payment_fee | FLOAT | Percentage of total charged as a result of the payment method. |
SELECT *
FROM public.sales;Project Outline:
The goal of this project is to conduct data cleaning and perform Exploratory Data Analysis (EDA) on the motorcycle parts sales dataset, alongside the SQL queries required to execute these tasks. The specific activities include:
-
Data Cleaning and Transformation Procedures: This step ensures that the dataset is accurate, consistent, and prepared for subsequent analysis.
-
Standardization of Categorical Data: This process involves rectifying inconsistencies in text-based columns (e.g., client_type, product_line) by eliminating extra spaces and standardizing text casing.
-
Exploratory Data Analysis (EDA): EDA facilitates a comprehensive understanding of the dataset's primary characteristics, allowing for the identification of underlying patterns and anomalies.
DATA CLEANING AND TRANSFORMATION
--The goal of this step is to STEP 1. Identify data types STEP 2. Create a staging table using a Common Table Expression (CTE) STEP 3. Data type conversion STEP 4. Handle missing values
--STEP 1: Data Types Identification
--To determine the types of operations that can be performed on the dataset.
--To ensure data integrity
SELECT * FROM information_schema.columns WHERE table_name = 'sales';NOTE:
- From the previous step, we determined that the warehouse, client type, product line, and payment are all categorized as 'character varying'. However, this classification does not clearly specify the data types for these variables. To resolve this ambiguity, we will proceed with data conversion.
- However, before converting the variables to the appropriate data types, we will first create a staging table using a CTE, which will be utilized for data transformation and analysis.
-- Using a Common Table Expression (CTE) to work with the data without creating a permanent table
WITH sales_staging AS (
SELECT * FROM sales
)
SELECT * FROM sales_staging;--STEP 3: Data Type CONVERSION
--This process ensures that columns are stored with the appropriate data types (e.g., date as DATE, total as NUMERIC, and payment_fee as NUMERIC).
--This is essential for accurate calculations and date functions.
--Converting TOTAL AND PAYMENT FEES columns to NUMERIC
--Utilised CAST function
WITH sales_staging AS (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY warehouse, client_type, product_line, quantity ORDER BY warehouse) AS rn
FROM sales
)
SELECT
CAST(total AS NUMERIC) AS total_conv,
CAST(payment_fee AS NUMERIC) AS payment_fee_conv
FROM sales_staging
WHERE rn = 1;--STEP 4 Identifying DUPLICATES
--Using SELECT * and a CTE statement
WITH sales_staging AS (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY warehouse, client_type, product_line, quantity ORDER BY warehouse) as rn
FROM sales
)
SELECT *
FROM sales_staging
WHERE rn > 1;--STEP 5 Addressing DUPLICATES
--Using a SELECT ONLY UNIQUE ROWS query
WITH sales_staging AS (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY warehouse, client_type, product_line, quantity ORDER BY warehouse) as rn
FROM sales
)
SELECT
*
FROM sales_staging
WHERE rn = 1;STANDARDIZING CATEGORICAL DATA
- This step involves addressing inconsistencies in text-based columns, such as client_type and product_line, by removing extra spaces (whitespaces) or standardizing casing.
- Utilised:
a CTE; TRIM function, CASE statement, WHERE, WHEN, GROUP BY, and ORDER BY clauses, LOWER function
--Example 1: Utilising DISTINCT, TRIM and LOWER functions
WITH sales_staging AS (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY warehouse, client_type, product_line, quantity ORDER BY warehouse) AS rn
FROM sales
)
SELECT
DISTINCT TRIM(LOWER(client_type))
FROM sales_staging
WHERE rn = 1;--Example 2: Using a DISTINCT and TRIM functions
WITH sales_staging AS (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY warehouse, client_type, product_line, quantity ORDER BY warehouse) as rn
FROM sales
)
SELECT
DISTINCT TRIM(product_line)
FROM sales_staging
WHERE rn = 1;--Example 3: Utilising TRIM function; CASE statement; GROUP BY, ORDER BY, WHEN, and WHERE clauses
WITH sales_staging AS (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY warehouse, client_type, product_line, quantity ORDER BY warehouse) AS rn
FROM sales
)
SELECT
product_line,
TRIM(TO_CHAR(date, 'Month')) AS month,
warehouse,
SUM(total - payment_fee) AS net_revenue
FROM sales_staging
WHERE rn = 1
AND TRIM(TO_CHAR(date, 'Month')) IN ('June', 'July', 'August')
AND client_type = 'Wholesale'
GROUP BY
product_line,
TRIM(TO_CHAR(date, 'Month')),
warehouse
ORDER BY
product_line,
CASE TRIM(TO_CHAR(date, 'Month'))
WHEN 'June' THEN 1
WHEN 'July' THEN 2
WHEN 'August' THEN 3
ELSE 4
END,
net_revenue DESC;EXPLORATORY DATA ANALYSIS (EDA)
- The goal of EDA is to
Help us understand the main characteristics of the data, Uncover patterns, and Identify anomalies.