Project: Data Cleaning and EDA, SQL

Task 1

From taking a quick look at the data, you are pretty certain it isn't quite as it should be. You need to make sure all of the data is clean before you start your analysis. The table below shows what the data should look like.

Write a query to return a table that matches the description provided.

Do not update the original table.

They have provided you with access to their database, which contains the following table called sales:

Sales

Column	Data type	Description
`order_number`	`VARCHAR`	Unique order number.
`date`	`DATE`	Date of the order, from June to August 2021.
`warehouse`	`VARCHAR`	The warehouse that the order was made from— `North`, `Central`, or `West`.
`client_type`	`VARCHAR`	Whether the order was `Retail` or `Wholesale`.
`product_line`	`VARCHAR`	Type of product ordered.
`quantity`	`INT`	Number of products ordered.
`unit_price`	`FLOAT`	Price per product (dollars).
`total`	`FLOAT`	Total price of the order (dollars).
`payment`	`VARCHAR`	Payment method—`Credit card`, `Transfer`, or `Cash`.
`payment_fee`	`FLOAT`	Percentage of `total` charged as a result of the `payment` method.

DataFrameas

df

variable

SELECT *
FROM public.sales;

Project Outline:

The goal of this project is to conduct data cleaning and perform Exploratory Data Analysis (EDA) on the motorcycle parts sales dataset, alongside the SQL queries required to execute these tasks. The specific activities include:

Data Cleaning and Transformation Procedures: This step ensures that the dataset is accurate, consistent, and prepared for subsequent analysis.
Standardization of Categorical Data: This process involves rectifying inconsistencies in text-based columns (e.g., client_type, product_line) by eliminating extra spaces and standardizing text casing.
Exploratory Data Analysis (EDA): EDA facilitates a comprehensive understanding of the dataset's primary characteristics, allowing for the identification of underlying patterns and anomalies.

DATA CLEANING AND TRANSFORMATION

--The goal of this step is to STEP 1. Identify data types STEP 2. Create a staging table using a Common Table Expression (CTE) STEP 3. Data type conversion STEP 4. Handle missing values

DataFrameas

df7

variable

--STEP 1: Data Types Identification
--To determine the types of operations that can be performed on the dataset.
--To ensure data integrity

SELECT * FROM information_schema.columns WHERE table_name = 'sales';

NOTE:

From the previous step, we determined that the warehouse, client type, product line, and payment are all categorized as 'character varying'. However, this classification does not clearly specify the data types for these variables. To resolve this ambiguity, we will proceed with data conversion.
However, before converting the variables to the appropriate data types, we will first create a staging table using a CTE, which will be utilized for data transformation and analysis.

DataFrameas

df8

variable

-- Using a Common Table Expression (CTE) to work with the data without creating a permanent table

WITH sales_staging AS (
  SELECT * FROM sales
)
SELECT * FROM sales_staging;

DataFrameas

df5

variable

--STEP 3: Data Type CONVERSION
--This process ensures that columns are stored with the appropriate data types (e.g., date as DATE, total as NUMERIC, and payment_fee as NUMERIC). 
--This is essential for accurate calculations and date functions.
--Converting TOTAL AND PAYMENT FEES columns to NUMERIC
--Utilised CAST function


WITH sales_staging AS (
    SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY warehouse, client_type, product_line, quantity ORDER BY warehouse) AS rn
    FROM sales
)
SELECT
    CAST(total AS NUMERIC) AS total_conv,
    CAST(payment_fee AS NUMERIC) AS payment_fee_conv
FROM sales_staging
WHERE rn = 1;

DataFrameas

df9

variable

--STEP 4 Identifying DUPLICATES 
--Using SELECT * and a CTE statement

WITH sales_staging AS (
    SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY warehouse, client_type, product_line, quantity ORDER BY warehouse) as rn
    FROM sales
)
SELECT *
FROM sales_staging
WHERE rn > 1;

DataFrameas

df10

variable

--STEP 5 Addressing DUPLICATES
--Using a SELECT ONLY UNIQUE ROWS query

WITH sales_staging AS (
    SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY warehouse, client_type, product_line, quantity ORDER BY warehouse) as rn
    FROM sales
)
SELECT
    *
FROM sales_staging
WHERE rn = 1;

STANDARDIZING CATEGORICAL DATA

This step involves addressing inconsistencies in text-based columns, such as client_type and product_line, by removing extra spaces (whitespaces) or standardizing casing.
Utilised:

a CTE; TRIM function, CASE statement, WHERE, WHEN, GROUP BY, and ORDER BY clauses, LOWER function

DataFrameas

df2

variable

--Example 1: Utilising DISTINCT, TRIM and LOWER functions

WITH sales_staging AS (
    SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY warehouse, client_type, product_line, quantity ORDER BY warehouse) AS rn
    FROM sales
)
SELECT
    DISTINCT TRIM(LOWER(client_type))
FROM sales_staging
WHERE rn = 1;

DataFrameas

df3

variable

--Example 2: Using a DISTINCT and TRIM functions

WITH sales_staging AS (
    SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY warehouse, client_type, product_line, quantity ORDER BY warehouse) as rn
    FROM sales
)
SELECT
	DISTINCT TRIM(product_line) 
FROM sales_staging
WHERE rn = 1;

DataFrameas

df4

variable

--Example 3: Utilising TRIM function; CASE statement; GROUP BY, ORDER BY, WHEN, and WHERE clauses 

WITH sales_staging AS (
    SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY warehouse, client_type, product_line, quantity ORDER BY warehouse) AS rn
    FROM sales
)
SELECT
    product_line,
    TRIM(TO_CHAR(date, 'Month')) AS month,
    warehouse,
    SUM(total - payment_fee) AS net_revenue
FROM sales_staging
WHERE rn = 1
  AND TRIM(TO_CHAR(date, 'Month')) IN ('June', 'July', 'August')
  AND client_type = 'Wholesale'
GROUP BY
    product_line,
    TRIM(TO_CHAR(date, 'Month')),
    warehouse
ORDER BY
    product_line,
    CASE TRIM(TO_CHAR(date, 'Month'))
        WHEN 'June' THEN 1
        WHEN 'July' THEN 2
        WHEN 'August' THEN 3
        ELSE 4
    END,
    net_revenue DESC;

EXPLORATORY DATA ANALYSIS (EDA)

The goal of EDA is to

Help us understand the main characteristics of the data, Uncover patterns, and Identify anomalies.

‌
‌
‌

Project: Data Cleaning and EDA, SQL

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Sales

DATA CLEANING AND TRANSFORMATION

NOTE:

STANDARDIZING CATEGORICAL DATA

Sales