Skip to content

Task 1

From taking a quick look at the data, you are pretty certain it isn't quite as it should be. You need to make sure all of the data is clean before you start your analysis. The table below shows what the data should look like.

Write a query to return a table that matches the description provided.

Do not update the original table.

They have provided you with access to their database, which contains the following table called sales:

Sales

ColumnData typeDescription
order_numberVARCHARUnique order number.
dateDATEDate of the order, from June to August 2021.
warehouseVARCHARThe warehouse that the order was made from— North, Central, or West.
client_typeVARCHARWhether the order was Retail or Wholesale.
product_lineVARCHARType of product ordered.
quantityINTNumber of products ordered.
unit_priceFLOATPrice per product (dollars).
totalFLOATTotal price of the order (dollars).
paymentVARCHARPayment method—Credit card, Transfer, or Cash.
payment_feeFLOATPercentage of total charged as a result of the payment method.
Spinner
DataFrameas
df
variable
SELECT *
FROM public.sales;

Project Outline:

The goal of this project is to conduct data cleaning and perform Exploratory Data Analysis (EDA) on the motorcycle parts sales dataset, alongside the SQL queries required to execute these tasks. The specific activities include:

  1. Data Cleaning and Transformation Procedures: This step ensures that the dataset is accurate, consistent, and prepared for subsequent analysis.

  2. Standardization of Categorical Data: This process involves rectifying inconsistencies in text-based columns (e.g., client_type, product_line) by eliminating extra spaces and standardizing text casing.

  3. Exploratory Data Analysis (EDA): EDA facilitates a comprehensive understanding of the dataset's primary characteristics, allowing for the identification of underlying patterns and anomalies.

DATA CLEANING AND TRANSFORMATION

--The goal of this step is to STEP 1. Identify data types STEP 2. Create a staging table using a Common Table Expression (CTE) STEP 3. Data type conversion STEP 4. Handle missing values

Spinner
DataFrameas
df7
variable
--STEP 1: Data Types Identification
--To determine the types of operations that can be performed on the dataset.
--To ensure data integrity

SELECT * FROM information_schema.columns WHERE table_name = 'sales';

NOTE:

  • From the previous step, we determined that the warehouse, client type, product line, and payment are all categorized as 'character varying'. However, this classification does not clearly specify the data types for these variables. To resolve this ambiguity, we will proceed with data conversion.
  • However, before converting the variables to the appropriate data types, we will first create a staging table using a CTE, which will be utilized for data transformation and analysis.
Spinner
DataFrameas
df8
variable
-- Using a Common Table Expression (CTE) to work with the data without creating a permanent table

WITH sales_staging AS (
  SELECT * FROM sales
)
SELECT * FROM sales_staging;
Spinner
DataFrameas
df5
variable
--STEP 3: Data Type CONVERSION
--This process ensures that columns are stored with the appropriate data types (e.g., date as DATE, total as NUMERIC, and payment_fee as NUMERIC). 
--This is essential for accurate calculations and date functions.
--Converting TOTAL AND PAYMENT FEES columns to NUMERIC
--Utilised CAST function


WITH sales_staging AS (
    SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY warehouse, client_type, product_line, quantity ORDER BY warehouse) AS rn
    FROM sales
)
SELECT
    CAST(total AS NUMERIC) AS total_conv,
    CAST(payment_fee AS NUMERIC) AS payment_fee_conv
FROM sales_staging
WHERE rn = 1;
Spinner
DataFrameas
df9
variable
--STEP 4 Identifying DUPLICATES 
--Using SELECT * and a CTE statement

WITH sales_staging AS (
    SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY warehouse, client_type, product_line, quantity ORDER BY warehouse) as rn
    FROM sales
)
SELECT *
FROM sales_staging
WHERE rn > 1;
Spinner
DataFrameas
df10
variable
--STEP 5 Addressing DUPLICATES
--Using a SELECT ONLY UNIQUE ROWS query

WITH sales_staging AS (
    SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY warehouse, client_type, product_line, quantity ORDER BY warehouse) as rn
    FROM sales
)
SELECT
    *
FROM sales_staging
WHERE rn = 1;

STANDARDIZING CATEGORICAL DATA

  • This step involves addressing inconsistencies in text-based columns, such as client_type and product_line, by removing extra spaces (whitespaces) or standardizing casing.
  • Utilised:

    a CTE; TRIM function, CASE statement, WHERE, WHEN, GROUP BY, and ORDER BY clauses, LOWER function

Spinner
DataFrameas
df2
variable
--Example 1: Utilising DISTINCT, TRIM and LOWER functions

WITH sales_staging AS (
    SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY warehouse, client_type, product_line, quantity ORDER BY warehouse) AS rn
    FROM sales
)
SELECT
    DISTINCT TRIM(LOWER(client_type))
FROM sales_staging
WHERE rn = 1;
Spinner
DataFrameas
df3
variable
--Example 2: Using a DISTINCT and TRIM functions

WITH sales_staging AS (
    SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY warehouse, client_type, product_line, quantity ORDER BY warehouse) as rn
    FROM sales
)
SELECT
	DISTINCT TRIM(product_line) 
FROM sales_staging
WHERE rn = 1;
Spinner
DataFrameas
df4
variable
--Example 3: Utilising TRIM function; CASE statement; GROUP BY, ORDER BY, WHEN, and WHERE clauses 

WITH sales_staging AS (
    SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY warehouse, client_type, product_line, quantity ORDER BY warehouse) AS rn
    FROM sales
)
SELECT
    product_line,
    TRIM(TO_CHAR(date, 'Month')) AS month,
    warehouse,
    SUM(total - payment_fee) AS net_revenue
FROM sales_staging
WHERE rn = 1
  AND TRIM(TO_CHAR(date, 'Month')) IN ('June', 'July', 'August')
  AND client_type = 'Wholesale'
GROUP BY
    product_line,
    TRIM(TO_CHAR(date, 'Month')),
    warehouse
ORDER BY
    product_line,
    CASE TRIM(TO_CHAR(date, 'Month'))
        WHEN 'June' THEN 1
        WHEN 'July' THEN 2
        WHEN 'August' THEN 3
        ELSE 4
    END,
    net_revenue DESC;

EXPLORATORY DATA ANALYSIS (EDA)

  • The goal of EDA is to

    Help us understand the main characteristics of the data, Uncover patterns, and Identify anomalies.