π Project Overview
Title: "SQL Data Cleaning, Maintenance, and Visualization using the LEGO Dataset"
Objective
This project aims to demonstrate data engineering and analytical skills using PostgreSQL. We will:
- Clean the data β handle missing values, duplicates, and inconsistencies.
- Optimize and maintain the database β create indexes, use triggers, and implement stored procedures.
- Perform exploratory data analysis (EDA) β generate reports and insights using SQL queries.
- Prepare data for visualization β export data for Power BI/Tableau dashboards.
π Step 1: Exploring the Database
π― Goal: Before running any query, we need to understand the structure of the dataset.
π Query Explanation: We'll list all available tables in the database.
-- List all tables in the database
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'public';
π Step 2: Identify Relationships Between Tables
We need to check how tables are connected using foreign keys.
π Query Explanation This query finds foreign key relationships in the database.
SELECT
conrelid::regclass AS table_name,
a.attname AS column_name,
confrelid::regclass AS referenced_table,
af.attname AS referenced_column
FROM pg_constraint c
JOIN pg_attribute a ON a.attnum = ANY(c.conkey) AND a.attrelid = c.conrelid
JOIN pg_attribute af ON af.attnum = ANY(c.confkey) AND af.attrelid = c.confrelid
WHERE c.contype = 'f';π Step 3: Exploring the Data
Before cleaning, we need to analyze:
The number of rows in each table. A sample of the data to identify inconsistencies. Potential missing values. π Query: Count Rows in Each Table π― Goal: Get an overview of how much data we have in each table.
-- Count the number of rows in each table
SELECT 'sets' AS table_name, COUNT(*) AS row_count FROM sets
UNION ALL
SELECT 'themes', COUNT(*) FROM themes
UNION ALL
SELECT 'parts', COUNT(*) FROM parts
UNION ALL
SELECT 'part_categories', COUNT(*) FROM part_categories
UNION ALL
SELECT 'colors', COUNT(*) FROM colors
UNION ALL
SELECT 'inventories', COUNT(*) FROM inventories
UNION ALL
SELECT 'inventory_parts', COUNT(*) FROM inventory_parts
UNION ALL
SELECT 'inventory_sets', COUNT(*) FROM inventory_sets;
π Step 4: Check for Missing Values
We now check for NULL values in all tables.
π― Goal: Identify columns with missing data.
df4π Step 5: Checking for Duplicate Entries
Since primary keys should be unique, let's verify that:
set_num in sets is unique. id in themes is unique. part_num in parts is unique. id in inventories is unique. π Query: Find Duplicate Primary Keys
-- Check for duplicate primary keys in major tables
SELECT 'sets' AS table_name, set_num::TEXT AS duplicate_value, COUNT(*) AS count
FROM sets
GROUP BY set_num
HAVING COUNT(*) > 1
UNION ALL
SELECT 'themes', id::TEXT, COUNT(*)
FROM themes
GROUP BY id
HAVING COUNT(*) > 1
UNION ALL
SELECT 'parts', part_num::TEXT, COUNT(*)
FROM parts
GROUP BY part_num
HAVING COUNT(*) > 1
UNION ALL
SELECT 'inventories', id::TEXT, COUNT(*)
FROM inventories
GROUP BY id
HAVING COUNT(*) > 1;
π Next Step: Start Data Analysis
Now that we know the data is clean and structured, we can move to exploratory analysis.
π What are the Most Popular LEGO Themes?
π― Goal: Find the top 10 most released LEGO themes based on the number of sets.
β
β