Skip to content

πŸ“Œ Project Overview

Title: "SQL Data Cleaning, Maintenance, and Visualization using the LEGO Dataset"

Objective

This project aims to demonstrate data engineering and analytical skills using PostgreSQL. We will:

  • Clean the data – handle missing values, duplicates, and inconsistencies.
  • Optimize and maintain the database – create indexes, use triggers, and implement stored procedures.
  • Perform exploratory data analysis (EDA) – generate reports and insights using SQL queries.
  • Prepare data for visualization – export data for Power BI/Tableau dashboards.

Spinner
Unknown table

πŸ“Œ Step 1: Exploring the Database

🎯 Goal: Before running any query, we need to understand the structure of the dataset.

πŸ” Query Explanation: We'll list all available tables in the database.

Spinner
DataFrameas
df
variable
-- List all tables in the database
SELECT table_name 
FROM information_schema.tables 
WHERE table_schema = 'public';

πŸ“Œ Step 2: Identify Relationships Between Tables

We need to check how tables are connected using foreign keys.

πŸ” Query Explanation This query finds foreign key relationships in the database.

Spinner
DataFrameas
df2
variable
SELECT 
    conrelid::regclass AS table_name, 
    a.attname AS column_name, 
    confrelid::regclass AS referenced_table, 
    af.attname AS referenced_column
FROM pg_constraint c
JOIN pg_attribute a ON a.attnum = ANY(c.conkey) AND a.attrelid = c.conrelid
JOIN pg_attribute af ON af.attnum = ANY(c.confkey) AND af.attrelid = c.confrelid
WHERE c.contype = 'f';

πŸ“Œ Step 3: Exploring the Data

Before cleaning, we need to analyze:

The number of rows in each table. A sample of the data to identify inconsistencies. Potential missing values. πŸ” Query: Count Rows in Each Table 🎯 Goal: Get an overview of how much data we have in each table.

Spinner
DataFrameas
df3
variable
-- Count the number of rows in each table
SELECT 'sets' AS table_name, COUNT(*) AS row_count FROM sets
UNION ALL
SELECT 'themes', COUNT(*) FROM themes
UNION ALL
SELECT 'parts', COUNT(*) FROM parts
UNION ALL
SELECT 'part_categories', COUNT(*) FROM part_categories
UNION ALL
SELECT 'colors', COUNT(*) FROM colors
UNION ALL
SELECT 'inventories', COUNT(*) FROM inventories
UNION ALL
SELECT 'inventory_parts', COUNT(*) FROM inventory_parts
UNION ALL
SELECT 'inventory_sets', COUNT(*) FROM inventory_sets;

πŸ“Œ Step 4: Check for Missing Values

We now check for NULL values in all tables.

🎯 Goal: Identify columns with missing data.

Hidden code df4

πŸ“Œ Step 5: Checking for Duplicate Entries

Since primary keys should be unique, let's verify that:

set_num in sets is unique. id in themes is unique. part_num in parts is unique. id in inventories is unique. πŸ” Query: Find Duplicate Primary Keys

Spinner
DataFrameas
df5
variable
-- Check for duplicate primary keys in major tables
SELECT 'sets' AS table_name, set_num::TEXT AS duplicate_value, COUNT(*) AS count
FROM sets 
GROUP BY set_num 
HAVING COUNT(*) > 1

UNION ALL

SELECT 'themes', id::TEXT, COUNT(*) 
FROM themes 
GROUP BY id 
HAVING COUNT(*) > 1

UNION ALL

SELECT 'parts', part_num::TEXT, COUNT(*) 
FROM parts 
GROUP BY part_num 
HAVING COUNT(*) > 1

UNION ALL

SELECT 'inventories', id::TEXT, COUNT(*) 
FROM inventories 
GROUP BY id 
HAVING COUNT(*) > 1;

πŸ“Œ Next Step: Start Data Analysis

Now that we know the data is clean and structured, we can move to exploratory analysis.

πŸ“Œ What are the Most Popular LEGO Themes?

🎯 Goal: Find the top 10 most released LEGO themes based on the number of sets.

β€Œ
β€Œ
β€Œ