Skip to content
Competition - Lego sets
  • AI Chat
  • Code
  • Report
  • Understanding Lego sets popularity

    📖 Background

    We are going use the provided dataset to understand the popularity of different Lego sets and themes. The idea is to become familiarized with the data to be ready for an interview with a business stakeholder.

    We will be utilizing PostgreSQL queries to perform data analysis and gain valuable insights from our data set. This will enable us to make informed decisions and improve our operations.

    💾 The data

    We got a database with the following tables. You can also see above a visualization of how the tables are related to each other. (source):
    inventory_parts
    • "inventory_id" - id of the inventory the part is in (as in the inventories table)
    • "part_num" - unique id for the part (as in the parts table)
    • "color_id" - id of the color
    • "quantity" - the number of copies of the part included in the set
    • "is_spare" - whether or not it is a spare part
    parts
    • "part_num" - unique id for the part (as in the inventory_parts table)
    • "name" - name of the part
    • "part_cat_id" - part category id (as in part_catagories table)
    part_categories
    • "id" - part category id (as in parts table)
    • "name" - name of the category the part belongs to
    colors
    • "id" - id of the color (as in inventory_parts table)
    • "name" - color name
    • "rgb" - rgb code of the color
    • "is_trans" - whether or not the part is transparent/translucent
    inventories
    • "id" - id of the inventory the part is in (as in the inventory_sets and inventory_parts tables)
    • "version" - version number
    • "set_num" - set number (as in sets table)
    inventory_sets
    • "inventory_id" - id of the inventory the part is in (as in the inventories table)
    • "set_num" - set number (as in sets table)
    • "quantity" - the quantity of sets included
    sets
    • "set_num" - unique set id (as in inventory_sets and inventories tables)
    • "name" - the name of the set
    • "year" - the year the set was published
    • "theme_id" - the id of the theme the set belongs to (as in themes table)
    • "num_parts" - the number of parts in the set
    themes
    • "id" - the id of the theme (as in the sets table)
    • "name" - the name of the theme
    • "parent_id" - the id of the larger theme, if there is one

    Acknowledgments: Rebrickable.com

    1. What is the average number of Lego sets released per year?

    Unknown integration
    DataFrameavailable as
    df
    variable
    WITH set_count_per_year AS (
        SELECT year, COUNT(set_num) AS set_count
        FROM sets
        GROUP BY year
    )
    SELECT AVG(set_count) AS avg_sets_per_year
    FROM set_count_per_year;
    
    /* This query first uses a Common Table Expression (CTE) 
    to group the sets by year and count the number of sets per year. 
    Then it uses the AVG function to calculate the average number of 
    sets across all the years. */
    
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

    2. What is the average number of Lego parts per year?

    Unknown integration
    DataFrameavailable as
    avg_num_parts
    variable
    SELECT 
        year, 
        AVG(num_parts) as avg_num_parts 
    FROM sets 
    GROUP BY year
    ORDER BY AVG(num_parts) DESC, year;
    
    /* The query will return the average number of Lego parts 
    for each year and it will be ordered by the average number of parts 
    in descending order and then by year. */
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

    3. Create a visualization for item 2.

    Current Type: Bar
    Current X-axis: year
    Current Y-axis: avg_num_parts
    Current Color: None

    The average number of Lego parts per year

    4. What are the 5 most popular colors used in Lego parts?

    Unknown integration
    DataFrameavailable as
    popular_colors
    variable
    WITH color_count AS
      (SELECT color_id,
              COUNT(color_id) AS color_count
       FROM inventory_parts
       GROUP BY color_id)
    SELECT c.name,
           cc.color_count
    FROM color_count cc
    JOIN colors c ON cc.color_id = c.id
    ORDER BY cc.color_count DESC
    LIMIT 5;
    
    /*   This query first uses a Common Table Expression (CTE) to group the parts by color_id 
    and count the number of parts per color. Then it joins the color_count CTE with the colors table 
    to retrieve the name of the color. Finally, it sorts the results by color_count in descending order 
    and limits the results to the top 5. This query will give us the 5 most popular colors 
    and the count of parts used in that color. */
    
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

    5. What proportion of Lego parts are transparent?

    Unknown integration
    DataFrameavailable as
    df
    variable
    WITH transparent_parts AS (
        SELECT COUNT(DISTINCT ip.part_num)::float as transparent_parts
        FROM inventory_parts ip
        LEFT JOIN colors c ON ip.color_id = c.id
        WHERE c.is_trans = true
    ),
    all_parts AS (
        SELECT COUNT(DISTINCT part_num)::float as all_parts
        FROM inventory_parts
    )
    SELECT (transparent_parts/all_parts * 100)::numeric(5,1) || '%' as percentage_transparent
    FROM transparent_parts, all_parts;
    
    /*     This query first uses two Common Table Expressions (CTEs) to count the number of distinct parts 
    that are transparent and the number of all distinct parts. By using the '::float' operator, it ensures 
    that the columns data type are converted to float. Then it selects the proportion of transparent parts 
    by dividing the number of transparent parts by the total number of parts, then it multiplies it by 100, 
    rounds it to 1 decimal place and concatenates the percent sign. */
    
    
    
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.