Skip to content
Competition - Lego sets - Exploratory analysis - Thomas KLEIN
0
  • AI Chat
  • Code
  • Report
  • Understanding Lego sets popularity

    Now let's now move on to the competition and challenge.

    📖 Background

    You recently applied to work as a data analyst intern at the famous Lego Group in Denmark. As part of the job interview process, you received the following take-home assignment:

    You are asked to use the provided dataset to understand the popularity of different Lego sets and themes. The idea is to become familiarized with the data to be ready for an interview with a business stakeholder.

    💾 The data

    You received access to a database with the following tables. You can also see above a visualization of how the tables are related to each other. (source):
    inventory_parts
    • "inventory_id" - id of the inventory the part is in (as in the inventories table)
    • "part_num" - unique id for the part (as in the parts table)
    • "color_id" - id of the color
    • "quantity" - the number of copies of the part included in the set
    • "is_spare" - whether or not it is a spare part
    parts
    • "part_num" - unique id for the part (as in the inventory_parts table)
    • "name" - name of the part
    • "part_cat_id" - part category id (as in part_catagories table)
    part_categories
    • "id" - part category id (as in parts table)
    • "name" - name of the category the part belongs to
    colors
    • "id" - id of the color (as in inventory_parts table)
    • "name" - color name
    • "rgb" - rgb code of the color
    • "is_trans" - whether or not the part is transparent/translucent
    inventories
    • "id" - id of the inventory the part is in (as in the inventory_sets and inventory_parts tables)
    • "version" - version number
    • "set_num" - set number (as in sets table)
    inventory_sets
    • "inventory_id" - id of the inventory the part is in (as in the inventories table)
    • "set_num" - set number (as in sets table)
    • "quantity" - the quantity of sets included
    sets
    • "set_num" - unique set id (as in inventory_sets and inventories tables)
    • "name" - the name of the set
    • "year" - the year the set was published
    • "theme_id" - the id of the theme the set belongs to (as in themes table)
    • num-parts - the number of parts in the set
    themes
    • "id" - the id of the theme (as in the sets table)
    • "name" - the name of the theme
    • "parent_id" - the id of the larger theme, if there is one

    Acknowledgments: Rebrickable.com

    1. Calculating the average number of Lego sets released per year

    First we will count the sets released each year using the table "sets". Based on that, we can average the result of the query to obtain the average number of sets per year.

    The query below shows that approximately 177 sets are released each year.

    Unknown integration
    DataFrameavailable as
    df
    variable
    SELECT avg(COUNT_SETS) AS AVERAGE_NUMBER_OF_SETS_PER_YEAR FROM (
    SELECT count(set_num) AS COUNT_SETS, year FROM sets group by year) SUBQ

    2. and 3. Calculating the average number of Lego parts per year

    From the table "invertory_parts" we can get the quantity of parts using the column "quantity". However, we need to join the table with inventories, and then with sets, to get the information of the year.

    The query below shows the average number of parts per year in the datasource between 1950 and 2018 is 29,224 parts per year but the total quantity of parts has greatly increased in the last 15 years.

    Unknown integration
    DataFrameavailable as
    df
    variable
    SELECT AVG(QUANTITY_OF_PARTS) AS AVERAGE_NUMBER_OF_PARTS_PER_YEAR FROM 
    (SELECT C.year, SUM(A.quantity) AS QUANTITY_OF_PARTS
    FROM inventory_parts A 
    LEFT JOIN inventories B ON B.id = A.inventory_id 
    LEFT JOIN sets C ON C.set_num = B.set_num
    Group by C.year
    Order by year) SUBQ
    Unknown integration
    DataFrameavailable as
    df
    variable
    SELECT C.year, SUM(A.quantity) AS QUANTITY_OF_PARTS
    FROM inventory_parts A 
    LEFT JOIN inventories B ON B.id = A.inventory_id 
    LEFT JOIN sets C ON C.set_num = B.set_num
    Group by C.year
    Order by year

    4. Calculating the most popular colors in LEGO parts

    To do this we will assess the most popular colors based on each individual part released. '[No color]' or null color will not be counted. Using the join on the table colors to get the color name, we are able to establish that the top 5 colors is as follows:

    • White (4,711 parts)
    • Black (4,372 parts)
    • Yellow (2,934 parts)
    • Red (2,879 parts)
    • Blue (1,831 parts)
    Unknown integration
    DataFrameavailable as
    df
    variable
    SELECT  A.color_id, B.name, COUNT(DISTINCT A.part_num) AS NUMBER_OF_PARTS 
    from  inventory_parts A 
    LEFT JOIN colors B ON B.id = A.color_id
    WHERE B.name != '[No Color]' AND B.name is not null
    GROUP BY A.color_id, B.name
    ORDER BY COUNT(DISTINCT A.part_num) desc
    LIMIT 5

    5. Calculating the propotion of transparent LEGO parts

    A very similar query to the previous one enables us to calculate the number of total parts (24,104) and the number of transparent parts, using the is_trans field (1,455), and therefore the proportion, in 1-decimal-rounded percentage of transparent parts: 6.3%.

    Unknown integration
    DataFrameavailable as
    df
    variable
    SELECT  COUNT(DISTINCT CASE WHEN B.is_trans = TRUE THEN A.part_num END) AS NUMBER_OF_TRANSPARENT_PARTS
    , COUNT(DISTINCT  A.part_num ) AS NUMBER_OF_PARTS
    , ROUND(100.0 * COUNT(DISTINCT CASE WHEN B.is_trans = TRUE THEN A.part_num END) / COUNT(DISTINCT  A.part_num ) ,1) AS PERCENTAGE_OF_TRANSPARENT_PARTS
    from  inventory_parts A 
    LEFT JOIN colors B ON B.id = A.color_id

    6. Which are the 5 rarest LEGO bricks

    A quick assessment of the part_categories table lets us observe that "Bricks" can be identified as a specific category of parts. There are several types of bricks, but if we only keep "Bricks" we can limit our query to the part_cat_id 11. Then we will estimate the rarity by calculating the total quantity among all inventories per part and take the 5 parts which have the lowest quantity. We will link the inventories (and quantities) to the part category using the parts table. The query below shows that the following 5 parts are the rarest:

    • Part number 733ex: Brick 10 x 10 without Bottom Tubes, with '+' Cross Support (early Baseplate) (1 part)
    • Part number 700e: Brick 10 x 20 without Bottom Tubes, with '+' Cross Support and 4 Side Supports (early Baseplate) (2 parts)
    • Part number 3003a: Brick 2 x 2 without Bottom Tubes (2 parts)
    • Part number 3009b: Brick 1 x 6 with Bottom Tubes, with 2 Lowered Cross Supports (2 parts)
    • Part number 3010a: Brick 1 x 4 with Bottom Tubes, with 1 Lowered Cross Support (2 parts)
    Unknown integration
    DataFrameavailable as
    df
    variable
    SELECT A.part_num, B.name, C.name AS CATEGORY, SUM(A.quantity) AS TOTAL_QUANTITY
    FROM inventory_parts A
    LEFT JOIN parts B ON B.part_num = A.part_num
    INNER JOIN part_categories C ON C.id = B.part_cat_id AND C.id = 11
    GROUP BY A.part_num, B.name, C.name
    ORDER BY SUM(QUANTITY)
    LIMIT 5

    7. Summary of the findings

    Around 177 sets are released each year, with an average quantity of parts per year of around 29,000 every year since 1950. However, over the last 15-20 years, the quantity of parts released has greatly increased to even over 100,000 parts in 2013-2016. the main colors are not very surprising: white and black are by far the most popular, while yellow, red and blue are also quite popular. Only around 6% of all parts are actually transparent while some blocks are extremely rare and are only featured in 1 or 2 inventories.