Understanding Lego sets popularity
Now let's now move on to the competition and challenge.
📖 Background
You recently applied to work as a data analyst intern at the famous Lego Group in Denmark. As part of the job interview process, you received the following take-home assignment:
You are asked to use the provided dataset to understand the popularity of different Lego sets and themes. The idea is to become familiarized with the data to be ready for an interview with a business stakeholder.
💾 The data
You received access to a database with the following tables. You can also see above a visualization of how the tables are related to each other. (source):
inventory_parts
- "inventory_id" - id of the inventory the part is in (as in the inventories table)
- "part_num" - unique id for the part (as in the parts table)
- "color_id" - id of the color
- "quantity" - the number of copies of the part included in the set
- "is_spare" - whether or not it is a spare part
parts
- "part_num" - unique id for the part (as in the inventory_parts table)
- "name" - name of the part
- "part_cat_id" - part category id (as in part_catagories table)
part_categories
- "id" - part category id (as in parts table)
- "name" - name of the category the part belongs to
colors
- "id" - id of the color (as in inventory_parts table)
- "name" - color name
- "rgb" - rgb code of the color
- "is_trans" - whether or not the part is transparent/translucent
inventories
- "id" - id of the inventory the part is in (as in the inventory_sets and inventory_parts tables)
- "version" - version number
- "set_num" - set number (as in sets table)
inventory_sets
- "inventory_id" - id of the inventory the part is in (as in the inventories table)
- "set_num" - set number (as in sets table)
- "quantity" - the quantity of sets included
sets
- "set_num" - unique set id (as in inventory_sets and inventories tables)
- "name" - the name of the set
- "year" - the year the set was published
- "theme_id" - the id of the theme the set belongs to (as in themes table)
- num-parts - the number of parts in the set
themes
- "id" - the id of the theme (as in the sets table)
- "name" - the name of the theme
- "parent_id" - the id of the larger theme, if there is one
Acknowledgments: Rebrickable.com
1. Calculating the average number of Lego sets released per year
First we will count the sets released each year using the table "sets". Based on that, we can average the result of the query to obtain the average number of sets per year.
The query below shows that approximately 177 sets are released each year.
SELECT avg(COUNT_SETS) AS AVERAGE_NUMBER_OF_SETS_PER_YEAR FROM (
SELECT count(set_num) AS COUNT_SETS, year FROM sets group by year) SUBQ
2. and 3. Calculating the average number of Lego parts per year
From the table "invertory_parts" we can get the quantity of parts using the column "quantity". However, we need to join the table with inventories, and then with sets, to get the information of the year.
The query below shows the average number of parts per year in the datasource between 1950 and 2018 is 29,224 parts per year but the total quantity of parts has greatly increased in the last 15 years.
SELECT AVG(QUANTITY_OF_PARTS) AS AVERAGE_NUMBER_OF_PARTS_PER_YEAR FROM
(SELECT C.year, SUM(A.quantity) AS QUANTITY_OF_PARTS
FROM inventory_parts A
LEFT JOIN inventories B ON B.id = A.inventory_id
LEFT JOIN sets C ON C.set_num = B.set_num
Group by C.year
Order by year) SUBQ
SELECT C.year, SUM(A.quantity) AS QUANTITY_OF_PARTS
FROM inventory_parts A
LEFT JOIN inventories B ON B.id = A.inventory_id
LEFT JOIN sets C ON C.set_num = B.set_num
Group by C.year
Order by year
4. Calculating the most popular colors in LEGO parts
To do this we will assess the most popular colors based on each individual part released. '[No color]' or null color will not be counted. Using the join on the table colors to get the color name, we are able to establish that the top 5 colors is as follows:
- White (4,711 parts)
- Black (4,372 parts)
- Yellow (2,934 parts)
- Red (2,879 parts)
- Blue (1,831 parts)
SELECT A.color_id, B.name, COUNT(DISTINCT A.part_num) AS NUMBER_OF_PARTS
from inventory_parts A
LEFT JOIN colors B ON B.id = A.color_id
WHERE B.name != '[No Color]' AND B.name is not null
GROUP BY A.color_id, B.name
ORDER BY COUNT(DISTINCT A.part_num) desc
LIMIT 5
5. Calculating the propotion of transparent LEGO parts
A very similar query to the previous one enables us to calculate the number of total parts (24,104) and the number of transparent parts, using the is_trans field (1,455), and therefore the proportion, in 1-decimal-rounded percentage of transparent parts: 6.3%.
SELECT COUNT(DISTINCT CASE WHEN B.is_trans = TRUE THEN A.part_num END) AS NUMBER_OF_TRANSPARENT_PARTS
, COUNT(DISTINCT A.part_num ) AS NUMBER_OF_PARTS
, ROUND(100.0 * COUNT(DISTINCT CASE WHEN B.is_trans = TRUE THEN A.part_num END) / COUNT(DISTINCT A.part_num ) ,1) AS PERCENTAGE_OF_TRANSPARENT_PARTS
from inventory_parts A
LEFT JOIN colors B ON B.id = A.color_id
6. Which are the 5 rarest LEGO bricks
A quick assessment of the part_categories table lets us observe that "Bricks" can be identified as a specific category of parts. There are several types of bricks, but if we only keep "Bricks" we can limit our query to the part_cat_id 11. Then we will estimate the rarity by calculating the total quantity among all inventories per part and take the 5 parts which have the lowest quantity. We will link the inventories (and quantities) to the part category using the parts table. The query below shows that the following 5 parts are the rarest:
- Part number 733ex: Brick 10 x 10 without Bottom Tubes, with '+' Cross Support (early Baseplate) (1 part)
- Part number 700e: Brick 10 x 20 without Bottom Tubes, with '+' Cross Support and 4 Side Supports (early Baseplate) (2 parts)
- Part number 3003a: Brick 2 x 2 without Bottom Tubes (2 parts)
- Part number 3009b: Brick 1 x 6 with Bottom Tubes, with 2 Lowered Cross Supports (2 parts)
- Part number 3010a: Brick 1 x 4 with Bottom Tubes, with 1 Lowered Cross Support (2 parts)
SELECT A.part_num, B.name, C.name AS CATEGORY, SUM(A.quantity) AS TOTAL_QUANTITY
FROM inventory_parts A
LEFT JOIN parts B ON B.part_num = A.part_num
INNER JOIN part_categories C ON C.id = B.part_cat_id AND C.id = 11
GROUP BY A.part_num, B.name, C.name
ORDER BY SUM(QUANTITY)
LIMIT 5
7. Summary of the findings
Around 177 sets are released each year, with an average quantity of parts per year of around 29,000 every year since 1950. However, over the last 15-20 years, the quantity of parts released has greatly increased to even over 100,000 parts in 2013-2016. the main colors are not very surprising: white and black are by far the most popular, while yellow, red and blue are also quite popular. Only around 6% of all parts are actually transparent while some blocks are extremely rare and are only featured in 1 or 2 inventories.