Skip to content
0

Understanding Lego sets popularity

📖 Background

You recently applied to work as a data analyst intern at the famous Lego Group in Denmark. As part of the job interview process, you received the following take-home assignment:

You are asked to use the provided dataset to understand the popularity of different Lego sets and themes. The idea is to become familiarized with the data to be ready for an interview with a business stakeholder.

💾 The data

You received access to a database with the following tables. You can also see above a visualization of how the tables are related to each other. (source):
inventory_parts
  • "inventory_id" - id of the inventory the part is in (as in the inventories table)
  • "part_num" - unique id for the part (as in the parts table)
  • "color_id" - id of the color
  • "quantity" - the number of copies of the part included in the set
  • "is_spare" - whether or not it is a spare part
parts
  • "part_num" - unique id for the part (as in the inventory_parts table)
  • "name" - name of the part
  • "part_cat_id" - part category id (as in part_catagories table)
part_categories
  • "id" - part category id (as in parts table)
  • "name" - name of the category the part belongs to
colors
  • "id" - id of the color (as in inventory_parts table)
  • "name" - color name
  • "rgb" - rgb code of the color
  • "is_trans" - whether or not the part is transparent/translucent
inventories
  • "id" - id of the inventory the part is in (as in the inventory_sets and inventory_parts tables)
  • "version" - version number
  • "set_num" - set number (as in sets table)
inventory_sets
  • "inventory_id" - id of the inventory the part is in (as in the inventories table)
  • "set_num" - set number (as in sets table)
  • "quantity" - the quantity of sets included
sets
  • "set_num" - unique set id (as in inventory_sets and inventories tables)
  • "name" - the name of the set
  • "year" - the year the set was published
  • "theme_id" - the id of the theme the set belongs to (as in themes table)
  • num-parts - the number of parts in the set
themes
  • "id" - the id of the theme (as in the sets table)
  • "name" - the name of the theme
  • "parent_id" - the id of the larger theme, if there is one

Acknowledgments: Rebrickable.com

💪 Challenge

Create a report to summarize your findings. Include:

  1. What is the average number of Lego sets released per year?
  2. What is the average number of Lego parts per year?
  3. Create a visualization for item 2.
  4. What are the 5 most popular colors used in Lego parts?
  5. [Optional] What proportion of Lego parts are transparent?
  6. [Optional] What are the 5 rarest lego bricks?
  7. Summarize your findings.

⌛️ Time is ticking. Good luck!

Spinner
DataFrameas
df
variable
-- Compute the average number of Lego sets released per year
SELECT 
    ROUND(AVG(sets_released),0) AS avg_sets_released_per_year
FROM (
    SELECT 
        year,
        COUNT(set_num) AS sets_released
    FROM sets
    GROUP BY year) sets_per_year

-- Result: 177
Spinner
DataFrameas
df
variable
-- Compute the average number of Lego parts released per year
SELECT 
    ROUND(AVG(parts_released),0) AS avg_parts_released_per_year
FROM (
    SELECT 
        year,
        SUM(num_parts) AS parts_released
    FROM sets
    GROUP BY year) parts_per_year

-- Result: 28698
Spinner
DataFrameas
df_released
variable
-- Prepare dataset for visualization
SELECT 
    year,
    COUNT(set_num) AS sets_released,
    SUM(num_parts) AS parts_released
FROM sets
GROUP BY year
ORDER BY year ASC
# visualize parts released

%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt

fig,ax = plt.subplots()
# plot data
ax.plot(df_released["year"], df_released["parts_released"], marker="*")
ax.axhline(df_released["parts_released"].mean(), linestyle="--")
# style the figure and add information
ax.set_title("Lego parts released per year")
ax.set_xlabel("Year")
ax.set_ylabel("Parts released")
ax.annotate("Avg parts released per year", xy=(1950, df_released["parts_released"].mean()+5000))
ax.grid(linewidth=0.3)

plt.show()

Spinner
DataFrameas
df
variable
-- identify popular colors for parts by counting unique part types in inventory
SELECT
    c.name AS color_name,
    COUNT(DISTINCT ip.part_num) AS parts_count,
    COUNT(DISTINCT CASE WHEN NOT ip.is_spare THEN ip.part_num END) AS parts_count_exc_spare_parts
FROM inventory_parts ip
LEFT JOIN colors c
ON ip.color_id = c.id
GROUP BY c.name
ORDER BY parts_count DESC
LIMIT 10

-- White is the most popular color according to this ranking criterion
Spinner
DataFrameas
df
variable
-- identify popular colors for parts by quantity of parts in inventory
SELECT
    c.name AS color_name,
    SUM(ip.quantity) AS parts_quantity,
    SUM(CASE WHEN NOT ip.is_spare THEN ip.quantity ELSE 0 END) AS parts_quantity_exc_spare_parts
FROM inventory_parts ip
LEFT JOIN colors c
ON ip.color_id = c.id
GROUP BY c.name
ORDER BY parts_quantity DESC
LIMIT 10

-- Black is the most popular color according to this ranking criterion
Spinner
DataFrameas
df
variable
-- compute proportion of transparent parts in the inventory (by quantity only)
SELECT
    ROUND((SUM(CASE 
               WHEN c.is_trans 
                   THEN ip.quantity 
               ELSE 0 END)::numeric / SUM(ip.quantity)),4) * 100 AS pct_trans_parts,
    ROUND((SUM(CASE 
               WHEN c.is_trans AND NOT ip.is_spare 
                   THEN ip.quantity 
               ELSE 0 END)::numeric / SUM(ip.quantity)),4) * 100 AS pct_trans_parts_exc_spare_parts
FROM inventory_parts ip
LEFT JOIN colors c
ON ip.color_id = c.id

-- Results: 4.94% including spare parts; 4.57% excluding spare parts
Spinner
DataFrameas
df
variable
-- Identify rarest lego bricks in the inventory
SELECT
    pc.name AS brick_category,
    SUM(ip.quantity) AS parts_quantity
FROM inventory_parts ip
LEFT JOIN parts p
ON ip.part_num = p.part_num
LEFT JOIN part_categories pc
ON p.part_cat_id = pc.id
WHERE LOWER(pc.name) LIKE '%brick%'
GROUP BY pc.name
ORDER BY parts_quantity ASC

-- Result: Bricks Printed

Summary

  • On average, Lego released 177 sets and 28,698 parts per year from 1950 to 2017.
  • The number of parts released has been exponentially increasing during those years, reaching a peak of 151k in 2016.
  • Looking at the current inventory, the highest number of unique parts are white (4,711 inc. spare parts), while the highest quantity of parts are black (396,416 units inc. spare parts).
  • 4.94% of the total parts quantity is transparent (inc. spare parts).
  • "Bricks Printed" is the rarest category of bricks in the inventory with only 4,580 units left.