Exploring the Popularity of Lego Sets and its Parts
Picture by Omar Albeik
2 hidden cells
📖 Background
The Lego Group, a renowned toy manufacturing company located in Denmark, is seeking a talented Data Analyst Intern. As part of the job interview process, applicants are required to complete a take-home assignment designed to assess their ability to analyze and understand data.
The assignment involves using a provided dataset to examine the popularity of various Lego sets and themes. This assignment serves as an introduction to the data and is intended to prepare the candidate for an interview with a business stakeholder.
The ultimate goal of this assignment is to gain insights into the Lego market and identify trends and patterns in the popularity of different Lego sets and themes. The successful completion of this assignment will demonstrate the candidate's ability to work with data, make informed decisions, and effectively communicate their findings.
🎯 Goals & Objectives
Create a report to summarize your findings. Include:
- What is the average number of Lego sets released per year?
- What is the average number of Lego parts per year?
- Create a visualization for item 2.
- What are the 5 most popular colors used in Lego parts?
- [Optional] What proportion of Lego parts are transparent?
- [Optional] What are the 5 rarest lego bricks?
- Summarize your findings.
💾 The Data
You received access to a database with the following tables. You can also see above a visualization of how the tables are related to each other. (source):
inventory_parts
- "inventory_id" - id of the inventory the part is in (as in the inventories table)
- "part_num" - unique id for the part (as in the parts table)
- "color_id" - id of the color
- "quantity" - the number of copies of the part included in the set
- "is_spare" - whether or not it is a spare part
parts
- "part_num" - unique id for the part (as in the inventory_parts table)
- "name" - name of the part
- "part_cat_id" - part category id (as in part_catagories table)
part_categories
- "id" - part category id (as in parts table)
- "name" - name of the category the part belongs to
colors
- "id" - id of the color (as in inventory_parts table)
- "name" - color name
- "rgb" - rgb code of the color
- "is_trans" - whether or not the part is transparent/translucent
inventories
- "id" - id of the inventory the part is in (as in the inventory_sets and inventory_parts tables)
- "version" - version number
- "set_num" - set number (as in sets table)
inventory_sets
- "inventory_id" - id of the inventory the part is in (as in the inventories table)
- "set_num" - set number (as in sets table)
- "quantity" - the quantity of sets included
sets
- "set_num" - unique set id (as in inventory_sets and inventories tables)
- "name" - the name of the set
- "year" - the year the set was published
- "theme_id" - the id of the theme the set belongs to (as in themes table)
- num-parts - the number of parts in the set
themes
- "id" - the id of the theme (as in the sets table)
- "name" - the name of the theme
- "parent_id" - the id of the larger theme, if there is one
🧬 Data Schema
Notes for the schema:
The data scheme we're working with is quite complex, with sets of data that can't be easily combined into a single table. To help us better understand the whole dataset, let's group the tables based on their relationships into three categories: outermost, middle, and deepest (not as a lookup table).
| category | tables | explanation |
|---|---|---|
| outermost | colors, part_categories, themes | First, let's take a look at the outermost tables. These are the tables that stand alone and don't require any lookup to other tables. They are independent and provide a good starting point for understanding the data scheme. |
| middle | sets, inventories, parts | Next, we have the middle tables. These tables are somewhat connected to the outermost tables, but still require some lookup to fully understand their relationships. |
| deepest | inventory_sets, inventory_parts | Finally, we have the deepest tables. These tables are the most interconnected and require the most lookup to fully understand their relationships with the other tables. |
Understanding the data scheme can be challenging, but by grouping the tables into these categories, it becomes much easier to navigate and make sense of the whole dataset.
Executive Summary
- The average number of sets releases per year from 1975 until 2017 is
sets - The average number of lego parts per year from 1975 until 2017 is
parts. Below is the line plot: - The 5 Most popular colors used in LEGO parts from most to least are Black, White, Light Bluish Gray, Red, Dark Bluish Gray
- The proportion of Lego parts are transparent is only about
% from overall parts ( / ) - The 5 rarest LEGO brick from most to least are
- Duplo Support Brick 2 x 4 x 3,
- Brick 2 x 2 No Studs, Sloped with 3 Side Pistons Raised and '30', Silver Stars Print [4591]
- Panel 4 x 6 x 6 Sloped with Amazon Brick and Vines Print
- Brick, Modified 2 x 2 No Studs, Sloped with ... (continious with variation),
- Slope Brick (45°, 33, 18, or 30) (2 x 2, 4 x 2, 3 x 4, 1 x 2 x 2/3, or 3 x 6) with ... (continious with variation)
Methodology
- Cleaning error data for each table (early cleaning)
- Defining what are LEGO sets and then reprocessing the dataset that can represent LEGO sets because the "Lego sets" table is not available. The main table used is the sets table.
- Aggregate the LEGO sets data by year to measure the average number of sets released per year and make it visual using a line plot.
- Defining what are LEGO parts and then reprocessing the dataset that can represent LEGO parts because the "lego parts" table is not available. The main table used is the parts table.
- Aggregate the LEGO parts data by year to measure the average number of parts released per year and make it visual using a line plot.
- Group the LEGO parts data by color to measure the total parts per color based on part release that released alongside its set
- Group the LEgO parts data by its id and name to measure the quantity of the part based on the part release released alongside its set
1. Explanatory Data Analysis
1.1 Early Data Cleaning
| Process | Outline |
|---|---|
| Missing Value | The null value is found in themes table, etc. Not all null value must be replaced, removed, or ignored depends on the analysis process |
| Duplicate value/record | We can check all tables but it will take so much time and so much space. We should check when it is needed to achieve our goal or find something abnormal in the analysis process |
| Incorrect value | The incorrect value is found in sets table num_parts column |
df1.1.1 Abnormal Value in sets Table num_parts column
It appears that the num_parts column is meant to record the number of components in each set. However, There are some unusual values such as 0 and -1 in this column. These values do not make sense as it is impossible for a set to have a negative or zero number of components. These strange values could potentially impact our analysis, so it is important to handle them with caution.
So, what to do about it? remove, replace, or ignore the values?
- To remove a value from the column, one option is to delete the entire row. However, it's important to consider the other values in the row as well. The num_parts column may be incorrectly recorded as zero, null, or a negative number, but the values in other columns could still be accurate.
- How can we determine the appropriate value to use when replacing the current values? It's challenging to identify the correct replacement because it must be calculated precisely and we can't rely on estimates or other techniques like using the mean value.
- The best way to treat the values is to ignore it because based on the objective in this analysis, the number of parts is not used but instead, the count of the rows is used, so to throw away that row, is like throwing away valuable information. But when we want to calculate descriptive statistic, we should remove the values.
Here below is the rows with abnormal value:
df1.2 Lego Sets
1.2.1 Defining term of Lego Sets
The datasets have no table or column named "Lego sets". So first we need to define what Lego sets are for our analysis. There are two tables (sets and inventory_sets) that contain "sets" in their name but it is not enough information. The best way to find out is to search the table information in the source of the datasets - rebrickable.com. Based on datasets source, sets consist of parts that make up things like houses, animals, people, land, people riding in cars, and so on. While inventory_sets record how many sets are in the inventory.
1.2.2 The Data
So, the sets table will be used to represent Lego Sets. But the sets table needs to be merged with the themes table to add necessary information about the theme of the sets. Below is the combined tables (only displays 100 of 11,673 rows).