Exploring the Popularity of Lego Sets and its Parts

2 hidden cells

📖 Background

The Lego Group, a renowned toy manufacturing company located in Denmark, is seeking a talented Data Analyst Intern. As part of the job interview process, applicants are required to complete a take-home assignment designed to assess their ability to analyze and understand data.

The assignment involves using a provided dataset to examine the popularity of various Lego sets and themes. This assignment serves as an introduction to the data and is intended to prepare the candidate for an interview with a business stakeholder.

The ultimate goal of this assignment is to gain insights into the Lego market and identify trends and patterns in the popularity of different Lego sets and themes. The successful completion of this assignment will demonstrate the candidate's ability to work with data, make informed decisions, and effectively communicate their findings.

🎯 Goals & Objectives

Create a report to summarize your findings. Include:

What is the average number of Lego sets released per year?
What is the average number of Lego parts per year?
Create a visualization for item 2.
What are the 5 most popular colors used in Lego parts?
[Optional] What proportion of Lego parts are transparent?
[Optional] What are the 5 rarest lego bricks?
Summarize your findings.

💾 The Data

You received access to a database with the following tables. You can also see above a visualization of how the tables are related to each other. (source):

inventory_parts

"inventory_id" - id of the inventory the part is in (as in the inventories table)
"part_num" - unique id for the part (as in the parts table)
"color_id" - id of the color
"quantity" - the number of copies of the part included in the set
"is_spare" - whether or not it is a spare part

parts

"part_num" - unique id for the part (as in the inventory_parts table)
"name" - name of the part
"part_cat_id" - part category id (as in part_catagories table)

part_categories

"id" - part category id (as in parts table)
"name" - name of the category the part belongs to

colors

"id" - id of the color (as in inventory_parts table)
"name" - color name
"rgb" - rgb code of the color
"is_trans" - whether or not the part is transparent/translucent

inventories

"id" - id of the inventory the part is in (as in the inventory_sets and inventory_parts tables)
"version" - version number
"set_num" - set number (as in sets table)

inventory_sets

"inventory_id" - id of the inventory the part is in (as in the inventories table)
"set_num" - set number (as in sets table)
"quantity" - the quantity of sets included

sets

"set_num" - unique set id (as in inventory_sets and inventories tables)
"name" - the name of the set
"year" - the year the set was published
"theme_id" - the id of the theme the set belongs to (as in themes table)
num-parts - the number of parts in the set

themes

"id" - the id of the theme (as in the sets table)
"name" - the name of the theme
"parent_id" - the id of the larger theme, if there is one

🧬 Data Schema

Notes for the schema:

The data scheme we're working with is quite complex, with sets of data that can't be easily combined into a single table. To help us better understand the whole dataset, let's group the tables based on their relationships into three categories: outermost, middle, and deepest (not as a lookup table).

category	tables	explanation
outermost	colors, part_categories, themes	First, let's take a look at the `outermost` tables. These are the tables that stand alone and don't require any lookup to other tables. They are independent and provide a good starting point for understanding the data scheme.
middle	sets, inventories, parts	Next, we have the `middle` tables. These tables are somewhat connected to the outermost tables, but still require some lookup to fully understand their relationships.
deepest	inventory_sets, inventory_parts	Finally, we have the `deepest` tables. These tables are the most interconnected and require the most lookup to fully understand their relationships with the other tables.

Understanding the data scheme can be challenging, but by grouping the tables into these categories, it becomes much easier to navigate and make sense of the whole dataset.

Executive Summary

The average number of sets releases per year from 1975 until 2017 is sets
The average number of lego parts per year from 1975 until 2017 is parts. Below is the line plot:
The 5 Most popular colors used in LEGO parts from most to least are Black, White, Light Bluish Gray, Red, Dark Bluish Gray
The proportion of Lego parts are transparent is only about % from overall parts (/)
The 5 rarest LEGO brick from most to least are
- Duplo Support Brick 2 x 4 x 3,
- Brick 2 x 2 No Studs, Sloped with 3 Side Pistons Raised and '30', Silver Stars Print [4591]
- Panel 4 x 6 x 6 Sloped with Amazon Brick and Vines Print
- Brick, Modified 2 x 2 No Studs, Sloped with ... (continious with variation),
- Slope Brick (45°, 33, 18, or 30) (2 x 2, 4 x 2, 3 x 4, 1 x 2 x 2/3, or 3 x 6) with ... (continious with variation)

Methodology

Cleaning error data for each table (early cleaning)
Defining what are LEGO sets and then reprocessing the dataset that can represent LEGO sets because the "Lego sets" table is not available. The main table used is the sets table.
Aggregate the LEGO sets data by year to measure the average number of sets released per year and make it visual using a line plot.
Defining what are LEGO parts and then reprocessing the dataset that can represent LEGO parts because the "lego parts" table is not available. The main table used is the parts table.
Aggregate the LEGO parts data by year to measure the average number of parts released per year and make it visual using a line plot.
Group the LEGO parts data by color to measure the total parts per color based on part release that released alongside its set
Group the LEgO parts data by its id and name to measure the quantity of the part based on the part release released alongside its set

1. Explanatory Data Analysis

1.1 Early Data Cleaning

Process	Outline
Missing Value	The null value is found in themes table, etc. Not all null value must be replaced, removed, or ignored depends on the analysis process
Duplicate value/record	We can check all tables but it will take so much time and so much space. We should check when it is needed to achieve our goal or find something abnormal in the analysis process
Incorrect value	The incorrect value is found in sets table num_parts column

Hidden code df

1.1.1 Abnormal Value in sets Table num_parts column

It appears that the num_parts column is meant to record the number of components in each set. However, There are some unusual values such as 0 and -1 in this column. These values do not make sense as it is impossible for a set to have a negative or zero number of components. These strange values could potentially impact our analysis, so it is important to handle them with caution.

So, what to do about it? remove, replace, or ignore the values?

To remove a value from the column, one option is to delete the entire row. However, it's important to consider the other values in the row as well. The num_parts column may be incorrectly recorded as zero, null, or a negative number, but the values in other columns could still be accurate.
How can we determine the appropriate value to use when replacing the current values? It's challenging to identify the correct replacement because it must be calculated precisely and we can't rely on estimates or other techniques like using the mean value.
The best way to treat the values is to ignore it because based on the objective in this analysis, the number of parts is not used but instead, the count of the rows is used, so to throw away that row, is like throwing away valuable information. But when we want to calculate descriptive statistic, we should remove the values.

Here below is the rows with abnormal value:

Hidden code df

1.2 Lego Sets

1.2.1 Defining term of Lego Sets

The datasets have no table or column named "Lego sets". So first we need to define what Lego sets are for our analysis. There are two tables (sets and inventory_sets) that contain "sets" in their name but it is not enough information. The best way to find out is to search the table information in the source of the datasets - rebrickable.com. Based on datasets source, sets consist of parts that make up things like houses, animals, people, land, people riding in cars, and so on. While inventory_sets record how many sets are in the inventory.

1.2.2 The Data

So, the sets table will be used to represent Lego Sets. But the sets table needs to be merged with the themes table to add necessary information about the theme of the sets. Below is the combined tables (only displays 100 of 11,673 rows).

‌
‌
‌

Exploring the Popularity of Lego Sets and its Parts

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Exploring the Popularity of Lego Sets and its Parts

📖 Background

🎯 Goals & Objectives

💾 The Data

inventory_parts

parts

part_categories

colors

inventories

inventory_sets

sets

themes

🧬 Data Schema

Executive Summary

Methodology

1. Explanatory Data Analysis

1.1 Early Data Cleaning

1.1.1 Abnormal Value in sets Table num_parts column

1.2 Lego Sets

1.2.1 Defining term of Lego Sets

1.2.2 The Data

Exploring the Popularity of Lego Sets and its Parts