Skip to content
0

Introduction

This report presents the results of a data analysis of Lego sets and themes. The analysis was conducted to identify trends and patterns in the data for the business stakeholders. The information on lego sets and its parts was gathered over the course of roughly 70 years.

The Data

The data on Lego sets is organized into 8 tables, as shown below. We can also see below visualization of how the tables are related to each other.

The name of each tables and its attributes was listed with the explanation of each attribute.

inventory_parts
  • "inventory_id" - id of the inventory the part is in (as in the inventories table)
  • "part_num" - unique id for the part (as in the parts table)
  • "color_id" - id of the color
  • "quantity" - the number of copies of the part included in the set
  • "is_spare" - whether or not it is a spare part
parts
  • "part_num" - unique id for the part (as in the inventory_parts table)
  • "name" - name of the part
  • "part_cat_id" - part category id (as in part_catagories table)
part_categories
  • "id" - part category id (as in parts table)
  • "name" - name of the category the part belongs to
colors
  • "id" - id of the color (as in inventory_parts table)
  • "name" - color name
  • "rgb" - rgb code of the color
  • "is_trans" - whether or not the part is transparent/translucent
inventories
  • "id" - id of the inventory the part is in (as in the inventory_sets and inventory_parts tables)
  • "version" - version number
  • "set_num" - set number (as in sets table)
inventory_sets
  • "inventory_id" - id of the inventory the part is in (as in the inventories table)
  • "set_num" - set number (as in sets table)
  • "quantity" - the quantity of sets included
sets
  • "set_num" - unique set id (as in inventory_sets and inventories tables)
  • "name" - the name of the set
  • "year" - the year the set was published
  • "theme_id" - the id of the theme the set belongs to (as in themes table)
  • num-parts - the number of parts in the set
themes
  • "id" - the id of the theme (as in the sets table)
  • "name" - the name of the theme
  • "parent_id" - the id of the larger theme, if there is one

Acknowledgments: Rebrickable.com

Data Validation

Sets Table

The sets table contains of 11673 rows and does not have any missing and duplicated values. In year column, the minimum value is 195, and the maximum value is 2017. In the table, there are 10370 distinct set names.

Parts Table

The parts table contains of 25993 rows and does not have any missing and duplicated values. In the table, there are 25779 distinct part names.

Part Categories Table

The parts table contains of 57 rows and does not have any missing and duplicated values. In the table, there are 57 distinct part category names.

Colors Table

The parts table contains of 135 rows and does not have any missing and duplicated values. In the table, there are 135 distinct color names.There are 28 translucent hues in all.

Themes Table

The parts table contains of 614 rows and does not have any missing and duplicated values. In the table, there are 402 themes name.

Other Tables

The junction tables that connect the other two tables are inventory parts, inventory sets, and inventory sets.

Spinner
DataFrameas
df
variable
select *
from themes
df["name"].nunique()
df.info()

What is the average number of Lego sets released per year?

The average number of lego sets released per year is 176.86. The standard deviation is equal to 189.20. Due to high standard deviation, the number of lego sets released per year is more spread out. We also show the median with boxplot and it is equal to 88.5

Spinner
DataFrameas
df
variable
WITH
   countOfSetsPerYear AS (
select year, count(*) as totalCount
from sets
group by year
order by year ASC
)
select STDDEV(totalcount) as standard_deviation,avg(totalcount) as average
from countOfSetsPerYear
Spinner
DataFrameas
df
variable
select year, count(*) as totalCount
from sets
group by year
order by year ASC
df["totalcount"].median()
df.boxplot(column=['totalcount'],showmeans=True)