Skip to content

1. Introduction

Everyone loves Lego (unless you ever stepped on one). Did you know by the way that "Lego" was derived from the Danish phrase leg godt, which means "play well"? Unless you speak Danish, probably not.

In this project, we will analyze a fascinating dataset on every single Lego block that has ever been built!

lego

# Nothing to do here
Hidden output

2. Reading Data

A comprehensive database of lego blocks is provided by Rebrickable. The data is available as csv files and the schema is shown below.

schema

Let us start by reading in the colors data to get a sense of the diversity of Lego sets!

import pandas as pd

# Read the CSV file into a DataFrame
colors = pd.read_csv('datasets/colors.csv')

# Inspect the first five rows of the DataFrame
print(colors.head())

3. Exploring Colors

Now that we have read the colors data, we can start exploring it! Let us start by understanding the number of colors available.

colors = pd.read_csv('datasets/colors.csv')

# Find the number of distinct colors
num_colors = colors['name'].nunique()

# Print the number of distinct colors
print(num_colors)

4. Transparent Colors in Lego Sets

The colors data has a column named is_trans that indicates whether a color is transparent or not. It would be interesting to explore the distribution of transparent vs. non-transparent colors.

import pandas as pd

# Read the csv file located in the path 'datasets/colors.csv' into a DataFrame named colors
colors = pd.read_csv('datasets/colors.csv')

# Summarize colors based on their transparency
colors_summary = colors.groupby('is_trans').agg({
    'id': 'nunique',
    'name': 'nunique',
    'rgb': 'nunique'
})

# Rename columns to match expected output
colors_summary.columns = ['id', 'name', 'rgb']

# Print out colors_summary
print(colors_summary)

5. Explore Lego Sets

Another interesting dataset available in this database is the sets data. It contains a comprehensive list of sets over the years and the number of parts that each of these sets contained.

sets_data

Let us use this data to explore how the average number of parts in Lego sets has varied over the years.



# Read the csv file located in the path 'datasets/sets.csv' into a DataFrame named sets
sets = pd.read_csv('datasets/sets.csv')
# Create a summary of the average number of parts per year
parts_by_year = sets.groupby('year')['num_parts'].mean().reset_index()

# Rename the columns for clarity
parts_by_year.columns = ['year', 'average_num_parts']

# Print out the first few rows of parts_by_year
print(parts_by_year.head())

6. Lego Themes Over Years

Lego blocks ship under multiple themes. Let us try to get a sense of how the number of themes shipped has varied over the years.

themes_by_year = sets.groupby('year')['theme_id'].nunique().reset_index()

# Rename the columns for clarity
themes_by_year.columns = ['year', 'num_distinct_themes']

# Print the first couple of rows in themes_by_year
print(themes_by_year.head())

7. Wrapping It All Up!

Lego blocks offer an unlimited amount of fun across ages. We explored some interesting trends around colors, parts, and themes. Before we wrap up, let's take a closer look at the themes_by_year DataFrame you created in the previous step.

# Filter the themes_by_year DataFrame for the year 1999 and assign the number of unique themes to num_themes
num_themes = themes_by_year[themes_by_year['year'] == 1999]['num_distinct_themes'].iloc[0]

# Print num_themes
print(num_themes)