Skip to content
(Python) Project: Examining the History of Lego Sets
  • AI Chat
  • Code
  • Report
  • Examining the History of Lego Sets

    Use a variety of data manipulation techniques to explore different aspects of Lego's history.

    Lego is a household name across the world, supported by a diverse toy line, hit movies, and a series of successful video games. In this project, we are going to explore a key development in the history of Lego: the introduction of licensed sets such as Star Wars, Super Heroes, and Harry Potter.

    The introduction of its first licensed series, Star Wars, was a hit that sparked a series of collaborations with more themed sets. The partnerships team has asked you to perform an analysis of this success, and before diving into the analysis, they have suggested reading the descriptions of the two datasets to use, reported below.

    The Data

    The Rebrickable dataset includes data on every LEGO set that has ever been sold; the names of the sets, what bricks they contain, etc. It might be small bricks, but this is big data! In this project, you will use this dataset together with the pandas library to dig into the history of Lego's licensed sets, including uncovering the percentage of all licensed sets that are Star Wars themed.

    You have been provided with two datasets to use. A summary and preview are provided below.

    lego_sets.csv

    ColumnDescription
    "set_num"A code that is unique to each set in the dataset. This column is critical, and a missing value indicates the set is a duplicate or invalid!
    "name"The name of the set.
    "year"The date the set was released.
    "num_parts"The number of parts contained in the set. This column is not central to our analyses, so missing values are acceptable.
    "theme_name"The name of the sub-theme of the set.
    "parent_theme"The name of the parent theme the set belongs to. Matches the name column of the parent_themes csv file.

    parent_themes.csv

    ColumnDescription
    "id"A code that is unique to every theme.
    "name"The name of the parent theme.
    "is_licensed"A Boolean column specifying whether the theme is a licensed theme.

    The team responsible for the Star Wars partnership has asked for specific information in preparation for their meeting:

    • What percentage of all licensed sets ever released were Star Wars themed? Save your answer as a variable the_force, as an integer.

    • In which year was the highest number of Star Wars sets released? Save your answer as a variable new_era, as an integer.

    1. Import the data

    Reading data into a pandas DataFrame

    # Import pandas, read and inspect the datasets
    import pandas as pd
    
    lego_sets = pd.read_csv('data/lego_sets.csv')
    lego_sets.head()
    parent_themes = pd.read_csv('data/parent_themes.csv')
    display(parent_themes.head())
    display(parent_themes[parent_themes['is_licensed']==False])
    print(lego_sets.info(), '\n\n', lego_sets.columns)
    print(parent_themes.info(), '\n\n', parent_themes.columns)
    print(parent_themes.shape, lego_sets.shape)
    display(lego_sets.isna)
    display(lego_sets.isna())
    # Analyzing Year column from lego_sets
    
    print(lego_sets['year'])
    display(lego_sets['year'].values)
    display(lego_sets['year'].unique())
    display(len(lego_sets['year'].unique()))
    display(lego_sets.groupby(['year'])['year'].count())
    display(lego_sets.groupby(['year'])['year'].count().sort_values(ascending=False))
    # Analyzing 'parent_theme' column from lego_sets
    
    display(lego_sets['parent_theme'])
    display(lego_sets['parent_theme'].unique())
    display(lego_sets['parent_theme'].values)
    display(len(lego_sets['parent_theme'].unique()))
    display(lego_sets.groupby(['parent_theme'])['parent_theme'].count().sort_values(ascending=False))

    Drop relevant missing rows/values