Skip to content
The introduction of Lego's first licensed series, Star Wars, was a hit that sparked a series of collaborations with more themed sets. The partnerships team has asked you to perform an analysis of this success, and before diving into the analysis, they have suggested reading the descriptions of the two datasets to use, reported below.

Please note:

This was very much a working notebook, in which I've tried to carry my understanding of basic pandas to tackle the tasks. There will be a few snippets that are either highly inefficient, non-pythonic or down-right incorrect. That said, I wanted to include these snippets for my own retrospective learning.

The Data

You have been provided with two datasets to use. A summary and preview are provided below.

lego_sets.csv

ColumnDescription
"set_num"A code that is unique to each set in the dataset. This column is critical, and a missing value indicates the set is a duplicate or invalid!
"name"The name of the set.
"year"The date the set was released.
"num_parts"The number of parts contained in the set. This column is not central to our analyses, so missing values are acceptable.
"theme_name"The name of the sub-theme of the set.
"parent_theme"The name of the parent theme the set belongs to. Matches the name column of the parent_themes csv file.

parent_themes.csv

ColumnDescription
"id"A code that is unique to every theme.
"name"The name of the parent theme.
"is_licensed"A Boolean column specifying whether the theme is a licensed theme.

Task

  • What percentage of all licensed sets ever released were Star Wars themed? Save your answer as a variable the_force, as an integer.

  • In which year was the highest number of Star Wars sets released? Save your answer as a variable new_era, as an integer (e.g. 2012).

import pandas as pd

lego_sets = pd.read_csv('data/lego_sets.csv')
lego_sets.head()
parent_themes = pd.read_csv('data/parent_themes.csv')
parent_themes.head()

What percentage of all licensed sets ever released were Star Wars themed?

Save your answer as a variable the_force, as an integer.

# firstly I will merge the datasets, joining on df1['parent name'] and df2['name']
merged_df = lego_sets.merge(parent_themes, left_on='parent_theme', right_on='name')
# expecting '4 juniors' to be the first parent theme if I order values from A-Z...
merged_df.sort_values(by='parent_theme',ascending=True).head()
# confirm the total number of unique sets in the entire list
total_set_count = merged_df['set_num'].value_counts().sum()
total_set_count
# get a total count of all the rows that feature 'Star Wars' as the parent_theme
star_wars_themed = merged_df['parent_theme'].str.contains('Star Wars', na=False).sum()
star_wars_themed
# What percentage (integer) of all licensed sets ever released were Star Wars themed?
the_force = (star_wars_themed / total_set_count * 100)
the_force

This doesn't look right, even as a float...

On review of the data-set, I realise now that not all Star Wars 'sub-themed' lego sits under the parent_theme: Star Wars.

# get a total count of all the rows that feature 'Star Wars' as the name
star_wars_themed = merged_df['theme_name'].str.contains('Star Wars', na=False).sum()
star_wars_themed

# What percentage (integer) of all licensed sets ever released were Star Wars themed?
the_force = (star_wars_themed / total_set_count * 100)
the_force

We've gone dramatically the other way... and this was my error - I misread the question.

It's the percentage of all licensed sets only. I need to get a total count of all sets, excluding all non-licensed.

# double-checking that I'm dealing with boolean in the licensed column
print(merged_df['is_licensed'].apply(type).unique())
# filter out the non-licensed sets
licensed_only = merged_df[merged_df['is_licensed'] != False]

# confirm the total number of unique sets in the entire licensed list
licensed_only = merged_df[merged_df['is_licensed'] == True]
total_set_count = licensed_only['set_num'].nunique()
total_set_count