ℹ️ Introduction to data science notebooks
You can skip this section if you are already familiar with data science notebooks.
Data science notebooks
A data science notebook is a document containing text cells (what you're reading now) and code cells. What is unique about a notebook is that it's interactive: You can change or add code cells and then run a cell by selecting it and then clicking the Run button on the right ( ▶, or Run All ) or hitting control + enter
.
The result will be displayed directly in the notebook.
# Modify any of the numbers and rerun the cell
100 * 1.75 * 21
Data science notebooks & data analysis
Notebooks are great for interactive data analysis. You can add a Text, Python or SQL cell by clicking on the Add Text, Add Code, and Add SQL buttons that appear as you move the mouse pointer near the bottom of any cell.
Here at DataCamp, we call our interactive notebook Workspace. You can find out more about Workspace here.
We will use a SQL cell to load the Lego database containing a wealth of information on Lego sets, themes, colors, and much, much more.
We will use the commands SELECT
and * to load the full sets table. We use the command LIMIT
to only show the first 10 rows.
Data analysis example:
Find the average number of parts per set.
We can use AVG()
to get the average of the column num_parts
A note on SQL cells
You can click the "Browse tables" button in the upper right-hand corner of the SQL cell above to view the available tables. They will show on the left of the notebook.
Data science notebooks & charts
Visualizations are very helpful to summarize data and gain insights. A well-crafted chart often conveys information much better than a table.
It is very straightforward to include charts in a data science notebook. For example, let's take a look at the number of sets released per year.
We will first get the information using COUNT
of the column set_num
and grouping by year with 'GROUP BY':
SELECT year, COUNT(set_num) as set_count
FROM sets
GROUP BY year;
Workspace has built in chart cells (create one by clicking on Add Chart). We use one to build the chart using the sets_year
table we created in the cell above.
Understanding Lego sets popularity
Now let's now move on to the competition and challenge.
📖 Background
You recently applied to work as a data analyst intern at the famous Lego Group in Denmark. As part of the job interview process, you received the following take-home assignment:
You are asked to use the provided dataset to understand the popularity of different Lego sets and themes. The idea is to become familiarized with the data to be ready for an interview with a business stakeholder.
💾 The data
You received access to a database with the following tables. You can also see above a visualization of how the tables are related to each other. (source):
inventory_parts
- "inventory_id" - id of the inventory the part is in (as in the inventories table)
- "part_num" - unique id for the part (as in the parts table)
- "color_id" - id of the color
- "quantity" - the number of copies of the part included in the set
- "is_spare" - whether or not it is a spare part
parts
- "part_num" - unique id for the part (as in the inventory_parts table)
- "name" - name of the part
- "part_cat_id" - part category id (as in part_catagories table)
part_categories
- "id" - part category id (as in parts table)
- "name" - name of the category the part belongs to
colors
- "id" - id of the color (as in inventory_parts table)
- "name" - color name
- "rgb" - rgb code of the color
- "is_trans" - whether or not the part is transparent/translucent
inventories
- "id" - id of the inventory the part is in (as in the inventory_sets and inventory_parts tables)
- "version" - version number
- "set_num" - set number (as in sets table)
inventory_sets
- "inventory_id" - id of the inventory the part is in (as in the inventories table)
- "set_num" - set number (as in sets table)
- "quantity" - the quantity of sets included
sets
- "set_num" - unique set id (as in inventory_sets and inventories tables)
- "name" - the name of the set
- "year" - the year the set was published
- "theme_id" - the id of the theme the set belongs to (as in themes table)
- num-parts - the number of parts in the set
themes
- "id" - the id of the theme (as in the sets table)
- "name" - the name of the theme
- "parent_id" - the id of the larger theme, if there is one
Acknowledgments: Rebrickable.com
💪 Challenge
Create a report to summarize your findings. Include:
- What is the average number of Lego sets released per year?
- What is the average number of Lego parts per year?
- Create a visualization for item 2.
- What are the 5 most popular colors used in Lego parts?
- [Optional] What proportion of Lego parts are transparent?
- [Optional] What are the 5 rarest lego bricks?
- Summarize your findings.
💡 Learn more
The following DataCamp courses can help review the skills needed for this challenge:
✅ Checklist before publishing
- Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
- Remove redundant cells like the introduction to data science notebooks, so the workbook is focused on your story.
- Check that all the cells run without error.