ℹ️ Introduction to Data Science Notebooks
You can skip this section if you are already familiar with data science notebooks.
Data science notebooks
A data science notebook is a document that contains text cells (what you're reading right now) and code cells. What is unique with a notebook is that it's interactive: You can change or add code cells, and then run a cell by selecting it and then clicking the Run button above ( ▶, or Run All ) or hitting control + enter
.
The result will be displayed directly in the notebook.
Try running the cell below:
# Run this cell to see the result
100 * 1.75 * 17
Modify any of the numbers and rerun the cell.
Data science notebooks & data analysis
Notebooks are great for interactive data analysis. Let's create a pandas DataFrame using the read_csv()
function.
We will load the dataset "chocolate_bars.csv" containing different characteristics of over 2,500 chocolate bars and their reviews.
By using the .head()
method, we display the first five rows of data:
# Importing the pandas module
import pandas as pd
# Reading in the data
df = pd.read_csv('data/chocolate_bars.csv')
# Take a look at the first datapoints
df.head()
Data analysis example:
Find the average rating for chocolate bars with different numbers of ingredients.
We can use .groupby()
to group the information by the column "num_ingredients". Then we select the column "rating" and use .mean()
to get the average rating for each group:
df.groupby('num_ingredients')[['rating']].mean()
Data science notebooks & visualizations
Visualizations are very helpful to summarize data and gain insights. A well-crafted chart often conveys information much better than a table.
It is very straightforward to include plots in a data science notebook. For example, let's take a look at the relationship between review year and rating.
We are using the seaborn
library for this example. We will run the scatterplot()
function and include the variables we want to display.
import matplotlib.pyplot as plt
import seaborn as sns
sns.scatterplot(x='year_reviewed', y='rating', data=df)
plt.show()
We can also make a plot from the table we calculated above (average rating for chocolate bars with different numbers of ingredients):
sns.barplot(x='num_ingredients', y='rating', data=df)
plt.show()
Finding the best chocolate bars
Now let's now move on to the competition and challenge.
📖 Background
You work at a specialty foods import company that wants to expand into gourmet chocolate bars. Your boss needs your team to research this market to inform your initial approach to potential suppliers.
After finding valuable chocolate bar ratings online, you need to explore if the chocolate bars with the highest ratings share any characteristics that could help you narrow your search for suppliers (e.g., cacao percentage, bean country of origin, etc.)
💾 The data
Your team created a file with the following information (source):
- "id" - id number of the review
- "manufacturer" - Name of the bar manufacturer
- "company_location" - Location of the manufacturer
- "year_reviewed" - From 2006 to 2021
- "bean_origin" - Country of origin of the cacao beans
- "bar_name" - Name of the chocolate bar
- "cocoa_percent" - Cocoa content of the bar (%)
- "num_ingredients" - Number of ingredients
- "ingredients" - B (Beans), S (Sugar), S* (Sweetener other than sugar or beet sugar), C (Cocoa Butter), (V) Vanilla, (L) Lecithin, (Sa) Salt
- "review" - Summary of most memorable characteristics of the chocolate bar
- "rating" - 1.0-1.9 Unpleasant, 2.0-2.9 Disappointing, 3.0-3.49 Recommended, 3.5-3.9 Highly Recommended, 4.0-5.0 Oustanding
Acknowledgments: Brady Brelinski, Manhattan Chocolate Society
df.head()
💪 Challenge
Create a report to summarize your research. Include:
- What is the average rating by country of origin?
- How many bars were reviewed for each of those countries?
- Create plots to visualize findings for questions 1 and 2.
- Is the cacao bean's origin an indicator of quality?
- [Optional] How does cocoa content relate to rating? What is the average cocoa content for bars with higher ratings (above 3.5)?
- [Optional 2] Your research indicates that some consumers want to avoid bars with lecithin. Compare the average rating of bars with and without lecithin (L in the ingredients).
- Summarize your findings.