Skip to content
Competition - Chocolate Bars - MSCH
the dataset: "chocolate_bars.csv" containing different characteristics of over 2,500 chocolate bars and their reviews.
# Importing the pandas module
import pandas as pd
# Reading in the data
df = pd.read_csv('data/chocolate_bars.csv')
# Take a look at the first datapoints
df.head(10)💾 The data
Your team created a file with the following information (source):
- "id" - id number of the review
- "manufacturer" - Name of the bar manufacturer
- "company_location" - Location of the manufacturer
- "year_reviewed" - From 2006 to 2021
- "bean_origin" - Country of origin of the cacao beans
- "bar_name" - Name of the chocolate bar
- "cocoa_percent" - Cocoa content of the bar (%)
- "num_ingredients" - Number of ingredients
- "ingredients" - B (Beans), S (Sugar), S* (Sweetener other than sugar or beet sugar), C (Cocoa Butter), (V) Vanilla, (L) Lecithin, (Sa) Salt
- "review" - Summary of most memorable characteristics of the chocolate bar
- "rating" - 1.0-1.9 Unpleasant, 2.0-2.9 Disappointing, 3.0-3.49 Recommended, 3.5-3.9 Highly Recommended, 4.0-5.0 Oustanding
Acknowledgments: Brady Brelinski, Manhattan Chocolate Society
df.head(30)💪 Challenge
Create a report to summarize your research. Include:
- What is the average rating by country of origin?
- How many bars were reviewed for each of those countries?
- Create plots to visualize findings for questions 1 and 2.
- Is the cacao bean's origin an indicator of quality?
- [Optional] How does cocoa content relate to rating? What is the average cocoa content for bars with higher ratings (above 3.5)?
- [Optional 2] Your research indicates that some consumers want to avoid bars with lecithin. Compare the average rating of bars with and without lecithin (L in the ingredients).
- Summarize your findings.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.read_csv('data/chocolate_bars.csv')
df_rating = pd.DataFrame(df.groupby('bean_origin')[['rating']].mean().sort_values(by='rating', ascending=False)).reset_index()
print(df_rating.head())
plt.figure(figsize=(12,4))
chart = sns.barplot(x='bean_origin', y='rating', data=df_rating)
chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
for x,y in zip(df_rating.index,df_rating['rating']):
label = "{:.2f}".format(y)
plt.annotate(label,
(x,y),
textcoords="offset points",
xytext=(0,1),
ha='center', size=5, rotation=90)
chart.set_title('Average Rating by Country')
plt.show()
df_bars = pd.DataFrame(df.groupby('bean_origin').agg({'id':'nunique'}).sort_values(by='id', ascending=False)).reset_index()
print(df_bars.head())
plt.figure(figsize=(10,4))
chart = sns.barplot(x='bean_origin', y='id', data=df_bars)
chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
for x,y in zip(df_bars.index,df_bars['id']):
label = "{:.0f}".format(y)
plt.annotate(label,
(x,y),
textcoords="offset points",
xytext=(0,1),
ha='center', size=5, rotation=90)
chart.set_title('Number of bars reviewed by Country')
plt.show()
#been origin is not an indicator of quality#
df_rating = df_rating.sort_values(by='rating', ascending=False)
print(df_rating.head())
df_high = df[df['rating'] > 3.5]
print(df_high.head())
avg_high = df_high['cocoa_percent'].mean()
print("the average cocoa content for bars with higher ratings is " + str(avg_high.round(2))+"%")
df_L = df[df['ingredients'].str.contains('L', na=False)]
print(df_L.head())
L_mean = df_L['rating'].mean()
print('The average rating of bars with lecithin is ' + str(L_mean.round(2)))
df_NL = df[~df['ingredients'].str.contains('L', na=False)]
print(df_NL.head())
NL_mean = df_NL['rating'].mean()
print('The average rating of bars without lecithin is ' + str(NL_mean.round(2)))
print('The average rating of bars without lecithin is slightly higher than the average rating of bars with lecithin')💡 Learn more
The following DataCamp courses can help review the skills needed for this challenge:
✅ Checklist before publishing
- Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
- Remove redundant cells like the introduction to data science notebooks, so the workbook is focused on your story.
- Check that all the cells run without error.