Skip to content
0

the dataset: "chocolate_bars.csv" containing different characteristics of over 2,500 chocolate bars and their reviews.

# Importing the pandas module
import pandas as pd

# Reading in the data
df = pd.read_csv('data/chocolate_bars.csv')

# Take a look at the first datapoints
df.head(10)

💾 The data

Your team created a file with the following information (source):
  • "id" - id number of the review
  • "manufacturer" - Name of the bar manufacturer
  • "company_location" - Location of the manufacturer
  • "year_reviewed" - From 2006 to 2021
  • "bean_origin" - Country of origin of the cacao beans
  • "bar_name" - Name of the chocolate bar
  • "cocoa_percent" - Cocoa content of the bar (%)
  • "num_ingredients" - Number of ingredients
  • "ingredients" - B (Beans), S (Sugar), S* (Sweetener other than sugar or beet sugar), C (Cocoa Butter), (V) Vanilla, (L) Lecithin, (Sa) Salt
  • "review" - Summary of most memorable characteristics of the chocolate bar
  • "rating" - 1.0-1.9 Unpleasant, 2.0-2.9 Disappointing, 3.0-3.49 Recommended, 3.5-3.9 Highly Recommended, 4.0-5.0 Oustanding

Acknowledgments: Brady Brelinski, Manhattan Chocolate Society

df.head(30)

💪 Challenge

Create a report to summarize your research. Include:

  1. What is the average rating by country of origin?
  2. How many bars were reviewed for each of those countries?
  3. Create plots to visualize findings for questions 1 and 2.
  4. Is the cacao bean's origin an indicator of quality?
  5. [Optional] How does cocoa content relate to rating? What is the average cocoa content for bars with higher ratings (above 3.5)?
  6. [Optional 2] Your research indicates that some consumers want to avoid bars with lecithin. Compare the average rating of bars with and without lecithin (L in the ingredients).
  7. Summarize your findings.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

df = pd.read_csv('data/chocolate_bars.csv')


df_rating = pd.DataFrame(df.groupby('bean_origin')[['rating']].mean().sort_values(by='rating', ascending=False)).reset_index()
print(df_rating.head())
plt.figure(figsize=(12,4))
chart = sns.barplot(x='bean_origin', y='rating', data=df_rating)
chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
for x,y in zip(df_rating.index,df_rating['rating']):
    label = "{:.2f}".format(y)
    plt.annotate(label,
                 (x,y),
                 textcoords="offset points",
                 xytext=(0,1),
                 ha='center', size=5, rotation=90)
chart.set_title('Average Rating by Country')
plt.show()

df_bars = pd.DataFrame(df.groupby('bean_origin').agg({'id':'nunique'}).sort_values(by='id', ascending=False)).reset_index()
print(df_bars.head())
plt.figure(figsize=(10,4))
chart = sns.barplot(x='bean_origin', y='id', data=df_bars)
chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
for x,y in zip(df_bars.index,df_bars['id']):
    label = "{:.0f}".format(y)
    plt.annotate(label,
                 (x,y),
                 textcoords="offset points",
                 xytext=(0,1),
                 ha='center', size=5, rotation=90)
chart.set_title('Number of bars reviewed by Country')
plt.show()

#been origin is not an indicator of quality#
df_rating = df_rating.sort_values(by='rating', ascending=False)
print(df_rating.head())

df_high = df[df['rating'] > 3.5]
print(df_high.head())
avg_high = df_high['cocoa_percent'].mean()

print("the average cocoa content for bars with higher ratings is " + str(avg_high.round(2))+"%")

df_L = df[df['ingredients'].str.contains('L', na=False)]
print(df_L.head())
L_mean = df_L['rating'].mean()

print('The average rating of bars with lecithin is ' + str(L_mean.round(2)))
df_NL = df[~df['ingredients'].str.contains('L', na=False)]
print(df_NL.head())
NL_mean = df_NL['rating'].mean()

print('The average rating of bars without lecithin is ' + str(NL_mean.round(2)))

print('The average rating of bars without lecithin is slightly higher than the average rating of bars with lecithin')

💡 Learn more

The following DataCamp courses can help review the skills needed for this challenge:

✅ Checklist before publishing

  • Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
  • Remove redundant cells like the introduction to data science notebooks, so the workbook is focused on your story.
  • Check that all the cells run without error.

⌛️ Time is ticking. Good luck!