Skip to content
0
# Importing the pandas and plotting modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Reading in the data
df = pd.read_csv('data/chocolate_bars.csv')

Summary of Conclusions

The market for gourmet chocolate bars was targeted for evaluation prior to entry. Identifying characteristics of highly rated chocolate bars will allow the company to provide a successful offering.

Rating data was evaluated on 1605 chocolate bars with beans originating in 62 countries. Ratings were based on a five point scale with the highest rating in the data four out of five. The influence of bean origin on rating was inconclusive, but bars including a blend of beans had a lower rating. Bars rated 4 had intermediate values of cocoa content and over half had a cocoa content of 70-72%. Bars containing lecithin, salt, and vanilla tended to have lower ratings.

💾 The data

Your team created a file with the following information (source):
  • "id" - id number of the review
  • "manufacturer" - Name of the bar manufacturer
  • "company_location" - Location of the manufacturer
  • "year_reviewed" - From 2006 to 2021
  • "bean_origin" - Country of origin of the cacao beans
  • "bar_name" - Name of the chocolate bar
  • "cocoa_percent" - Cocoa content of the bar (%)
  • "num_ingredients" - Number of ingredients
  • "ingredients" - B (Beans), S (Sugar), S* (Sweetener other than sugar or beet sugar), C (Cocoa Butter), (V) Vanilla, (L) Lecithin, (Sa) Salt
  • "review" - Summary of most memorable characteristics of the chocolate bar
  • "rating" - 1.0-1.9 Unpleasant, 2.0-2.9 Disappointing, 3.0-3.49 Recommended, 3.5-3.9 Highly Recommended, 4.0-5.0 Oustanding

Acknowledgments: Brady Brelinski, Manhattan Chocolate Society

df.head()
display(df.info())
display(df.nunique())
df.duplicated().sum()
display(df.describe().transpose())
# create a dataframe with avg rating by bar to deweight by number of reviews
bar_ratings = df.groupby('bar_name')['rating'].agg(['mean','count','max'])
print(bar_ratings.shape)
bar_ratings.head()

💪 Challenge

Create a report to summarize your research. Include:

  1. What is the average rating by country of origin?
  2. How many bars were reviewed for each of those countries?
  3. Create plots to visualize findings for questions 1 and 2.
  4. Is the cacao bean's origin an indicator of quality?
  5. [Optional] How does cocoa content relate to rating? What is the average cocoa content for bars with higher ratings (above 3.5)?
  6. [Optional 2] Your research indicates that some consumers want to avoid bars with lecithin. Compare the average rating of bars with and without lecithin (L in the ingredients).
  7. Summarize your findings.

Bars by country

The data contains 62 countries which is too many to visualize cleanly. To reduce visual clutter the data set was limited to the ten highest rated countries and the ten countries with the most bars.

The seven bean origins with the highest average ratings all offer fewer than 50 bars while countries with large numbers of bars tended to have very average ratings. Bars with blends are a notable outlier with a high number of bars and a relatively average low rating.

Origin was inconclusive as an indicator of quality because of the small sample size from many highly rated origins and reversion to the mean of countries with large numbers of bars.

#bean_origin includes bean_origin, mean and max rating, and number of bars grouped by origin
bean_origin=bar_ratings.merge(df[['bean_origin','bar_name']],on='bar_name',how='left')
bean_origin=bean_origin.groupby(['bean_origin'])['mean'].agg(['mean','count','max'])
bean_origin.rename(columns={'mean':'Avg_Rating','count':'Number_of_Bars','max':'Highest_Rating'},inplace=True)
bean_origin.sort_values('Avg_Rating',ascending=False,inplace=True)
display(bean_origin.head())
#ten highest rated and 10 most common origins, checked for duplicates
bean_origin_rev = pd.concat([bean_origin.head(10),bean_origin.sort_values('Number_of_Bars',ascending=False).head(10)])
print('Duplicates:',bean_origin_rev.duplicated().sum())
bean_origin_rev.sort_values('Avg_Rating',ascending=False,inplace=True)
display(bean_origin_rev)
fig,ax = plt.subplots(figsize=(8,5))
ln1 = sns.barplot(x=bean_origin_rev.index,y=bean_origin_rev.Avg_Rating,color='royalblue',ax=ax,alpha=0.7,label='Avg Rating')
_ = plt.xticks(rotation=45,ha='right',rotation_mode='anchor')
_ = plt.xlabel('Bean Origin')
ax2 = ax.twinx()
ln2 = sns.lineplot(x=bean_origin_rev.index,y=bean_origin_rev.Number_of_Bars,color='brown',ax=ax2,linewidth=3,label='# Bars')
_ = ax.set_ylabel('Average Rating')
_ = ax2.set_ylabel('Number of Bars')
_ = plt.title('Average Rating and Number of Bars by Origin')
_ = ax.legend([ln1,ln2],[ln1.get_label(),ln2.get_label()],loc='upper right')
_ = sns.histplot(bean_origin['Number_of_Bars'],binwidth=10)
_ = plt.title('Number of Bars per Bean Origin')
_ = plt.xlabel('Number of Bars')