Skip to content
Identifying the best chocolate bars with data
  • AI Chat
  • Code
  • Report
  • Finding the best chocolate bars

    📖 Background

    You work at a specialty foods import company that wants to expand into gourmet chocolate bars. Your boss needs your team to research this market to inform your initial approach to potential suppliers.

    After finding valuable chocolate bar ratings online, you need to explore if the chocolate bars with the highest ratings share any characteristics that could help you narrow your search for suppliers (e.g., cacao percentage, bean country of origin, etc.)

    💾 The data

    Your team created a file with the following information (source):
    • "id" - id number of the review
    • "manufacturer" - Name of the bar manufacturer
    • "company_location" - Location of the manufacturer
    • "year_reviewed" - From 2006 to 2021
    • "bean_origin" - Country of origin of the cacao beans
    • "bar_name" - Name of the chocolate bar
    • "cocoa_percent" - Cocoa content of the bar (%)
    • "num_ingredients" - Number of ingredients
    • "ingredients" - B (Beans), S (Sugar), S* (Sweetener other than sugar or beet sugar), C (Cocoa Butter), (V) Vanilla, (L) Lecithin, (Sa) Salt
    • "review" - Summary of most memorable characteristics of the chocolate bar
    • "rating" - 1.0-1.9 Unpleasant, 2.0-2.9 Disappointing, 3.0-3.49 Recommended, 3.5-3.9 Highly Recommended, 4.0-5.0 Oustanding

    Acknowledgments: Brady Brelinski, Manhattan Chocolate Society

    💪 Challenge

    Create a report to summarize your research. Include:

    1. What is the average rating by country of origin?
    2. How many bars were reviewed for each of those countries?
    3. Create plots to visualize findings for questions 1 and 2.
    4. Is the cacao bean's origin an indicator of quality?
    5. [Optional] How does cocoa content relate to rating? What is the average cocoa content for bars with higher ratings (above 3.5)?
    6. [Optional 2] Your research indicates that some consumers want to avoid bars with lecithin. Compare the average rating of bars with and without lecithin (L in the ingredients).
    7. Summarize your findings.

    Imports

    # Importing the pandas module
    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np
    import seaborn as sns
    
    # Reading in the data
    df = pd.read_csv('data/chocolate_bars.csv')
    df.head()

    Q1. What is the average rating by country of origin?

    avg_rating = df.groupby("bean_origin")['rating'].mean().sort_values(ascending = False).reset_index()
    avg_rating.columns = ['country', 'avg_rating']
    avg_rating

    Q2 How many bars were reviewed for each of those countries?

    bar_count = df.groupby("bean_origin")['rating'].count().sort_values(ascending = False).reset_index()
    bar_count.columns = ['country', 'count']
    bar_count

    Q3. Create plots to visualize Q1 & Q2

    fig, axs = plt.subplots(1, 2, figsize = (16, 10))
    
    sns.barplot(data = avg_rating, y = 'country', x = 'avg_rating', ax = axs[0])
    sns.barplot(data = bar_count, y = 'country', x = 'count', ax = axs[1])
    
    # Annotate the first subplot
    for p in axs[0].patches:
        axs[0].annotate(format(p.get_width(), '.1f'),  # Use p.get_width() for horizontal bars
                     (p.get_x() + p.get_width(), p.get_y() + p.get_height()/2.), 
                     ha='left', va='center', 
                     xytext=(5, 0), 
                     textcoords='offset points')
    
    # Annotate the second subplot
    for p in axs[1].patches:
        axs[1].annotate(format(p.get_width(), '.1f'),  # Use p.get_width() for horizontal bars
                     (p.get_x() + p.get_width(), p.get_y() + p.get_height()/2.), 
                     ha='left', va='center', 
                     xytext=(5, 0), 
                     textcoords='offset points')
    
    axs[0].set_title("Average rating by country")
    axs[1].set_title("Total review count by country")
    
    plt.tight_layout()

    Q4. Is the cacao bean's origin an indicator of quality?

    plt.figure(figsize = (16, 8))
    
    merge = pd.merge(avg_rating, bar_count, on = 'country', how = 'left').sort_values(by = 'count', ascending = False)
    ax = sns.scatterplot(data = merge, x = 'count', y = 'avg_rating')
    
    print(merge.columns)
    
    # Annotate each point in the scatter plot
    for i in range(merge.shape[0]):
        ax.text(merge['count'][i] + 0.1,  # slightly offset the text in the x direction for clarity
                merge['avg_rating'][i],  # slightly offset the text in the y direction for clarity
                merge['country'][i],  # text to display
                horizontalalignment='left',
                size='small', color='black', weight=None, alpha = 0.6)
    
    plt.xlabel('rating count')
    plt.ylabel('avg rating')
    plt.title('Scatter Plot of rating count with average rating')
    
    plt.tight_layout()

    Q5.1 How does cocoa content relate to rating?

    corr_val = df[['cocoa_percent','rating']].corr().iloc[0, 1]
    print(f"For every 1-percent increase in cocoa content, the average rating falls by {round(corr_val, 2)}")