Finding the Best Chocolate Bars
Background
A specialty food import company researches the gourmet chocolate bar market in order to know how best to approach potential suppliers. In particular, we want to know what characteristics determine the highest ratings so we can find suppliers with chocolate bars that get high ratings.
The data file contains the following information (source):
- "id" - id number of the review
- "manufacturer" - Name of the bar manufacturer
- "company_location" - Location of the manufacturer
- "year_reviewed" - From 2006 to 2021
- "bean_origin" - Country of origin of the cacao beans
- "bar_name" - Name of the chocolate bar
- "cocoa_percent" - Cocoa content of the bar (%)
- "num_ingredients" - Number of ingredients
- "ingredients" - B (Beans), S (Sugar), S* (Sweetener other than sugar or beet sugar), C (Cocoa Butter), (V) Vanilla, (L) Lecithin, (Sa) Salt
- "review" - Summary of most memorable characteristics of the chocolate bar
- "rating" - 1.0-1.9 Unpleasant, 2.0-2.9 Disappointing, 3.0-3.49 Recommended, 3.5-3.9 Highly Recommended, 4.0-5.0 Oustanding
Acknowledgments: Brady Brelinski, Manhattan Chocolate Society
Summary of Findings
In general, cocoa content does have some relation to overall chocolate bar rating. Bars with a Recommended or higher rating are more likely to have a cocoa content between 70 and 75%. In fact, around three quarters of the Outstanding rated bars having cocoa content in this range. According to the collected data, the average cocoa content for chocolate bars with ratings greater than 3.5 is about 71% cocoa, so this would be the percentage to look for when finding a highly rated chocolate bar.
The cacao bean's country of origin is not a clear, automatic indicator of quality. The majority of countries with many (more than 25) chocolate bar ratings in the data set have average ratings in the Recommended rating category: 3.125 to 3.375. The only average ratings in the Highly Recommended category come from countries with only 1 or 2 chocolate bars in the dataset (Tobago, China, and Sao Tome & Principe) so it may be the case that these countries only submitted their very best bars for the ratings.
Research indicates that some consumers want to avoid chocolate bars that contain lecithin. It does not appear that the absence of lecithin as an ingredient makes a large difference in the rating of a chocolate bar. The average rating of bars that do NOT contain lecithin is 3.225 which is less than a tenth of a point higher than the average rating of bars that do contain this ingredient.
Number of ingredients is also a small indicator of quality. Most of the well rated chocolate bars contain 2 - 4 ingredients but bars with only 3 ingredients have the highest average rating of 3.2688 compared to others. Bars with 1 or 5+ ingredients should probably be avoided since they have average ratings at our below the Recommended level.
Overall, when determining which suppliers will have highly rated chocolate bars, we should look for a cocoa content between 70 and 75%, with no more than 4 ingredients and beans that originated in a country that appears often in our data.
# Loading the tidiverse package
suppressPackageStartupMessages(library(tidyverse))
# Reading in the dataset "chocolate_bars.csv" containing different characteristics of over 2,500 chocolate bars and their reviews.
df <- readr::read_csv('data/chocolate_bars.csv', show_col_types = FALSE)Data Exploration
head(df)
summary(df)Average Rating by Country of Bean Origin
Average ratings range from 2.7143 to 3.625, with Puerto Rico beans having the lowest average rating and Tobago beans having the highest average rating.
###Average rating by country of origin
mean_ratings_by_origin <- df %>% group_by(bean_origin) %>% summarize(mean_rating = as.numeric(mean(rating)))
mean_ratings_by_origin %>% arrange(desc(mean_rating))
mean_ratings_by_origin %>% ggplot(aes(reorder(bean_origin, mean_rating), mean_rating)) + geom_col() + coord_flip() + xlab("Bean Country of Origin")Number of Bars Reviewed by Country
The country with the lowest average rating (Puerto Rico) submitted 7 bars. The country with the highest average rating (Tobago) only submitted 2 bars for review.Ten countries only submitted 1 chocolate bar for review. Venezuela submitted the most - 253 chocolate bars and three other countries also submitted over 200 bars (Ecuador, Dominican Republic, Peru).
###How many bars were reviewed for each country
number_ratings_by_origin <- df %>% group_by(bean_origin) %>% count()
number_ratings_by_origin %>% arrange(desc(n))
ggplot(number_ratings_by_origin, aes(reorder(bean_origin, n), n)) + geom_col() + coord_flip() +xlab("Bean Country of Origin") + ylab("Number of Ratings")Is the cacao bean's origin an indicator of quality?
No. When we group by the bean's country of origin and also consider how many chocolate bars were submitted for review from that country, it is clear that bean origin alone is not a clear indicator of quality.
For example, consider the following comparison. The highest mean rating in our data (3.625) comes from the two bars with beans that originated in Tobago. Individually, the Tobago bars had ratings of 3.25 and 4. The fourth highest mean rating in our data (3.45) comes from the ten bars with beans that originated in the Solomon Islands. Individually, only one of the bars is rated lower than a Tobago bean bar and five of the Solomon Island bars are actually rated at 3.5 or higher. So, if those 5 bars were the only ones submitted for review instead of all 10, then the mean rating for the Solomon Islands would have been 3.7, making it the country with the highest average rating.
Thus, it could be argued that beans from the Solomon Islands seem to consistently produce better rated chocolate bars.
#Is the cacao bean's origin an indicator of quality?
mean_ratings_by_origin %>% inner_join(number_ratings_by_origin) %>% arrange(desc(mean_rating)) %>% head(5)df %>% filter(bean_origin %in% c("Tobago", "Solomon Islands")) %>% select(c("id", "bean_origin", "bar_name", "rating")) %>% arrange(desc(bean_origin), rating)Mean Rating by Number of Chocolate Bar Submissions
On the scatterplot below, it is easier to see that some of the countries with the highest and lowest average ratings had only a few chocolate bars submitted for review. Whereas, the countries that submitted 50 or more chocolate bars tend to have "middle of the road" average ratings which would fall in the Recommended category shown between the red lines.
mean_ratings_by_origin %>% inner_join(number_ratings_by_origin) %>% ggplot(aes(n, mean_rating)) + geom_point() + geom_hline(aes(yintercept=3, color = "red")) + geom_hline(aes(yintercept=3.49, color = "red"))How does cocoa content relate to rating?
In general, cocoa content does seem to have some relation to overall chocolate bar rating. Unpleasant ratings include bars with a wide variety of cocoa content (anywhere from 55 - 100% cocoa). OUtstanding ratings include bars from a narrower cocoa content range (60-90%). The average cocoa content for chocolate bars with ratings greater than 3.5 is about 71% cocoa. However, their are outliers in many categories, so it is possible that a higher or lower cocoa content could still produce a well rated chocolate bar.