Skip to content
Epicurious Flavor Keyword Analysis
Epicurious.com Flavor Keyword Analysis - The Unsung Heroes
Epicurious.com is a popular food blog and cooking site that features fun cooking videos, home cooking advice columns, and over 33,000 recipes. In a recent analysis, I discovered and downloaded a dataset from Kaggle with over 7,000 unique recipe names scraped from the blog site (last updated 2019). The dataset also included arrays of flavors associated with the recipes, an average rating, and an average user success score (average percent chance a user would make the recipe again).
I wanted to see if I could establish a correlation between distinct flavors, how many times they were included in the list of recipes, and average rating and user success scores. In later parts, I also wanted to try and establish which flavor combinations were associated with higher ratings and user success scores. The primary problem I faced when cleaning this dataset was that the flavor column was inconsistent with its keywords, as some recipes had no flavor keywords listed, some had only one, while others had up to eight distinct flavors listed in a single cell, listing even the same ones multiple times in some cases. To complete an accurate analysis, I utilized some non-standard SQL functions such as "SPLIT" and "UNNEST", which will help seperate field strings listing multiple values and concatenate distinct flavors into their own fields.
SQL functions such as "SPLIT" and "UNNEST", which will help seperate field strings listing multiple values and concatenate distinct flavors into their own fields.Part I: Seperating Distinct Flavors
First, we will run a basic query to preview the dataset as it is.
DataFrameas
recipes
variable
SELECT *
FROM 'recipes.csv'
ORDER BY rating DESC, success DESC;Our first issue is recognizing that in some cases, the flavors in this original dataset list the same one multiple times within the same field. We'll need to query and save a new, referenceable dataset that removes the duplicates within the same field in the flavors column. This query also removes records where the flavor field is listed as null.
DataFrameas
df4
variable
SELECT
recipe_name,
TRIM(CASE
WHEN COUNT(DISTINCT word) > 1 THEN STRING_AGG(DISTINCT word, ' ')
ELSE MAX(word) -- Preserving single-word flavors
END) AS flavors,
ROUND(AVG(rating), 2) AS rating,
ROUND(AVG(success), 2) AS success
FROM (
SELECT
recipe_name,
flavors,
UNNEST(SPLIT(flavors, ' ')) AS word,
rating,
success
FROM
recipes.csv
WHERE
flavors IS NOT NULL
)
GROUP BY
recipe_name, flavors
ORDER BY
rating DESC, success DESC;
Saved to new dataset:
DataFrameas
new_data
variable
SELECT *
FROM 'new_data.csv'
ORDER BY rating DESC, success DESC;Now that we have a new dataset with the individual field duplicates removed, we can continue running our analysis. This time, we will run a query that will split every word mentioned in the flavors field with a single space as the delimeter, then count how many times every distinct word is mentioned as it's own field. There is one typo with the word "tomatoe," which we have also fixed. As you can see with the query, of the 97 distinct flavors, "vanilla" is still the most mentioned flavor by a large margin, but now it is only being compared to other individual flavors.
DataFrameas
df
variable
SELECT
REPLACE(word, 'tomatoe', 'tomato') AS word,
COUNT(*) AS word_count
FROM (
SELECT
UNNEST(SPLIT(flavors, ' ')) AS word
FROM
new_data.csv
) AS t
WHERE word IS NOT NULL
GROUP BY
REPLACE(word, 'tomatoe', 'tomato')
ORDER BY word_count DESC;
Now let's calculate the average ratings and user success scores for all recipes that mention distinct flavors to try and establish a correlation between associated flavors and these averages.
DataFrameas
flavor_ratings
variable
SELECT
REPLACE(flavor_name, 'tomatoe', 'tomato') AS flavor,
COUNT(*) AS flavor_count,
ROUND(AVG(rating),2) AS avg_rating,
ROUND(AVG(success),2) AS avg_success
FROM (
SELECT
recipe_name,
UNNEST(SPLIT(REPLACE(flavors, 'tomatoe', 'tomato'), ' ')) AS flavor_name,
rating,
success
FROM
new_data.csv
)
GROUP BY
REPLACE(flavor_name, 'tomatoe', 'tomato')
ORDER BY
flavor_count DESC;
By this analysis, we can conclude that though the flavor "vanilla" is the most frequently mentioned flavor in this list of recipes, recipes actually mentioning either garlic or cola are associated with a higher average rating (3.75) despite only having been mentioned twice each. Furthermore, recipes having mentioned tomatoes as a flavor are associated with a higher average user success rating (0.84). In Part II we will dive deeper into this analysis by dividing the top 6 flavors with the highest count (vanilla, chocolate, lemon, cinnamon, orange, and almond) into their distinct flavor combinations to determine how additions to these flavors compare to each other and may affect ratings/success scores.
As a fun addition to this analysis, I decided to utilize Python code to quickly create a node map of every distinct flavor to illustrate all of the flavor combinations that were represented in the original dataset. The cool to warm color map key spreads from flavors LEAST included or mentioned in a combination (dark blue) to the most listed flavor choice (dark red) which, as is confirmed by our SQL analysis, is vanilla.
import pandas as pd
import itertools
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
# Load Data
df = pd.read_csv('new_data.csv')
# Handle NaN values in 'flavors' column
df['flavors'] = df['flavors'].fillna('')
# Preprocess Data
flavor_combinations = []
for flavors in df['flavors']:
flavor_list = flavors.split(' ') # Split by single space
flavor_list = [flavor.strip() for flavor in flavor_list if flavor.strip()] # Remove empty strings
if len(flavor_list) < 2:
continue # Skip if less than 2 flavors
flavor_combinations.extend(list(itertools.combinations(flavor_list, 2)))
# Create Network Graph
G = nx.Graph()
G.add_edges_from(flavor_combinations)
# Calculate node colors based on degree
node_degrees = dict(G.degree())
max_degree = max(node_degrees.values())
min_degree = min(node_degrees.values())
# Define a color map based on degree using coolwarm colormap
color_map = []
for node in G.nodes():
degree = node_degrees[node]
normalized_degree = (degree - min_degree) / (max_degree - min_degree) # Normalize degree between 0 and 1
color = plt.cm.coolwarm(normalized_degree) # Use coolwarm colormap
color_map.append(color)
# Calculate node sizes based on degree
node_sizes = [75 * degree for node, degree in node_degrees.items()] # Adjust multiplier as needed
# Create a blank lavender-colored figure
fig = plt.figure(figsize=(14,16), facecolor='peachpuff')
# Visualize Graph with Fruchterman-Reingold layout on the lavender-colored figure
pos = nx.spring_layout(G, k=5.8, iterations=125, seed=42) # Adjust k and iterations
# Draw nodes and edges with adjusted sizes
nx.draw(G, pos, with_labels=False, node_color=color_map, node_size=node_sizes, edge_color='gray', alpha=0.7)
# Draw node labels with dynamically adjusted font size
for node, (x, y) in pos.items():
degree = node_degrees[node]
font_size = min(max(degree, 6), 12) # Adjust font size based on node degree (clamped between 8 and 16)
plt.text(x, y, node, fontsize=font_size, ha='center', va='center', color='black')
plt.title('Flavor Combinations Web')
plt.axis('off')
plt.show()