Epicurious Flavor Keyword Analysis

Epicurious.com Flavor Keyword Analysis - The Unsung Heroes

Epicurious.com is a popular food blog and cooking site that features fun cooking videos, home cooking advice columns, and over 33,000 recipes. In a recent analysis, I discovered and downloaded a dataset from Kaggle with over 7,000 unique recipe names scraped from the blog site (last updated 2019). The dataset also included arrays of flavors associated with the recipes, an average rating, and an average user success score (average percent chance a user would make the recipe again).

I wanted to see if I could establish a correlation between distinct flavors, how many times they were included in the list of recipes, and average rating and user success scores. In later parts, I also wanted to try and establish which flavor combinations were associated with higher ratings and user success scores. The primary problem I faced when cleaning this dataset was that the flavor column was inconsistent with its keywords, as some recipes had no flavor keywords listed, some had only one, while others had up to eight distinct flavors listed in a single cell, listing even the same ones multiple times in some cases. To complete an accurate analysis, I utilized some non-standard `SQL` functions such as "SPLIT" and "UNNEST", which will help seperate field strings listing multiple values and concatenate distinct flavors into their own fields.

Part I: Seperating Distinct Flavors

First, we will run a basic query to preview the dataset as it is.

DataFrameas

recipes

variable

SELECT *
FROM 'recipes.csv'
ORDER BY rating DESC, success DESC;

Our first issue is recognizing that in some cases, the flavors in this original dataset list the same one multiple times within the same field. We'll need to query and save a new, referenceable dataset that removes the duplicates within the same field in the flavors column. This query also removes records where the flavor field is listed as null.

DataFrameas

df4

variable

SELECT 
  recipe_name,
  TRIM(CASE 
         WHEN COUNT(DISTINCT word) > 1 THEN STRING_AGG(DISTINCT word, ' ')
         ELSE MAX(word) -- Preserving single-word flavors
       END) AS flavors,
  ROUND(AVG(rating), 2) AS rating,
  ROUND(AVG(success), 2) AS success
FROM (
  SELECT 
    recipe_name,
	flavors,
    UNNEST(SPLIT(flavors, ' ')) AS word,
    rating,
    success
  FROM 
    recipes.csv
  WHERE 
    flavors IS NOT NULL
)
GROUP BY 
  recipe_name, flavors
ORDER BY 
  rating DESC, success DESC;

Saved to new dataset:

DataFrameas

new_data

variable

SELECT *
FROM 'new_data.csv'
ORDER BY rating DESC, success DESC;

Now that we have a new dataset with the individual field duplicates removed, we can continue running our analysis. This time, we will run a query that will split every word mentioned in the flavors field with a single space as the delimeter, then count how many times every distinct word is mentioned as it's own field. There is one typo with the word "tomatoe," which we have also fixed. As you can see with the query, of the 97 distinct flavors, "vanilla" is still the most mentioned flavor by a large margin, but now it is only being compared to other individual flavors.

DataFrameas

df

variable

SELECT
  REPLACE(word, 'tomatoe', 'tomato') AS word,
  COUNT(*) AS word_count
FROM (
  SELECT
    UNNEST(SPLIT(flavors, ' ')) AS word
  FROM
    new_data.csv
) AS t
WHERE word IS NOT NULL
GROUP BY
  REPLACE(word, 'tomatoe', 'tomato')
ORDER BY word_count DESC;

Now let's calculate the average ratings and user success scores for all recipes that mention distinct flavors to try and establish a correlation between associated flavors and these averages.

DataFrameas

flavor_ratings

variable

SELECT
  REPLACE(flavor_name, 'tomatoe', 'tomato') AS flavor,
  COUNT(*) AS flavor_count,
  ROUND(AVG(rating),2) AS avg_rating,
  ROUND(AVG(success),2) AS avg_success
FROM (
  SELECT
    recipe_name,
    UNNEST(SPLIT(REPLACE(flavors, 'tomatoe', 'tomato'), ' ')) AS flavor_name,
    rating,
    success
  FROM
    new_data.csv
)
GROUP BY
  REPLACE(flavor_name, 'tomatoe', 'tomato')
ORDER BY
  flavor_count DESC;

By this analysis, we can conclude that though the flavor "vanilla" is the most frequently mentioned flavor in this list of recipes, recipes actually mentioning either garlic or cola are associated with a higher average rating (3.75) despite only having been mentioned twice each. Furthermore, recipes having mentioned tomatoes as a flavor are associated with a higher average user success rating (0.84). In Part II we will dive deeper into this analysis by dividing the top 6 flavors with the highest count (vanilla, chocolate, lemon, cinnamon, orange, and almond) into their distinct flavor combinations to determine how additions to these flavors compare to each other and may affect ratings/success scores.

As a fun addition to this analysis, I decided to utilize Python code to quickly create a node map of every distinct flavor to illustrate all of the flavor combinations that were represented in the original dataset. The cool to warm color map key spreads from flavors LEAST included or mentioned in a combination (dark blue) to the most listed flavor choice (dark red) which, as is confirmed by our SQL analysis, is vanilla.

import pandas as pd
import itertools
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np

# Load Data
df = pd.read_csv('new_data.csv')

# Handle NaN values in 'flavors' column
df['flavors'] = df['flavors'].fillna('')

# Preprocess Data
flavor_combinations = []
for flavors in df['flavors']:
    flavor_list = flavors.split(' ')  # Split by single space
    flavor_list = [flavor.strip() for flavor in flavor_list if flavor.strip()]  # Remove empty strings
    if len(flavor_list) < 2:
        continue  # Skip if less than 2 flavors
    flavor_combinations.extend(list(itertools.combinations(flavor_list, 2)))

# Create Network Graph
G = nx.Graph()
G.add_edges_from(flavor_combinations)

# Calculate node colors based on degree
node_degrees = dict(G.degree())
max_degree = max(node_degrees.values())
min_degree = min(node_degrees.values())

# Define a color map based on degree using coolwarm colormap
color_map = []
for node in G.nodes():
    degree = node_degrees[node]
    normalized_degree = (degree - min_degree) / (max_degree - min_degree)  # Normalize degree between 0 and 1
    color = plt.cm.coolwarm(normalized_degree)  # Use coolwarm colormap
    color_map.append(color)

# Calculate node sizes based on degree
node_sizes = [75 * degree for node, degree in node_degrees.items()]  # Adjust multiplier as needed

# Create a blank lavender-colored figure
fig = plt.figure(figsize=(14,16), facecolor='peachpuff')

# Visualize Graph with Fruchterman-Reingold layout on the lavender-colored figure
pos = nx.spring_layout(G, k=5.8, iterations=125, seed=42)  # Adjust k and iterations

# Draw nodes and edges with adjusted sizes
nx.draw(G, pos, with_labels=False, node_color=color_map, node_size=node_sizes, edge_color='gray', alpha=0.7)

# Draw node labels with dynamically adjusted font size
for node, (x, y) in pos.items():
    degree = node_degrees[node]
    font_size = min(max(degree, 6), 12)  # Adjust font size based on node degree (clamped between 8 and 16)
    plt.text(x, y, node, fontsize=font_size, ha='center', va='center', color='black')

plt.title('Flavor Combinations Web')
plt.axis('off')
plt.show()

‌
‌
‌

Epicurious Flavor Keyword Analysis

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Epicurious.com Flavor Keyword Analysis - The Unsung Heroes

Part I: Seperating Distinct Flavors

First, we will run a basic query to preview the dataset as it is.

Saved to new dataset:

Now let's calculate the average ratings and user success scores for all recipes that mention distinct flavors to try and establish a correlation between associated flavors and these averages.

Epicurious.com Flavor Keyword Analysis - The Unsung Heroes