Skip to content

Recipe Site Traffic

Tasty Bytes is a growing subscription-based meal planning and delivery company that started as a recipe search engine during the pandemic. Now, they're looking to use data science to improve homepage recipe selection—specifically, to predict which recipes will drive high traffic and increase subscriptions. My task is to develop a model that can accurately identify popular recipes, with a target of at least 80% prediction accuracy.

Data Validation

This data set has 947 rows, 8 columns. I have validated all variables and I have made changes after validation. All the columns are just as described in the data dictionary:

  • recipe: 947 recipes without missing values, same as the description. No cleaning is needed.
  • calories: 52 missing values, corresponding to 5% of the dataset. I opted to drop these rows.
  • carbohydrate: 52 missing values, corresponding to 5% of the dataset. I opted to drop these rows.
  • sugar: 52 missing values, corresponding to 5% of the dataset. I opted to drop these rows.
  • protein: 52 missing values, corresponding to 5% of the dataset. I opted to drop these rows.
  • category: 11 distinct categories with no missing values, rather than the 10 stated in the original description. I combined the "Chicken Breast" category with "Chicken," as they represent the same type of item.
  • servings: 6 strings without missing values, Non-numeric entries ("4 as a snack", "6 as a snack") replaced with numeric values (3, 5) then converted to int.
  • high_traffic: 373 missing values, NaN values converted to "Low", "High" = 1, "Low" = 0, binary column.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, PowerTransformer, StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV

df = pd.read_csv("recipe_site_traffic_2212.csv")

# Make a copy to keep the original
df_clean = df.copy()
print(df_clean.info())
df_clean.head()
df_clean["recipe"].nunique()
print(df_clean["high_traffic"].isnull().sum())

# Ensure binary conversion (1="High", 0="Low" including NaN cases)
df_clean['high_traffic'] = df_clean['high_traffic'].fillna("Low")

print(df_clean['high_traffic'].unique())
# Number of missing values
df_clean.isnull().sum()
# Rows have missing values
missing_values = df_clean[df_clean.isnull().any(axis=1)]

# Calculate percent of missing values
percent_missing = (missing_values.isnull().sum() / len(df_clean)) * 100
print(percent_missing)

The missing values in calories, carbohydrate, sugar, and protein are all in the same rows, which account for only 5.5% of the dataset. Since this is a minimal portion, removing these rows is a reasonable step to maintain data integrity without significantly affecting the analysis.

# Remove rows have missing values
df_clean.dropna(axis=0,inplace=True)
df_clean.shape
df_clean["category"].unique()

I combined the "Chicken Breast" category with "Chicken," as they represent the same type of item.

# Merge category Chicken Breast into Chicken
df_clean['category'] = df_clean['category'].replace('Chicken Breast', 'Chicken')
print(df_clean["servings"].unique())
df_clean["servings"] = df_clean["servings"].replace({'4 as a snack': '4', '6 as a snack': '6'}).astype(int)
print(df_clean["servings"].unique())
# Validate any negative values in numeric variables
df_clean.describe()