Recipe Site Traffic Prediction: A Data-Driven Approach
1. Executive Summary
Introduction
This report details a data science project undertaken for Tasty Bytes, an internet-based business reliant on subscriptions, where increased website traffic directly correlates with higher revenue. The Product Manager identified that featuring popular recipes on the homepage boosts site traffic but faced challenges in consistently identifying such recipes. This project discusses how this challenge was addressed by:
- automating the prediction of high-traffic recipes, with the specific goal of achieving an 80% correct prediction rate
- minimizing the display of unpopular recipes on the homepage.
Methods
To achieve the project goals, a dataset of 947 recipes with 7 features was analyzed. The high_traffic column was identified as the target variable for predicting high or low traffic. Initial analysis revealed a class imbalance, with 61% of recipes classified as "High" traffic, which was addressed through data balancing techniques. Exploratory data analysis (EDA) demonstrated relationships between high_traffic and other features like category and servings, indicating their predictive potential.
Two models were developed: Logistic Regression (LR) as a baseline and Random Forest (RF) for comparison. Model evaluation focused on the Area Under the Curve - Receiver Operating Characteristic (AUC-ROC) score and, crucially, Precision for the "High" traffic class, as it directly aligns with the business objective of minimizing false alarms (unpopular recipes displayed as high traffic).
Results
Logistic Regression outperformed Random Forest, achieving an AUC-ROC score of 79.9% and a precision of 82% for the "High" traffic class. This precision surpasses the business target of 80%, indicating that when LR predicts a recipe as high traffic, 82% of the time it genuinely is.
Furthermore, LR produced fewer false positives (18) compared to RF (27), directly addressing the business need to minimize unpopular recipes on the homepage.
The feature importance analysis for the Logistic Regression model highlighted category_Beverages as the most influential feature, followed by other category features, protein, and sugar. Numerical features like servings, carbohydrates, and calories had negligible contributions.
Recommendations
Based on these findings, it is recommended that Tasty Bytes deploy the Logistic Regression model into production. Ongoing monitoring of precision in production, coupled with A/B testing against the Product Manager's current selection method, is crucial. Further considerations include feature engineering, model parameter optimization, exploring alternative algorithms, understanding the cost of false positives versus false negatives, and integrating external factors like seasonality and trends into the content strategy.
2. Data Validation and Cleaning
The original data is 947 rows and 8 columns. After validation, the number of rows and columns was maintained. The following describes what I did to each column:
- Recipe: Integer data type as expected, Holds the unique IDs of each recipe from 1 to 947 as expected, no cleaning was needed
- Calories: Float data type as expected, Contained 52 missing values that I imputed with the column median
- Carbohydrate: Float data type as expected, Contained 52 missing values that I imputed with the column median
- Sugar: Float data type as expected, Contained 52 missing values that I imputed with the column median
- Protein: Float data type as expected, Contained 52 missing values that I imputed with the column median
- Category: Object data type as expected, Initially had 11 groups - transformed Chicken Breast into Chicken to maintain 10 groups, as expected
- Servings: Inititaly object data type, transformed to integer as expected with 4 unique values, had 3 rows whose entries were strings (eg '4 as a snack') while the rest were integers, dropped 'as a snack' part of the entries and kept the numbers to maintain the data count of 947
- High_traffic: Object data type as expected, had 2 unique values; 'High' and 'NaN', I converted all 373 NaN values to 'Low'.
# importing all the necessary programs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, roc_curve, confusion_matrix
from imblearn.over_sampling import SMOTE
# reading the csv file and naming the dataframe df
df = pd.read_csv('recipe_site_traffic_2212.csv')
# visualising the first 5 rows of the data
df.head(10)# visualising the data types and missing values
df.info()# validating column recipe has no duplicates
# validating that number of unique values is equal to number of rows
df['recipe'].nunique(), df.shape[0]# validating column recipe runs from 1 to 947
df['recipe'].unique()# checking for missing values in the column calories
df['calories'].isnull().sum()# imputing the missing values by the column median
# the median is robust against outliers preserving the original distribution of the data
df['calories'] = df.groupby('category')['calories'].transform(lambda x: x.fillna(x.median()))
# validating no more missing values in column calories
df['calories'].isnull().sum()# checking for missing values in column carbohydrates
df['carbohydrate'].isnull().sum()# imputing the missing values by the column median
# the median is robust against outliers preserving the original distribution of the data
df['carbohydrate'] = df.groupby('category')['carbohydrate'].transform(lambda x: x.fillna(x.median()))
# validating no more missing values
df['carbohydrate'].isnull().sum()# checking for missing values in the column sugar
df['sugar'].isnull().sum()# imputing the missing values by the column median
# the median is robust against outliers preserving the original distribution of the data
df['sugar'] = df.groupby('category')['sugar'].transform(lambda x: x.fillna(x.median()))
# validating no more missing values
df['sugar'].isnull().sum()# checking for missing values in the column protein
df['protein'].isnull().sum()