Recipe Site Traffic Analysis
Objective:
This analysis aims to develop a predictive model to identify which recipes on the Tasty Bytes website will bring high traffic to the site once they are posted on the home page. The business targets to get the popular recipes right at least 80 percent of the time, and reduce the chances of displaying unpopular recipes.
1. Data Validation and Cleaning
Dataset Overview:
- Total Records: 947
- Total Columns: 8
- Target Variable: high_traffic
- Features: Calories, Carbohydrate, Sugar, Protein, Category, Servings
Initial Findings:
- Missing values were present in the numerical features (calories, carbohydrate, sugar, protein) and the target variable (high_traffic).
- The category and servings columns contained consistent categorical values.
Cleaning Actions:
-
Converted all numeric columns to numeric type and imputed missing values with category-wise median.
-
Standardized category entries by trimming spaces and converting to lowercase.
-
Converted high_traffic into a binary label:
-
1 - โhighโ
-
0 - non-high or missing
-
-
Outliers were identified and removed using the Z-score method.
Results After Cleaning
-
Data shape after outlier removal: (874, 9)
-
Target distribution:
-
1 (High Traffic): 524
-
0 (Low Traffic): 350
-
-
No missing values remained in the cleaned dataset.
-
Final cleaned dataset shape: (874, 9)
2. Exploratory Data Analysis (EDA)
Single-Variable Insights:
-
Calories Distribution: Positively skewed, with most recipes between 100โ600 calories per serving.
-
Category Distribution: Most recipes belonged to Breakfast, Chicken Breast, and Beverages categories.
Multi-Variable Insight:
A boxplot of calories versus high traffic revealed that high-traffic recipes generally had moderate-to-high calorie content, suggesting that richer meals may attract more engagement.
3. Model Development
Problem Type:
Binary classification โ predicting whether a recipe leads to High (1) or Low (0) traffic.
Data Split:
-
Training set: 80%
-
Testing set: 20%
-
Stratified by target variable to maintain class balance.
Models Used:
-
Logistic Regression (Baseline): Chosen for interpretability and as a linear benchmark.
max_iter=1000 ensures full convergence.
-
Random Forest Classifier (Comparison): Selected to capture non-linear relationships and feature interactions.
n_estimators=100 provides stable and accurate predictions through ensemble averaging.
4. Model Evaluation and Comparison
| Metric | Logistic Regression | Random Forest |
|---|---|---|
| Accuracy | 0.7886 | 0.7029 |
| Precision | 0.8091 | 0.7677 |
| Recall | 0.8476 | 0.7238 |
| F1 Score | 0.8279 | 0.7451 |
| ROC AUC | 0.8497 | 0.8213 |
Interpretation Both models were evaluated on identical test data, revealing Logistic Regression as the stronger performer across nearly all key metrics:
-
Logistic Regression achieved a higher accuracy (78.9%) compared to Random Forest (70.3%).
-
It also exhibited superior precision (80.9%) and recall (84.8%), indicating that it correctly identifies a larger proportion of truly positive (high-traffic) recipes while keeping false positives under control.
-
The ROC AUC score of 0.85 further confirms its stronger discriminative power, meaning it can better distinguish between high and low traffic recipes.
-
Random Forest, while still respectable, underperformed in both accuracy and recall, suggesting potential overfitting or less sensitivity to feature interactions in this dataset.
Confusion Matrix (Logistic Regression):
| Predicted Low Traffic | Predicted High Traffic | |
|---|---|---|
| Actual Low Traffic | 49 | 21 |
| Actual High Traffic | 16 | 89 |
Confusion Matrix (Random Forest):
| Predicted Low Traffic | Predicted High Traffic | |
|---|---|---|
| Actual Low Traffic | 47 | 23 |
| Actual High Traffic | 29 | 76 |
Interpretation Logistic Regression
-
Correctly classifies 89 high-traffic recipes, maintaining a high recall rate (84.8%).
-
Keeps false positives (21) at a manageable level.
-
Strikes an excellent balance between precision and recall โ making it highly suitable for predicting recipes for homepage promotion, where both accuracy and confidence matter.
Random Forest
-
Detects fewer true positives (76) and slightly more false positives (23).
-
Its lower recall (72.4%) indicates that it misses more genuinely popular recipes compared to Logistic Regression.
5. Business Metrics
| Model | Precision | Recall | False Positive Rate |
|---|---|---|---|
| Logistic Regression | 0.8091 | 0.8476 | 0.3000 |
| Random Forest | 0.7677 | 0.7238 | 0.3286 |
Interpretation
-
Precision (80.9%) โ When Logistic Regression predicts a recipe will be popular, itโs correct 81% of the time.
-
Recall (84.8%) โ Exceeds the business target (โฅ80%), ensuring the model identifies nearly all high-traffic recipes.
-
False Positive Rate (30%) โ Indicates an acceptable level of incorrect positive predictions for operational use.
6. Feature Importance (Random Forest)
| Rank | Feature | Importance |
|---|---|---|
| 1 | Protein | 0.1691 |
| 2 | Calories | 0.1521 |
| 3 | Carbohydrate | 0.1478 |
| 4 | Sugar | 0.1461 |
| 5 | Category: Beverages | 0.0787 |
| 6 | Servings | 0.0624 |
| 7 | Category: Vegetable | 0.0525 |
| 8 | Category: Breakfast | 0.0451 |
| 9 | Category: Potato | 0.0332 |
| 10 | Category: Pork | 0.0270 |
Interpretation
The Random Forest feature importance analysis highlights the most influential variables driving recipe popularity:
- Nutritional Attributes Dominate Predictions
-
Protein, Calories, Carbohydrates, and Sugar rank as the top four predictors.
-
Recipes rich in protein and moderate in calories tend to attract more users โ possibly due to growing interest in high-protein diets and balanced meals.
- Category-Level Factors Remain Strong Secondary Influencers
-
Categories like Beverages, Vegetables, and Breakfast exhibit notable importance, suggesting these groups have consistent consumer appeal.
-
Niche categories such as Potato and Pork also contribute meaningfully, indicating diversified user interest across food types.
- Operational Insight
-
Marketing and recommendation strategies can prioritize high-protein and nutrient-dense recipes for greater engagement.
-
Product teams can focus on optimizing presentation and availability of Beverage and Vegetable recipes to capture repeat user attention.
7. Summary and Recommendations
Findings:
-
The cleaned dataset contained 874 complete records with balanced class distribution.
-
Logistic Regression and Random Forest both achieved strong predictive performance.
-
Logistic Regression demonstrated superior precision and ROC-AUC, making it ideal for deployment.
-
Feature importance analysis confirmed that nutritional content and category type play major roles in recipe popularity.
Recommendations:
-
Deploy Logistic Regression for homepage recipe selection due to its higher precision and interpretability.
-
Use Random Forest for exploratory scenarios when maximizing recall is preferred (finding more high-traffic recipes).
-
Prioritize high-protein, moderately caloric recipes in homepage recommendations to increase engagement.
-
Monitor Precision and Recall as business KPIs to ensure continued reliability.
-
Retrain the model periodically as new recipe data becomes available to maintain predictive performance.
8. Business Impact
By deploying the Logistic Regression model, Tasty Bytes can significantly enhance its recipe promotion strategy through data-driven insights and predictive targeting:
-
Accurately identify approximately 85% of high-traffic recipes (Recall = 0.8476), ensuring that the majority of popular recipes are featured prominently on the homepage.
-
Maintain precision above 80% (0.8091) โ meaning that when the model predicts a recipe will be popular, it is correct 8 out of 10 times.
-
Limit false promotions to roughly 30% of total recommendations, minimizing wasted homepage exposure on low-engagement content.
-
Boost overall user engagement and subscription conversions by consistently showcasing recipes that are most likely to attract views, clicks, and saves.
-
Optimize marketing and content planning โ the model provides actionable insights into which nutritional and categorical factors drive user interest, guiding the creation of more engaging recipes.
Conclusion
Both models successfully meet the business objective of predicting popular recipes, but Logistic Regression offers the best balance of performance, interpretability, and precision. Feature importance insights further guide recipe strategy, highlighting the influence of nutritional richness and category type on user engagement.
# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, confusion_matrix
)
from scipy import stats# Load and Inspect Dataset
df = pd.read_csv("recipe_site_traffic_2212.csv")
print("Dataset shape:", df.shape)
print("\nPreview:\n", df.head())
print("\nMissing values:\n", df.isna().sum())
# DATA VALIDATION AND CLEANING
# Convert 'recipe' to numeric and ensure uniqueness
df['recipe'] = pd.to_numeric(df['recipe'], errors='coerce')
print("Unique recipe IDs:", df['recipe'].nunique())
# Convert numeric columns to numeric data types
numeric_cols = ['calories', 'carbohydrate', 'sugar', 'protein', 'servings']
for col in numeric_cols:
df[col] = pd.to_numeric(df[col], errors='coerce')
# Impute missing numeric values by category median (context-aware imputation)
if 'category' in df.columns:
for col in ['calories', 'carbohydrate', 'sugar', 'protein']:
df[col] = df.groupby('category')[col].transform(lambda x: x.fillna(x.median()))
# Standardize text variables
text_cols = ['category', 'high_traffic']
for col in text_cols:
if col in df.columns:
df[col] = df[col].astype(str).str.strip().str.lower()
# Clean and convert the 'servings' column to numeric
df['servings'] = (
df['servings']
.astype(str)
.str.extract('(\d+)') # Extract numeric part (ignore text like "as a snack")
.astype(float)
)
# Impute missing 'servings' by category median
df['servings'] = df.groupby('category')['servings'].transform(lambda x: x.fillna(x.median()))
# Final validation checks
print("\nโ
Data Validation Summary")
print("Dataset Shape:", df.shape)
print("\nColumn Data Types:\n", df.dtypes)
print("\nMissing Values per Column:\n", df.isna().sum())
print("\nDuplicate Recipe IDs:", df['recipe'].duplicated().sum())
print("\nPreview of Cleaned Dataset:")
print(df.head())
from scipy import stats
# Detect and remove extreme outliers using z-score
z = np.abs(stats.zscore(df[numeric_cols]))
df = df[(z < 3).all(axis=1)]
print("\nData shape after outlier removal:", df.shape)
# Fix Target Column ('high_traffic')
# 'high_traffic' contains 'high' and NaN -> treat NaN as 'low' (not high traffic)
df['high_traffic'] = df['high_traffic'].apply(lambda x: 'high' if str(x).lower() == 'high' else 'low')
# Encode binary labels: high = 1, low = 0
df['high_traffic_label'] = df['high_traffic'].map({'low': 0, 'high': 1})
# Confirm encoding
print("\nTarget variable distribution:\n", df['high_traffic_label'].value_counts())
# Check for Missing Values
print("\nFinal Missing Values:\n", df.isna().sum())
print("\nData Types:\n", df.dtypes)
print("\nPreview of Cleaned Data:\n", df.head())
# Assign cleaned data to variable for further analysis
clean_df = df.copy()
print("\nโ
clean_df successfully created. Shape:", clean_df.shape)# Exploratory Data Analysis (EDA)
# Calories distribution (Single Variable)
plt.figure(figsize=(8,4))
plt.hist(clean_df['calories'], bins=30, color='skyblue', edgecolor='black')
plt.title("Distribution of Calories")
plt.xlabel("Calories")
plt.ylabel("Count")
plt.show()
# Category frequency (Single Variable)
plt.figure(figsize=(10,4))
clean_df['category'].value_counts().plot(kind='bar', color='orange')
plt.title("Recipe Counts by Category")
plt.xlabel("Category")
plt.ylabel("Count")
plt.xticks(rotation=45, ha='right')
plt.show()
# Calories vs High Traffic (Multi-varibale)
plt.figure(figsize=(6,5))
clean_df.boxplot(column='calories', by='high_traffic')
plt.title("Calories vs Traffic Type")
plt.suptitle('')
plt.xlabel("High Traffic (0 = Low, 1 = High)")
plt.ylabel("Calories")
plt.show()# MODEL DEVELOPMENT
# Classification (Predicting High Traffic or Not)
# Prepare Features and Target
X = clean_df[['calories', 'carbohydrate', 'sugar', 'protein', 'servings', 'category']]
y = clean_df['high_traffic_label']
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Preprocessing Pipelines
numeric_features = ['calories', 'carbohydrate', 'sugar', 'protein', 'servings']
categorical_features = ['category']
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
# Define Models
lr_model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(max_iter=1000))
])
rf_model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Train both Models
lr_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)# MODEL EVALUATION
def evaluate_model(model, X_test, y_test, name):
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
metrics = {
'Accuracy': accuracy_score(y_test, y_pred),
'Precision': precision_score(y_test, y_pred),
'Recall': recall_score(y_test, y_pred),
'F1 Score': f1_score(y_test, y_pred),
'ROC AUC': roc_auc_score(y_test, y_proba)
}
print(f"\n{name} Performance Metrics:")
for k, v in metrics.items():
print(f"{k}: {v:.4f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
return metrics
# Evaluate Models
metrics_logreg = evaluate_model(lr_model, X_test, y_test, "Logistic Regression")
metrics_rf = evaluate_model(rf_model, X_test, y_test, "Random Forest")
# Compare Results
results_df = pd.DataFrame([metrics_logreg, metrics_rf], index=['Logistic Regression', 'Random Forest'])
print("\nModel Comparison:\n")
print(results_df)
# Confusion matrices for both models
from sklearn.metrics import ConfusionMatrixDisplay
models = {
"Logistic Regression": lr_model,
"Random Forest": rf_model
}
for name, model in models.items():
y_pred = model.predict(X_test)
disp = ConfusionMatrixDisplay.from_estimator(
model, X_test, y_test,
display_labels=["Low Traffic", "High Traffic"],
cmap='Blues',
colorbar=True
)
plt.title(f"Confusion Matrix - {name}")
plt.show()โ
โ