Skip to content

Recipe Site Traffic Analysis

Objective:

This analysis aims to develop a predictive model to identify which recipes on the Tasty Bytes website will bring high traffic to the site once they are posted on the home page. The business targets to get the popular recipes right at least 80 percent of the time, and reduce the chances of displaying unpopular recipes.

1. Data Validation and Cleaning

Dataset Overview:

  • Total Records: 947
  • Total Columns: 8
  • Target Variable: high_traffic
  • Features: Calories, Carbohydrate, Sugar, Protein, Category, Servings

Initial Findings:

  • Missing values were present in the numerical features (calories, carbohydrate, sugar, protein) and the target variable (high_traffic).
  • The category and servings columns contained consistent categorical values.

Cleaning Actions:

  • Converted all numeric columns to numeric type and imputed missing values with category-wise median.

  • Standardized category entries by trimming spaces and converting to lowercase.

  • Converted high_traffic into a binary label:

    • 1 - โ€œhighโ€

    • 0 - non-high or missing

  • Outliers were identified and removed using the Z-score method.

Results After Cleaning

  • Data shape after outlier removal: (874, 9)

  • Target distribution:

    • 1 (High Traffic): 524

    • 0 (Low Traffic): 350

  • No missing values remained in the cleaned dataset.

  • Final cleaned dataset shape: (874, 9)

2. Exploratory Data Analysis (EDA)

Single-Variable Insights:

  • Calories Distribution: Positively skewed, with most recipes between 100โ€“600 calories per serving.

  • Category Distribution: Most recipes belonged to Breakfast, Chicken Breast, and Beverages categories.

Multi-Variable Insight:

A boxplot of calories versus high traffic revealed that high-traffic recipes generally had moderate-to-high calorie content, suggesting that richer meals may attract more engagement.

3. Model Development

Problem Type:

Binary classification โ€” predicting whether a recipe leads to High (1) or Low (0) traffic.

Data Split:

  • Training set: 80%

  • Testing set: 20%

  • Stratified by target variable to maintain class balance.

Models Used:

  • Logistic Regression (Baseline): Chosen for interpretability and as a linear benchmark.

    max_iter=1000 ensures full convergence.

  • Random Forest Classifier (Comparison): Selected to capture non-linear relationships and feature interactions.

    n_estimators=100 provides stable and accurate predictions through ensemble averaging.

4. Model Evaluation and Comparison

MetricLogistic RegressionRandom Forest
Accuracy0.78860.7029
Precision0.80910.7677
Recall0.84760.7238
F1 Score0.82790.7451
ROC AUC0.84970.8213

Interpretation Both models were evaluated on identical test data, revealing Logistic Regression as the stronger performer across nearly all key metrics:

  • Logistic Regression achieved a higher accuracy (78.9%) compared to Random Forest (70.3%).

  • It also exhibited superior precision (80.9%) and recall (84.8%), indicating that it correctly identifies a larger proportion of truly positive (high-traffic) recipes while keeping false positives under control.

  • The ROC AUC score of 0.85 further confirms its stronger discriminative power, meaning it can better distinguish between high and low traffic recipes.

  • Random Forest, while still respectable, underperformed in both accuracy and recall, suggesting potential overfitting or less sensitivity to feature interactions in this dataset.

Confusion Matrix (Logistic Regression):

Predicted Low TrafficPredicted High Traffic
Actual Low Traffic4921
Actual High Traffic1689

Confusion Matrix (Random Forest):

Predicted Low TrafficPredicted High Traffic
Actual Low Traffic4723
Actual High Traffic2976

Interpretation Logistic Regression

  • Correctly classifies 89 high-traffic recipes, maintaining a high recall rate (84.8%).

  • Keeps false positives (21) at a manageable level.

  • Strikes an excellent balance between precision and recall โ€” making it highly suitable for predicting recipes for homepage promotion, where both accuracy and confidence matter.

Random Forest

  • Detects fewer true positives (76) and slightly more false positives (23).

  • Its lower recall (72.4%) indicates that it misses more genuinely popular recipes compared to Logistic Regression.

5. Business Metrics

ModelPrecisionRecallFalse Positive Rate
Logistic Regression0.80910.84760.3000
Random Forest0.76770.72380.3286

Interpretation

  • Precision (80.9%) โ€” When Logistic Regression predicts a recipe will be popular, itโ€™s correct 81% of the time.

  • Recall (84.8%) โ€” Exceeds the business target (โ‰ฅ80%), ensuring the model identifies nearly all high-traffic recipes.

  • False Positive Rate (30%) โ€” Indicates an acceptable level of incorrect positive predictions for operational use.

6. Feature Importance (Random Forest)

RankFeatureImportance
1Protein0.1691
2Calories0.1521
3Carbohydrate0.1478
4Sugar0.1461
5Category: Beverages0.0787
6Servings0.0624
7Category: Vegetable0.0525
8Category: Breakfast0.0451
9Category: Potato0.0332
10Category: Pork0.0270

Interpretation

The Random Forest feature importance analysis highlights the most influential variables driving recipe popularity:

  1. Nutritional Attributes Dominate Predictions
  • Protein, Calories, Carbohydrates, and Sugar rank as the top four predictors.

  • Recipes rich in protein and moderate in calories tend to attract more users โ€” possibly due to growing interest in high-protein diets and balanced meals.

  1. Category-Level Factors Remain Strong Secondary Influencers
  • Categories like Beverages, Vegetables, and Breakfast exhibit notable importance, suggesting these groups have consistent consumer appeal.

  • Niche categories such as Potato and Pork also contribute meaningfully, indicating diversified user interest across food types.

  1. Operational Insight
  • Marketing and recommendation strategies can prioritize high-protein and nutrient-dense recipes for greater engagement.

  • Product teams can focus on optimizing presentation and availability of Beverage and Vegetable recipes to capture repeat user attention.

7. Summary and Recommendations

Findings:

  • The cleaned dataset contained 874 complete records with balanced class distribution.

  • Logistic Regression and Random Forest both achieved strong predictive performance.

  • Logistic Regression demonstrated superior precision and ROC-AUC, making it ideal for deployment.

  • Feature importance analysis confirmed that nutritional content and category type play major roles in recipe popularity.

Recommendations:

  • Deploy Logistic Regression for homepage recipe selection due to its higher precision and interpretability.

  • Use Random Forest for exploratory scenarios when maximizing recall is preferred (finding more high-traffic recipes).

  • Prioritize high-protein, moderately caloric recipes in homepage recommendations to increase engagement.

  • Monitor Precision and Recall as business KPIs to ensure continued reliability.

  • Retrain the model periodically as new recipe data becomes available to maintain predictive performance.

8. Business Impact

By deploying the Logistic Regression model, Tasty Bytes can significantly enhance its recipe promotion strategy through data-driven insights and predictive targeting:

  • Accurately identify approximately 85% of high-traffic recipes (Recall = 0.8476), ensuring that the majority of popular recipes are featured prominently on the homepage.

  • Maintain precision above 80% (0.8091) โ€” meaning that when the model predicts a recipe will be popular, it is correct 8 out of 10 times.

  • Limit false promotions to roughly 30% of total recommendations, minimizing wasted homepage exposure on low-engagement content.

  • Boost overall user engagement and subscription conversions by consistently showcasing recipes that are most likely to attract views, clicks, and saves.

  • Optimize marketing and content planning โ€” the model provides actionable insights into which nutritional and categorical factors drive user interest, guiding the creation of more engaging recipes.

Conclusion

Both models successfully meet the business objective of predicting popular recipes, but Logistic Regression offers the best balance of performance, interpretability, and precision. Feature importance insights further guide recipe strategy, highlighting the influence of nutritional richness and category type on user engagement.

# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, confusion_matrix
)
from scipy import stats
# Load and Inspect Dataset

df = pd.read_csv("recipe_site_traffic_2212.csv")

print("Dataset shape:", df.shape)
print("\nPreview:\n", df.head())
print("\nMissing values:\n", df.isna().sum())
# DATA VALIDATION AND CLEANING

#  Convert 'recipe' to numeric and ensure uniqueness
df['recipe'] = pd.to_numeric(df['recipe'], errors='coerce')
print("Unique recipe IDs:", df['recipe'].nunique())

#  Convert numeric columns to numeric data types
numeric_cols = ['calories', 'carbohydrate', 'sugar', 'protein', 'servings']
for col in numeric_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce')

#  Impute missing numeric values by category median (context-aware imputation)
if 'category' in df.columns:
    for col in ['calories', 'carbohydrate', 'sugar', 'protein']:
        df[col] = df.groupby('category')[col].transform(lambda x: x.fillna(x.median()))

#  Standardize text variables
text_cols = ['category', 'high_traffic']
for col in text_cols:
    if col in df.columns:
        df[col] = df[col].astype(str).str.strip().str.lower()

#  Clean and convert the 'servings' column to numeric
df['servings'] = (
    df['servings']
    .astype(str)
    .str.extract('(\d+)')   # Extract numeric part (ignore text like "as a snack")
    .astype(float)
)

#  Impute missing 'servings' by category median
df['servings'] = df.groupby('category')['servings'].transform(lambda x: x.fillna(x.median()))

#  Final validation checks
print("\nโœ… Data Validation Summary")
print("Dataset Shape:", df.shape)
print("\nColumn Data Types:\n", df.dtypes)
print("\nMissing Values per Column:\n", df.isna().sum())
print("\nDuplicate Recipe IDs:", df['recipe'].duplicated().sum())
print("\nPreview of Cleaned Dataset:")
print(df.head())
from scipy import stats

# Detect and remove extreme outliers using z-score
z = np.abs(stats.zscore(df[numeric_cols]))
df = df[(z < 3).all(axis=1)]
print("\nData shape after outlier removal:", df.shape)
#  Fix Target Column ('high_traffic')

# 'high_traffic' contains 'high' and NaN -> treat NaN as 'low' (not high traffic)
df['high_traffic'] = df['high_traffic'].apply(lambda x: 'high' if str(x).lower() == 'high' else 'low')

# Encode binary labels: high = 1, low = 0
df['high_traffic_label'] = df['high_traffic'].map({'low': 0, 'high': 1})
# Confirm encoding
print("\nTarget variable distribution:\n", df['high_traffic_label'].value_counts())
# Check for Missing Values
print("\nFinal Missing Values:\n", df.isna().sum())
print("\nData Types:\n", df.dtypes)
print("\nPreview of Cleaned Data:\n", df.head())
# Assign cleaned data to variable for further analysis 
clean_df = df.copy()
print("\nโœ… clean_df successfully created. Shape:", clean_df.shape)
# Exploratory Data Analysis (EDA)

#  Calories distribution (Single Variable)
plt.figure(figsize=(8,4))
plt.hist(clean_df['calories'], bins=30, color='skyblue', edgecolor='black')
plt.title("Distribution of Calories")
plt.xlabel("Calories")
plt.ylabel("Count")
plt.show()

#  Category frequency (Single Variable)
plt.figure(figsize=(10,4))
clean_df['category'].value_counts().plot(kind='bar', color='orange')
plt.title("Recipe Counts by Category")
plt.xlabel("Category")
plt.ylabel("Count")
plt.xticks(rotation=45, ha='right')
plt.show()

#  Calories vs High Traffic (Multi-varibale)
plt.figure(figsize=(6,5))
clean_df.boxplot(column='calories', by='high_traffic')
plt.title("Calories vs Traffic Type")
plt.suptitle('')
plt.xlabel("High Traffic (0 = Low, 1 = High)")
plt.ylabel("Calories")
plt.show()
#  MODEL DEVELOPMENT
#  Classification (Predicting High Traffic or Not)

# Prepare Features and Target

X = clean_df[['calories', 'carbohydrate', 'sugar', 'protein', 'servings', 'category']]
y = clean_df['high_traffic_label']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


#  Preprocessing Pipelines

numeric_features = ['calories', 'carbohydrate', 'sugar', 'protein', 'servings']
categorical_features = ['category']

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)


# Define Models

lr_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])

rf_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])


# Train both Models

lr_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)
#  MODEL EVALUATION

def evaluate_model(model, X_test, y_test, name):
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]
    
    metrics = {
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1 Score': f1_score(y_test, y_pred),
        'ROC AUC': roc_auc_score(y_test, y_proba)
    }
    
    print(f"\n{name} Performance Metrics:")
    for k, v in metrics.items():
        print(f"{k}: {v:.4f}")
    
    print("\nConfusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    return metrics


# Evaluate Models
metrics_logreg = evaluate_model(lr_model, X_test, y_test, "Logistic Regression")
metrics_rf = evaluate_model(rf_model, X_test, y_test, "Random Forest")
# Compare Results
results_df = pd.DataFrame([metrics_logreg, metrics_rf], index=['Logistic Regression', 'Random Forest'])
print("\nModel Comparison:\n")
print(results_df)
# Confusion matrices for both models
from sklearn.metrics import ConfusionMatrixDisplay
models = {
    "Logistic Regression": lr_model,
    "Random Forest": rf_model
}

for name, model in models.items():
    y_pred = model.predict(X_test)
    disp = ConfusionMatrixDisplay.from_estimator(
        model, X_test, y_test,
        display_labels=["Low Traffic", "High Traffic"],
        cmap='Blues',
        colorbar=True
    )
    plt.title(f"Confusion Matrix - {name}")
    plt.show()
โ€Œ
โ€Œ
โ€Œ