Skip to content
Recipe Site Traffic Prediction
Recipe Site Traffic Prediction
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as mnso
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
Import the dataset:
df = pd.read_csv('recipe_site_traffic_2212.csv')
df.head()
Data Validation:
The original dataset has 947 rows and 8 columns. After validating the data on each column, I made some changes as below.
- Recipe: 947 unique values without missing values, similar to the description. After data cleaning, 52 rows were removed because of the missing values from other columns.
- Calories: 895 non-null values, similar to the description. 52 missing values were replaced by the average value of the "calories" column grouped by "category" and "servings".
- Carbohydrate: 895 non-null values, similar to the description. 52 missing values were replaced by the average value of the "carbohydrate" column grouped by "category" and "servings".
- Sugar: 895 non-null values, similar to the description. 52 missing values were replaced by the average value of the "sugar" column grouped by "category" and "servings".
- Protein: 895 non-null values, similar to the description. 52 missing values were replaced by the average value of the "protein" column grouped by "category" and "servings".
- Category: 11 unique categories without missing values instead of 10 categories provided by the description. I merged category "Chicken Breast" into category "Chicken" because they belong to the same category.
- Servings: 6 unique categories without missing values. According to the description, "servings" must be a numeric column, not a character column. Two extra categories "4 as a snack" and "6 as a snack" were merged into "4" and "6", respectively. The column type was changed to integer.
- High-traffic: 1 non-null values, similar to the description. 373 missing values were replaced with "Low".
Data Cleaning:
- Remove rows with null values in calories, carbohydrate, sugar, protein to maintain data integrity.
- Category: "Chicken Breast" category was united with the "Chicken" category to ensure consistency.
- Servings: Extra values "4 as a snack" and "6 as a snack" were united with "4" and "6", respectively, and the column type was changed to integer.
- High-traffic: Replace null values with "Low".
Furthermore, I created four more columns that illustate the total nutrients of each recipe. After validating and cleaning, the dataset has 895 rows and 11 columns.
Overview the dataset:
# Check size of dataset
df.shape
print("The dataset has {} rows and {} columns.".format(df.shape[0],df.shape[1]))
The dataset has 947 rows and 8 columns.
# Overview the dataset
df.info()
- The dataset has 8 columns: recipe, calories, carbohydrate, sugar, protein, category, servings, and high_traffic.
- Numerical columns: recipe, calories, carbohydrate, sugar, protein.
- Categorical columns: category, servings, high_traffic.
df.describe()
# Number of missing values in each columns
df.isnull().sum()
Column calories, carbohydrate, sugar, protein and high_traffic have missing values.
Categorical columns: