Skip to content

Recipe Site Traffic

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import  confusion_matrix, classification_report
# Load the data
recipes = pd.read_csv('recipe_site_traffic_2212.csv')
recipes.head()
recipes.shape

Data Validation

The data set has 947 rows and 8 columns. I have validated all the variables and have made changes after validation. Not all the columns were as described in the data dictionary.

  • recipe: 947 unique numeric values without missing values, same as the description. No cleaning is needed.
  • calories: Numeric values with 52 missing values. Missing values imputed with the mean.
  • carbohydrate: Numeric values with 52 missing values. Missing values imputed with the mean.
  • sugar: Numeric values with 52 missing values. Missing values imputed with the mean.
  • protein: Numeric values with 52 missing values. Missing values imputed with the mean.
  • category: 11 categories without missing values, not same as the description. Replaced Chicken Breast with Chicken to make it the 10 required types of recipes.
  • servings: Object values without missing values. Replaced 4 as a snackand 6 as a snack with 4 and 6 respectively, and changed data type to numerical (int64).
  • high_traffic: Character values with 373 missing values. Missing values imputed with the value Low.
# Check variable data types
recipes.info()
# Check for missing values
recipes.isnull().sum()
recipes['high_traffic'].unique()
# Percentage of missing values
print("Missing values for calories: {:.2f}%".format(100 * recipes['calories'].isnull().sum() / len(recipes)))
print("Missing values for high_traffic: {:.2f}%".format(100 * recipes['high_traffic'].isnull().sum() / len(recipes)))
# Check for duplicates
recipes.duplicated().sum()
# Check for outliers
recipes.describe()
recipes['servings'].dtype
recipes['servings'].value_counts()
recipes['category'].value_counts()
recipes['high_traffic'].unique()