Data Validation and Cleaning
The dataset has 947 rows and 8 columns. All variables were validated and cleaned when appropriate
- recipe: 947 unique values, all of which are numeric just as in description, no missing data, no cleaning done.
- calories: numeric values with 52 missing values, filled using mean based on data set 'category' column.
- carbohydrate: numeric values with 52 missing values, filled using mean based on data set 'category' column.
- sugar: numeric values with 52 missing values, filled using mean based on data set 'category' column.
- protein: numeric values with 52 missing values, filled using mean based on data set 'category' column.
- category: found 11 categories instead of 10, cleaned wrong entries to fit description.
- servings: stripped strings from entries to make values numerical as in description.
- high_traffic: 574 'High' values for when recipes had high traffic, cleaned to have 'Not High' otherwise
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_score
plt.style.use('ggplot')# load recipe dataset
url = 'https://s3.amazonaws.com/talent-assets.datacamp.com/recipe_site_traffic_2212.csv'
df = pd.read_csv(url)
df.head()print(df.info())print(df.isnull().sum())# Assigning features with numeric values
numeric_cols = df[['calories', 'carbohydrate', 'sugar', 'protein']]
# fill missing values in numeric columns
for col in numeric_cols.columns:
mean_dict = df.groupby('category')[col].mean().to_dict()
df[col] = df[col].fillna(df['category'].map(mean_dict)) # this works only when category column isn't a category dtype
for feature in numeric_cols.columns:
print(df[feature].var())
# Validate 'recipe' column
df['recipe'].nunique()# Validate possibly 10 values for 'category' and clean if not
print(df['category'].unique())
df.loc[df['category'] == 'Chicken Breast', 'category'] = 'Chicken'
df['category'] = df['category'].astype('category')
print(df['category'].nunique())# Validate 'servings' column and clean if necessary
df['servings'] = df['servings'].str.strip('as a snack').astype('category')
df['servings'].unique()# Validate and 'high_traffic' column
df['high_traffic'] = df['high_traffic'].fillna('Not High').astype('category')
print(df['high_traffic'].value_counts())print(df.describe())print(df.info())Exploratory Analysis
Moving on to investigate the Target variable, features of the recipes, and the relationships between both the Target variable and recipe features.
Target Variable - high_traffic
The data set contains more recipes with High Traffic as shown in the plot below.
fig, axes = plt.subplots(figsize = (15,8))
sns.countplot(data=df, x='high_traffic', color='gray').set(title='Count of recipes having High Traffic vs Unknown')
plt.xlabel('Traffic')
plt.show()Numeric features - calories, carbohydrate, sugar, protein
The distribution of the Numeric Features appear to be right skewed, the variables were noticed to have wide range of values.
However, following Logarithmic scaling, all Numeric Features appear to have close to a normal distribution justifying why all missing values in the columns were filled using their mean per 'category' column of the data set.
All Numeric Features have a weak correlation with each other as shown in the Heat map.