Datacamp Data Scientist Professional Python Exam

Recipe Site Traffic Analysis: Predicting Popular Recipes for Maximum Engagement

Data Scientist Professional Practical Exam

Weiyuan Liu

Data Validation

This data set has 947 rows and 8 columns. Data validation is conducted in two parts. The first part of the validation serves for the visualization of the data and exploratory data analysis. It involves cleaning missing values and non-uniform entries according to the column types while maintaining inherent data structure intact. The second part of the validation serves for the performance of the predictive models. For numeric columns, a standard normal transformation is applied, while categorical columns are encoded with numeric values.

recipe: Unique identifier. 947 numeric values. Recipe is set as the index.
calories: Continuous numeric variable. 947 numeric values. The original dataset has 52 missing values. Missing values are imputed with the mean value of the variable. The resulting variable has no missing values.
carbohydrate: Continuous numeric variable. 947 numeric values.The original dataset has 52, around 5.49%, missing values. Missing values are imputed with the mean value of the variable. The resulting variable has no missing values.
sugar: Continuous numeric variable. 947 numeric values. The original dataset has 52, around 5.49%, missing values. Missing values are imputed with the mean value of the variable. The resulting variable has no missing values.
protein: Continuous numeric variable. 947 numeric values. The original dataset has 52, around 5.49%, missing values. Missing values are imputed with the mean value of the variable. The resulting variable has no missing values.
category: 947 categorical values, containing 11 categories: 'Pork', 'Potato', 'Breakfast', 'Beverages', 'One Dish Meal', 'Chicken Breast', 'Lunch/Snacks', 'Chicken', 'Vegetable', 'Meat', 'Dessert'. The orignal data are strings, and have been converted to categorical type.
servings: 947 categorical values, containing 4 categories: '6', '4', '2', '1'. The original data have unnecessary string characters, and those characters have been cleaned.
high_traffic: Target variable, containing 'High' and 'Normal'. 947 categorical values. Only high traffic are recorded in the original data, and normal traffic are recorded as missing values. Missing values are filled with "Normal".

# Packages with standard aliases
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, precision_score
from sklearn.ensemble import RandomForestClassifier

# Importing data, glimpsing numeric data and clearning numeric data 
df = pd.read_csv('recipe_site_traffic_2212.csv');
df = df.set_index('recipe');
df.head();
df.info();
df.describe();
df.isna().sum()/len(df);
df[df.isna().any(axis=1)];

Hidden output

# Glimpse non-numeric data details
print(df['servings'].unique())
print(df['category'].unique())
print(df['high_traffic'].unique())

Hidden output

# Clean non-numeric data
df['high_traffic'] = df['high_traffic'].fillna('Normal').astype('category')
df['servings'] = df['servings'].str.replace('[A-Za-z]', '').str.strip().astype('category')
df['category'] = df['category'].astype('category')
df.info()

Hidden output

# Clean numeric data
si = SimpleImputer(strategy = 'mean')
numeric = ['calories','carbohydrate','sugar','protein']
df[numeric] = si.fit_transform(df[numeric]);
df.info()
df.isna().sum()

Hidden output

Exploratory Data Analysis

Distribution of target variable: high_traffic against categorical variables

The target variable is plotted against categorical variables: servings and recipe. We can clearly see that there are more high traffic records than normal traffic records, and high traffic tends to associate with dessert, lunch/snacks, meat, one dish meal, pork, potato and vegetable.

#Plot macros
sns.set(style="white",palette = 'muted')

fig,ax = plt.subplots(1,2,figsize = (15,5))
sns.despine(left=True)
sns.countplot(x='servings', data=df, hue = 'high_traffic', ax = ax[0]).set(xlabel = 'Number of servings', ylabel = 'Count', title = 'Number of High Traffic and Normal Traffic by Number of Servings')
sns.countplot(x='category', data=df, hue = 'high_traffic', ax = ax[1]).set(xlabel = 'Recipe type', ylabel = 'Count', title='Number of High Traffic and Normal Traffic by Recipe Type')
ax[0].legend().remove()
ax[1].legend().remove()
ax[1].tick_params(axis='x', labelrotation=45)
handles, labels = plt.gca().get_legend_handles_labels()
fig.legend(handles, labels, loc='upper center', ncol = 2)
fig.show()

Distribution of target variable: high_traffic against numeric variables

The trend that there are more high traffic records remains same for all numeric variables. Recipe nutritions are generally distributed around small values.

fig,ax = plt.subplots(2,2,figsize = (15,10))
sns.despine(left=True)
sns.histplot(x='calories',data=df,ax = ax[0,0],bins=20, hue ='high_traffic',multiple='dodge',kde= True).set(xlabel = 'Calories')
sns.histplot(x='carbohydrate',data=df,ax = ax[0,1],bins=20, hue ='high_traffic',multiple='dodge',kde= True).set(xlabel = 'Carbohydrate (g)')
sns.histplot(x='sugar',data=df,ax = ax[1,0],bins=20, hue ='high_traffic',multiple='dodge',kde= True).set(xlabel = 'Sugar (g)')
sns.histplot(x='protein',data=df,ax = ax[1,1],bins=20, hue ='high_traffic',multiple='dodge',kde= True).set(xlabel = 'Protein (g)')
for a in (0,1):
    for b in (0,1):
        ax[a,b].set_ylabel(None)
        ax[a,b].legend().remove()
fig.legend(handles, labels, loc='upper center', ncol = 2,bbox_to_anchor=(0.5,0.92))
fig.text(0.09,0.5,'Count',rotation='vertical',horizontalalignment='left',verticalalignment='center',fontsize = 15)
fig.text(0.5,0.05,'Nutrition type',horizontalalignment='center',fontsize = 15)
fig.text(0.5,0.93,'Distribution of Website Traffic by Nutrition Type: High Traffic vs. Normal Traffic',horizontalalignment='center',fontsize = 15);

Distribution of numeric variables by categorical variables: servings

Here are boxplots of nutritions categorized by number of servings. There are not large differences in distribution of nutritions between different number of servings.

fig,ax = plt.subplots(2,2,figsize = (15,10))
sns.despine(left=True)
sns.boxplot(hue='high_traffic',x='calories',y ='servings',data=df,ax = ax[0,0],dodge=True,palette='muted',boxprops={'edgecolor': 'None'},medianprops=dict(linewidth=1, color='black'),linewidth=0.9,order=['6','4','2','1']).set(xlabel = 'Calories')
sns.boxplot(hue='high_traffic',x='carbohydrate',data=df,ax = ax[0,1],y ='servings',dodge = True,palette='muted',boxprops={'edgecolor': 'None'},medianprops=dict(linewidth=1, color='black'),linewidth=0.9,order=['6','4','2','1']).set(xlabel = 'Carbohydrate (g)')
sns.boxplot(hue='high_traffic',x='sugar',data=df,ax = ax[1,0],y ='servings',dodge = True,palette='muted',boxprops={'edgecolor': 'None'},medianprops=dict(linewidth=1, color='black'),linewidth=0.9,order=['6','4','2','1']).set(xlabel = 'Sugar (g)')
sns.boxplot(hue='high_traffic',x='protein',data=df,ax = ax[1,1],y ='servings',dodge = True,palette='muted',boxprops={'edgecolor': 'None'},medianprops=dict(linewidth=1, color='black'),linewidth=0.9,order=['6','4','2','1']).set(xlabel = 'Protein (g)')
for a in (0,1):
    for b in (0,1):
        ax[a,b].set_ylabel(None)
        ax[a,b].legend().remove()
handles, labels = plt.gca().get_legend_handles_labels()
fig.legend(handles, labels, loc='upper center', ncol = 2,bbox_to_anchor=(0.5,0.92))
fig.text(0.09,0.5,'Number of servings',rotation='vertical',horizontalalignment='left',verticalalignment='center',fontsize = 15)
fig.text(0.5,0.05,'Nutrition type',horizontalalignment='center',fontsize = 15)
fig.text(0.5,0.93,'Distribution of Nutritions by Number of Servings: High Traffic vs. Normal Traffic',horizontalalignment='center',fontsize = 15);

Distribution of numeric variables by categorical variables: category

The distribution of nutritions are significant different between different recipe types, which might influence the model performance. Furthermore, some recipe types do not contains some types of nutritions. Standardization is suggested to improve model performance.

‌
‌
‌

Datacamp Data Scientist Professional Python Exam

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Recipe Site Traffic Analysis: Predicting Popular Recipes for Maximum Engagement

Data Scientist Professional Practical Exam

Weiyuan Liu

Data Validation

Exploratory Data Analysis

Distribution of target variable: high_traffic against categorical variables

Distribution of target variable: high_traffic against numeric variables

Distribution of numeric variables by categorical variables: servings

Distribution of numeric variables by categorical variables: category

Recipe Site Traffic Analysis: Predicting Popular Recipes for Maximum Engagement