Skip to content

DataCamp Data Scientist Professional Certification

Practical Exam Submission

In the below, I analyze a dataset of ~1,000 recipes that were hosted on the front page of a fictitious company and whose web traffic was categorized as either "high-traffic" or not. I create several graphs to examine each variable and their interrelationships, and I build two predictive models: a Logistic Regression and a Random Forest Classifier.

The request from the company's Product Team was to build a model that could predict which recipes would be high-traffic, and correctly predict high-traffic recipes 80% of the time. Ultimately I recommend the Random Forest Classifier because its True Positive Rate is superior, which aligns better with the priorities shared in the request.

Data Validation

Below I import useful libraries, read the recipes dataset into a dataframe, and examine each field's completeness and number of unique entries.

Summary of findings: the dataset has 947 rows and seven columns, in addition to the index recipe. I have validated that all variables contain what they are expected to per the data dictionary, but there are a few data cleaning steps needed to align the data types with the data dictionary and prepare the data for further analysis.

Data fields:

  • calories: float type numeric values with a mean of 435, missing data for 52 rows
  • carbohydrate: float type numeric values with a mean of 35, missing data for 52 rows
  • sugar: float type numeric values with a mean of 9, missing data for 52 rows
  • protein: float type numeric values with a mean of 25, missing data for 52 rows
  • category: object type text data with 11 unique categories, missing no data. Cleaning is needed in order to consolidate into the 10 categories listed in the data dictionary.
  • servings: object type text data with 6 unique categories, missing no data. Cleaning is needed in order to convert to numeric type per the data dictionary.
  • high_traffic: object type with 1 unique value, "High," and appears to be missing a lot of data but is actually intentionally incomplete in that null values represent "Not High". Cleaning is needed to convert this to the data type it actually represents, a boolean.

Data cleaning steps:

  1. Set recipe to be the dataframe index.
  2. Convert high_traffic to Boolean by replacing null values with FALSE and "High" with TRUE. It is redundant to use the label "High" since the variable name already indicates that a non-null data point is a high-traffic recipe.
  3. Replace "Chicken Breast" with "Chicken" in category to align with the groupings list in the data dictionary.
  4. Convert servings to numeric type and remove the "as a snack" string appended to a minimal number of entries in this field because it is redundant with the information in category.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from statsmodels.stats.outliers_influence import variance_inflation_factor
plt.style.use('ggplot')
df = pd.read_csv('recipe_site_traffic_2212.csv', index_col = 'recipe')
df.info()
df.head()
df.describe()
# Count the number of unique values in each column to gauge diversity of data.
for col in df.columns:
    print(col, "has ", df[col].nunique() , " unique values.")

Data Cleaning

# Convert high-traffic to boolean type after replacing NAs with False
df['high_traffic'] = df['high_traffic'].fillna(False)
df['high_traffic'] = df['high_traffic'].replace("High", True)
df['high_traffic'].dtype
# Plot count of each unique entry in 'category' column
df['category'].value_counts(dropna = False).plot(kind='bar')
plt.title("Count of Recipes by Category")
plt.xticks(rotation=60)
plt.show()

In the above, there are 11 categories, but the dictionary only includes 10. The category that is present above but not in the dictionary is "Chicken Breast". Given there is already a "Chicken" category, I choose to consolidate "Chicken" and "Chicken Breast".

# Replace "Chicken Breast" with "Chicken" in `category`
df['category'] = df['category'].replace({"Chicken Breast": "Chicken"})
# Plot count of each unique entry in 'servings' column
df['servings'].value_counts(dropna = False).plot(kind='bar')
plt.title("Count of Recipes by Servings")
plt.xticks(rotation=60)
plt.show()

The above shows that an extremely small fraction of recipes have "as a snack" in the servings field. Before I throw out this information, I check if it is useful and may need to be moved to a categorical field like category. I look at the category of the recipes whose data in servings ends in "as a snack". The below shows that the category of all these recipes is "Lunch/Snacks", so it seems that "as a snack" is redundant information and should just be removed.

# Count the 'category' types for recipes where 'servings' includes "as a snack"
df[df['servings'].str.find('snack')>0]['category'].value_counts()