DataCamp Data Scientist Professional Certification
Practical Exam Submission
In the below, I analyze a dataset of ~1,000 recipes that were hosted on the front page of a fictitious company and whose web traffic was categorized as either "high-traffic" or not. I create several graphs to examine each variable and their interrelationships, and I build two predictive models: a Logistic Regression and a Random Forest Classifier.
The request from the company's Product Team was to build a model that could predict which recipes would be high-traffic, and correctly predict high-traffic recipes 80% of the time. Ultimately I recommend the Random Forest Classifier because its True Positive Rate is superior, which aligns better with the priorities shared in the request.
Data Validation
Below I import useful libraries, read the recipes dataset into a dataframe, and examine each field's completeness and number of unique entries.
Summary of findings: the dataset has 947 rows and seven columns, in addition to the index recipe
. I have validated that all variables contain what they are expected to per the data dictionary, but there are a few data cleaning steps needed to align the data types with the data dictionary and prepare the data for further analysis.
Data fields:
- calories: float type numeric values with a mean of 435, missing data for 52 rows
- carbohydrate: float type numeric values with a mean of 35, missing data for 52 rows
- sugar: float type numeric values with a mean of 9, missing data for 52 rows
- protein: float type numeric values with a mean of 25, missing data for 52 rows
- category: object type text data with 11 unique categories, missing no data. Cleaning is needed in order to consolidate into the 10 categories listed in the data dictionary.
- servings: object type text data with 6 unique categories, missing no data. Cleaning is needed in order to convert to numeric type per the data dictionary.
- high_traffic: object type with 1 unique value, "High," and appears to be missing a lot of data but is actually intentionally incomplete in that null values represent "Not High". Cleaning is needed to convert this to the data type it actually represents, a boolean.
Data cleaning steps:
- Set
recipe
to be the dataframe index. - Convert
high_traffic
to Boolean by replacing null values with FALSE and "High" with TRUE. It is redundant to use the label "High" since the variable name already indicates that a non-null data point is a high-traffic recipe. - Replace "Chicken Breast" with "Chicken" in
category
to align with the groupings list in the data dictionary. - Convert
servings
to numeric type and remove the "as a snack" string appended to a minimal number of entries in this field because it is redundant with the information incategory
.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from statsmodels.stats.outliers_influence import variance_inflation_factor
plt.style.use('ggplot')
df = pd.read_csv('recipe_site_traffic_2212.csv', index_col = 'recipe')
df.info()
df.head()
df.describe()
# Count the number of unique values in each column to gauge diversity of data.
for col in df.columns:
print(col, "has ", df[col].nunique() , " unique values.")
Data Cleaning
# Convert high-traffic to boolean type after replacing NAs with False
df['high_traffic'] = df['high_traffic'].fillna(False)
df['high_traffic'] = df['high_traffic'].replace("High", True)
df['high_traffic'].dtype
# Plot count of each unique entry in 'category' column
df['category'].value_counts(dropna = False).plot(kind='bar')
plt.title("Count of Recipes by Category")
plt.xticks(rotation=60)
plt.show()
In the above, there are 11 categories, but the dictionary only includes 10. The category that is present above but not in the dictionary is "Chicken Breast". Given there is already a "Chicken" category, I choose to consolidate "Chicken" and "Chicken Breast".
# Replace "Chicken Breast" with "Chicken" in `category`
df['category'] = df['category'].replace({"Chicken Breast": "Chicken"})
# Plot count of each unique entry in 'servings' column
df['servings'].value_counts(dropna = False).plot(kind='bar')
plt.title("Count of Recipes by Servings")
plt.xticks(rotation=60)
plt.show()
The above shows that an extremely small fraction of recipes have "as a snack" in the servings field. Before I throw out this information, I check if it is useful and may need to be moved to a categorical field like category
. I look at the category
of the recipes whose data in servings
ends in "as a snack". The below shows that the category
of all these recipes is "Lunch/Snacks", so it seems that "as a snack" is redundant information and should just be removed.
# Count the 'category' types for recipes where 'servings' includes "as a snack"
df[df['servings'].str.find('snack')>0]['category'].value_counts()