Skip to content
🏆Certification - Data Scientist - Recipe Site Traffic (copy)
Data Scientist Professional Practical Exam Submission
Check out my video presentation: https://youtu.be/htqnv6uulPM?si=GpYIN-_kzTkpciTL
📝 Task List
Your written report should include both code, output and written text summaries of the following:
- Data Validation:
- Describe validation and cleaning steps for every column in the data
- Exploratory Analysis:
- Include two different graphics showing single variables only to demonstrate the characteristics of data
- Include at least one graphic showing two or more variables to represent the relationship between features
- Describe your findings
- Model Development
- Include your reasons for selecting the models you use as well as a statement of the problem type
- Code to fit the baseline and comparison models
- Model Evaluation
- Describe the performance of the two models based on an appropriate metric
- Business Metrics
- Define a way to compare your model performance to the business
- How should the business monitor what they want to achieve?
- Describe how your models perform using this approach
- Estimate the initial values (s) for the metric based on the current data
- Initial accuracy of high-traffic recipes
- Define a way to compare your model performance to the business
- Final summary including recommendations that the business should undertake
Recipe Site Traffic
# libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
sns.set_style('whitegrid')
from matplotlib.colors import LinearSegmentedColormap
# custom palette
palette = ["#E4E0E1", "#D6C0B3", "#AB886D", "#493628"]
palette_reversed = palette[::-1]
# Create a continuous colormap from the custom palette
cmap = LinearSegmentedColormap.from_list("custom_food_cmap", palette)
print(palette, palette_reversed)
# import dataset
recipe_site_traffic = pd.read_csv('recipe_site_traffic_2212.csv')
recipe_site_traffic.head()
Data Validation
Data Validation Summary
Original dataset is 947 rows, 8 columns. After dropping missing values there’s 895 rows remaining.
- recipe is numeric, 947 unique values, no missing values. No cleaning is needed.
- calories is numeric, 52 missing values, no negative values. I'll handle missing values
- carbohydrate is numeric, 52 missing values, no negative values. I'll handle missing values
- sugar is numeric, 52 missing values, no negative values. I'll handle missing values
- protein is numeric, 52 missing values, no negative values. I'll handle missing values
- category is string, 11 possible values (there’s an extra category), no missing values. I'll convert Chicken Breast category to Chicken.
- servings is string, 6 possible values, no missing values. I'll convert to numeric type but use as ordinal categories
- high_traffic is string, 373 missing values. I'll convert to boolean, missing values are 'low traffic recipes'.
# validate for column types
recipe_site_traffic.info()
# validate for missing values
recipe_site_traffic.isna().sum()
# validate for duplicate data
recipe_site_traffic.duplicated().sum()
# validate recipe id 947 unique values
recipe_site_traffic['recipe'].nunique()
# validate for negative and extreme values in calories, carbohydrate, sugar, protein
recipe_site_traffic[['calories', 'carbohydrate', 'sugar', 'protein']].describe()
# validate category 10 possible values (food groupings)
categories = ['Lunch/Snacks', 'Beverages', 'Potato', 'Vegetable', 'Meat', 'Chicken', 'Pork', 'Dessert', 'Breakfast', 'One Dish Meal']
print(recipe_site_traffic['category'].nunique())
print(recipe_site_traffic['category'].unique())
# find extra category
set(recipe_site_traffic['category'].unique()) - set(categories)
# validate servings
recipe_site_traffic['servings'].unique()