Skip to content

Data Scientist Professional Practical Exam Submission

Check out my video presentation: https://youtu.be/htqnv6uulPM?si=GpYIN-_kzTkpciTL

📝 Task List

Your written report should include both code, output and written text summaries of the following:

  • Data Validation:
    • Describe validation and cleaning steps for every column in the data
  • Exploratory Analysis:
    • Include two different graphics showing single variables only to demonstrate the characteristics of data
    • Include at least one graphic showing two or more variables to represent the relationship between features
    • Describe your findings
  • Model Development
    • Include your reasons for selecting the models you use as well as a statement of the problem type
    • Code to fit the baseline and comparison models
  • Model Evaluation
    • Describe the performance of the two models based on an appropriate metric
  • Business Metrics
    • Define a way to compare your model performance to the business
      • How should the business monitor what they want to achieve?
    • Describe how your models perform using this approach
      • Estimate the initial values (s) for the metric based on the current data
      • Initial accuracy of high-traffic recipes
  • Final summary including recommendations that the business should undertake

Recipe Site Traffic

# libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
sns.set_style('whitegrid')
from matplotlib.colors import LinearSegmentedColormap
# custom palette
palette = ["#E4E0E1", "#D6C0B3", "#AB886D", "#493628"]
palette_reversed = palette[::-1]
# Create a continuous colormap from the custom palette
cmap = LinearSegmentedColormap.from_list("custom_food_cmap", palette)
print(palette, palette_reversed)
# import dataset
recipe_site_traffic = pd.read_csv('recipe_site_traffic_2212.csv')
recipe_site_traffic.head()

Data Validation

Data Validation Summary

Original dataset is 947 rows, 8 columns. After dropping missing values there’s 895 rows remaining.

  • recipe is numeric, 947 unique values, no missing values. No cleaning is needed.
  • calories is numeric, 52 missing values, no negative values. I'll handle missing values
  • carbohydrate is numeric, 52 missing values, no negative values. I'll handle missing values
  • sugar is numeric, 52 missing values, no negative values. I'll handle missing values
  • protein is numeric, 52 missing values, no negative values. I'll handle missing values
  • category is string, 11 possible values (there’s an extra category), no missing values. I'll convert Chicken Breast category to Chicken.
  • servings is string, 6 possible values, no missing values. I'll convert to numeric type but use as ordinal categories
  • high_traffic is string, 373 missing values. I'll convert to boolean, missing values are 'low traffic recipes'.
# validate for column types
recipe_site_traffic.info()
# validate for missing values
recipe_site_traffic.isna().sum()
# validate for duplicate data
recipe_site_traffic.duplicated().sum()
# validate recipe id 947 unique values
recipe_site_traffic['recipe'].nunique()
# validate for negative and extreme values in calories, carbohydrate, sugar, protein
recipe_site_traffic[['calories', 'carbohydrate', 'sugar', 'protein']].describe()
# validate category 10 possible values (food groupings)
categories = ['Lunch/Snacks', 'Beverages', 'Potato', 'Vegetable', 'Meat', 'Chicken', 'Pork', 'Dessert', 'Breakfast', 'One Dish Meal']
print(recipe_site_traffic['category'].nunique())
print(recipe_site_traffic['category'].unique())
# find extra category
set(recipe_site_traffic['category'].unique()) - set(categories)
# validate servings
recipe_site_traffic['servings'].unique()