Skip to content

Data Scientist Professional Practical Exam Submission

Use this template to write up your summary for submission. Code in Python or R needs to be included.

📝 Task List

Your written report should include both code, output and written text summaries of the following:

  • Data Validation:
    • Describe validation and cleaning steps for every column in the data
  • Exploratory Analysis:
    • Include two different graphics showing single variables only to demonstrate the characteristics of data
    • Include at least one graphic showing two or more variables to represent the relationship between features
    • Describe your findings
  • Model Development
    • Include your reasons for selecting the models you use as well as a statement of the problem type
    • Code to fit the baseline and comparison models
  • Model Evaluation
    • Describe the performance of the two models based on an appropriate metric
  • Business Metrics
    • Define a way to compare your model performance to the business
    • Describe how your models perform using this approach
  • Final summary including recommendations that the business should undertake

Start writing report here..

Data Validation

The data set contains 947 rows and 8 columns. I have reviewed all of the variables and have made some modifications based on the validation results. Specifically, the following changes were made :

  • calories: numeric value, with 52 missing values. I imputed the missing values with the mean calories of their respective categories.
  • carbohydrate: numeric value, with 52 missing values. I imputed the missing values with the mean carbohydrate of their respective categories.
  • sugar: numeric value, with 52 missing values. I imputed the missing values with the mean sugar of their respective categories.
  • protein: numeric value, with 52 missing values. I imputed the missing values with the mean protein of their respective categories.
  • category: 11 categories, without missing values. I renamed the category 'Chicken breast' to 'Chicken', which brings it to 10 categories same as listed in the data dictionary.
  • servings: non-numeric value, without missing values. I replaced the expressions '4 as a snack' and '6 as a snack' respectively by 4 and 6. Then i converted the column type to integer, in other to make it numeric.
  • high-traffic: same as the description. But i replaced the null values by "low", to avoid ambiguity. Hence no missing values declared.
  • recipe: numeric value, no missing values. No cleaning needed

Once I validated each column, I removed any duplicate rows that arose as a result of the imputations and I designated the 'recipe' column as the index column. This resulted in a final dataset comprising of 922 rows and 7 columns.

#data discovery...


import pandas as pd
recipe_site_traffic_2212 = pd.read_csv('recipe_site_traffic_2212.csv')
display(recipe_site_traffic_2212.info())
display(recipe_site_traffic_2212.describe())
display(recipe_site_traffic_2212.head(10))
Hidden output
# proceed with data validation steps...
# import necessary libraries
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
#display(recipe_site_traffic_2212)
# changing the 'servings' column type into 'int' type..

recipe_site_traffic_2212['servings'].replace({'4 as a snack':'4', '6 as a snack':'6' }, inplace = True)

recipe_site_traffic_2212['servings'] = recipe_site_traffic_2212['servings'].astype(int)

# changing the categorical 'chiken breast' into 'chiken'...
recipe_site_traffic_2212['category'].replace({ 'Chicken Breast' :'Chicken' }, inplace = True)


# missing values...
# convert the null values in traffic column to 'low'...
recipe_site_traffic_2212['high_traffic'].replace({ np.nan :'Low' }, inplace = True)

# impute the missing values in the numerical columns with the mean of their respective categoy
l = []
categories =['Pork', 'Meat', 'Chicken',  'Dessert',  'Potato', 'Lunch/Snacks', 'One Dish Meal', 'Vegetable', 'Beverages', 'Breakfast']
for val  in categories: 
    df = recipe_site_traffic_2212[recipe_site_traffic_2212['category'] == val ]
    df.fillna(df.mean().round(2), inplace = True)
    l.append(df)
   
recipe_site_traffic_2212_updated = pd.concat(l, axis= 0)
recipe_site_traffic_2212_updated.sort_values('recipe', inplace = True)
 
#display(recipe_site_traffic_2212_updated.isna().sum())

# remove duplicates ...
recipe_site_traffic_2212_updated.drop_duplicates(['calories', 'carbohydrate', 'sugar', 'protein', 'category', 'servings', 'high_traffic'], inplace=True)
recipe_site_traffic_2212_updated.set_index('recipe', inplace=True) 


display(recipe_site_traffic_2212_updated)

Exploratory Analysis

I have examined the recipe's target variable and its features, and analyzed their relationship. However, I found that no changes were necessary for the newly obtained dataset after completing the investigation. Details of my findings are provided below.

Target Variable - high_traffic

Given that our goal is to predict whether a recipe will be popular or not, and popularity implies more traffic and subscriptions, the target variable for our analysis would be the 'high_traffic' variable.

Hidden code

Numeric Variable - Calorie, Carbohydrate, Sugar, Protein, Servings

The heatmap suggests that there is a slight positive linear relationship between two pairs of variables: protein and calories, as well as sugar and carbohydrates.

Hidden code

Relationship between Calorie, Carbohydrate, Sugar, Protein and high_traffic

To look deeper into this linear relationship, I started by creating boxplots to investigate the connection between Calories, Carbohydrates, Sugar, Proteins, and our target variable, high_traffic. Based on the boxplots below, it appears that foods rich in calories and carbohydrates are more likely to attract traffic.

Hidden code

To provide further clarity, I included scatter plots that depict the food composition by category. These scatter plots reinforce the previous observation and provide additional insights, such as: pork, meat, vegetables, and potatoes are popular meals, while breakfast and beverages are less likely to generate traffic.

Hidden code

Categorical Variables - Servings, Category

Looking at the bar plot below, we can see that chicken is the most frequently ordered meal, followed by breakfast and beverages. However, it is important to note that popularity is not necessarily implied by frequency. Additionally, the majority of recipes seem to serve four people, which could potentially lead to humorous thoughts of double dates.