Skip to content

Data Scientist Professional Practical Exam Submission

Use this template to write up your summary for submission. Code in Python or R needs to be included.

📝 Task List

Your written report should include both code, output and written text summaries of the following:

  • Data Validation:
    • Describe validation and cleaning steps for every column in the data
  • Exploratory Analysis:
    • Include two different graphics showing single variables only to demonstrate the characteristics of data
    • Include at least one graphic showing two or more variables to represent the relationship between features
    • Describe your findings
  • Model Development
    • Include your reasons for selecting the models you use as well as a statement of the problem type
    • Code to fit the baseline and comparison models
  • Model Evaluation
    • Describe the performance of the two models based on an appropriate metric
  • Business Metrics
    • Define a way to compare your model performance to the business
    • Describe how your models perform using this approach
  • Final summary including recommendations that the business should undertake

Start writing report here..

# Start coding here...
# Task 1: Data Validation

# import necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
# importing data
data = pd.read_csv('recipe_site_traffic_2212.csv')
print(data.info())
print()
data.head()
# Data cleaning
data.isna().sum(), data.isnull().sum()
data['servings'].unique()

In the dataset, the 'servings' column is supposed to be int data type. but it is not because the data was entered as string due to some mistake. Let's transform it

data['servings'] = data['servings'].replace(['4 as a snack', '6 as a snack'], ['4', '6'])
data['servings'].unique()
# Now 'servings'column can be converted to integer which is the right data type
data['servings'] = data['servings'].astype(int)
print(data.info())
data.isna().sum()
# Handling missing values
mean_calories = np.round(data['calories'].mean(), 2)
data['calories'] = data['calories'].fillna(mean_calories).round(2)
mean_carbo = np.round(data['carbohydrate'].mean(), 2)
data['carbohydrate'] = data['carbohydrate'].fillna(mean_carbo).round(2)
data['sugar'] = data['sugar'].fillna(np.round(data['sugar'].mean(), 2)).round(2)
data['protein'] = data['protein'].fillna(np.round(data['protein'].mean(), 2)).round(2)

# Let's fill the null values in 'high_traffic' column with the 'Not-high' since others are marked high
data['high_traffic'] = data['high_traffic'].fillna('Not high')

data.isna().sum()
data.head()
print(['Lunch/Snacks', 'Beverages', 'Potato','Vegetable', 'Meat', 'Chicken', 'Pork','Dessert', 'Breakfast', 'One Dish Meal'])
data['category'].unique(), data['high_traffic'].unique()
data = data[data['category'] != 'Chicken Breast']
data['category'].unique(), data['category'].nunique()
# Validate any negative value
data.describe().round(2)

DATA VALIDATION

The dataset contains 946 rows and 8 columns before data inspection, cleaning and validation. The columns are recipe, calories, carbohydrate, protein, sugar, servings, category, high_traffic.

After visualizing and inspecting the data it was discovered that the data was messy. These are the steps taken to prepare the data and make it clean:

  1. The 'servings' column was a character(string/object) type instead of numeric(int) type. The reason was because some rows were not numbers. This column was cleaned by making all the rows number and converted to int type
  2. In the 'category' column, there are supposed to be 10 distinct values but it was not so here. 11 were present. 'Chicken breast' should not be there. Consequently, all the rows having 'Chicken Breast' were dropped off the data.
  3. In the 'high_rate' column, There are supposed to be 'High' or 'Not high' in each row. Upon inspection, it was not so. It only contained 'High' values while others have about 373 missing values. To clean up the missing values, I filled them with 'Not high' since the column indicates that each recipe had high traffic or not. Also because it is classification problem, it has to be 'High' or 'Not high'.
  4. 52 Missing were found in the following columns: calories, carbohydrate, protein and sugar. The missing values were treated by filling them up with their respective means. I used mean because it is the most common a fairly large data.
After cleaning and validation, I rechecked the data to confirm the validation and this was my findings:

That each column had the right data type and is without missing. Below is a detailed report

Column NameDetails
recipeNumeric(int)
unique identifier of recipe
caloriesNumeric(float)
number of calories
Missing values were filled with the mean
carbohydrateNumeric(float)
amount of carbohydrates in grams
Missing values were filled with the mean
sugarNumeric(float)
amount of sugar in grams
Missing values were filled with the mean
proteinNumeric(float)
amount of protein in grams
Missing values were filled with the mean
categoryCharacter(string)
type of recipe.
Recipes are listed in one of ten possible groupings
(Lunch/Snacks', 'Beverages', 'Potato','Vegetable',
'Meat', 'Chicken, 'Pork', 'Dessert', 'Breakfast', 'One Dish Meal').
servingsNumeric(int)
number of servings for the recipe
high_trafficCharacter(string)
if the traffic to the site was high
when this recipe was shown, this is marked with “High”,
if not this is marked by "Not high"