Certification - Data Scientist - Recipe Site Traffic c

Data Scientist Professional Practical Exam Submission

Use this template to write up your summary for submission. Code in Python or R needs to be included.

📝 Task List

Your written report should include both code, output and written text summaries of the following:

Data Validation:
- Describe validation and cleaning steps for every column in the data
Exploratory Analysis:
- Include two different graphics showing single variables only to demonstrate the characteristics of data
- Include at least one graphic showing two or more variables to represent the relationship between features
- Describe your findings
Model Development
- Include your reasons for selecting the models you use as well as a statement of the problem type
- Code to fit the baseline and comparison models
Model Evaluation
- Describe the performance of the two models based on an appropriate metric
Business Metrics
- Define a way to compare your model performance to the business
- Describe how your models perform using this approach
Final summary including recommendations that the business should undertake

Start writing report here..

# Start coding here...
# Task 1: Data Validation

# import necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# importing data
data = pd.read_csv('recipe_site_traffic_2212.csv')
print(data.info())
print()
data.head()

# Data cleaning
data.isna().sum(), data.isnull().sum()

data['servings'].unique()

In the dataset, the 'servings' column is supposed to be int data type. but it is not because the data was entered as string due to some mistake. Let's transform it

data['servings'] = data['servings'].replace(['4 as a snack', '6 as a snack'], ['4', '6'])
data['servings'].unique()

# Now 'servings'column can be converted to integer which is the right data type
data['servings'] = data['servings'].astype(int)
print(data.info())
data.isna().sum()

# Handling missing values
mean_calories = np.round(data['calories'].mean(), 2)
data['calories'] = data['calories'].fillna(mean_calories).round(2)
mean_carbo = np.round(data['carbohydrate'].mean(), 2)
data['carbohydrate'] = data['carbohydrate'].fillna(mean_carbo).round(2)
data['sugar'] = data['sugar'].fillna(np.round(data['sugar'].mean(), 2)).round(2)
data['protein'] = data['protein'].fillna(np.round(data['protein'].mean(), 2)).round(2)

# Let's fill the null values in 'high_traffic' column with the 'Not-high' since others are marked high
data['high_traffic'] = data['high_traffic'].fillna('Not high')

data.isna().sum()

data.head()

print(['Lunch/Snacks', 'Beverages', 'Potato','Vegetable', 'Meat', 'Chicken', 'Pork','Dessert', 'Breakfast', 'One Dish Meal'])
data['category'].unique(), data['high_traffic'].unique()

data = data[data['category'] != 'Chicken Breast']
data['category'].unique(), data['category'].nunique()

# Validate any negative value
data.describe().round(2)

DATA VALIDATION

The dataset contains 946 rows and 8 columns before data inspection, cleaning and validation. The columns are recipe, calories, carbohydrate, protein, sugar, servings, category, high_traffic.

After visualizing and inspecting the data it was discovered that the data was messy. These are the steps taken to prepare the data and make it clean:

The 'servings' column was a character(string/object) type instead of numeric(int) type. The reason was because some rows were not numbers. This column was cleaned by making all the rows number and converted to int type
In the 'category' column, there are supposed to be 10 distinct values but it was not so here. 11 were present. 'Chicken breast' should not be there. Consequently, all the rows having 'Chicken Breast' were dropped off the data.
In the 'high_rate' column, There are supposed to be 'High' or 'Not high' in each row. Upon inspection, it was not so. It only contained 'High' values while others have about 373 missing values. To clean up the missing values, I filled them with 'Not high' since the column indicates that each recipe had high traffic or not. Also because it is classification problem, it has to be 'High' or 'Not high'.
52 Missing were found in the following columns: calories, carbohydrate, protein and sugar. The missing values were treated by filling them up with their respective means. I used mean because it is the most common a fairly large data.

After cleaning and validation, I rechecked the data to confirm the validation and this was my findings:

That each column had the right data type and is without missing. Below is a detailed report

Column Name	Details
recipe	Numeric(int) unique identifier of recipe
calories	Numeric(float) number of calories Missing values were filled with the mean
carbohydrate	Numeric(float) amount of carbohydrates in grams Missing values were filled with the mean
sugar	Numeric(float) amount of sugar in grams Missing values were filled with the mean
protein	Numeric(float) amount of protein in grams Missing values were filled with the mean
category	Character(string) type of recipe. Recipes are listed in one of ten possible groupings (Lunch/Snacks', 'Beverages', 'Potato','Vegetable', 'Meat', 'Chicken, 'Pork', 'Dessert', 'Breakfast', 'One Dish Meal').
servings	Numeric(int) number of servings for the recipe
high_traffic	Character(string) if the traffic to the site was high when this recipe was shown, this is marked with “High”, if not this is marked by "Not high"

‌
‌
‌

Certification - Data Scientist - Recipe Site Traffic c

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Data Scientist Professional Practical Exam Submission

📝 Task List

DATA VALIDATION

After cleaning and validation, I rechecked the data to confirm the validation and this was my findings:

Data Scientist Professional Practical Exam Submission