Skip to content

Data Scientist Professional Practical Exam Submission

Use this template to write up your summary for submission. Code in Python or R needs to be included.

📝 Task List

Your written report should include both code, output and written text summaries of the following:

  • Data Validation:
    • Describe validation and cleaning steps for every column in the data
  • Exploratory Analysis:
    • Include two different graphics showing single variables only to demonstrate the characteristics of data
    • Include at least one graphic showing two or more variables to represent the relationship between features
    • Describe your findings
  • Model Development
    • Include your reasons for selecting the models you use as well as a statement of the problem type
    • Code to fit the baseline and comparison models
  • Model Evaluation
    • Describe the performance of the two models based on an appropriate metric
  • Business Metrics
    • Define a way to compare your model performance to the business
    • Describe how your models perform using this approach
  • Final summary including recommendations that the business should undertake

Start writing report here..

BUSINESS GOAL

  • Predict which recipes will lead to high traffic
  • Correctly predict high traffic repices 80% of the time
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style as style
plt.style.use('ggplot')
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, make_scorer

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import KFold, GridSearchCV, RandomizedSearchCV, StratifiedKFold

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

Data Validation

Before any change the dataset had 947 rows, 8 columns. In this step, I am going to examine, validate and clean each column in the dataset, so in order to do that I am going to follow this tasks:

  1. Check missing values that need to be replace/eliminated.
  2. Analyzing the dataset and check the types of each column to see if it is necessary to change them.
  3. Check duplicate values and eliminate them.

Check Missing Values

We observed that there were 52 missing values in the columns calories, carbohydrate, sugar and protein. Since the number of missing values represents a small percentage of the total of 947 recipes, I decided to eliminate these null values for this columns.

Now, the column high_traffic has 373 nulls which has a being impact on our dataset, but after a closer look and according with the description of the dataset, the High label was only assign to the recipes that received high traffic where the other recipes hasn't assign any values. Therefore, we can conclude that the missing values correspond to the recipes that didn't have high traffic, so I am going to replace the null value with the Label Low to deal with this high number of missing values.

After these changes, the dataset contains 895 recipes.

# Import the dataset
df = pd.read_csv('recipe_site_traffic_2212.csv')
df.info()
df.head()
# check NaN
print(df.isna().sum().sort_values())
df.dropna(subset=['calories', 'carbohydrate', 'sugar', 'protein'], inplace=True)
print(df.isna().sum().sort_values())
print(df['high_traffic'].unique())
df['high_traffic'].fillna(value='Low', inplace=True)
print(df['high_traffic'].unique())
print(df.isna().sum().sort_values())

Data types

After see the types of each column, we can immediatly see that the column servings needs to be converted from object to integer type, where category and high_traffic need to be changed to categorical type.

Besides that, after closer inspection on the categorical columns category and servings, I am conclude that there must be some changes in order to have our data in the correct format. For the column servings it is necessary to change this two values, '4 as a snack' and '6 as a snack', to their respective integer value, 4 and 6.

For the column category it is necessary to change 'Chicken Breast' to the right category which is just 'Chicken'.