Skip to content

Data Scientist Professional Practical Exam Submission

Use this template to write up your summary for submission. Code in Python or R needs to be included.

📝 Task List

Your written report should include both code, output and written text summaries of the following:

  • Data Validation:
    • Describe validation and cleaning steps for every column in the data
  • Exploratory Analysis:
    • Include two different graphics showing single variables only to demonstrate the characteristics of data
    • Include at least one graphic showing two or more variables to represent the relationship between features
    • Describe your findings
  • Model Development
    • Include your reasons for selecting the models you use as well as a statement of the problem type
    • Code to fit the baseline and comparison models
  • Model Evaluation
    • Describe the performance of the two models based on an appropriate metric
  • Business Metrics
    • Define a way to compare your model performance to the business
    • Describe how your models perform using this approach
  • Final summary including recommendations that the business should undertake

Start writing report here..

# Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import PowerTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

1. Explore and Validate the Data

The dataset comprises 947 rows and 8 columns. Each variable has undergone validation, with necessary modifications implemented:

The column recipe was removed from our data.

Due to the skewed nature of the data with outliers, the median was chosen as a better imputation option than the mean; so, missing values were filled with the median for the following columns: [calories, carbohydrate, sugar, protein]

The texts in the servings column, specifically '4 as a snack' and '6 as a snack', were removed, and the data type was subsequently changed to numeric.

The text "Chicken Breast" in category column is replaced with "Chicken".

The target column high_traffic has been sanitized, replacing all NaN entries with 'Low'. Below is a recapitulation of the measures applied to each variable:

  • Recipe: This identifier is unique for each recipe and has no missing values. It requires no cleaning but will be excluded from our model as it is irrelevant for prediction purposes.
  • Calories: This numeric field has 52 missing entries, which will be imputed with the median value.
  • Carbohydrate: Also numeric, this field has 52 missing entries, to be filled with the median.
  • Sugar: With 52 missing numeric values, these will be replaced by the median.
  • Protein: This numeric field has 52 missing entries, which will be filled with the median.
  • Category: Initially comprising 11 categorical values and no missing entries, it requires one cleaning to match the description, the "Chicken Breast" will be replaced with "Chicken".
  • Servings: Initially containing 6 categorical values with no missing entries, it will be cleaned to include only 4 numeric values, with '4 as a snack' and '6 as a snack' converted to '4' and '6', respectively.
  • High_traffic: This category has one value with 373 missing entries, which will be uniformly assigned to the "Low" category.
# Data loading and displaying the first few elements
data = pd.read_csv('recipe_site_traffic_2212.csv')
data.head(5)
# Exploring the information of the data
data.info()
# The percentage of missing data by columns
print("% of missing values by column \n\n" + str(round(data.isna().mean() * 100, 2)))

In the high_traffic target column, a significant percentage of values are missing; however, these are not truly missing values but rather represent cases with no high traffic. Therefore, we will label all such instances as Low.

# Description of the Numerical data
data.select_dtypes(include=['int64', 'float64']).describe()
# Displaying the count of unique values in numerical columns
data.select_dtypes(include=['int64', 'float64']).nunique()
# Description of the Categorical data
data.select_dtypes(include=['object']).describe()

The data type of the servings column is currently categorical, which is incorrect. It will be converted to numeric.

The category column contains 11 categories but it will be modified to only include 10 categories as per the data information.

# Displaying unique elements in categorical columns
for c in data.select_dtypes(include=['object']):
    print("The \'" + c + "\' column contains:")
    print([i for i in data[c].unique()])
    print('\n')

The rows containing the data '4 as a snack' and '6 as a snack' will be cleaned by removing the text, leaving only the numeric value.

The rows containg the data 'Chicken Breast' will be replaced with 'Chicken'.

2. Data Cleansing and Preparation for Analysis