Skip to content

Data Scientist Professional Practical Exam Submission

Use this template to write up your summary for submission. Code in Python or R needs to be included.

πŸ“ Task List

Your written report should include both code, output and written text summaries of the following:

  • Data Validation:
    • Describe validation and cleaning steps for every column in the data
  • Exploratory Analysis:
    • Include two different graphics showing single variables only to demonstrate the characteristics of data
    • Include at least one graphic showing two or more variables to represent the relationship between features
    • Describe your findings
  • Model Development
    • Include your reasons for selecting the models you use as well as a statement of the problem type
    • Code to fit the baseline and comparison models
  • Model Evaluation
    • Describe the performance of the two models based on an appropriate metric
  • Business Metrics
    • Define a way to compare your model performance to the business
    • Describe how your models perform using this approach
  • Final summary including recommendations that the business should undertake

Start writing report here..

1. Data Validation

This chapter covers the essential techniques and best practices for ensuring data accuracy and integrity before analysis.

# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, PowerTransformer, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

1.1 Data Cleaning

In this section, I load the dataset and check for missing values, revealing several columns, particularly calories, carbohydrate, sugar, protein, and high_traffic, with significant missing entries. This initial inspection highlights the need for data cleaning to ensure quality before analysis.

df = pd.read_csv(r'./recipe_site_traffic_2212.csv')
df.head(n=10)
copy_df = df.copy()
# Checking for missing values in the dataset
missing_values = copy_df.isnull().sum()
print(missing_values)
copy_df.info()
  • Total Entries: 947
  • Columns:
    • recipe: 947 non-null (integer type)
    • calories: 895 non-null (float type) β†’ 52 missing values
    • carbohydrate: 895 non-null (float type) β†’ 52 missing values
    • sugar: 895 non-null (float type) β†’ 52 missing values
    • protein: 895 non-null (float type) β†’ 52 missing values
    • category: 947 non-null (object type)
    • servings: 947 non-null (object type)
    • high_traffic: 574 non-null (object type) β†’ 373 missing values
  • Missing Values: There are multiple columns with missing values, particularly the nutrient columns (calories, carbohydrate, sugar, and protein), which have 52 missing values each, and high_traffic, which has 373 missing values.
# Checking the unique values in the 'high_traffic' column
high_traffic_values = copy_df['high_traffic'].unique()
print(high_traffic_values)
# Replacing NaN values in the 'high_traffic' column with 'Low'
copy_df['high_traffic'].fillna('Low', inplace=True)

# Checking the updated unique values in the 'high_traffic' column
updated_high_traffic_values = copy_df['high_traffic'].unique()
print(updated_high_traffic_values)
# Checking for missing values in the 'high_traffic'
missing_high_traffic = copy_df['high_traffic'].isnull().sum()
print(f"Missing values in 'high_traffic' column: {missing_high_traffic}")
  • Initial Check: I examined the unique values in the high_traffic column and found two values: 'High' and NaN.

  • Handling Missing Values: To address the NaN values, I replaced them with 'Low'. This ensures all entries in the column are accounted for.

  • Updated Values: After replacing NaNs, I checked the unique values again, confirming they are now 'High' and 'Low'.

  • Final Verification: I also checked for any remaining missing values in the high_traffic column and found none, which ensures the data is complete and ready for further analysis.

# Dropping rows with missing values in specified float-type columns
copy_df_cleaned = copy_df.dropna(subset=['calories', 'carbohydrate', 'sugar', 'protein'])

# Checking the shape of the cleaned DataFrame
print(f"Original DataFrame shape: {copy_df.shape}")
print(f"Cleaned DataFrame shape: {copy_df_cleaned.shape}")
β€Œ
β€Œ
β€Œ