Skip to content

Data Scientist Professional Practical Exam Submission

Use this template to write up your summary for submission. Code in Python or R needs to be included.

📝 Task List

Your written report should include both code, output and written text summaries of the following:

  • Data Validation:
    • Describe validation and cleaning steps for every column in the data
  • Exploratory Analysis:
    • Include two different graphics showing single variables only to demonstrate the characteristics of data
    • Include at least one graphic showing two or more variables to represent the relationship between features
    • Describe your findings
  • Model Development
    • Include your reasons for selecting the models you use as well as a statement of the problem type
    • Code to fit the baseline and comparison models
  • Model Evaluation
    • Describe the performance of the two models based on an appropriate metric
  • Business Metrics
    • Define a way to compare your model performance to the business
    • Describe how your models perform using this approach
  • Final summary including recommendations that the business should undertake

Start writing report here..

1. Importing the required libraries

#importing relevant libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import boxcox, yeojohnson
from scipy.stats.mstats import winsorize
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

2. Data Validation & Cleaning

Initial Data Inspection

The dataset is loaded and given a preliminary review. General information about its structure and contents is displayed to understand what kind of data is present. Summary statistics are also examined to get a sense of the distribution and possible anomalies. Finally, a quick preview of the first few rows helps to visualize how the data is organized.

recipe = pd.read_csv('recipe_site_traffic_2212.csv')
print(recipe.info())
print(recipe.describe())
recipe.head()

Checking for Missing Values

A check is performed across all columns to identify how many missing values are present in the dataset. This helps determine if any data cleaning or imputation is necessary before further analysis.

recipe = pd.read_csv('recipe_site_traffic_2212.csv')
print(recipe.isnull().sum())

Examining Missing Data and Value Distribution

The dataset is reviewed again for missing values to see if any columns have gaps in the data. Additionally, the frequency of each unique value in the "servings" column is checked to understand how often different serving sizes appear and whether the data is consistent or skewed.

recipe.isnull().sum()
print(recipe['servings'].value_counts())

Cleaning the 'servings' Column

The servings column contained entries like "4 as a snack" and "6 as a snack," where numeric values were combined with descriptive text. To make the data usable for analysis, only the numeric part was extracted and converted into integers. This ensured that the servings column was clean, consistent, and ready for numerical operations.

# Remove non-digit characters and convert to numeric in-place
recipe['servings'] = recipe['servings'].astype(str).str.extract(r'(\d+)').astype(int)

# Optional: check cleaned values
print(recipe['servings'].value_counts())

Reviewing Data Structure

The structure of the dataset is reviewed again to confirm data types and ensure that previous cleaning steps were applied successfully. The overall shape of the dataset, including the number of rows and columns, is also examined to get a sense of its size.

recipe
recipe.info()
recipe.shape