Data Scientist Professional Practical Exam Submission
Use this template to write up your summary for submission. Code in Python or R needs to be included.
π Task List
Your written report should include both code, output and written text summaries of the following:
- Data Validation:
- Describe validation and cleaning steps for every column in the data
- Exploratory Analysis:
- Include two different graphics showing single variables only to demonstrate the characteristics of data
- Include at least one graphic showing two or more variables to represent the relationship between features
- Describe your findings
- Model Development
- Include your reasons for selecting the models you use as well as a statement of the problem type
- Code to fit the baseline and comparison models
- Model Evaluation
- Describe the performance of the two models based on an appropriate metric
- Business Metrics
- Define a way to compare your model performance to the business
- Describe how your models perform using this approach
- Final summary including recommendations that the business should undertake
Start writing report here..
1. Data Validation
This chapter covers the essential techniques and best practices for ensuring data accuracy and integrity before analysis.
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, PowerTransformer, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report1.1 Data Cleaning
In this section, I load the dataset and check for missing values, revealing several columns, particularly calories, carbohydrate, sugar, protein, and high_traffic, with significant missing entries. This initial inspection highlights the need for data cleaning to ensure quality before analysis.
df = pd.read_csv(r'./recipe_site_traffic_2212.csv')
df.head(n=10)copy_df = df.copy()# Checking for missing values in the dataset
missing_values = copy_df.isnull().sum()
print(missing_values)copy_df.info()- Total Entries: 947
- Columns:
recipe: 947 non-null (integer type)calories: 895 non-null (float type) β 52 missing valuescarbohydrate: 895 non-null (float type) β 52 missing valuessugar: 895 non-null (float type) β 52 missing valuesprotein: 895 non-null (float type) β 52 missing valuescategory: 947 non-null (object type)servings: 947 non-null (object type)high_traffic: 574 non-null (object type) β 373 missing values
- Missing Values: There are multiple columns with missing values, particularly the nutrient columns (
calories,carbohydrate,sugar, andprotein), which have 52 missing values each, andhigh_traffic, which has 373 missing values.
# Checking the unique values in the 'high_traffic' column
high_traffic_values = copy_df['high_traffic'].unique()
print(high_traffic_values)# Replacing NaN values in the 'high_traffic' column with 'Low'
copy_df['high_traffic'].fillna('Low', inplace=True)
# Checking the updated unique values in the 'high_traffic' column
updated_high_traffic_values = copy_df['high_traffic'].unique()
print(updated_high_traffic_values)
# Checking for missing values in the 'high_traffic'
missing_high_traffic = copy_df['high_traffic'].isnull().sum()
print(f"Missing values in 'high_traffic' column: {missing_high_traffic}")
-
Initial Check: I examined the unique values in the
high_trafficcolumn and found two values: 'High' and NaN. -
Handling Missing Values: To address the NaN values, I replaced them with 'Low'. This ensures all entries in the column are accounted for.
-
Updated Values: After replacing NaNs, I checked the unique values again, confirming they are now 'High' and 'Low'.
-
Final Verification: I also checked for any remaining missing values in the
high_trafficcolumn and found none, which ensures the data is complete and ready for further analysis.
# Dropping rows with missing values in specified float-type columns
copy_df_cleaned = copy_df.dropna(subset=['calories', 'carbohydrate', 'sugar', 'protein'])
# Checking the shape of the cleaned DataFrame
print(f"Original DataFrame shape: {copy_df.shape}")
print(f"Cleaned DataFrame shape: {copy_df_cleaned.shape}")
β
β