Data Scientist Professional Practical Exam Submission
Use this template to write up your summary for submission. Code in Python or R needs to be included.
๐ Task List
Your written report should include both code, output and written text summaries of the following:
- Data Validation:
- Describe validation and cleaning steps for every column in the data
- Exploratory Analysis:
- Include two different graphics showing single variables only to demonstrate the characteristics of data
- Include at least one graphic showing two or more variables to represent the relationship between features
- Describe your findings
- Model Development
- Include your reasons for selecting the models you use as well as a statement of the problem type
- Code to fit the baseline and comparison models
- Model Evaluation
- Describe the performance of the two models based on an appropriate metric
- Business Metrics
- Define a way to compare your model performance to the business
- Describe how your models perform using this approach
- Final summary including recommendations that the business should undertake
โ
When you have finished...
- Publish your Workspace using the option on the left
- Check the published version of your report:
- Can you see everything you want us to grade?
- Are all the graphics visible?
- Review the grading rubric. Have you included everything that will be graded?
- Head back to the Certification Dashboard to submit your practical exam report and record your presentation
Introduction
Tasty Bytes is an online recipe search engine that has built up a subscription based model where they provide meal plans and even ship ingredients to their premium customers. One of their features is a daily recipe that that feature on their home page. They have found that certain daily recipes can increases website traffic and are desiring to use past date to predict what recipes will generate high traffic which in turn leads to increased subscriptions.
The goal of this project is to take the data contained in the file recipe_site_traffic_2212.csv
and prepare a model that can predict what recipes will generate high traffic and also minimze the chances of uploading an unpopular recipe. The goal is to have a model that is at least 80% accurate in predicting popularity.
Data Validation
Data Summary
The data was read into a pandas DataFrame and initial observations were performed to discern what properties the data has. It is important to determine the data type of each variable and assess its validity and perform any cleaning as necessasry.
#Import necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#Read in the data
recipe_traffic_df = pd.read_csv('recipe_site_traffic_2212.csv')
#Display the first 5 rows
display(recipe_traffic_df.head())
#Dispaly the data types of the different columns
display(recipe_traffic_df.dtypes)
#Dispaly the dimensions of the data
display(recipe_traffic_df.shape)
Initial observations showed that there were 974 rows (or observations) split into eight columns in the DataFrame, however only seven of them were viable for use in the model. The first column, recipe
, was an indexing column that was not appropriate to use as a predictor variable. The seven remaining variables were:
calories
- A float data type describing the number of calories per serving.carbohydrate
- A float data type listing the number of carbohydrates (in grams) per serving.sugar
- A float data type listing the amount of sugar (in grams) per serving.protein
- A float data type listing the amount of protein (in grams) per serving.category
- An object data type listing the category the recipe belongs under.servings
- An object data type listing the number of servings the recipe provides.high_traffic
- An object data type describing if the recipe garnered high traffic for the website or not. This will be the dependent variable for the model.
The data types for four of the seven variables were appropriate for data analysis; however the category
, servings
, and high_traffic
variables will need to be modified to an appropriate data type after cleaning in preparation for running the model.
To eliminate the recipe
column as a predictor variable, it was recast as the index of the DataFrame after confirming all entries were unique.
#Check that the recipe column is unique
print("Number of duplicates: " + str(recipe_traffic_df.recipe.duplicated().sum()))
#Reset the index
recipe_traffic_df = recipe_traffic_df.set_index('recipe')
Data Cleaning
The dependent variable, high_traffic
, initially contained two values: a string 'High' indicating an observed recipe generated high traffic, and a null value indicating that the recipe did not generate high traffic. To prepare this column for modeling, it was transformed to a Boolean data type. The string High
was replaced with True
and the null values were replaced with False
, which resulted in a Boolean column indicating if an observed recipe generated high traffic or not.
โ
โ