Data Scientist Professional Practical Exam Submission
Use this template to write up your summary for submission. Code in Python or R needs to be included.
📝 Task List
Your written report should include both code, output and written text summaries of the following:
- Data Validation:
- Describe validation and cleaning steps for every column in the data
- Exploratory Analysis:
- Include two different graphics showing single variables only to demonstrate the characteristics of data
- Include at least one graphic showing two or more variables to represent the relationship between features
- Describe your findings
- Model Development
- Include your reasons for selecting the models you use as well as a statement of the problem type
- Code to fit the baseline and comparison models
- Model Evaluation
- Describe the performance of the two models based on an appropriate metric
- Business Metrics
- Define a way to compare your model performance to the business
- Describe how your models perform using this approach
- Final summary including recommendations that the business should undertake
Start writing report here..
# loading required libraries
library(tidyverse)
library(caret)
library(stringr)
library(e1071)
#loading data
df = read.csv("https://s3.amazonaws.com/talent-assets.datacamp.com/recipe_site_traffic_2212.csv")
head(df)
Data validation
- Having a look on the structure of the data
str(df)
- The data set has 947 rows and 8 variables. The data is faced with a lot of null values in some columns, which need to be taken care of.
# checking for nulls
colSums(is.na(df))
handling nulls and non-desirable variables
# dropping nulls in calories, carbohydrate, sugar and protein
df = df %>%
filter(!is.na(calories))
# adding "Low" to high_traffic if value is missing
df = df %>%
mutate(high_traffic = ifelse(is.na(high_traffic), "Low", high_traffic))
# Replacing 'Chicken Breast' with Chicken
df = df %>%
mutate(category = ifelse(category=='Chicken Breast', 'Chicken', category))
# solving issues in servings by removing 'as a snack'
df = df %>%
mutate(servings= ifelse(str_detect(servings, pattern="[0-9] "), str_extract(servings, pattern="[0-9]"),servings))
# confirming no nulls exist
colSums(is.na(df))
# changing data types
df$recipe = as.numeric(df$recipe)
df$servings = as.numeric(df$servings)
# Ensuring recepe has unique values
df = df[!duplicated(df$recipe), ]
-
In data validation :
i. Nulls are removed.
These are observed to be in the
carbohydrate
,sugar
,protein
andhigh traffic
variables. -
Nulls in
carbohydrate
,sugar
andprotein
, occur concurrently. By dropping nulls in one of these variables other variables are free of nulls. These nulls cannot be imputed as this would lead to making more than one assumption in the same row. This is not usually desirable as analysis are likely to be impacted by those assumptions leading to wrong insights and conclusions. -
The
high traffic
variable has nulls due to data entry, so by replacing nulls by low the variable is clean of any nulls.ii. Undesirable values are removed.
-
The
category
variable has 11 categories but 10 are expected. The value 'Chicken Breast' needs to be replaced with 'Chicken' to solve this issue. -
The
servings
variable has an issue with it, a text 'as a snack' is added to the end of the digit. This needs to be dropped as it will generate an error when converting the variable to numeric. Stringer functions are used to perform this operation.iii. Changing data types.
-
recipe
andservings
are in undesirable data types, that is they are integer and character respectively. They are required to be in numeric, so they are then converted to this data type.
str(df)
- After cleaning the data per column is as follows:
recepe
it has unique values as per the data descriptioncalories
it has only numerical values as requiredcarbohydrate
it has numeric values as requiredsugar
it has numeric values as requiredprotein
it has numeric values as requiredcategory
it has a character data type and ten categories as expectedservings
it has numeric values as expectedhigh_traffic
it has character data type and two values "High" and "Low"