Skip to content

Data Scientist Professional Practical Exam Submission

Use this template to write up your summary for submission. Code in Python or R needs to be included.

📝 Task List

Your written report should include both code, output and written text summaries of the following:

  • Data Validation:
    • Describe validation and cleaning steps for every column in the data
  • Exploratory Analysis:
    • Include two different graphics showing single variables only to demonstrate the characteristics of data
    • Include at least one graphic showing two or more variables to represent the relationship between features
    • Describe your findings
  • Model Development
    • Include your reasons for selecting the models you use as well as a statement of the problem type
    • Code to fit the baseline and comparison models
  • Model Evaluation
    • Describe the performance of the two models based on an appropriate metric
  • Business Metrics
    • Define a way to compare your model performance to the business
    • Describe how your models perform using this approach
  • Final summary including recommendations that the business should undertake

Start writing report here..

# loading required libraries
library(tidyverse)
library(caret)
library(stringr)
library(e1071)
#loading data 
df = read.csv("https://s3.amazonaws.com/talent-assets.datacamp.com/recipe_site_traffic_2212.csv")
head(df)

Data validation

  • Having a look on the structure of the data
str(df)
  • The data set has 947 rows and 8 variables. The data is faced with a lot of null values in some columns, which need to be taken care of.
# checking for nulls
colSums(is.na(df))

handling nulls and non-desirable variables

# dropping nulls in calories, carbohydrate, sugar and protein
df = df %>% 
	filter(!is.na(calories))



# adding "Low" to high_traffic if value is missing
df = df %>% 
	mutate(high_traffic = ifelse(is.na(high_traffic), "Low", high_traffic))

# Replacing 'Chicken Breast' with Chicken
df = df %>%
	mutate(category = ifelse(category=='Chicken Breast', 'Chicken', category))
	


# solving issues in servings by removing 'as a snack'
df = df %>% 
  mutate(servings= ifelse(str_detect(servings, pattern="[0-9] "), str_extract(servings, pattern="[0-9]"),servings))
# confirming no nulls exist
colSums(is.na(df))

# changing data types
df$recipe = as.numeric(df$recipe)
df$servings = as.numeric(df$servings)

# Ensuring recepe has unique values
df = df[!duplicated(df$recipe), ]
  • In data validation :

    i. Nulls are removed.

    These are observed to be in the carbohydrate, sugar, protein and high traffic variables.

  • Nulls in carbohydrate, sugar and protein, occur concurrently. By dropping nulls in one of these variables other variables are free of nulls. These nulls cannot be imputed as this would lead to making more than one assumption in the same row. This is not usually desirable as analysis are likely to be impacted by those assumptions leading to wrong insights and conclusions.

  • The high traffic variable has nulls due to data entry, so by replacing nulls by low the variable is clean of any nulls.

    ii. Undesirable values are removed.

  • The category variable has 11 categories but 10 are expected. The value 'Chicken Breast' needs to be replaced with 'Chicken' to solve this issue.

  • The servings variable has an issue with it, a text 'as a snack' is added to the end of the digit. This needs to be dropped as it will generate an error when converting the variable to numeric. Stringer functions are used to perform this operation.

    iii. Changing data types.

  • recipe and servings are in undesirable data types, that is they are integer and character respectively. They are required to be in numeric, so they are then converted to this data type.

str(df)
  • After cleaning the data per column is as follows:
    • recepe it has unique values as per the data description
    • calories it has only numerical values as required
    • carbohydrate it has numeric values as required
    • sugar it has numeric values as required
    • protein it has numeric values as required
    • category it has a character data type and ten categories as expected
    • servings it has numeric values as expected
    • high_trafficit has character data type and two values "High" and "Low"