Data Scientist Associate Practical Exam Submission
Use this template to complete your analysis and write up your summary for submission.
Task 1
a)
- owned is ok
- make_model is not nominal but character
- review_month is not clean, contains days on several occations & is character
- web_browser has 150 NAs& is character
- reviewer_age contains over 100 "-" values& is character
- primary_use is clean
- value_for_money is not integer & has "5/10"
- overall_rating is is clean
b)
- Missing Values only in web_browser (150)
c)
- I used a mutate() call to manipulate the data.
- To change all characters into nominal values I used "as.factor(.)"
- For review_month, I used str_sub(review_month, -3, -1)
- For web_browser I used ifelse(is.na(web_browser), "unknown", web_browser)
- For reviewer_age I used ifelse(is.na(as.numeric(reviewer_age)), mean(as.numeric(reviewer_age), na.rm = T), as.numeric(reviewer_age))
- For value_for_money I used str_sub(value_for_money, 1, -4)
Task 2
a) I can see that: owned = 1 for over 800 cases and owned = 0 for only around 600 cases. (exact numbers: 890 & 1500-890 = 610)
b) As we have 1,500 rows (observations) in our dataset, I would say that roughly 60-40 split can still be seen as balanced.
Task 3
As we can see, the distribution is multimodal: there are thee peaks at around 12, 15 and 19. The highest peak is the last one at around 19. This means that overall rating is clumped into three clumps and the most of the ratings (the highest peak) are at around 19, so rather high.
Task 4
The owned bikes (= 1) do have some points (= observations) at ratings over 21, whereas the non-owned bikes do not. This means that if you dont own a bike you probably do not rate it above 21. There seems to be a gap in both groups between 13&14, where neither group values their bike. The plot starts at an overall rating of 12, no one seems to be less happy with the bikes than that.
Task 5
This is a (binary) classification problem, as we want to predict into one of two groups. This could be done with a logistic regression for example.
Task 6
Not all packages are required for this step, but I used them in my own R environment to do the tasks and modify the data up until this point. Required for now is just tidyverse and caret.
library(tidyverse)
library(ggplot2)
#library(naniar)
library(stringr)
library(caret)
#Importing and cleaning the data
#data <- read.csv("filepath/filename.csv")
#checking for NAs
#vis_miss(data)
#data_fixed <- data%>%
# mutate(
# owned = as.factor(owned),
# make_model = as.factor(make_model),
# review_month = as.factor(str_sub(review_month, -3 , -1)),
# web_browser = as.factor(ifelse(is.na(web_browser), "unknown", web_browser)),
# reviewer_age = as.integer(ifelse(is.na(as.numeric(reviewer_age)), mean(as.numeric(reviewer_age), na.rm = T), #as.numeric(reviewer_age))),
# primary_use = as.factor(primary_use),
# value_for_money = as.integer(str_sub(value_for_money, 1, -4))
# )
#The actual model:
##creating a train-test-split
data_tree <- data_fixed%>%
mutate(train_index = sample(c("train", "test"), nrow(data_fixed), replace = T, prob = c(.8,.2)))
data_test <- data_tree%>%filter(train_index == "test")
data_train <- data_tree%>%filter(train_index == "train")
##The logistic regression
logisticmodel1 <- glm(data = data_train, owned~make_model+web_browser+as.numeric(reviewer_age)+primary_use+as.numeric(value_for_money)+overall_rating, family = "binomial")
##Making predictions
data_test$pred_log <- predict(logisticmodel1, newdata = data_test, type = c("response"))
data_test <- data_test%>%
mutate(pred_log = as.factor(round(pred_log)))
#Getting the confusion matrix for evaluation later
logCM <- confusionMatrix(data = data_test$pred_log, reference = data_test$owned)
Task 7
Write your description here
library(rpart)
## Setting up the tree
tree <- rpart(owned~make_model+web_browser+as.numeric(reviewer_age)+primary_use+as.numeric(value_for_money)+overall_rating, data = data_train, cp = 0.01)
## Making predictions
data_test$pred_tree <- predict(tree, newdata = data_test, type = c("class"))
## Getting the confusion matrix for evaluation later
treeCM <- confusionMatrix(data = data_test$pred_tree, reference = data_test$owned)
Task 8
I chose the logistic regression since it is a two group classification problem and a logistic regression is able to do that. It is also a more econometric way to approach the question. My comparison model comes from a Machine Learning (Supervised Learning) and it seemed like a good opportunity to compare those two.
Task 9
Comparison is done by looking at both created confusion matrices and see which model has better scores there.
logCM
treeCM
Task 10
To compare my models, I used the confusion matrix to check how many observations have been classified correctly. For the logistic regression, the predictions have to be converted from probabilites to "certain" predictions by rounding up/ down to 1/ 0. Looking at the results of both confusion matrices, one can see that the accuracy of the logistic model is slightly better at .77 vs .75 for the classification tree. The CI and other measures like sensitivity and specificity also show a slight advantage for the logistic regression. A thing to consider is that further improving on the tree and using either a random forest or a boosted tree might change the preferred model.