Skip to content
Datacamp Data Scientist Associate R Exam
Data Scientist Associate Practical Exam Submission
Use this template to complete your analysis and write up your summary for submission.
library(readr)
library(tidyverse)
library(dplyr)
data <- read_csv("electric_bike_ratings_2212.csv")
str(data)
head(data)
Task 1
owned: no make_model: yes review_month: no web_browser: no, 150 review_age: no, 105 primary_use: yes value_for_money: no overall_rating: yes
data$review_month <- gsub("[0-9-]","",data$review_month)
data$web_browser<-replace_na(data$web_browser,"unknown")
data$reviewer_age <- replace_na(data$reviewer_age,round(mean(data$reviewer_age, na.rm= TRUE)))
data$value_for_money <- data$value_for_money %>% gsub("/10","",.) %>% as.integer(.)
Task 2
The category of reviewers own the moped. It's not balanced.
ggplot(data,aes(x=owned))+
geom_histogram(bins = 3)
table(data$owned)
t.test(data$overall_rating[data$owned == 1],data$overall_rating[data$owned == 0])
Task 3
There are three clusters, with means approximately at 13, 16, and 19. The largest cluster of reviews centers around 19.
ggplot(data,aes(x=overall_rating))+
geom_histogram(bins = 100)
Task 4
For the group of reviewers that do not own a moped, their ratings are clustered around 12, 15, and 19, and three clusters have similar ranges. For the group of reviewers that own a moped, their ratings are clustered around 16 and 20, with most ratings in the cluster around 20.
ggplot(data,aes(x=overall_rating, fill = as.factor(owned)))+
geom_histogram(position = "dodge",bins = 50)
Task 5
Classification.
#Create the train set and the test set (0.75/0.25).
index <- sample(nrow(data), round(0.75*nrow(data)))
data_train <- data[index,]
data_test <- data[-index,]
Task 6
Write your description here
model_base <- glm(owned ~., data = data_train, family = "binomial")
prob <- predict(model_base, data_test, type = "response")
pred <- ifelse(prob > 0.50, 1, 0)
table(data_test$owned, pred)
mean(pred==data_test$owned)
Task 7