Datacamp Data Scientist Associate R Exam

Data Scientist Associate Practical Exam Submission

Use this template to complete your analysis and write up your summary for submission.

library(readr)
library(tidyverse)
library(dplyr)
data <- read_csv("electric_bike_ratings_2212.csv")
str(data)
head(data)

Task 1

owned: no make_model: yes review_month: no web_browser: no, 150 review_age: no, 105 primary_use: yes value_for_money: no overall_rating: yes

data$review_month <- gsub("[0-9-]","",data$review_month)
data$web_browser<-replace_na(data$web_browser,"unknown")
data$reviewer_age <- replace_na(data$reviewer_age,round(mean(data$reviewer_age, na.rm= TRUE)))
data$value_for_money <- data$value_for_money %>% gsub("/10","",.) %>% as.integer(.)

Task 2

The category of reviewers own the moped. It's not balanced.

ggplot(data,aes(x=owned))+
  geom_histogram(bins = 3)

table(data$owned)
t.test(data$overall_rating[data$owned == 1],data$overall_rating[data$owned == 0])

Task 3

There are three clusters, with means approximately at 13, 16, and 19. The largest cluster of reviews centers around 19.

ggplot(data,aes(x=overall_rating))+
  geom_histogram(bins = 100)

Task 4

For the group of reviewers that do not own a moped, their ratings are clustered around 12, 15, and 19, and three clusters have similar ranges. For the group of reviewers that own a moped, their ratings are clustered around 16 and 20, with most ratings in the cluster around 20.

ggplot(data,aes(x=overall_rating, fill = as.factor(owned)))+
  geom_histogram(position = "dodge",bins = 50)

Task 5

Classification.

#Create the train set and the test set (0.75/0.25).
index <- sample(nrow(data), round(0.75*nrow(data)))
data_train <- data[index,]
data_test <- data[-index,]

Task 6

Write your description here

model_base <- glm(owned ~., data = data_train, family = "binomial")
prob <- predict(model_base, data_test, type = "response")
pred <- ifelse(prob > 0.50, 1, 0)
table(data_test$owned, pred)
mean(pred==data_test$owned)

Task 7

‌
‌
‌