Course notes: Cleaning Data in R

Course Notes

Use this workspace to take notes, store code snippets, or build your own interactive cheatsheet! The datasets used in this course are available in the datasets folder.

# Import any packages you want to use here
library(tidyverse)

Take Notes

Add notes here about the concepts you've learned and code cells with code you want to keep.

Add your notes here

# Add your code snippets here
bike_share_rides <- read_rds("datasets/bike_share_rides_ch1_1.rds")
str(bike_share_rides)

Clean `duration` field

bike_share_rides <- bike_share_rides %>% 
	mutate(duration_minutes = as.numeric(str_remove(string = duration, pattern = " minutes")))

class(bike_share_rides$duration_minutes)

head(bike_share_rides$duration_minutes)

What are the the most popular stations for departure and arrival?

We try to figure out what are the most frequent departure stations, the most frequent arrival stations, and if any of them is in both top positions.

# departure stations
starts <- bike_share_rides %>% group_by(station_A_id, station_A_name) %>% 
	summarize(ride_starts = n(), avg_duration = round(mean(duration_minutes), 2)) %>% 
	arrange(desc(ride_starts))
# most popular departure stations
starts

options(repr.plot.width = 10)
# histogram of rides counts for departure stations
starts %>% ggplot(aes(x = ride_starts)) + 
	geom_histogram(bins = 5, color = "darkgreen", fill = "green")

# arrival stations
ends <- bike_share_rides %>% group_by(station_B_id, station_B_name) %>% 
	summarize(ride_ends = n(), avg_duration = round(mean(duration_minutes), 2)) %>% 
	arrange(desc(ride_ends))
# most popular arrival stations
head(ends, 10)

# histogram of rides counts for arrival stations
ends %>% ggplot(aes(x = ride_ends)) + 
	geom_histogram(bins = 7, color = "red3", fill = "salmon") + 
	stat_bin(bins = 7, geom = "text", aes(label = ..count..), position = position_stack(vjust = 0.5))

Are the most popular departure stations also the most popular arrival stations?

ends %>% 
	inner_join(starts, by = c("station_B_id" = "station_A_id"), suffix = c("_ends", "_starts"))

‌
‌
‌