Linear Regression Analysis on Bike Sharing Demand in Seoul.

Introduction

Hello, everyone. My name is Ariel. Thank you for taking the time to look at my linear regression analysis. Within this body of work, I will explore the relationship between temperature, and its effect on bike rentals in Seoul during the winter season. To jump start this analysis, the code seen below is used to first import data from the University of California, Irvine. Lastly, the original dataset is then filtered to have datapoints only from the winter season. The Dataset used in this analysis has 2,160 rows of data, with 14 columns.


# Importing our library
library(tidyverse)

# Downloading, filtering data, and viewing data
Bike_data <- readr::read_csv('data/SeoulBikeData.csv')

Winter_Bike_Count <- Bike_data %>%
  filter(Seasons == "Winter")

glimpse(Winter_Bike_Count)

Density Plot

The image below explores the dispersion of bike rental during the winter. Since the original data is skewed, a log transformation was applied. The median point of this data is 203 bike rentals per day. In addition, during the winter season, the greatest number of bikes rented in one day in the city of Seoul was 937, while the lowest was 3 bikes. The standard deviation (SD) for the numbers of bike rentals was 150. Since the dispersion of the bike rental variable is large, it must not go unmentioned that the number of bikes rented per day fluctuates greatly. The code below will explore how temperature impacts bike rentals in Seoul.

ggplot(Winter_Bike_Count, aes(`Rented Bike Count`)) +
  geom_density(bins = 200) + 
  scale_x_log10() +
  theme_classic() +
  labs(
    title = "Winter Bike Rentals in Seoul",
    caption = "(https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand)"
  )

Linear Model

The code chuck below is a linear regression exploring temperatures impact on the number of bikes rented per day during the winter season in Seoul. The intercept was 252.3, while the coefficient for temperature was 10.5. Also, using the alpha level of α = .05, temperature is a significant at predicting the number of bike rentals in a day. Therefore, based on the model, the equation for the line of best fit is 252.3 + 10.5 * Temperature(°C). In order words, to predict the number of bike rentals in a single day, multiple the days temperature by 10.5 and add 252.3.

lin_mod <- lm(`Rented Bike Count` ~ `Temperature(°C)`, data = Winter_Bike_Count)
summary(lin_mod)

Scatter Plot

This image is a scatter plot of the number of bike rentals in Seoul during the winter with a linear prediction mapped in blue. Generally speaking, as temperature increases, the number of bike rentals also increase. In order to create predictions, a tibble was created with temperatures ranging from -15 °C to 10 °C. Next, the prediction data frame was then created before being mapped onto the plot. At -15 °C, it is anticipated that 94 bikes will rent in Seoul. At 10 °C, it is predicated that 358 bikes should be rented.

explanatory_data <- tibble('Temperature(°C)' = -15:10)

prediction_data <- explanatory_data %>%
  mutate('Rented Bike Count' = predict(lin_mod,explanatory_data))

  ggplot(Winter_Bike_Count, aes(`Temperature(°C)`, `Rented Bike Count`)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  geom_point(data = prediction_data,color = "blue", size = 2.5) +
  theme_classic() +
  labs(
    title = "Seoul winter bike rentals",
    subtitle = "Slope = 252.3 + 10.5 * Temperature(°C)",
    caption = "(https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand)",
    x = "Temperature(°C)",
    y = "Bike Rentals"
  )

Residuals

The plot below represents the residuals or the difference between an actual data point and the expected data point. Based on the liner model, we found our residuals had a median score of -13.82. Also, the models first quartile score was -86.37, and a third quartile score of 61.86. When I see these figure, I believe the model above does a poor job at predicting the number of bikes rented during the winter. In addition, the assumption on model fit is further supported if we look at residual standard error. When evaluating the figure for the residual standard error, the smaller the number, the more accurate the model. We see a large value of 138.9. Therefore, this model does a poor job at predicting the number of bike rentals.

Also, if we were to look at the image, we would notice that a more detailed story. Our model preforms well at predicting larger values since the positive values are randomly dispersed. However, the model begins to weaken when trying to guess the number of bikes rented on slower days. Towards the bottom of the chart, we noticed how the values are more compact and reveal the trend. We know that temperature is significant at predicting the number of bike rentals in Seoul. However, using only the temperature variable is best not best method to predict the number of bike rentals in the capital. An alternative would be to use more variable in a multiple liner regression. Based on this analysis, it is clear that more than one variable impacts how many bikes people rent in one day.