Skip to content

Regression: Bike Sharing Demand

This dataset consists of the number of public bikes rented in Seoul's bike sharing system at each hour. It also includes information about the weather and the time, such as whether it was a public holiday. Source of dataset.

Attribute Information:

ColumnExplanation
Datemonth-day
Rented Bike countCount of bikes rented at each hour
HourHour of the day
TemperatureTemperature in Celsius
Humidity%
Windspeedm/s
Visibility10m
Dew point temperatureCelsius
Solar radiationMJ/m2
Rainfallmm
Snowfallcm
SeasonsWinter, Spring, Summer, Autumn
HolidayHoliday/No holiday
Functional DayNoFunc(Non Functional Hours), Fun(Functional hours)

Dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='ticks', palette='magma')
plt.rc('xtick', labelsize=9) 
plt.rc('xtick.major', width=0.5)
plt.rc('ytick', labelsize=9)
plt.rc('ytick.major', width=0.5)
plt.rc('axes', linewidth=0.5)

data = pd.read_csv("data/SeoulBikeData.csv").drop('Date', axis=1)
data.columns = ['Rented Bike Count', 'Hour', 'Temperature', 'Humidity',
       'Wind speed', 'Visibility', 'Dew point temperature',
       'Solar Radiation', 'Rainfall', 'Snowfall', 'Seasons',
       'Holiday', 'Functioning Day']
data.columns = data.columns.str.replace(' ', '_').str.lower()
print('Data:')
display(data)
print('Data statistics:')
display(data.describe())

Exploratory data analysis

Target variable

Let's first take a closer look to the variable of interest — Rented Bike Count:

print('Rented Bike Count: mean {:.2f}, median {}, std {:.2f}'.format(data.rented_bike_count.mean(),
                                                                     data.rented_bike_count.median(),
                                                                     data.rented_bike_count.std()))
print('.'*75)
figure, ax = plt.subplots(1, 2, figsize=(16, 6))
sns.histplot(ax=ax[0], data=data, x='rented_bike_count', kde=True)
ax[0].set_title('Histogram')
sns.boxplot(ax=ax[1], data=data, x='rented_bike_count')
ax[1].set_title('Boxplot')
plt.show()

Most of values lay in range of 0 to 1200. The distribution is notably left skewed with a considerable difference between mean and median. Values above 2400 are probably outliers, but needs further investigation.

Target vs independant variables

Now we see how distributed other variables in the dataset and how they correlate with Rented Bike count.

Hour

Start with Hour — hour of a day the bike was rented:

figure, ax = plt.subplots(1, 2, figsize=(16, 6))
sns.histplot(ax=ax[0], data=data, x='hour', y='rented_bike_count')
ax[0].set_title('Histogram')
sns.boxplot(ax=ax[1], data=data, x='hour', y='rented_bike_count', palette='magma')
ax[1].set_title('Boxplot')
plt.show()

There is a strong (and not really surprising) non-linear correlation between Hour and Rented Bike count. Demand raises from 6 am to 6 pm.