Regression: Bike Sharing Demand
This dataset consists of the number of public bikes rented in Seoul's bike sharing system at each hour. It also includes information about the weather and the time, such as whether it was a public holiday. Source of dataset.
Attribute Information:
Column | Explanation |
---|---|
Date | month-day |
Rented Bike count | Count of bikes rented at each hour |
Hour | Hour of the day |
Temperature | Temperature in Celsius |
Humidity | % |
Windspeed | m/s |
Visibility | 10m |
Dew point temperature | Celsius |
Solar radiation | MJ/m2 |
Rainfall | mm |
Snowfall | cm |
Seasons | Winter, Spring, Summer, Autumn |
Holiday | Holiday/No holiday |
Functional Day | NoFunc(Non Functional Hours), Fun(Functional hours) |
Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='ticks', palette='magma')
plt.rc('xtick', labelsize=9)
plt.rc('xtick.major', width=0.5)
plt.rc('ytick', labelsize=9)
plt.rc('ytick.major', width=0.5)
plt.rc('axes', linewidth=0.5)
data = pd.read_csv("data/SeoulBikeData.csv").drop('Date', axis=1)
data.columns = ['Rented Bike Count', 'Hour', 'Temperature', 'Humidity',
'Wind speed', 'Visibility', 'Dew point temperature',
'Solar Radiation', 'Rainfall', 'Snowfall', 'Seasons',
'Holiday', 'Functioning Day']
data.columns = data.columns.str.replace(' ', '_').str.lower()
print('Data:')
display(data)
print('Data statistics:')
display(data.describe())
Exploratory data analysis
Target variable
Let's first take a closer look to the variable of interest — Rented Bike Count:
print('Rented Bike Count: mean {:.2f}, median {}, std {:.2f}'.format(data.rented_bike_count.mean(),
data.rented_bike_count.median(),
data.rented_bike_count.std()))
print('.'*75)
figure, ax = plt.subplots(1, 2, figsize=(16, 6))
sns.histplot(ax=ax[0], data=data, x='rented_bike_count', kde=True)
ax[0].set_title('Histogram')
sns.boxplot(ax=ax[1], data=data, x='rented_bike_count')
ax[1].set_title('Boxplot')
plt.show()
Most of values lay in range of 0 to 1200. The distribution is notably left skewed with a considerable difference between mean and median. Values above 2400 are probably outliers, but needs further investigation.
Target vs independant variables
Now we see how distributed other variables in the dataset and how they correlate with Rented Bike count.
Hour
Start with Hour — hour of a day the bike was rented:
figure, ax = plt.subplots(1, 2, figsize=(16, 6))
sns.histplot(ax=ax[0], data=data, x='hour', y='rented_bike_count')
ax[0].set_title('Histogram')
sns.boxplot(ax=ax[1], data=data, x='hour', y='rented_bike_count', palette='magma')
ax[1].set_title('Boxplot')
plt.show()
There is a strong (and not really surprising) non-linear correlation between Hour and Rented Bike count. Demand raises from 6 am to 6 pm.