Skip to content

Bike Sharing Demand

This dataset consists of the number of public bikes rented in Seoul's bike sharing system at each hour. It also includes information about the weather and the time, such as whether it was a public holiday.

Not sure where to begin? Scroll to the bottom to find challenges!

import pandas as pd
seoul = pd.read_csv("data/SeoulBikeData.csv")
seoul.head(100)

Source of dataset.

Citations:

  • Sathishkumar V E, Jangwoo Park, and Yongyun Cho. 'Using data mining techniques for bike sharing demand prediction in metropolitan city.' Computer Communications, Vol.153, pp.353-366, March, 2020
  • Sathishkumar V E and Yongyun Cho. 'A rule-based model for Seoul Bike sharing demand prediction using weather data' European Journal of Remote Sensing, pp. 1-18, Feb, 2020

Don't know where to start?

Challenges are brief tasks designed to help you practice specific skills:

  • 🗺️ Explore: Compare the average number of bikes rented by the time of day (morning, afternoon, and evening) across the four different seasons.
  • 📊 Visualize: Create a plot to visualize the relationship between temperature and the number of bikes rented. Differentiate between seasons within the plot.
  • 🔎 Analyze: Which variables correlate most with the number of bikes rented, and how strong are these relationships?

Scenarios are broader questions to help you develop an end-to-end project for your portfolio:

A bike-sharing startup has just hired you as their data analyst. The business is scaling quickly, but the demand fluctuates a lot. This means that there are not enough usable bikes available on some days, and on other days there are too many bikes. If the company could predict demand in advance, it could avoid these situations.

The founder of the company has asked you whether you can predict the number of bikes that will be rented based on information such as predicted weather, the time of year, and the time of day.

You will need to prepare a report that is accessible to a broad audience. It will need to outline your steps, findings, and conclusions.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming 'seoul' DataFrame is already defined and loaded with data

# Print info
print(seoul.info())

# Check categorical data and convert to nominal or ordinal numeric data
print(seoul['Seasons'].unique())
seoul['seasons_ord'] = seoul['Seasons'].map({'Winter': 1, 'Spring': 2, 'Summer': 3, 'Autumn': 4})

print(seoul['Holiday'].unique())
seoul['holiday_nom'] = np.where(seoul['Holiday'] == 'Holiday', 1, 0)

print(seoul['Functioning Day'].unique())
seoul['func_day_nom'] = np.where(seoul['Functioning Day'] == 'Yes', 1, 0)

# Specify the date format and set dayfirst=True also add year and month
seoul['Date'] = pd.to_datetime(seoul['Date'], dayfirst=True, format='%d/%m/%Y')
seoul['year'] = seoul['Date'].dt.year
seoul['month'] = seoul['Date'].dt.month

# Calculate correlation matrix and print Rented bike count correlation with other variables
seoul_corr = seoul.corr(numeric_only=True)
print(seoul_corr['Rented Bike Count'])
sns.heatmap(seoul_corr)
plt.show()

# Plot temperature vs bike count with linear regression line
result = np.polyfit(seoul['Temperature(C)'], seoul['Rented Bike Count'], 1)
intercept = result[1]
print(intercept)
coef1 = result[0]
xp = np.arange(-20, 40, 1)
yp = intercept + xp * coef1 

sns.scatterplot(data=seoul, x='Temperature(C)', y='Rented Bike Count')
sns.lineplot(x=xp, y=yp, color='red')
plt.title('Bike rent count vs temperature (C)')
plt.show()

# Corrected the plotting code
sns.lmplot(data=seoul, x='Temperature(C)', y='Rented Bike Count', order=1, hue='Seasons')
plt.title("Bike rent count vs tempersture (C) for different seasons")
plt.show()
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import train_test_split, KFold, cross_val_score
import matplotlib.pyplot as plt

# Prepare for regression analysis
# Drop columns redundant columns
seoul_num = seoul.drop(['Seasons', 'Holiday', 'Functioning Day', 'Date'], axis=1)
print(seoul_num.info())

# Generate predictors and dependent variables
X = seoul_num.drop('Rented Bike Count', axis=1)
feature_cols = X.columns
X = X.values
y = seoul_num['Rented Bike Count'].values

# Split train and test columns
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit a regression model and calculate r-squared and rmse
reg = LinearRegression()
reg.fit(X_train, y_train)
rsq = reg.score(X_test, y_test)
y_pred = reg.predict(X_test)
rmse = MSE(y_test, y_pred)**(1/2)
print(f' rsq {rsq.round(2)}')
print(f' rmse {rmse.round(2)}')

# Try KFold and print
kf = KFold(n_splits=6, shuffle=True, random_state=42)
cv_scores = cross_val_score(reg, X_train, y_train, cv=kf)
print('CV scores\n')
print(cv_scores.round(2))
print(f'Best CV score: {cv_scores.max().round(2)}')

# Try Ridge regression
ridge = Ridge(alpha=0.3)
ridge.fit(X_train, y_train)
ridge_score = ridge.score(X_test, y_test)
print(f' ridge score {ridge_score}')

# Try Lasso and plot coefficients
lasso = Lasso(alpha=0.3)
lasso.fit(X_train, y_train)
lasso_coef = lasso.coef_
plt.bar(feature_cols, lasso_coef)
plt.xticks(rotation=90)
plt.show()
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error as MSE
import pandas as pd
import matplotlib.pyplot as plt

# Assuming X_train, y_train, X_test, y_test, and feature_cols are already defined

# set up and run random forest regressor

rf = RandomForestRegressor(n_estimators=25, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

# Calculate and print r_square and rmse
rmse = MSE(y_test, y_pred)**(1/2)
score = rf.score(X_test, y_test)
print(f' rsq {score.round(2)}')
print(f' rmse {rmse.round(2)}')

# Calculate and plot importance of features
importances = pd.Series(data=rf.feature_importances_, index=feature_cols)
importances_sorted = importances.sort_values()
importances_sorted.plot(kind='barh')
plt.title('Feature importances')
plt.show()