Bike Sharing Demand EDA & Linear Regression

Bike Sharing Demand

This dataset consists of the number of public bikes rented in Seoul's bike sharing system at each hour. It also includes information about the weather and the time, such as whether it was a public holiday.

Not sure where to begin? Scroll to the bottom to find challenges!

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df=pd.read_csv("data/SeoulBikeData.csv")
df

Source of dataset.

Citations:

Sathishkumar V E, Jangwoo Park, and Yongyun Cho. 'Using data mining techniques for bike sharing demand prediction in metropolitan city.' Computer Communications, Vol.153, pp.353-366, March, 2020
Sathishkumar V E and Yongyun Cho. 'A rule-based model for Seoul Bike sharing demand prediction using weather data' European Journal of Remote Sensing, pp. 1-18, Feb, 2020

Don't know where to start?

Challenges are brief tasks designed to help you practice specific skills:

🗺️ Explore: Compare the average number of bikes rented by the time of day (morning, afternoon, and evening) across the four different seasons.
📊 Visualize: Create a plot to visualize the relationship between temperature and the number of bikes rented. Differentiate between seasons within the plot.
🔎 Analyze: Which variables correlate most with the number of bikes rented, and how strong are these relationships?

Scenarios are broader questions to help you develop an end-to-end project for your portfolio:

A bike-sharing startup has just hired you as their data analyst. The business is scaling quickly, but the demand fluctuates a lot. This means that there are not enough usable bikes available on some days, and on other days there are too many bikes. If the company could predict demand in advance, it could avoid these situations.

The founder of the company has asked you whether you can predict the number of bikes that will be rented based on information such as predicted weather, the time of year, and the time of day.

You will need to prepare a report that is accessible to a broad audience. It will need to outline your steps, findings, and conclusions.

df.info()

df.describe()

print(df.isnull().sum())

Distributions of 'Rented Bike Count' & also 'Temperature(C)', 'Humidity(%)', 'Visibility (10m)','Rainfall(mm)'.

The Distribution of Rented Bike Count is Positivly skewed which suggests that most bikes were rented with low take up.

plt.hist(df['Rented Bike Count'], bins=20)
plt.xlabel('Rented Bike Count')
plt.ylabel('Frequency')
plt.title('Distribution of Rented Bike Count')
plt.show()

# Select the fields for which you want to create histograms
fields = ['Temperature(C)', 'Humidity(%)', 'Visibility (10m)','Rainfall(mm)']

# Create a 2x2 grid of subplots
fig, axes = plt.subplots(2, 2, figsize=(10, 8))

# Iterate over each field and create a histogram in the corresponding subplot
for i, field in enumerate(fields):
    row = i // 2  # Row index of the subplot
    col = i % 2   # Column index of the subplot
    ax = axes[row][col]
    ax.hist(df[field], bins=20, alpha=0.7)
    ax.set_title(field)
    ax.set_xlabel(field)
    ax.set_ylabel('Frequency')

# Adjust spacing between subplots
plt.tight_layout()

# Show the histograms
plt.show()

Average Demand per Hour

As we can see, this is bi modal and peaks during rush hours.

# Calculate the average rented bike count for each hour
hourly_average = df.groupby('Hour')['Rented Bike Count'].mean().reset_index()

# Create a polar plot using matplotlib
fig = plt.figure(figsize=(8, 6))
ax = plt.subplot(111, polar=True)

# Convert hour values to radians
theta = (24 - hourly_average['Hour']) * 2 * 3.14159 / 24

# Plot the average rented bike count using Seaborn
sns.lineplot(x=theta, y=hourly_average['Rented Bike Count'], sort=False, ax=ax)

# Customize the plot
hour_labels = ['12 AM', '1 AM', '2 AM', '3 AM', '4 AM', '5 AM', '6 AM', '7 AM', '8 AM', '9 AM', '10 AM', '11 AM',
               '12 PM', '1 PM', '2 PM', '3 PM', '4 PM', '5 PM', '6 PM', '7 PM', '8 PM', '9 PM', '10 PM', '11 PM']
ax.set_xticks(theta)
ax.set_xticklabels(hour_labels, fontsize=8)
ax.set_yticklabels([])  # Hide y-axis labels
ax.set_title('Average Rented Bike Count by Hour', pad=20)

# Show the polar plot
plt.show()

Box Plot of Average Demand per Hour

Also demostrates peaks at 9am and 7pm

# Create a box plot of the average rented bike count for each hour
fig = plt.figure(figsize=(8, 6))
ax = plt.subplot(111)

# Plot the box plot using Seaborn
sns.boxplot(x='Hour', y='Rented Bike Count', data=df, ax=ax)

# Customize the plot
hour_labels = ['12 AM', '1 AM', '2 AM', '3 AM', '4 AM', '5 AM', '6 AM', '7 AM', '8 AM', '9 AM', '10 AM', '11 AM',
               '12 PM', '1 PM', '2 PM', '3 PM', '4 PM', '5 PM', '6 PM', '7 PM', '8 PM', '9 PM', '10 PM', '11 PM']
ax.set_xticklabels(hour_labels, fontsize=8)
ax.set_xlabel('Hour')
ax.set_ylabel('Average Rented Bike Count')
ax.set_title('Box Plot of Average Rented Bike Count by Hour', pad=20)

# Show the box plot
plt.show()

correlation_matrix = df.corr()
plt.figure(figsize=(10, 8))
plt.imshow(correlation_matrix, cmap='coolwarm', interpolation='nearest')
plt.colorbar()
plt.xticks(range(len(correlation_matrix.columns)), correlation_matrix.columns, rotation=90)
plt.yticks(range(len(correlation_matrix.columns)), correlation_matrix.columns)
plt.title('Correlation Matrix Heatmap')
plt.show()