Bike Sharing Demand
This dataset consists of the number of public bikes rented in Seoul's bike sharing system at each hour. It also includes information about the weather and the time, such as whether it was a public holiday.
Not sure where to begin? Scroll to the bottom to find challenges!
import pandas as pd
# Load the dataset
# Assuming the dataset is in a CSV file named 'bike_sharing_data.csv'
df = pd.read_csv("data/SeoulBikeData.csv")
df.head(100)Source of dataset.
Citations:
- Sathishkumar V E, Jangwoo Park, and Yongyun Cho. 'Using data mining techniques for bike sharing demand prediction in metropolitan city.' Computer Communications, Vol.153, pp.353-366, March, 2020
- Sathishkumar V E and Yongyun Cho. 'A rule-based model for Seoul Bike sharing demand prediction using weather data' European Journal of Remote Sensing, pp. 1-18, Feb, 2020
Don't know where to start?
Challenges are brief tasks designed to help you practice specific skills:
- 🗺️ Explore: Compare the average number of bikes rented by the time of day (morning, afternoon, and evening) across the four different seasons.
- 📊 Visualize: Create a plot to visualize the relationship between temperature and the number of bikes rented. Differentiate between seasons within the plot.
- 🔎 Analyze: Which variables correlate most with the number of bikes rented, and how strong are these relationships?
Scenarios are broader questions to help you develop an end-to-end project for your portfolio:
A bike-sharing startup has just hired you as their data analyst. The business is scaling quickly, but the demand fluctuates a lot. This means that there are not enough usable bikes available on some days, and on other days there are too many bikes. If the company could predict demand in advance, it could avoid these situations.
The founder of the company has asked you whether you can predict the number of bikes that will be rented based on information such as predicted weather, the time of year, and the time of day.
You will need to prepare a report that is accessible to a broad audience. It will need to outline your steps, findings, and conclusions.
print(df.columns)# Rename columns
df = df.rename(columns={'Rented Bike Count': 'count', 'Temperature(C)': 'temp'})print(df.columns)print(df['temp'].describe())
print(df['count'].describe())df['temp_normalized'] = (df['temp'] - df['temp'].min()) / (df['temp'].max() - df['temp'].min())
df['count_normalized'] = (df['count'] - df['count'].min()) / (df['count'].max() - df['count'].min())# Check formats of 'Date' and 'Hour'
print(df['Date'].head())
print(df['Hour'].head())# Combine 'date' and 'hour' columns to create a 'datetime' column
# Use the correct format for your 'Date' column, which seems to be 'dd/mm/yyyy'
df['datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Hour'].astype(str), format='%d/%m/%Y %H')# Import necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np # Add this line to import numpy
# Convert the datetime column to pandas datetime type
df['datetime'] = pd.to_datetime(df['datetime'])
# Extract time of day and season from the datetime column
df['hour'] = df['datetime'].dt.hour
df['season'] = df['Seasons'].map({1: 'Spring', 2: 'Summer', 3: 'Fall', 4: 'Winter'})
# Define time of day
def time_of_day(hour):
if 5 <= hour < 12:
return 'Morning'
elif 12 <= hour < 17:
return 'Afternoon'
elif 17 <= hour < 21:
return 'Evening'
else:
return 'Night'
df['time_of_day'] = df['hour'].apply(time_of_day)
# Compare the average number of bikes rented by the time of day across the four different seasons
avg_bikes_time_season = df.groupby(['season', 'time_of_day'])['count'].mean().unstack()
print(avg_bikes_time_season)
# Create a plot to visualize the relationship between temperature and the number of bikes rented
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='temp_normalized', y='count_normalized', hue='season')
plt.title('Relationship between Normalized Temperature and Number of Bikes Rented')
plt.xlabel('Normalized Temperature')
plt.ylabel('Normalized Number of Bikes Rented')
plt.legend(title='Season')
plt.show()
# Analyze which variables correlate most with the number of bikes rented
# Select only numeric columns for correlation calculation
numeric_df = df.select_dtypes(include=[np.number])
correlation_matrix = numeric_df.corr()
correlation_with_count = correlation_matrix['count'].sort_values(ascending=False)
print(correlation_with_count)
# Draw Conclusions
conclusions = """
Key Insights:
1. The average number of bikes rented varies significantly by time of day and season.
2. There is a noticeable relationship between temperature and the number of bikes rented, with different patterns observed in different seasons.
3. Variables such as temperature, humidity, and wind speed show varying degrees of correlation with the number of bikes rented.
These insights can help the bike-sharing startup predict demand more accurately and manage their fleet more efficiently.
"""
print(conclusions)