Bay Wheels 2022 Bike Sharing

Source: https://www.lyft.com/bikes/bay-wheels/explore

How do casual and member users ride differently in San Francisco Bay Wheels bike sharing?

Introduction

Bay Wheels is a bike-sharing system operated by Lyft, Inc. in San Francisco, San Jose, and the East Bay in northern California, United States. It features hundreds of docking stations and thousands of classic bikes and e-bikes in the Bay Area.

The system offers good pricing flexibility, which attracts many customers all year. However, the director of marketing believes that annual subscriptions are fundamental to the company's future growth and success. To make the business more profitable, he needs to better understand the behavior of casual and member users to make appropriate decisions.

In this data analysis project, I will answer the business questions, such as:

How often do casual and member users ride compared to each other?
When do casuals and members prefer to use the service?
Who usually takes longer rides, casuals or members?
How does the number of rides change throughout a week, month, year?
What type of bike is more popular, classic or electric?

In this project, I used PostgreSQL for analysis, Markdown and MS Excel for visualization. The Datacamp Workspace is used for building and formatting.

This data analysis and visualizations are based on publicly available one-year period data of Bay Wheels trip data in the year 2022, under this license. The dataset can be downloaded here in CSV format for each month.

3 hidden cells

Data Preperation

Before I started coding, I needed to prepare the data in the best method. On the BayWheels website I downloaded publicly available bike trips data for each month in 2022.

The biggest challenge was to append 12 CSV downloaded files into one huge database. Luckily, there was a solution in the DataCamp workspace by writing a SQL code. First, I created a table with all the columns and field types the same as in each csv file. Then, I inserted data from each CSV file into the new table. In the end, I exported the appended data to a new big csv file. The SQL code below illustrates these steps:

-- Create a table to store the appended data
DROP TABLE IF EXISTS baywheels_appended_tripdata_2022;

CREATE TABLE baywheels_appended_tripdata_2022 (
	ride_id varchar,
	rideable_type varchar,
	started_at text,
	ended_at text,
	start_station_name varchar,
	start_station_id varchar,
	end_station_name varchar,
	end_station_id varchar,
	start_lat double precision,
	start_lng double precision,
	end_lat double precision,
	end_lng double precision,
	member_casual varchar
);

-- Insert data from each CSV file into the table
COPY baywheels_appended_tripdata_2022 FROM '202201-baywheels-tripdata.csv' DELIMITER ',' CSV HEADER;
COPY baywheels_appended_tripdata_2022 FROM '202202-baywheels-tripdata.csv' DELIMITER ',' CSV HEADER;
COPY baywheels_appended_tripdata_2022 FROM '202203-baywheels-tripdata.csv' DELIMITER ',' CSV HEADER;
COPY baywheels_appended_tripdata_2022 FROM '202204-baywheels-tripdata.csv' DELIMITER ',' CSV HEADER;
COPY baywheels_appended_tripdata_2022 FROM '202205-baywheels-tripdata.csv' DELIMITER ',' CSV HEADER;
COPY baywheels_appended_tripdata_2022 FROM '202206-baywheels-tripdata.csv' DELIMITER ',' CSV HEADER;
COPY baywheels_appended_tripdata_2022 FROM '202207-baywheels-tripdata.csv' DELIMITER ',' CSV HEADER;
COPY baywheels_appended_tripdata_2022 FROM '202208-baywheels-tripdata.csv' DELIMITER ',' CSV HEADER;
COPY baywheels_appended_tripdata_2022 FROM '202209-baywheels-tripdata.csv' DELIMITER ',' CSV HEADER;
COPY baywheels_appended_tripdata_2022 FROM '202210-baywheels-tripdata.csv' DELIMITER ',' CSV HEADER;
COPY baywheels_appended_tripdata_2022 FROM '202211-baywheeels-tripdata.csv' DELIMITER ',' CSV HEADER;
COPY baywheels_appended_tripdata_2022 FROM '202212-baywheels-tripdata.csv' DELIMITER ',' CSV HEADER;
-- Repeat the above line for all 12 files

-- Export the data to a new CSV file
COPY baywheels_appended_tripdata_2022 TO 'baywheels_tripdata_2022.csv' DELIMITER ',' CSV HEADER;

The combined table contains 2,614,800 rows and 13 columns. It's a lot of data!

1 hidden cell

Data Exploration and Cleaning

Each row of data represents a single ride with date and time, location, type of bike, and user type. Here is how I validated the data before the analysis, column by column:

ride_id: The unique identifier of each ride that contains 16 characters. There are no duplicates or null values within over 2.6 million records.
rideable_type: There are three unique values - classic_bike, electric_bike, and docked_bike. The rides using the docked_bike didn't occur after March 24th. It's very likely that "docked_bike" is the old name for "classic_bike". As a result, all "docked_bike" values have been changed to "classic_bike", which affected 9250 rows. There were no null values.
started_at / ended_at: Timestamp fields that show start and end of each ride, respectively. For further analysis and visualization, I created two new columns - ride_duration (computed as ended_at - started_at) and day_of_week. Next, I noticed that in some records, the "ended_at" value was less than the "started_at" value (negative ride duration), as well as very short or long trips. Therefore, I considered all rides that have less than 1 minute and over 24 hours as outliers. 74057 rows were deleted.
start_station_name / start_station_id: According to the rules of Lyft, the bike-sharing system provider, all classic bike trips must start and end at a docking station. On the other hand, electric bikes have a lock option which makes it possible to start or finish a ride outside a docking station. Considering this, I found no null values in the case a ride was taken on a "classic_bike".
end_station_name / end_station_id: Similar to the previous columns, I validated all the rows with classic bikes. 439 rows with null values have been removed.
start_lat / start_lng: Represent geographical coordinates in decimal degrees of the starting location. Because each ride always has coordinates, both start and finish, there couldn't be any null values. Luckily, I didn't find any.
end_lat / end_lng: Here, no null values were found as well, but there were 2 records with 0-value coordinates. The geographic location of the research area makes it impossible to obtain such values, thus I deleted these rows.
member_casual: "Casual" and "member" are the only two allowed values. No null or other values were detected.

-- Create a table to hold the data
DROP TABLE IF EXISTS baywheels_tripdata_2022;

CREATE TABLE baywheels_tripdata_2022 (
    ride_id varchar,
    rideable_type varchar,
    started_at timestamp,
    ended_at timestamp,
    start_station_name varchar,
    start_station_id varchar,
    end_station_name varchar,
    end_station_id varchar,
    start_lat double precision,
    start_lng double precision,
    end_lat double precision,
    end_lng double precision,
    member_casual varchar
);

-- Import data from CSV into the table (replace 'your_file_path' with the actual path)
COPY baywheels_tripdata_2022 FROM './baywheels_tripdata_2022.csv' DELIMITER ',' CSV HEADER;

-- Replace docked_bike values with classic_bike
UPDATE baywheels_tripdata_2022
SET rideable_type = 'classic_bike'
WHERE rideable_type = 'docked_bike';

-- Add two new columns to our table
ALTER TABLE baywheels_tripdata_2022
ADD COLUMN ride_duration interval;
ALTER TABLE baywheels_tripdata_2022
ADD COLUMN day_of_week varchar;

-- Compute data about trip duration and day of the week
UPDATE baywheels_tripdata_2022
SET ride_duration = ended_at-started_at,
	day_of_week = DAYNAME(started_at)::varchar;
	
-- Remove outlying values based on trip duration
DELETE FROM baywheels_tripdata_2022
WHERE ride_duration < INTERVAL '1 minute'
	  OR ride_duration > INTERVAL '24 hours';

--Remove null values in end_station_name for classic bike trips
DELETE FROM baywheels_tripdata_2022
WHERE end_station_name IS NULL
	  AND rideable_type = 'classic_bike';

-- Remove zero values of end coordinates
DELETE FROM baywheels_tripdata_2022
WHERE end_lat=0;

-- Query the modified data
SELECT COUNT(*)
FROM baywheels_tripdata_2022;

In the end, I obtained 2,540,302 public bike rides that occurred throughout the year 2022 in the San Francisco Bay Area for analysis.

1 hidden cell

Analysis and Visualization

Thank you for staying this far through reading my project. This is where things get really interesting.

Distribution of rides by city

I started off by analyzing the distribution of all rides in major Bay Area cities. The fastest way to figure it out was by using the geographic coordinates and CASE WHEN structure.

The vast majority of records, 2,309,049 or 90.9%, belong to San Francisco. Only a fraction of rides started either in San Jose (2.69%) or in the East Bay (Oakland and Berkeley) (6.41%).

Member vs casual user behavior: general statistics

As a next step, I analyzed in general how member users and casual users behave differently. Users with a membership rented a bike 1,460,433 times, which is about 57% of all rides.

Then I compared the mean and the median of the ride duration for both types of users. The mean value (the average) turned out to be greater than the median (a middle value of all the values sorted from biggest to smallest) for both casual and member user types. Member's average ride time is about 3 minutes than its median, whereas for a casual user the difference is bigger - over 6.5 minutes.

The reason might be that there is a handful of very long rides (up to 24 hours) which bring the average up, especially in the case of casual rides.

User behavior during the day

The user activity varies significantly throughout a day. The peak activity of any user type usually occurs twice during the day: in the morning (8-9 am) and in the evening (5-6 pm). Especially in the case of members, who use bikes mostly to commute to work or school and return home. The lowest activity happens between midnight and 5 am, regardless of the user type.

)

Ride frequency during rush hours

Let's take a closer look at rush hours. In this binary analysis, I considered all rides taken between 8 and 9 in the morning and between 5 and 6 in the evening as rush hour. Saturdays and Sundays were excluded entirely, as well as federal holidays in 2022.

Indeed, users with a membership tend to ride more during rush hours compared to casual riders. However, about 3 out of 4 member rides occur outside the busy hours.

User behavior during the week

The analysis of the distribution of rides during the week among casual and member riders provided helpful insights to understand users' behavior.

The average ride duration casual users fluctuates from the minimum of around 16.5 minutes in mid-week to the maximum of 22.2 minutes on Sundays. Member riders remain more consistent in the amount of time for a ride. The value keeps steady from Monday to Friday and slightly increases on the weekend.

Casual users in general take longer rides and use bikes mostly for recreational purposes. This hypothesis makes even more sense when we look at the bar chart below:

Saturdays and Sundays are the only days when casual rides outnumber member ones. In fact, about every third trip taken by casual users occurred on the weekend. For members, the rate is only 22.9%.

User behavior throughout the year

Casual and member riders' preferences on a monthly basis

The similar code helped me to figure out the distribution of rides throughout the year:

The total number of rides gradually increases from January through spring and summer and reaches its highest point in September and October when the warm weather and clear skies are dominant. Then it falls drastically in November and December when it gets colder along with rainy days.

Casual and member users have a similar number of rides most of the year, only in November-December members outnumber casual users significantly. In every single month, there were more member rides than casual rides.

The average ride duration remains steady in different months for both casual and member users. Only in November-December period, people tend to ride shorter because of colder weather and longer nights.

Casual and member riders' preferences on a seasonal basis

In general, 25.1% of all rides took place in spring, 29.2% took place in summertime, 28.1% happened during fall, and only 17.6% occurred in winter.

Although for both member and casual user types, winter was the least active season, the high season was different. Casual users used the service mostly in summer, but for members, fall was the most popular season. That's the time when vacations are over and the weather is still warm and pleasant.

Bike type preferences among member & casual riders

General statistics

In the San Francisco Bay Area in 2022, electric bikes were almost twice as popular as classic bikes among all users. In fact, casual users chose an electric bike 68.1% of the time, which is a higher rate than members did (61.8% of rides via an electric bike).

Bike type choice during the day

During each hour within a 24-hour span, there were roughly twice as many rides via an electric bike than by a classic one, regardless of the user type. The similar ratio was also observed during rush hours.

Bike type preferences on weekdays & weekends/holidays

The data shows that members took 75.5% of all rides on weekdays and 24.5% on days off. On the other hand, casual users tend to spend more time biking on weekends and federal holidays than members do, with percentages of 64.4% and 35.6% respectively.

Bike type usage throughout the year

The way different bike types were used stays similar throughout the entire year, with small differences. In February-March, there were about 1.7 electric bike rides per 1 classic bike ride on average, but in July the ratio was 1.95:1, respectively.

September had the greatest number of rides for both classic and electric bikes. In contrast, December was the least popular month for all bike types.

Key Findings

The bike service in San Francisco accounted 90.9% of rides.
Member users had a median ride duration of 9 minutes 3 seconds, casual riders had a median ride duration of 12 minutes 6 seconds.
Members primarily used the service for commuting, while casual riders used it for recreational purposes.
Ride distribution during the day followed general human behavior and working habits, while seasonal variations were influenced by climate patterns.
Electric bikes were preferred over classic bikes regardless of the time of day, season, or user type.

Recommendations

Implement additional fees for bike usage during rush hours to incentivize casual users to switch to a membership.
Offer a year-round discount for an annual subscription to highlight potential cost savings for casual riders.
Provide a special discount for an annual subscription for rides over 1 hour during weekends to encourage conversion to membership.
Increase bike inventory in San Jose and East Bay to improve availability in these areas.

Thank you for reading till the very end! Your feedback and sharing are greatly appreciated.