Google Data Analytics Capstone: Bike Share
Scenario
I am assuming the role of a junior data analyst on Cyclistic's marketing team, tasked with analyzing differences in how casual riders and annual members use the company's bike-share service. The goal is to provide insights that will help design a marketing strategy to convert casual riders into annual members. The recommendations presented must be supported by strong data insights and professional visualizations to secure executive approval.
Stakeholders
Cyclistic
A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use the bikes to commute to work each day.
Lily Moreno
The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team
A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals—as well as how you, as a junior data analyst, can help Cyclistic achieve them.
Cyclistic executive team
The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
Table of Contents
- Ask
- Prepare
- Process
- Analysis of the dataset 4.1 Riders distribution 4.2 Average ride duration 4.3 Distribution of the total monthly rides 4.4 Distribution of weekly bike usage 4.5 Distribution of hourly bike usage 4.6 Average ride duration per hour 4.7 Most popular trip start stations 4.8 Most popular route 4.9 Bike type used by riders
- Share
- Act Acknowledgement
Ask
Three questions will guide the future marketing program:
- How do annual members and casual riders use Cyclistic bikes differently?
- Why would casual riders buy Cyclistic annual memberships?
- How can Cyclistic use digital media to influence casual riders to become members?
Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?
Business Task
To analyze Cyclistic historical data to determine how annual members and casual riders are differnt in their use of Cyclistic bikes. Insights from this analysis will assist the team to design a marketing program targeting casual riders to become annual members in order to grow Cyclistic.
Prepare
Using Cyclistic’s historical trip data to analyze and identify trends. I downloaded the 2-19 data from here.
https://divvy-tripdata.s3.amazonaws.com/index.html
- Divvy_Trips_2019_Q1.zip
- Divvy_Trips_2019_Q2.zip
- Divvy_Trips_2019_Q3.zip
- Divvy_Trips_2019_Q4.zip
The data has been made available by Motivate International Inc. under this license.) This is public data that you can use to explore how different customer types are using Cyclistic bikes. But note that data-privacy issues prohibit you from using riders’ personally identifiable information. This means that you won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.
Source Data
Data was pulled together by combing 4 csv files into a SQL server. The data was futher combined using SQL and loaded into a csv file (bike_data_all.csv)
WITH cte_bike_data AS ( SELECT [trip_id] ,[start_time] ,[end_time] ,[bikeid] ,[tripduration] ,[from_station_id] ,[from_station_name] ,[to_station_id] ,[to_station_name] ,[usertype] ,[gender] ,[birthyear] FROM [dbo].[Divvy_Trips_2019_Q1] UNION SELECT [trip_id] ,[start_time] ,[end_time] ,[bikeid] ,[tripduration] ,[from_station_id] ,[from_station_name] ,[to_station_id] ,[to_station_name] ,[usertype] ,[gender] ,[birthyear] FROM [dbo].[Divvy_Trips_2019_Q2] UNION SELECT [trip_id] ,[start_time] ,[end_time] ,[bikeid] ,[tripduration] ,[from_station_id] ,[from_station_name] ,[to_station_id] ,[to_station_name] ,[usertype] ,[gender] ,[birthyear] FROM [dbo].[Divvy_Trips_2019_Q3] UNION SELECT [trip_id] ,[start_time] ,[end_time] ,[bikeid] ,[tripduration] ,[from_station_id] ,[from_station_name] ,[to_station_id] ,[to_station_name] ,[usertype] ,[gender] ,[birthyear] FROM [dbo].[Divvy_Trips_2019_Q4] ) SELECT [trip_id] ,[start_time] ,[end_time] ,[bikeid] ,[tripduration] ,[from_station_id] ,[from_station_name] ,[to_station_id] ,[to_station_name] ,[usertype] ,[gender] ,[birthyear] ,DATEPART(weekday, start_time) AS day_of_week ,RIGHT('0' + CAST((DATEDIFF(SECOND, start_time, end_time) / 3600) AS VARCHAR), 2) + ':' + RIGHT('0' + CAST((DATEDIFF(SECOND, start_time, end_time) % 3600 / 60) AS VARCHAR), 2) + ':' + RIGHT('0' + CAST((DATEDIFF(SECOND, start_time, end_time) % 60) AS VARCHAR), 2) AS time_difference FROM cte_bike_data
Process
I will be using R to work with the data. I chose R because of its flexibility in data manipulation and visualization and also because I want to be proficient in using R for exploratory data analysis.
The cleaning and manipulation steps will be documented using this R notebook. I will be using the trip data from January 2019 to December 2019.
# Load packages
library(tidyverse)
library(lubridate)
library(janitor)
library(readr)# Load data
trips <- read_csv("bike_data.csv")# view the first few rows, an overview, and the last few rows
head(trips)
glimpse(trips)
tail(trips)# check for duplicates
nrow(distinct(trips)) == nrow(trips)# remove NAs
trips_clean <- drop_na(trips)
glimpse(trips_clean)# add additional fields
trips_clean <- trips_clean %>%
mutate(ride_duration = round(difftime(end_time, start_time, units = "mins"))) %>%
mutate(month = month(start_time, label = TRUE)) %>%
mutate(weekday = wday(start_time, label = TRUE)) %>%
mutate(start_hour = hour(start_time)) %>%
mutate(route = str_c(from_station_name, to_station_name, sep = " -to- "))
glimpse(trips_clean)# explore new fields
head(trips_clean)
summary(trips_clean)
# Checking for zero trip duration
trips_clean %>%
filter(ride_duration == 0)
# Checking for negative trip duration
trips_clean %>%
filter(ride_duration < 0)# remove zero and negative rides
bike_trips <- trips_clean %>%
filter(!ride_duration <= 0)
# review final data set
head(bike_trips)
summary(bike_trips)Dataset Analysis
I will now work out calculations and visuals for insights. We want to explain how casul riders are differnt form the member riders.
The following visuals will be created for this analyze.
Visuals
- Distribution of Riders
- Average Ride Duration
- Distribution of Rides Monthly
- Distribution of Rides per Week
- Distribution of Hourly Usage
- Average Ride per Hour
- Most Popular Start Stations
- Most Popular Bike Routes
- Type of Bike by Rider
Distribution of Riders
To beign, I will show a visual of the percentage of each category of rider.