Skip to content

Bikeshare Insights: Summer in the Windy City

This dataset contains information on Divvy Bikes, a bikeshare program that provides residents and visitors of Chicago with a convenient way to explore the city.

The workspace is set up with one CSV file containing bikeshare activities at the peak of the summer-July 2023. Columns include ride ID, bike type, start and end times, station names and IDs, location coordinates, and member type. Feel free to make this workspace yours by adding and removing cells, or editing any of the existing cells.

Source: Divvy Bikes

๐ŸŒŽ Some guiding questions to help you explore this data:

  1. How many observations are in the dataset? Are there null values?
  2. How would you clean and prepare the data for analysis?
  3. Which bike types are popular and which ones aren't? Check if being a member or casual rider makes a difference in bike choice.
  4. Time check! What are the peak and off-peak riding times during the day?

๐Ÿ“Š Visualization ideas

  • Bar chart: Display the number of times each bike type is used to identify the most and least used bikes.
  • Grouped bar chart: Compare bike usage by member type (member vs. casual) to see if it affects bike choice.
  • Heatmap: Vividly illustrate the popularity of bikes at different times during the day and week.

Exploratory Analysis ๐Ÿšต

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

divvy_jan2023 = pd.read_parquet("202307-divvy-tripdata.parquet")
divvy_jan2023.head()
# create a copy of the dataset

divvy = divvy_jan2023.copy()
# check the summary of the dataset

divvy.info()

Check for the null values count and how many percentage of each columns are null values

# Null values, Non-null values, % of Null values, Unique Count breakdown


def divvy_info():
    temp = pd.DataFrame(index = divvy.columns)
    temp['Datatype'] = divvy.dtypes
    temp['Not Null Values'] = divvy.count()
    temp['Null Values'] = divvy.isnull().sum()
    temp['Percentage of Null Values'] = (divvy.isnull().mean()) * 100
    temp["Unique Count"] = divvy.nunique()
    return temp
divvy_info()

From this analysis, we can draw up of some observations.

  • start_station_name, start_station_id, end_station_name, end_station_id, end_lat, end_lng have null values.

The null values available in these columns constitute a minute percentage of the total dataset, which indicates that the sample size is sufficient for analysis. Therefore the columns will be dropped.

  • start_at and ended_at are in object datatype.

These columns need to be converted to the appropriate datatype.

# drop all null values

divvy = divvy.dropna(subset=['start_station_name', 'start_station_id', 'end_station_name','end_station_id', 'end_lat', 'end_lng'])

# confirm if the null values have been dropped

divvy.info()
# convert `start_at` and `end_at` column to datetime datatype

divvy['started_at'] = pd.to_datetime(divvy['started_at'])
divvy['ended_at'] = pd.to_datetime(divvy['ended_at'])

#confirm that the datatype has changed

divvy.info()
# Check If the end date is earlier than start date
false_date = divvy.loc[divvy['started_at'] > divvy['ended_at']]

# (No. of rows with false date, No. of coloumns)
false_date

3 of the started_at column has earlier end date than start date which makes the data false.

I will be dropping these rows.

โ€Œ
โ€Œ
โ€Œ