Bikeshare Insights: Summer in the Windy City
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsExploratory Data analysis 🧭
First I check the dataframe and how large is it, getting different caracteristics of it such as how large is it, the data types it contains and the null values
raw_bikes_df = pd.read_csv("202307-divvy-tripdata.csv")
display(raw_bikes_df)# Making initial description and discovery of null values
print("Number of rows and columns: ", raw_bikes_df.shape, "\n\n")
print("Data types: ", raw_bikes_df.info(), "\n\n")
print("Number of Null values\n", raw_bikes_df.isnull().sum(), "\n\n")Even though in this case the missing values are not affecting the analysis. I manage them by filling the blank spaces with variables such as 'Other', that stands for other stations names that are not considered inside the dataframe, and manage the missing id of different stations with ND00 wich stand form "No Data"
pure_bikes_df = raw_bikes_df
pure_bikes_df["start_station_name"] = pure_bikes_df["start_station_name"].fillna("Other")
pure_bikes_df["start_station_id"] = pure_bikes_df["start_station_id"].fillna('ND00')
pure_bikes_df["end_station_name"] = pure_bikes_df["end_station_name"].fillna("Other")
pure_bikes_df["end_station_id"] = pure_bikes_df["end_station_id"].fillna("ND00")
pure_bikes_df["end_lat"] = pure_bikes_df["end_lat"].fillna(0)
pure_bikes_df["end_lng"] = pure_bikes_df["end_lng"].fillna(0)
print(pure_bikes_df.isnull().sum())After managing the missing values I started with the analysis. First I made the study by understanding in wich proportion each type of users (Casual or member) use some of the available types of bikes. The initial result show that, without caring if is a member or not, the users are more willing to use the electric bike. Also, it is shown that more than 56% of users are members. Other thing is shown, exist a type of bike that is exclusive of casual users: the docked bike.
grouping_by_members = pure_bikes_df.groupby("member_casual")["rideable_type"].value_counts()
grouping_by_members = (grouping_by_members/grouping_by_members.sum()) * 100
print("Percentage of bike type usage: \n\n", grouping_by_members)Also, when taking the analysis for seeing wich time of the day are the more common for users to rent a bike, the result shows that the evening (12 to 20) and the morning (6 to 12), are the more popular times of the day. These two times of the days surpass the 80% of the collected data. The time of the day that corresponds to "out of ranges" is because the hours in wich the user rent is in two times of the days at the same time, for example, the start hour was at 12, but the user return the bike at 21.
pure_bikes_df["started_at"] = pd.to_datetime(pure_bikes_df["started_at"])
pure_bikes_df["ended_at"] = pd.to_datetime(pure_bikes_df["ended_at"])
pure_bikes_df["started_hour"] = pure_bikes_df["started_at"].dt.hour
pure_bikes_df["ended_hour"] = pure_bikes_df["ended_at"].dt.hour
pure_bikes_df["started_hour"] = pure_bikes_df["started_hour"].astype(float)
pure_bikes_df["ended_hour"] = pure_bikes_df["ended_hour"].astype(float)
conditions = [
(pure_bikes_df["started_hour"] >= 0) & (pure_bikes_df["ended_hour"] <= 6),
(pure_bikes_df["started_hour"] >= 6) & (pure_bikes_df["ended_hour"] <= 12),
(pure_bikes_df["started_hour"] >= 12) & (pure_bikes_df["ended_hour"] <= 20),
(pure_bikes_df["started_hour"] >= 20) & (pure_bikes_df["ended_hour"] <= 24)
]
times_of_the_day = ["Early Morning", "Morning", "Evening", "Night"]
pure_bikes_df["Time_day"] = np.select(conditions, times_of_the_day, default="Out of ranges")
print(pure_bikes_df["Time_day"].value_counts())Visualization of the data 📊
After the exploratory analysis I started making some plots for a better understanding. Primary I did a countplot for each one of the types of bikes. The firts two conclusions that are taking out form here are:
- The electric bikes are the ones preferred by users, even though the classic bikes usage is realy near.
- The docked bikes are not realy significant in the service that the company offers