Optimizing Traffic Flow in New York City: Predictive Modeling and Strategic Insights from Telemetry Data
Executive Summary
We were provided with a dataset containing telemetry data from two taxi drivers in New York, covering the period from January 2016 to July 2016. Our goal was to predict trip duration based on known pickup and dropoff locations, as well as pickup time.
By integrating weather and geographical data with the telemetry dataset, we analyzed how various conditions impact trip performance. We found that an XGBoost model was the most effective predictive approach. On average, our predictions are expected to be within 346 seconds (less than 6 minutes) of the actual trip duration. For context, most trip durations range between 393 and 1063 seconds.
Key findings include:
-
Adverse Driving Conditions: Heavy precipitation and snow significantly affect trip performance. It is advisable to regularly clean major streets, especially in Manhattan, where most of the trips occurred.
-
Public Transportation Investment: During rush hour, traffic speeds decrease significantly, with an average reduction of 18%. Despite this, trip durations do not seem to be impacted. We suggest that this might be due to the close relationship between trip duration and taxi fares, and that passengers are unwilling to pay high taxi fares. Consequently, improvements in public transportation are likely to be appreciated by New York residents, as they help alleviate traffic congestion.
-
Temporary Connection Losses: Temporary losses of connection can result in recorded trip durations that are longer than actual durations. Given their unavoidable nature, we commend the taxi company for documenting these instances. This data is valuable for understanding and mitigating the effects of connection issues on trip duration records.
-
Trip durations may include rest stops: We recommend collecting additional features, such as
waiting_time
, to account for periods when the taxi is at rest waiting for a passenger. In this dataset, it was assumed that trip durations did not include intentional rest stops. -
Dropoff Location: It may be useful to gather data on both
original_dropoff
andactual_dropoff
locations. This would help account for situations where passengers request to be dropped off earlier due to traffic conditions. -
Traffic Volume Prediction: Predictive models for traffic volume are crucial for anticipating congestion and informing traffic management. With more computational power, an approach focusing on geographical data would probably be able to provide better insights, as well as antecipate congestion points.
These insights can enhance the accuracy of trip duration predictions, address significant traffic challenges, and refine operational strategies for the taxi company.
Data Sources
This report draws upon four distinct data sources:
1. Training Traffic Data
This dataset contains details of 60,554 taxi trips in New York City, spanning from January 1, 2016, to July 1, 2016, and covers trips by two taxi drivers. The data includes trip durations, pickup and dropoff locations and timestamps, passenger count, and information on whether telemetry data was transmitted to company servers in real time or if there were instances of temporary connection loss.
Although there are no missing values, data cleaning is required to address inconsistencies, such as trips where the pickup and dropoff locations are identical.
Source: DataCamp.
2. Test Traffic Data
Similar to the training dataset, this dataset consists of 47,999 records, but excludes trip duration and dropoff timestamps. It is used for model testing and validation.
Source: DataCamp.
3. Weather Data
This dataset provides daily meteorological conditions in New York City. While the problem description specifies the metric system, the data is presented in the imperial system. It includes data on precipitation, snowfall, and snow depth, with trace amounts too small to measure labeled as "T".
Source: DataCamp.
4. Geographical Data
This dataset contains geographical boundaries and information about New York City boroughs.
Source: GeoSets API.
%%capture
!pip install geodatasets
# Import libraries
import pandas as pd
import numpy as np
from typing import Optional
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
from pyproj import Proj, transform
from scipy.spatial import distance, ConvexHull
import folium
import geopandas as gpd
import geodatasets
from shapely.geometry import Point
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, VotingRegressor
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
import xgboost as xgb
# Global variables
RANDOM_STATE = 42
TEST_SIZE = 0.25
PALETTE = "husl"
# Load datasets
traffic_train: pd.DataFrame = pd.read_csv("data/train.csv")
traffic_submission: pd.DataFrame = pd.read_csv("data/test.csv")
weather_data: pd.DataFrame = pd.read_csv("data/weather.csv")
nyc_boroughs: pd.DataFrame = gpd.read_file(geodatasets.get_path("nybb")).to_crs(epsg=4326)
nyc_boroughs["centroid"] = nyc_boroughs.centroid
# Map functions
def find_boundary_coords (pickup_latitude: pd.Series, dropoff_latitude: pd.Series,
pickup_longitude: pd.Series, dropoff_longitude: pd.Series
) -> list[tuple[float]]:
# Concat coordinates
lat_lon_points: np.array = np.transpose((pd.concat([pickup_latitude, dropoff_latitude]).values,
pd.concat([pickup_longitude, dropoff_longitude]).values)
)
# Find boundary points
hull: ConvexHull = ConvexHull(lat_lon_points)
boundary_points: list[np.array[float]] = [lat_lon_points[i] for i in hull.vertices]
# Close the polygon by starting and ending in same point
boundary_points.append(boundary_points[0])
return [(lat, lon) for lat, lon in boundary_points]
def plot_points_in_map (city_map: folium.folium.Map,
latitude: pd.Series,
longitude: pd.Series, color: str) -> folium.folium.Map:
for lat, lon in zip(latitude, longitude):
folium.CircleMarker(
location=[lat, lon],
radius=8,
color=color,
fill=True,
fill_color=color,
fill_opacity=0.7
).add_to(city_map)
return city_map
def overlay_boroughs_in_map (df_boroughs: pd.DataFrame,
city_map: folium.folium.Map,
label_centroids: str,
centroid_col: str = "centroid",
label_col: str = "name"
) -> folium.folium.Map:
for _, row in df_boroughs.iterrows():
sim_geo: geopandas.geoseries.GeoSeries = gpd.GeoSeries(row["geometry"]).simplify(tolerance=0.001)
geo_j: folium.features.GeoJson = sim_geo.to_json()
geo_j = folium.GeoJson(data=geo_j,
style_function=lambda x: {
"color": "lightblue",
"weight": 2.5,
"fillColor": "lightblue",
"fillOpacity": 0.2
})
folium.Popup(row[label_col]).add_to(geo_j)
geo_j.add_to(city_map)
if label_centroids:
folium.Marker(
location=[row[centroid_col].y, row[centroid_col].x], # Reverse to [latitude, longitude]
icon=folium.DivIcon(
icon_size=(150, 36),
icon_anchor=(0, 0),
html=f'<div style="font-size: 8pt; font-weight: bold; color: darkblue;">{row[label_col]}</div>'
)
).add_to(city_map)
return city_map
def overlay_boundary_coords_in_map (city_map: folium.folium.Map,
pickup_latitude: pd.Series,
dropoff_latitude: pd.Series,
pickup_longitude: pd.Series,
dropoff_longitude: pd.Series
) -> folium.folium.Map:
boundary_coords: list[tuple[float]] = find_boundary_coords(pickup_latitude,
dropoff_latitude,
pickup_longitude,
dropoff_longitude
)
folium.PolyLine(boundary_coords, color="red", weight=2.5, opacity=1).add_to(city_map)
return city_map
Taxi Route and NYC Borough Map Visualization
Due to the large number of data points, plotting all individual trips in a reasonable time frame is impractical. However, we can still effectively visualize the general area where the taxi drivers operated, highlighted in red.
In addition, the map shows the five New York City boroughs in light blue. Given the high density of the city, these areas are expected to experience higher levels of traffic congestion.
By analyzing the red boundary, we observe that pickups and dropoffs occur both within the main city and in the surrounding suburbs, indicating diverse routes beyond the urban core.
map_center: list[float] = [np.mean(traffic_train["pickup_latitude"]), np.mean(traffic_train["pickup_longitude"])]
city_map: folium.folium.Map = folium.Map(location=map_center, zoom_start=9, tiles="Cartodb Positron")
city_map = overlay_boroughs_in_map(nyc_boroughs, city_map, label_centroids=False, label_col="BoroName")
city_map = overlay_boundary_coords_in_map(city_map,
traffic_train["pickup_latitude"],
traffic_train["dropoff_latitude"],
traffic_train["pickup_longitude"],
traffic_train["dropoff_longitude"]
)
city_map
Feature Generation and Data Cleaning
In this section, we performed the following tasks:
- Rectify Data Types: Ensure all features have the appropriate data types.
- Convert Coordinates: Transform latitude and longitude coordinates into UTM (Universal Transverse Mercator) coordinates. Calculate the straight-line distance between pickup and dropoff points.
- Calculate Maximum Speed: Use the previously calculated distance to determine the maximum possible speed of the vehicle during the trip.
- Categorize Weather Conditions: Classify weather conditions into predefined categories.
- Identify Rush Hours: Determine if a trip started during peak traffic hours.
- Assign Boroughs: Assign pickup and dropoff locations to their respective New York City boroughs.
These transformations are implemented through a pipeline, making it easy to apply the same preprocessing steps to the submission dataset.
Note: It is assumed that trips did not involve any waiting periods.
# Weather data cleaning
def categorize_weather_conditions (df:pd.DataFrame = weather_data,
snow_fall_col: str = "snow fall",
snow_depth_col: str = "snow depth",
precipitation_col: str = "precipitation",
value_to_replace: str = "T"
) -> pd.DataFrame:
# Categorize snow fall
snow_fall_map: dict[float, Optional[str]] = {
-1.0: "No Snow (0 in)",
0.0: "Trace Fall (<0.1 in)",
0.1: "Light Fall (0.1-1.0 in)",
1.0: "Moderate Fall (1.0-2.0 in)",
2.0: "Heavy Fall (2.0-4.0 in)",
4.0: "Very Heavy Fall (>4.0 in)",
}
# Categorize snow depth
snow_depth_map: dict[float, Optional[str]] = {
-1.0: "No Snow (0 in)",
0.0: "Trace Depth (<0.5 in)",
0.5: "Shallow Depth (0.5-2.0 in)",
2.0: "Moderate Depth (2.0-6.0 in)",
6.0: "Deep Snow (6.0-12.0 in)",
12.0: "Very Deep Snow (>12.0 in)",
}
# Categorize precipitation
precipitation_map: Dict[float, Optional[str]] = {
-1.0: "No Precipitation",
0.0: "Trace Precipitation (<0.1 in)",
0.1: "Light Precipitation (0.1-0.5 in)",
0.5: "Moderate Precipitation (0.5-1.0 in)",
1.0: "Heavy Precipitation (1.0-2.0 in)",
2.0: "Very Heavy Precipitation (>2.0 in)",
}
# Add new features
df = df.assign(
snow_fall_category = pd.cut(df[snow_fall_col].replace(value_to_replace, 0.01).astype(float),
bins = list(snow_fall_map.keys()) + [float('inf')],
labels = snow_fall_map.values(),
right = True),
snow_depth_category = pd.cut(df[snow_depth_col].replace(value_to_replace, 0.01).astype(float),
bins = list(snow_depth_map.keys()) + [float('inf')],
labels = snow_depth_map.values(),
right = True),
precipitation_category = pd.cut(df[precipitation_col].replace(value_to_replace, 0.01).astype(float),
bins = list(precipitation_map.keys()) + [float('inf')],
labels = precipitation_map.values(),
right = True)
)
# Replace T by 0.0
df[snow_fall_col] = df[snow_fall_col].replace(value_to_replace, 0.0).astype(float)
df[snow_depth_col] = df[snow_depth_col].replace(value_to_replace, 0.0).astype(float)
df[precipitation_col] = df[precipitation_col].replace(value_to_replace, 0.0).astype(float)
return df
def flag_bad_weather (df:pd.DataFrame,
snow_fall_col:str = "snow_fall_category",
precipitation_col:str ="precipitation_category",
snow_depth_col:str = "snow_depth_category",
temperature_col: str = "maximum temperature",
bad_weather_words: tuple[str] = ("Heavy", "Deep"),
good_weather_words: tuple[str] = ("No",),
max_temperature: float = 90.0,
min_temperature: float = 40.0) -> pd.DataFrame:
return df.assign(
bad_driving_weather = (df[snow_fall_col].apply(lambda val: any(x in val for x in bad_weather_words))) | (
df[precipitation_col].apply(lambda val: any(x in val for x in bad_weather_words))) | (
df[snow_depth_col].apply(lambda val: any(x in val for x in bad_weather_words))),
bad_walking_weather = (df[snow_fall_col].apply(lambda val: any(x not in val for x in good_weather_words))) | (
df[precipitation_col].apply(lambda val: any(x not in val for x in good_weather_words))) | (
df[snow_depth_col].apply(lambda val: any(x not in val for x in good_weather_words))) | (
df[temperature_col].apply(lambda val: val <= min_temperature or val >= max_temperature))
)
weather_data = categorize_weather_conditions()
weather_data["date"] = pd.to_datetime(weather_data["date"], format="%d-%m-%Y").dt.date
weather_data = flag_bad_weather(weather_data)
# EDA Transformer for training data
class EDATransformer(BaseEstimator, TransformerMixin):
def __init__(self,
weather_data: pd.DataFrame,
boroughs_data: pd.DataFrame,
rush_hours: list[tuple[float]] = [(7, 10), (16,19)],
weather_date_col: str = "date",
borough_geometry_col: str = "geometry",
borough_name_col: str = "BoroName",
pickup_datetime: str = "pickup_datetime",
dropoff_datetime: str = "dropoff_datetime",
lat_col_pickup: str = "pickup_latitude",
lon_col_pickup: str = "pickup_longitude",
lat_col_dropoff: str = "dropoff_latitude",
lon_col_dropoff: str = "dropoff_longitude",
easting_col_pickup: str = "easting_pickup",
northing_col_pickup: str = "northing_pickup",
easting_col_dropoff: str = "easting_dropoff",
northing_col_dropoff: str = "northing_dropoff",
maximum_velocity: str = "maximum_velocity",
trip_duration: str = "trip_duration",
store_and_fwd_flag: str = "store_and_fwd_flag",
vendor_id: str = "vendor_id"
):
self.rush_hours: list[tuple[float]] = rush_hours
self.weather_data: pd.DataFrame = weather_data
self.weather_date_col: str = weather_date_col
self.boroughs_data: pd.DataFrame = boroughs_data
self.borough_geometry_col: str = borough_geometry_col
self.borough_name_col: str = borough_name_col
# datetime columns
self.pickup_datetime: str = pickup_datetime
self.dropoff_datetime: str = dropoff_datetime
# degree coordinates columns
self.lat_col_pickup: str = lat_col_pickup
self.lon_col_pickup: str = lon_col_pickup
self.lat_col_dropoff: str = lat_col_dropoff
self.lon_col_dropoff: str = lon_col_dropoff
# UTM coordinates columns
self.easting_col_pickup: str = easting_col_pickup
self.northing_col_pickup: str = northing_col_pickup
self.easting_col_dropoff: str = easting_col_dropoff
self.northing_col_dropoff: str = northing_col_dropoff
# velocity column
self.maximum_velocity: str = maximum_velocity
self.trip_duration: str = trip_duration
# cols to set datatype
self.store_and_fwd_flag: str = store_and_fwd_flag
self.vendor_id: str = vendor_id
def fit(self, df, y=None):
return self
def transform(self, df: pd.DataFrame):
if not isinstance(df, pd.DataFrame):
raise ValueError("Input should be a pandas DataFrame")
# Perform transformations
df = self.handle_datetimes(df=df)
df = self.convert_lat_lon_to_utm(df=df,
lat_col=self.lat_col_pickup,
lon_col=self.lon_col_pickup,
col_suffix="pickup"
)
df = self.convert_lat_lon_to_utm(df=df,
lat_col=self.lat_col_dropoff,
lon_col=self.lon_col_dropoff,
col_suffix="dropoff"
)
df = self.calculate_travelled_distance_in_m(df=df)
if self.dropoff_datetime in df.columns:
df = self.estimate_maximum_velocity(df=df)
df = self.merge_weather_data(df)
df = self.set_features_datatypes(df)
df = self.transform_lat_lon_to_points(df)
df = self.assign_borough_to_pickup_and_dropoff(df)
df = self.flag_trips_in_city (df)
return df
def _calculate_if_rush_hour(self, hours_series: pd.Series,
is_weekday_series: pd.Series) -> pd.Series:
return is_weekday_series & hours_series.apply(
lambda hour: any(start <= hour <= end for start, end in self.rush_hours)
)
def handle_datetimes (self, df: pd.DataFrame) -> pd.DataFrame:
cols = [self.pickup_datetime, self.dropoff_datetime]
if self.dropoff_datetime not in df.columns:
cols = [self.pickup_datetime]
for col in cols:
df[col] = pd.to_datetime(df[col])
df[col.replace("time","")] = df[col].dt.date
df[col.replace("_datetime","_hour")] = df[col].dt.hour
df["is_pickup_weekday"] = df[self.pickup_datetime].dt.weekday < 5
df["is_pickup_during_rushhour"] = self._calculate_if_rush_hour(df["pickup_hour"],
df["is_pickup_weekday"])
return df
def convert_lat_lon_to_utm (self,
df: pd.DataFrame,
lat_col: str,
lon_col:str,
col_suffix: str = "pickup") -> pd.DataFrame:
# NYC is in zone 18
utm_proj = Proj(proj="utm", zone=18, ellps="WGS84")
df[f"easting_{col_suffix}"], df[f"northing_{col_suffix}"] = zip(
*df.apply(lambda row: utm_proj(row[lon_col], row[lat_col]), axis=1)
)
return df
def calculate_travelled_distance_in_m (self, df: pd.DataFrame) -> pd.DataFrame:
df = df.assign(
distance_in_m = df.apply(
lambda row: distance.euclidean(
(row[self.easting_col_pickup], row[self.northing_col_pickup]),
(row[self.easting_col_dropoff], row[self.northing_col_dropoff])
),
axis=1
)
)
return df
def estimate_maximum_velocity (self, df: pd.DataFrame) -> pd.DataFrame:
df[self.maximum_velocity] = df["distance_in_m"]/df[self.trip_duration]
return df
def merge_weather_data (self, df: pd.DataFrame) -> pd.DataFrame:
return pd.merge(left=df,
right=self.weather_data,
left_on= "pickup_date",
right_on=self.weather_date_col)
def set_features_datatypes (self, df: pd.DataFrame) -> pd.DataFrame:
df[self.store_and_fwd_flag] = df[self.store_and_fwd_flag].replace({'Y': True, 'N': False})
df[self.vendor_id] = df[self.vendor_id].astype("object")
return df.drop(columns=[self.weather_date_col])
def transform_lat_lon_to_points(self, df:pd.DataFrame) -> pd.DataFrame:
return df.assign(
pickup_point = df.apply(
lambda row: Point(
row[self.lon_col_pickup], row[self.lat_col_pickup]
),
axis=1
),
dropoff_point = df.apply(
lambda row: Point(
row[self.lon_col_dropoff], row[self.lat_col_dropoff]
),
axis=1
),
)
def _find_borough_of_point (self, point: Point) -> pd.DataFrame:
for _, row in self.boroughs_data.iterrows():
if point.within(row[self.borough_geometry_col]):
return row[self.borough_name_col]
return "Not NYC center"
def assign_borough_to_pickup_and_dropoff (self, df:pd.DataFrame):
df["pickup_point_borough"] = df["pickup_point"].apply(
lambda point: self._find_borough_of_point(point)
).astype("category")
df["dropoff_point_borough"] = df["dropoff_point"].apply(
lambda point: self._find_borough_of_point(point)
).astype("category")
return df
def flag_trips_in_city (self, df: pd.DataFrame) -> pd.DataFrame:
return df.assign(
trip_in_city = (df["pickup_point_borough"] != "Not NYC center") | (df["dropoff_point_borough"] != "Not NYC center")
)
eda_pipeline: Pipeline = Pipeline([
('EDATransformation', EDATransformer(weather_data=weather_data, boroughs_data=nyc_boroughs)),
])
df_train: pd.DataFrame = eda_pipeline.fit_transform(traffic_train)
Exploratory Data Analysis
In this exploratory data analysis phase, we will:
- Resolve Data Inconsistencies: Address and clean any inconsistencies in the dataset.
- Analyze Factors Affecting Trip Duration and Speed, including:
- Number of Passengers: Assess the impact of passenger count on trip duration and speed.
- Connection Loss Delays: Evaluate delays caused by temporary losses of telemetry data connection.
- Rush Hour Patterns: Investigate how trip duration and speed change during peak traffic hours.
- Adverse Driving Conditions: Examine the effects of poor driving weather on trip performance.
- Adverse Walking Conditions: Analyze the influence of bad weather that impacts pedestrians, which may increase the number of vehicles in the streets.
- Individual Weather Conditions: Study the impact of specific weather factors (e.g., rain, snow) on trip outcomes.
- Pickup Time Impact: Explore how the time of day (pickup hour) influences trip performance.
- Dristribution of Boroughts: Visualize the number of trips that involved pickups and/or dropoffs in the five New York boroughs