Predicting NYC Traffic Patterns with Ride Data

Can you find a better way to manage traffic?

💪 Competition challenge

In this challenge, we will focus on the following key tasks:

Exploratory Data Analysis of Traffic Flow
Impact Analysis of Weather Conditions on Traffic
Development of a Traffic Volume Prediction Model

Create features to capture temporal dependencies and weather conditions.

Build and evaluate predictive models to forecast traffic volumes.

Compare the performance of different machine learning algorithms.

Strategic Recommendations for Traffic Management:

Provide actionable insights based on the analysis and predictive model. Recommend strategies for optimizing traffic flow in New York City.

Introduction
Data Loading and Inspection
Data Preprocessing
Exploratory Data Analysis
Impact Analysis of Weather Conditions on Traffic
Predictive Modeling
Strategic Recommendations for Traffic Management
Conclusion

1. Introduction

Background

Traffic congestion is an issue faced by urban centers. The complexity of managing traffic increases every year due to several reasons, this is bad because it generates higher fuel consumption, and increased emissions. In the New York City (NY), the complexity is even higher, in NY there are exceptional complex road network. Efficiently managing traffic flow in such a bustling environment is crucial.

This challenge gives you an opportunity to try to relieve traffic congestions in NY. We analyze historical traffic and weather data. The end game/goal is to identify factors that contribute to congestion, we do this with Data Analytics and to develop a predictive model that forecasts traffic volumes based on these factors.

Objectives

The primary aim of this project is to analyze traffic congestion patterns in New York City and develop predictive models to forecast traffic volumes. By understanding the key factors contributing to traffic congestion, this study aims to provide actionable insights that can be used to optimize traffic flow and improve urban mobility.

The project will focus on the following specific objectives:

Exploratory Data Analysis of Traffic Flow: To explore and visualize traffic patterns over time and space, identifying trends, peaks, and anomalies.
Impact Analysis of Weather Conditions on Traffic: To understand how various weather conditions, such as temperature, precipitation, and snowfall, affect traffic volumes.
Feature Engineering: To create features that capture temporal dependencies and weather conditions, which will enhance the predictive capabilities of the models.
Predictive Modeling: To build and evaluate predictive models for forecasting traffic volumes using machine learning algorithms. This will involve comparing the performance of different models to determine the most accurate and reliable approach.
Strategic Recommendations for Traffic Management: To provide actionable insights based on the analysis and predictive models. Recommendations will focus on strategies for optimizing traffic flow in New York City, considering factors like peak congestion times and weather-related delays.

Datasets

The analysis will be based on two primary datasets:

Traffic Data (train.csv): This dataset contains historical taxi trip records in New York City, including information such as trip start and end times, pickup and dropoff locations, number of passengers, and trip duration. It provides a detailed view of traffic flow patterns across the city.

Key Features: pickup_datetime, dropoff_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, passenger_count, trip_duration. Weather Data (weather.csv): This dataset provides historical weather information for New York City, including daily records of temperature, precipitation, snowfall, and snow depth. Integrating weather data with traffic data will help analyze the impact of weather conditions on traffic congestion.

Key Features: date, maximum temperature, minimum temperature, average temperature, precipitation, snow fall, snow depth. By combining these datasets, the project will perform a comprehensive analysis of traffic patterns and build predictive models that account for both temporal and weather-related factors.

💾 The data

The data for this analysis is based on three key datasets:

Traffic Data (train and test tables):

train table: Contains detailed information on individual taxi trips, including:

the start and end times
the number of passengers
the GPS coordinates of pickup and dropoff locations.

The target variable is the trip duration. test table: This table is similar to the train table, but without the trip duration, this data will be used to test the predictive model.

Weather Data (weather table):

Historical weather data corresponding to the dates in the traffic data, including:

temperature
precipitation
snowfall
snow depth

As of the complete Metadata of the tables, you can find the following information: The data for this competition is stored in the following tables, "train", "test" and "weather".

train

This table contains training data with features and target variable:

id: Unique identifier for each trip.
vendor_id: Identifier for the taxi vendor.
pickup_datetime: Date and time when the trip started.
dropoff_datetime: Date and time when the trip ended.
passenger_count: Number of passengers in the taxi.
pickup_longitude: Longitude of the pickup location.
pickup_latitude: Latitude of the pickup location.
dropoff_longitude: Longitude of the dropoff location.
dropoff_latitude: Latitude of the dropoff location.
store_and_fwd_flag: Indicates if the trip data was stored and forwarded.
trip_duration: Duration of the trip in seconds.

test

This table is very similar to the train table but in the test data there is no target variable.

weather

This table contains historical weather data for New York City.

date: Date of the weather record (should match the pickup and dropoff dates in the traffic data).
maximum temperature: Maximum temperature of the day in Celsius.
minimum temperature: Minimum temperature of the day in Celsius.
average temperature: Average temperature of the day in Celsius.
precipitation: Total precipitation of the day in millimeters.
snow fall: Snowfall of the day in millimeters.
snow depth: Snow depth of the day in millimeters.

A "T" in the snow depth field stands for a "trace" amount of snow. This means that snowfall was observed, but the amount was too small to be measured accurately, less than 0.1 inches.

In the dataset, the store_and_fwd_flag field indicates whether a taxi trip record was stored in the vehicle's memory before sending to the server due to a temporary loss of connection. The value "N" stands for "No," meaning that the trip data was not stored and was sent directly to the server in real-time. Conversely, the value "Y" stands for "Yes," indicating that the trip data was stored temporarily before being forwarded to the server.

2. Data Loading and Inspection

# core libraries
import pandas as pd
import numpy as np

# Import additional libraries for visualization
import matplotlib.dates as mdates
from matplotlib import pyplot as plt
import seaborn as sns

# Set the style of plots
sns.set(style='whitegrid')

# helpers
import helpers as h
import holidays
us_holidays = holidays.US()

# sklearn
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# statistics libraries
from scipy import stats
import statsmodels.api as sm

# folium
import folium

# warnings
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings('ignore', category=RuntimeWarning)

%load_ext autoreload
%autoreload 2

Traffic

Training Set

# Load datasets
traffic_data = pd.read_csv('data/train.csv')

# Display the first few rows of the traffic data
traffic_data.head(5)

traffic_data.info()

traffic_data.isnull().sum()

‌
‌
‌