Competition - Improve traffic

Can you find a better way to manage traffic?

📖 Background

Traffic congestion is an issue faced by urban centers. The complexity of managing traffic increases every year due to several reasons, this is bad because it generates higher fuel consumption, and increased emissions. In the New York City (NY), the complexity is even higher, in NY there are exceptional complex road network. Efficiently managing traffic flow in such a bustling environment is crucial.

This challenge gives you an opportunity to try to relieve traffic congestions in NY. We analyze historical traffic and weather data. The end game/goal is to identify factors that contribute to congestion, we do this with Data Analytics and to develop a predictive model that forecasts traffic volumes based on these factors.

💾 The data

The data for this analysis is based on three key datasets:

Traffic Data (train and test tables):

train table: Contains detailed information on individual taxi trips, including:

the start and end times
the number of passengers
the GPS coordinates of pickup and dropoff locations.

The target variable is the trip duration. test table: This table is similar to the train table, but without the trip duration, this data will be used to test the predictive model.

Weather Data (weather table):

Historical weather data corresponding to the dates in the traffic data, including:

temperature
precipitation
snowfall
snow depth

As of the complete Metadata of the tables, you can find the following information: The data for this competition is stored in the following tables, "train", "test" and "weather".

train

This table contains training data with features and target variable:

id: Unique identifier for each trip.

vendor_id: Identifier for the taxi vendor.

pickup_datetime: Date and time when the trip started.

dropoff_datetime: Date and time when the trip ended.

passenger_count: Number of passengers in the taxi.

pickup_longitude: Longitude of the pickup location.

pickup_latitude: Latitude of the pickup location.

dropoff_longitude: Longitude of the dropoff location.

dropoff_latitude: Latitude of the dropoff location.

store_and_fwd_flag: Indicates if the trip data was stored and forwarded.

trip_duration: Duration of the trip in seconds.

test

This table is very similar to the train table but in the test data there is no target variable.

weather

This table contains historical weather data for New York City.

date: Date of the weather record (should match the pickup and dropoff dates in the traffic data).

maximum temperature: Maximum temperature of the day in Celsius.

minimum temperature: Minimum temperature of the day in Celsius.

average temperature: Average temperature of the day in Celsius.

precipitation: Total precipitation of the day in millimeters.

snow fall: Snowfall of the day in millimeters.

snow depth: Snow depth of the day in millimeters.

A "T" in the snow depth field stands for a "trace" amount of snow. This means that snowfall was observed, but the amount was too small to be measured accurately, less than 0.1 inches.

In the dataset, the store_and_fwd_flag field indicates whether a taxi trip record was stored in the vehicle's memory before sending to the server due to a temporary loss of connection. The value "N" stands for "No," meaning that the trip data was not stored and was sent directly to the server in real-time. Conversely, the value "Y" stands for "Yes," indicating that the trip data was stored temporarily before being forwarded to the server.

import polars as pl

# Load datasets
traffic_data = pl.read_csv('data/train.csv')

# Convert date columns to datetime
traffic_data = traffic_data.with_columns([
    pl.col('pickup_datetime').str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S").alias('pickup_datetime'),
    pl.col('dropoff_datetime').str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S").alias('dropoff_datetime')
])

# Display the first few rows of the traffic data
traffic_data.slice(1).head(5)

weather_data = pl.read_csv('data/weather.csv')
weather_data

💪 Competition challenge

In this challenge, we will focus on the following key tasks:

Exploratory Data Analysis of Traffic Flow
Impact Analysis of Weather Conditions on Traffic
Development of a Traffic Volume Prediction Model

Create features to capture temporal dependencies and weather conditions.

Build and evaluate predictive models to forecast traffic volumes.

Compare the performance of different machine learning algorithms.

Strategic Recommendations for Traffic Management:

Provide actionable insights based on the analysis and predictive model. Recommend strategies for optimizing traffic flow in New York City.

🧑‍⚖️ Judging criteria

This competition is for helping to understand how competitions work. This competition will not be judged.

✅ Checklist before publishing into the competition

Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
Remove redundant cells like the judging criteria, so the workbook is focused on your story.
Make sure the workbook reads well and explains how you found your insights.
Try to include an executive summary of your recommendations at the beginning.
Check that all the cells run without error

⌛️ Time is ticking. Good luck!

Exploratory Data Analysis

To begin, let us analyze our data and answer the following questions:

How does trip duration vary by which borough of New York City the ride took place in?
How does the distance travelled vary by borough and by local weather conditions?
Which borough of New York City is the most traffic dense and why?

First, let us import some libraries we will need, and then take a high level glance at our data.

# import libraries
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd

traffic_data.schema

weather_data.schema

To stay organized, we will train our models and do our analysis on preprocessed copies of each dataset.

traffic_df = traffic_data.clone()
weather_df = weather_data.clone()

Now, let us fix datatypes in weather_df, create a date column in train_df to join it to weather_df, deal with all null values, then create some new features in train_df for our model and for our analysis. To do this, we will first need to create a few helper functions.

‌
‌
‌