Comp Improving New York traffic- a tall order

Can you find a better way to manage traffic?

📖 Background

Traffic congestion is an issue faced by urban centers. The complexity of managing traffic increases every year due to several reasons, this is bad because it generates higher fuel consumption, and increased emissions. In the New York City (NY), the complexity is even higher, in NY there are exceptional complex road network. Efficiently managing traffic flow in such a bustling environment is crucial.

This challenge gives you an opportunity to try to relieve traffic congestions in NY. We analyze historical traffic and weather data. The end game/goal is to identify factors that contribute to congestion, we do this with Data Analytics and to develop a predictive model that forecasts traffic volumes based on these factors.

💾 The data

The data for this analysis is based on three key datasets:

Traffic Data (train and test tables):

train table: Contains detailed information on individual taxi trips, including:

the start and end times
the number of passengers
the GPS coordinates of pickup and dropoff locations.

The target variable is the trip duration. test table: This table is similar to the train table, but without the trip duration, this data will be used to test the predictive model.

Weather Data (weather table):

Historical weather data corresponding to the dates in the traffic data, including:

temperature
precipitation
snowfall
snow depth

As of the complete Metadata of the tables, you can find the following information: The data for this competition is stored in the following tables, "train", "test" and "weather".

train

This table contains training data with features and target variable:

id: Unique identifier for each trip.

vendor_id: Identifier for the taxi vendor.

pickup_datetime: Date and time when the trip started.

dropoff_datetime: Date and time when the trip ended.

passenger_count: Number of passengers in the taxi.

pickup_longitude: Longitude of the pickup location.

pickup_latitude: Latitude of the pickup location.

dropoff_longitude: Longitude of the dropoff location.

dropoff_latitude: Latitude of the dropoff location.

store_and_fwd_flag: Indicates if the trip data was stored and forwarded.

trip_duration: Duration of the trip in seconds.

test

This table is very similar to the train table but in the test data there is no target variable.

weather

This table contains historical weather data for New York City.

date: Date of the weather record (should match the pickup and dropoff dates in the traffic data).

maximum temperature: Maximum temperature of the day in Celsius.

minimum temperature: Minimum temperature of the day in Celsius.

average temperature: Average temperature of the day in Celsius.

precipitation: Total precipitation of the day in millimeters.

snow fall: Snowfall of the day in millimeters.

snow depth: Snow depth of the day in millimeters.

A "T" in the snow depth field stands for a "trace" amount of snow. This means that snowfall was observed, but the amount was too small to be measured accurately, less than 0.1 inches.

In the dataset, the store_and_fwd_flag field indicates whether a taxi trip record was stored in the vehicle's memory before sending to the server due to a temporary loss of connection. The value "N" stands for "No," meaning that the trip data was not stored and was sent directly to the server in real-time. Conversely, the value "Y" stands for "Yes," indicating that the trip data was stored temporarily before being forwarded to the server.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load datasets
traffic_data = pd.read_csv('data/train.csv')


# Convert date columns to datetime
traffic_data['pickup_datetime'] = pd.to_datetime(traffic_data['pickup_datetime'])
traffic_data['dropoff_datetime'] = pd.to_datetime(traffic_data['dropoff_datetime'])

# Display the first few rows of the traffic data
traffic_data.head(5)

weather_data = pd.read_csv('data/weather.csv')
weather_data

💪 Competition challenge

In this challenge, we will focus on the following key tasks:

Exploratory Data Analysis of Traffic Flow
Impact Analysis of Weather Conditions on Traffic
Development of a Traffic Volume Prediction Model

Create features to capture temporal dependencies and weather conditions.

Build and evaluate predictive models to forecast traffic volumes.

Compare the performance of different machine learning algorithms.

Strategic Recommendations for Traffic Management:

Provide actionable insights based on the analysis and predictive model. Recommend strategies for optimizing traffic flow in New York City.

✅ Checklist before publishing into the competition

Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
Remove redundant cells like the judging criteria, so the workbook is focused on your story.
Make sure the workbook reads well and explains how you found your insights.
Try to include an executive summary of your recommendations at the beginning.
Check that all the cells run without error

⌛️ Time is ticking. Good luck!

EXECUTIVE SUMMARY

The map distribution clearly shows that most pickups and dropoff occur within downtown and immediate periphery with only occasional to and from suburbs.

There is notable correlation between distance and speed. Trip out to suburbs gain speed once out of the congested downtown area.

Frequency of trips vary conspiciously only with the hours of the day. The correlation r=0.81. There is not much change within weekdays or from month to month.

Snow fall and snow depth does not affect traffic unless there is unusually heavy snow.

Temperature and precipitation have no affect on traffic.

The main cause of traffic congession in downtown New York is that a large majority of trips are only for a single passanger for short distance within congested downtown area. Very few trips have 3 or more passangers.

Thus, one possible solution to the traffic congession is to encourage short distance public transportation with special taxi caps which operate between fixed locations of downtown area and start only after at last more than 3 passanger are avaialable.

=====================

PRELIMINARY SURVEY

Info

Missing values

Distribution of trip duration

Hourly, Weekly Monthly variation in trip numbers

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import skew
#print info for traffic data
print('\n Information on traffic_data:')
print(traffic_data.info())

print('\nmissing values:')

print(traffic_data.isna().sum())


#extract weekday, hour, minute and month to see the changes in traffic volume in terms of hours of day,week days and months.

traffic_data['weekday_dropoff'] = traffic_data['pickup_datetime'].dt.weekday


#extrack pickup and dropoff hours  day of week, hour of day and month

traffic_data['hour_pickup'] = traffic_data['pickup_datetime'].dt.hour
traffic_data['hour_dropoff'] = traffic_data['dropoff_datetime'].dt.hour


#traffic_data['minute_pickup'] = traffic_data['pickup_datetime'].dt.minute
#traffic_data['minute_dropoff'] = traffic_data['dropoff_datetime'].dt.minute
traffic_data['month_pickup']=traffic_data['pickup_datetime'].dt.month
print('\nUnique months:')
print(traffic_data["month_pickup"].unique())


#traffic_data['date']=traffic_data['pickup_datetime'].dt.date
#traffic_data['datexx']=pd.to_datetime(traffic_data['date'])


#generate a histogram of trip duration in mimutes
#to see if the distribution follows normal distribution
fig, ax = plt.subplots(2, 1)  # Create a figure with 2 subplots (2 rows, 1 column)
fig.set_size_inches(5, 7, forward=True)
ax[0].hist(traffic_data['trip_duration']/60, bins=1000)
ax[0].set_xlim(0, 120)
fig.suptitle('Trip duration') 
ax[0].set_xlabel('Minutes')
ax[0].set_ylabel('Frequency')

#find mean median mode and standard deviation of arrival time
artime=traffic_data['trip_duration']/60
trip_mean=np.mean(artime)
trip_median=np.median(artime)
#trip_mode=np.mode(artime)
trip_skew=skew(artime)
trip_stdv=np.std(artime)

print('\nStatistics of trip frequency:')
print(f'mean {trip_mean}')
print(f'median {trip_median}')
print(f'skew {trip_skew}')
print(f'stdev {trip_stdv}')

#Generate a histogram of logarithm of trip duration to check it it follows log normal
ax[1].hist(np.log(traffic_data['trip_duration']/60), bins=100)
#ax[1].set_title('log trip duration')
ax[1].set_xlabel('log trip duration')
ax[1].set_ylabel('Frequency')
plt.show()


#calculate and plot weekday variation of drop off as an indicator of traffic density
weekday_freq = traffic_data['weekday_dropoff'].value_counts().reset_index()

#rename columns
weekday_freq.columns = ['weekday', 'frequency']  # Rename columns for clarity
#print(weekday_freq['frequency'].sum())

#bar plot
plt.bar(weekday_freq['weekday'], weekday_freq['frequency'])  # Corrected this line
plt.xlabel('Weekday')
plt.ylabel('Frequency')
plt.title('Frequency of Dropoffs by Weekday')
plt.show()

#Calculate and plot hourly rate of pickups to see daylong traffic variation
day_hour = traffic_data['hour_pickup'].value_counts().reset_index()
day_hour.columns = ['day_hour', 'frequency']
plt.bar(day_hour['day_hour'], day_hour['frequency'])
plt.title('Pickup frequency per hour')
plt.xlabel('Hours of day')
plt.ylabel('Frequency')
plt.show()
#print(day_hour['frequency'].sum())

sns.regplot(data=day_hour,x='day_hour',y='frequency')
plt.title('Traffic density vs.hours')
plt.ylabel('Frequency')
plt.show()
rcor=day_hour['day_hour'].corr(day_hour['frequency'])
print('Correlation coefficient:',round(rcor,2))

#print(traffic_data['vendor_id'].value_counts())
month_freq=traffic_data['month_pickup'].value_counts().reset_index()
month_freq.columns=['month_pickup','pickups']
plt.bar(month_freq['month_pickup'],month_freq['pickups'])
plt.title('Monthly variation of pickup frequency')
plt.xlabel('First 6 months')
plt.ylabel('Frquency')
plt.show()

Preliminary survey- results: Trip duration is highly skewed resembling log normal distribution Average trim duration is 15 minutes. Most trip durations are less than half an hour with sparse long duration trips. This suggest most trip are confined to downtown with few trips to suburbs. Frequency of trips vary conspiciously only with the hours of the day. The correlation r=0.81. There is not much change within weekdays or from month to month.

==================

PICKUP AND DROPOFF

Map distribution of pickup and dropoff locations.


from matplotlib.ticker import PercentFormatter 
from matplotlib import colors
import matplotlib.pyplot as plt
#map = folium.Map(location=(40.4,-73.9)) #, tiles="OpenStreetMap", #opacity=0.4,              zoom_start=9.2)

#Get pickup and dropoff latitue and longitude
x = traffic_data['pickup_longitude']
y = traffic_data['pickup_latitude']
xd = traffic_data['dropoff_longitude']
yd = traffic_data['dropoff_latitude']

#Generate a 2D plot for pickup locations
fig,ax=plt.subplots()
ax.hist2d(x, y, 
           bins = 100,  
           norm = colors.LogNorm()
           ) 

ax.set_title('Frequency of pickup locations', fontweight ="bold") 
ax.set_xlim(-74.4,-73.2)
ax.set_ylim(40.5,41.0)
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
plt.show()

#Generate a 2D plot for dropoff locations
fig,ax=plt.subplots()
ax.hist2d(xd, yd, 
           bins = 100,  
           norm = colors.LogNorm()
           ) 
  
ax.set_title('Frequency of dropoff locations', fontweight ="bold") 
ax.set_xlim(-74.4,-73.2)
ax.set_ylim(40.5,41.0)    
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
plt.show()

pickup and dropoff locations- results:

The map distribution clearly shows that most pickups and dropoff occur within downtown and immediate periphery with only occasional to and from suburbs.

===================

DISTANCE AND SPEED

Plot distance and speed distributions and regression.

‌
‌
‌