Traffic data fluctuates constantly or is affected by time. Predicting it can be challenging, but this task will help sharpen your time-series skills. With deep learning, you can use abstract patterns in data that can help boost predictability.
Your task is to build a system that can be applied to help you predict traffic volume or the number of vehicles passing at a specific point and time. Determining this can help reduce road congestion, support new designs for roads or intersections, improve safety, and more! Or, you can use to help plan your commute to avoid traffic!
The dataset provided contains the hourly traffic volume on an interstate highway in Minnesota, USA. It also includes weather features and holidays, which often impact traffic volume.
Time to predict some traffic!
The data:
The dataset is collected and maintained by UCI Machine Learning Repository. The target variable is traffic_volume. The dataset contains the following and has already been normalized and saved into training and test sets:
train_scaled.csv, test_scaled.csv
| Column | Type | Description |
|---|---|---|
temp | Numeric | Average temp in kelvin |
rain_1h | Numeric | Amount in mm of rain that occurred in the hour |
snow_1h | Numeric | Amount in mm of snow that occurred in the hour |
clouds_all | Numeric | Percentage of cloud cover |
date_time | DateTime | Hour of the data collected in local CST time |
holiday_ (11 columns) | Categorical | US National holidays plus regional holiday, Minnesota State Fair |
weather_main_ (11 columns) | Categorical | Short textual description of the current weather |
weather_description_ (35 columns) | Categorical | Longer textual description of the current weather |
traffic_volume | Numeric | Hourly I-94 ATR 301 reported westbound traffic volume |
hour_of_day | Numeric | The hour of the day |
day_of_week | Numeric | The day of the week (0=Monday, Sunday=6) |
day_of_month | Numeric | The day of the month |
month | Numeric | The number of the month |
traffic_volume | Numeric | Hourly I-94 ATR 301 reported westbound traffic volume |
Approach:
-
Prepare the data
- Dataset: Metro Interstate Traffic Volume
- Dataset: Metro Interstate Traffic Volume
- https://archive.ics.uci.edu/dataset/492/metro+interstate+traffic+volume
- Fill in missing data
- Expand "Datetime" related columns --> hour of day, day of week, day of month, etc.
- One-Hot encode categorical data (weather, holidays)
- Apply MinMax scaler, and format to float
The end result gives an expansion of the dataset from 8 to 65 columns - Dataset: Metro Interstate Traffic Volume
-
Apply a time shift on the target column
The target column is shifted forward in time vs the features column -
Split dataset into train/test
Dataset is split 80% train / 20% test -
Train Random Forest Regressor model
- Random Forest Regressor allows to establish an accuracy baseline
- Determine the relative importance of each feature
Accuracy baseline -
Slice train/test into tensor sequences
- Train and Test datasets sliced to tensors
X_train, y_train / X_test, y_test -
Train LSTM and GRU models (CPU)
- Long-Term Short Term Memory (LSTM)
- Gated Recurrent Unit (GRU)
Compare LSTM vs GRU vs Baseline: Accuracy and Compute time
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import time
import math
import matplotlib.pyplot as plt
from tqdm import tqdm
from torch.utils.data import TensorDataset, DataLoader, Dataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
# Dataset: Metro Interstate Traffic Volume
# https://archive.ics.uci.edu/dataset/492/metro+interstate+traffic+volume
# Dataset Characteristics : Multivariate, Sequential, Time-Series
# Associated Tasks: Regression
# Feature Type : Integer, Real
# Nb of Features : 8
# Nb of Instances: 48204
def check_nan_values(data_df:pd.DataFrame):
"""
Check if any NaN values exist in the DataFrame
"""
NaNcheck = 0
# Check if any NaN values exist
if data_df.isnull().values.any():
NaNcheck = True
print('NaN values in data')
else:
NaNcheck = False
return NaNcheck
def formatData(data:pd.DataFrame):
df_copy = data.copy()
# Adding Date related columns
df_copy['date_time'] = pd.to_datetime(df_copy['date_time'])
df_copy['month'] = df_copy['date_time'].dt.month
df_copy['day'] = df_copy['date_time'].dt.day
df_copy['hour'] = df_copy['date_time'].dt.hour
df_copy['weekday'] = df_copy['date_time'].dt.weekday
#df_copy['day_of_year'] = df_copy['date_time'].dt.dayofyear
df_copy = df_copy.drop(columns=['date_time'])
# Adding encoding columns
## replace spaces with underscores in categorical columns
df_copy['holiday'] = df_copy['holiday'].str.replace(' ', '_')
df_copy['weather_main'] = df_copy['weather_main'].str.replace(' ', '_')
df_copy['weather_description'] = df_copy['weather_description'].str.replace(' ', '_')
# One-hot encoding for categorical columns
df_copy = pd.get_dummies(df_copy, columns=['holiday', 'weather_main', 'weather_description'], drop_first=True, dtype='float64')
# Scaling
## Instantiate the scaler
scaler = MinMaxScaler(feature_range=(0, 1))
#scaler = StandardScaler()
## Apply the scaler to the numerical columns
numerical_cols_to_scale = ['temp', 'rain_1h', 'snow_1h', 'clouds_all', 'traffic_volume']
df_copy[numerical_cols_to_scale] = scaler.fit_transform(df_copy[numerical_cols_to_scale]).astype('float64')
date_cols = ['month', 'day', 'hour', 'weekday']
df_copy[date_cols] = scaler.fit_transform(df_copy[date_cols]).astype('float64')
NaNcheck = check_nan_values(df_copy)
if NaNcheck == True:
return None
else:
return df_copy
def shiftData(data:pd.DataFrame, time_shift:int, target_column:str):
"""
Shift data and fillna(0) and remove rows
"""
df_copy = data.copy()
NaNcheck = check_nan_values(df_copy)
if NaNcheck == True:
return None
else:
df_copy['shifted_target'] = df_copy[target_column].shift(time_shift) # Add shifted target column
df_copy = df_copy.dropna()# Drop rows with NaN values
df_copy = df_copy.drop(columns=[target_column])
df_copy = df_copy.rename(columns={'shifted_target': target_column})
if ((data.shape[0] - df_copy.shape[0]) > time_shift) or ((data.shape[0] - df_copy.shape[0]) < time_shift) :
print('Error in data shift')
return None
else:
pass
return df_copy
def slice_to_numpy_array_sequences(data: pd.DataFrame, sequence_length: int, target_column: str, feature_columns: list):
"""
Creates time sequences by slicing data using a while loop. Output data to NumPy arrays.
"""
features = []
targets = []
df_copy = data.copy()
NaNcheck = check_nan_values(df_copy)
if NaNcheck == True:
return None
else:
i = 0
while i < (len(df_copy) - sequence_length):
feature_window = df_copy.iloc[i:i + sequence_length][feature_columns].values
features.append(feature_window)
target_value = df_copy.iloc[i:i + sequence_length][target_column].values
#target_value = df_copy[target_column].iloc[i + sequence_length]
targets.append(target_value)
i += 1
# Convert the lists to NumPy arrays
features = np.array(features)
targets = np.array(targets)
return features, targets, df_copy
def execute_model_training_loop(MODEL, modelName, OPTimizer, lossFunction , NUM_EPOCHS, trainLoader, testLoader):
train_loader = trainLoader
test_loader = testLoader
model = MODEL
optimizer = OPTimizer
loss_function = lossFunction
model_name = modelName
"""
Execute the training loop for the model
"""
# Train/Test Loop
train_loss_global = []
test_loss_global = []
last_epoch_y_pred = []
last_epoch_y_act = []
print(f"[+][+] Start training {model_name} model...")
global_train_time = time.time()
start_time = time.time() # Start time for training
for epoch in range(NUM_EPOCHS):
train_loss = 0.0 # Initialize train loss for the epoch
test_loss = 0.0 # Initialize test loss for the epoch
epoch_y_pred = []
epoch_y_actual = []
# Training loop
for j, data in enumerate(train_loader):
X, y = data
optimizer.zero_grad()
y_pred = model(X)
lossA = loss_function(y_pred, y)
lossA.backward()
optimizer.step()
train_loss += lossA.item() # Accumulate training loss
train_loss /= len(train_loader) # Average training loss of Epoch
train_loss_global.append(train_loss)
# Test Loop
model.eval() # Set the model to evaluation mode
with torch.no_grad(): # Disable gradients during testing
for X_test_batch, y_test_batch in test_loader:
if X_test_batch.shape[0] == test_loader.batch_size:
y_pred_batch = model(X_test_batch) # Get predictions for the batch
lossB = loss_function(y_pred_batch, y_test_batch)
test_loss += lossB.item() # Accumulate test loss
epoch_y_pred.extend(y_pred_batch.cpu().numpy())
epoch_y_actual.extend(y_test_batch.cpu().numpy())
epoch_r2 = r2_score(epoch_y_actual, epoch_y_pred)
# if last epoch
if epoch == max(range(NUM_EPOCHS)):
last_epoch_y_pred.extend(y_pred_batch.cpu().numpy()) # .cpu() make available to sklearn metrics
last_epoch_y_act.extend(y_test_batch.cpu().numpy())
else:
pass
else:
pass
test_loss /= len(test_loader) # Average test loss
test_loss_global.append(test_loss)
model.train() # set model back to train before the next epoch
# Print the training loss for the epoch
print(f"Epoch: {epoch}, Train Loss: {train_loss:.9f}, Test Loss: {test_loss:.9f} , Accuracy: {epoch_r2:.9f} , Time: {time.time() - start_time:.2f} seconds")
start_time = time.time() # Reset start time for the next epoch
if epoch == (max(range(NUM_EPOCHS))) and (len(last_epoch_y_pred)>0):
# Calculate metrics (example: MSE, RMSE, MAE, r2)
mae = mean_absolute_error(last_epoch_y_act, last_epoch_y_pred)
mse = mean_squared_error(last_epoch_y_act, last_epoch_y_pred)
rmse = np.sqrt(mean_squared_error(last_epoch_y_act, last_epoch_y_pred))
r2 = r2_score(last_epoch_y_act, last_epoch_y_pred)
else:
pass
print("Training complete.")
cpuTime = time.time() - global_train_time
metrics = {'MAE': mae, 'MSE': mse, 'RMSE': rmse, 'R2': r2, 'cpuTime(s)': cpuTime}
# plot train/test losses
plt.figure(figsize=(9, 6))
train_loss_idx = x_idx = [ i for i in range(len(train_loss_global)) ]
plt.plot(train_loss_idx, train_loss_global, label='Train Loss', color='blue', alpha=0.5)
plt.plot(train_loss_idx, train_loss_global, label='Test Loss', color='red', alpha=0.5)
# Add text and box
ax = plt.gca()
bbox_props = dict(boxstyle='round, pad=0.5', facecolor='blue', alpha=0.8)
text_str = '\n'.join([f'{name}: {value:.4f}' for name, value in metrics.items()])
x, y = 0.98, 0.80
ha, va = 'right', 'top'
ax.text(x, y, text_str, transform=ax.transAxes, ha=ha, va=va, bbox=bbox_props, fontsize=10, color='white')
plt.title('Train/Test Loss function of Epoch')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()
return metrics
class TrafficDataset(Dataset):
def __init__(self, X, y):
# to tensor
self.X = torch.tensor(X, dtype=torch.float32)
self.y = torch.tensor(y, dtype=torch.float32)
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
class TrafficVolumeLSTM(nn.Module):
def __init__(self, num_features_col:int, hidden_layer_multiplier:float, output_size:int, num_lstm_layers:int, dropout:float):
super().__init__()
self.input_size = num_features_col # Set input_size based on num_features
self.hidden_layer_size = int(num_features_col * hidden_layer_multiplier) # Calculate hidden_layer_size
self.num_lstm_layers = num_lstm_layers
self.lstm = nn.LSTM(
input_size=self.input_size,
hidden_size=self.hidden_layer_size, # output size
num_layers=num_lstm_layers,
batch_first=True,
dropout=dropout,
)
self.train_dataloader = None
self.test_dataloader = None
self.fc1 = nn.Linear(in_features=self.hidden_layer_size, out_features= output_size)
self.fc2 = nn.Linear(in_features=output_size, out_features=output_size)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
batch_size = x.size(0)
# Initialize hidden and cell states
h0 = torch.zeros(self.num_lstm_layers, batch_size, self.hidden_layer_size).to(x.device)
c0 = torch.zeros(self.num_lstm_layers, batch_size, self.hidden_layer_size).to(x.device)
out, _ = self.lstm(x, (h0, c0))
out = out[:, -1, :] # Transform the output of the last time step of LSTM
out = self.fc1(out)
out = self.fc2(out)
out = self.sigmoid(out)
return out
class TrafficVolumeGRU(nn.Module):
# model for GRU
def __init__(self, num_features_col:int, hidden_layer_multiplier:float, output_size:int, num_gru_layers:int, dropout:float):
super().__init__()
self.input_size = num_features_col
self.hidden_layer_size = int(num_features_col * hidden_layer_multiplier)
self.num_gru_layers = num_gru_layers
self.gru = nn.GRU(
input_size=self.input_size,
hidden_size=self.hidden_layer_size,
num_layers=num_gru_layers,
batch_first=True,
dropout=dropout,
)
self.train_dataloader = None
self.test_dataloader = None
self.fc1 = nn.Linear(in_features=self.hidden_layer_size, out_features=output_size)
self.fc2 = nn.Linear(in_features=output_size, out_features=output_size)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
batch_size = x.size(0)
# Initialize hidden state
h0 = torch.zeros(self.num_gru_layers, batch_size, self.hidden_layer_size).to(x.device)
out, _ = self.gru(x, h0)
out = out[:, -1, :]
out = self.fc1(out)
out = self.fc2(out)
out = self.sigmoid(out)
return out
class TrafficVolumeRF():
def __init__(self):
super().__init__()
# Works
def trainRandomForestRegressor(self, train_df:pd.DataFrame, test_df:pd.DataFrame, target_col:str):
train = train_df.copy()
test = test_df.copy()
start = time.time()
# Separate features (X) and target variable (y)
# try with p_max as target
y_train = train[target_col]
X_train = train.drop(columns=[target_col])
y_test = test[target_col]
X_test = test.drop(columns=[target_col])
# Create parameter grid for hyperparameter tuning
param_grid = {
'max_depth': [3, 6, 9],
'n_estimators': [10],
'max_features': ['sqrt', 'log2', None], # Explore different feature selection strategies
'bootstrap': [True, False], # Experiment with and without bootstrapping
'n_jobs': [-1] # Use all available cores for parallelization (if applicable)
}
# Create GridSearchCV object
grid_search = GridSearchCV(estimator=RandomForestRegressor(random_state=42), param_grid=param_grid, cv=10, scoring='neg_mean_squared_error')
# Fit GridSearchCV to training data (performs hyperparameter tuning)
grid_search.fit(X_train, y_train)
# Extract the best model found by GridSearchCV
best_model = grid_search.best_estimator_
# Use the best model for training and evaluation
model = best_model # Assign the best model to the 'model' variable
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Adjust figure size and other parameters as needed
# Get feature importances
importances = best_model.feature_importances_
# Sort feature importances in descending order
indices = np.argsort(importances)[::-1]
# Create a DataFrame to visualize feature importances
feature_importances = pd.DataFrame({'feature': X_train.columns[indices], 'importance': importances[indices]})
# Evaluate the model's performance using regression metrics
# Calculate accuracy
# Mean Squared Error (MSE)
# R-squared scores
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = math.sqrt(mse)
r2 = r2_score(y_test, y_pred)
stop = time.time()
cpuTime = stop - start
metrics = {'MAE': mae, 'MSE': mse, 'RMSE': rmse, 'R2': r2, 'cpuTime(s)': cpuTime}
# Plot feature importances
## Include metrics in the same plot
plt.figure(figsize=(4, 12))
plt.barh(feature_importances['feature'], feature_importances['importance'])
plt.yticks(fontsize=9)
plt.xticks(fontsize=9)
# Add text and box
ax = plt.gca()
bbox_props = dict(boxstyle='round,pad=0.5', facecolor='blue', alpha=0.8)
text_str = '\n'.join([f'{name}: {value:.4f}' for name, value in metrics.items()])
# Calculate box position and size
x, y = 0.95, 0.95
ha, va = 'right', 'top'
ax.text(x, y, text_str, transform=ax.transAxes, ha=ha, va=va, bbox=bbox_props, fontsize=10, color='white')
plt.xlabel('Importance')
plt.ylabel('Feature Name')
plt.title('Feature Importance')
plt.show()
# Create a DataFrame for predictions
predictions_df = pd.DataFrame({'p_score_predicted': y_pred}, index=X_test.index)
return model, metrics, predictions_df
def main(target_column:str, time_shift:int, sequence_length:int, num_epochs:int):
"""
Main function to execute the traffic volume prediction pipeline
"""
# Example Usage: MAIN START HERE
# Read the traffic dataset from CSV
data_df = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
# Format datasets
print('[+] Formatting Datasets...')
formatted_data_df = formatData(data=data_df.copy())
# Split to train/test datasets
print('[+] Spliting to Train/Test Datasets...')
train_data_df, test_data_df = train_test_split(formatted_data_df.copy() , test_size=0.20, shuffle=False)
# Shift datasets
print('[+] Shifting, Slicing...')
#time_shift = 7
#sequence_length = 21
#target_column = 'traffic_volume'
feature_columns = formatted_data_df.columns.tolist()
feature_columns.remove(target_column)
feature_columns_count = len(feature_columns)
print(f" feature_columns_count: {feature_columns_count}")
print(f" target_column: {target_column}")
shifted_train_data_df = shiftData(data=train_data_df.copy(), time_shift=time_shift, target_column=target_column)
shifted_test_data_df = shiftData(data=test_data_df.copy(), time_shift=time_shift, target_column=target_column)
print(f" shifted_train_data_df.shape: {shifted_train_data_df.shape}")
print(f" shifted_test_data_df.shape: {shifted_test_data_df.shape}")
# Sequence Slicing
X_train, y_train, shiftedTrainDF = slice_to_numpy_array_sequences(data=shifted_train_data_df.copy(),
sequence_length=sequence_length,
target_column=target_column,
feature_columns=feature_columns)
X_test, y_test, shiftedTestDF = slice_to_numpy_array_sequences(data=shifted_test_data_df.copy(),
sequence_length=sequence_length,
target_column=target_column,
feature_columns=feature_columns)
# Instantiate DataLoaders
print('[+] DataLoaders...')
train_loader = DataLoader(TrafficDataset(X_train, y_train), batch_size=64, shuffle=False)
test_loader = DataLoader(TrafficDataset(X_test, y_test), batch_size=64, shuffle=False)
## Verify shape of train/test DataLoader
X_batch_train, y_batch_train = next(iter(train_loader))
print(" X_batch_train shape:", X_batch_train.shape)
print(" y_batch_train shape:", y_batch_train.shape)
# Instantiate models train/test DataLoaders
TrafficVolumeLSTM.train_dataloader = train_loader
TrafficVolumeLSTM.test_dataloader = test_loader
TrafficVolumeGRU.train_dataloader = train_loader
TrafficVolumeGRU.test_dataloader = test_loader
# RandomForestRegression model
print('[+] RandomForestRegressor - Baseline')
model_instance = TrafficVolumeRF()
rf_model, rf_metrics, rf_predictions = model_instance.trainRandomForestRegressor(train_df=shifted_train_data_df.copy(),
test_df=shifted_test_data_df.copy(),
target_col=target_column)
# Instantiate LSTM model
print('[+] LSTM model')
LSTM_model = TrafficVolumeLSTM(num_features_col=feature_columns_count,
hidden_layer_multiplier=1.0,
output_size=sequence_length,
num_lstm_layers=2,
dropout=0.50)
# Train LSTM
LSTM_metrics = execute_model_training_loop(MODEL=LSTM_model,
modelName='LSTM',
OPTimizer=optim.Adam(LSTM_model.parameters(), lr=0.001),
lossFunction=nn.MSELoss(),
NUM_EPOCHS=num_epochs,
trainLoader=TrafficVolumeLSTM.train_dataloader,
testLoader=TrafficVolumeLSTM.test_dataloader)
# Instantiate GRU model
print('[+] GRU model')
GRU_model = TrafficVolumeGRU(num_features_col=feature_columns_count,
hidden_layer_multiplier=1.0,
output_size=sequence_length,
num_gru_layers=2,
dropout=0.50)
# Train GRU
GRU_metrics = execute_model_training_loop(MODEL=GRU_model,
modelName='GRU',
OPTimizer=optim.Adam(GRU_model.parameters(), lr=0.001),
lossFunction=nn.MSELoss(),
NUM_EPOCHS= num_epochs,
trainLoader=TrafficVolumeGRU.train_dataloader,
testLoader=TrafficVolumeGRU.test_dataloader)
return rf_metrics, LSTM_metrics, GRU_metrics
if __name__ == "__main__":
# Execute the main function
RF_metrics, LSTM_metrics, GRU_metrics = main(target_column='traffic_volume', time_shift=7, sequence_length=21, num_epochs=100)Compare Results (CPU training)
Parameters:
- Epochs: 100
- Learning rate: 0.001
- Optimizer: Adam
- Dropout: 0.50
- Time shift=7
- Sequence length=21
| Metric | RF | LSTM | GRU |
|---|---|---|---|
| MAE | 0.0940 | 0.0651 | 0.0721 |
| MSE | 0.0180 | 0.0113 | 0.0123 |
| RMSE | 0.1342 | 0.1061 | 0.1135 |
| R2 | 0.7585 | 0.8461 | 0.8239 |
| Max R2 | 0.7585 | 0.8621 | 0.8555 |
| Compute Time | 32.53s | 2568s | 6098s |
Results appreciation
Overall Conclusion:
Based on the metrics, LSTM generally outperforms both RF and GRU, achieving the lowest error rates (MAE, MSE, RMSE) and the highest R-squared values. GRU performs better than RF across all performance metrics.
However, there is a substantial trade-off in terms of computation time. RF is significantly faster to train compared to both LSTM and GRU. LSTM and GRU require considerably longer training times, with GRU being the slowest among the three.
If high accuracy, LSTM appears to be the best choice. If speed is critical and a slightly lower accuracy is acceptable, RF might be a more suitable option. GRU offers a middle ground in terms of performance but has the highest computational cost in this comparison.