Skip to content

Traffic data fluctuates constantly or is affected by time. Predicting it can be challenging, but this task will help sharpen your time-series skills. With deep learning, you can use abstract patterns in data that can help boost predictability.

Your task is to build a system that can be applied to help you predict traffic volume or the number of vehicles passing at a specific point and time. Determining this can help reduce road congestion, support new designs for roads or intersections, improve safety, and more! Or, you can use to help plan your commute to avoid traffic!

The dataset provided contains the hourly traffic volume on an interstate highway in Minnesota, USA. It also includes weather features and holidays, which often impact traffic volume.

Time to predict some traffic!

The data:

The dataset is collected and maintained by UCI Machine Learning Repository. The target variable is traffic_volume. The dataset contains the following and has already been normalized and saved into training and test sets:

train_scaled.csv, test_scaled.csv

ColumnTypeDescription
tempNumericAverage temp in kelvin
rain_1hNumericAmount in mm of rain that occurred in the hour
snow_1hNumericAmount in mm of snow that occurred in the hour
clouds_allNumericPercentage of cloud cover
date_timeDateTimeHour of the data collected in local CST time
holiday_ (11 columns)CategoricalUS National holidays plus regional holiday, Minnesota State Fair
weather_main_ (11 columns)CategoricalShort textual description of the current weather
weather_description_ (35 columns)CategoricalLonger textual description of the current weather
traffic_volumeNumericHourly I-94 ATR 301 reported westbound traffic volume
hour_of_dayNumericThe hour of the day
day_of_weekNumericThe day of the week (0=Monday, Sunday=6)
day_of_monthNumericThe day of the month
monthNumericThe number of the month
traffic_volumeNumericHourly I-94 ATR 301 reported westbound traffic volume

Approach:

  1. Prepare the data

    • Dataset: Metro Interstate Traffic Volume
      • Dataset: Metro Interstate Traffic Volume
      • https://archive.ics.uci.edu/dataset/492/metro+interstate+traffic+volume
    • Fill in missing data
    • Expand "Datetime" related columns --> hour of day, day of week, day of month, etc.
    • One-Hot encode categorical data (weather, holidays)
    • Apply MinMax scaler, and format to float

    The end result gives an expansion of the dataset from 8 to 65 columns

  2. Apply a time shift on the target column

    The target column is shifted forward in time vs the features column

  3. Split dataset into train/test

    Dataset is split 80% train / 20% test

  4. Train Random Forest Regressor model

    • Random Forest Regressor allows to establish an accuracy baseline
    • Determine the relative importance of each feature

    Accuracy baseline

  5. Slice train/test into tensor sequences

    • Train and Test datasets sliced to tensors

    X_train, y_train / X_test, y_test

  6. Train LSTM and GRU models (CPU)

    • Long-Term Short Term Memory (LSTM)
    • Gated Recurrent Unit (GRU)

    Compare LSTM vs GRU vs Baseline: Accuracy and Compute time

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import time
import math
import matplotlib.pyplot as plt
from tqdm import tqdm
from torch.utils.data import TensorDataset, DataLoader, Dataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor


# Dataset: Metro Interstate Traffic Volume
# https://archive.ics.uci.edu/dataset/492/metro+interstate+traffic+volume
# Dataset Characteristics : Multivariate, Sequential, Time-Series
# Associated Tasks: Regression
# Feature Type : Integer, Real 
# Nb of Features : 8
# Nb of Instances: 48204


def check_nan_values(data_df:pd.DataFrame):
    """
    Check if any NaN values exist in the DataFrame
    """
    NaNcheck = 0
    # Check if any NaN values exist
    if data_df.isnull().values.any():
        NaNcheck = True
        print('NaN values in data')
    else:
        NaNcheck = False
    return NaNcheck

def formatData(data:pd.DataFrame):

    df_copy = data.copy()
    # Adding Date related columns
    df_copy['date_time'] = pd.to_datetime(df_copy['date_time'])
    df_copy['month'] = df_copy['date_time'].dt.month
    df_copy['day'] = df_copy['date_time'].dt.day
    df_copy['hour'] = df_copy['date_time'].dt.hour
    df_copy['weekday'] = df_copy['date_time'].dt.weekday
    #df_copy['day_of_year'] = df_copy['date_time'].dt.dayofyear
    df_copy = df_copy.drop(columns=['date_time'])

    # Adding encoding columns
    ## replace spaces with underscores in categorical columns
    df_copy['holiday'] = df_copy['holiday'].str.replace(' ', '_')
    df_copy['weather_main'] = df_copy['weather_main'].str.replace(' ', '_')
    df_copy['weather_description'] = df_copy['weather_description'].str.replace(' ', '_')

    # One-hot encoding for categorical columns
    df_copy = pd.get_dummies(df_copy, columns=['holiday', 'weather_main', 'weather_description'], drop_first=True, dtype='float64')

    # Scaling
    ## Instantiate the scaler
    scaler = MinMaxScaler(feature_range=(0, 1))
    #scaler = StandardScaler()
    
    ## Apply the scaler to the numerical columns
    numerical_cols_to_scale = ['temp', 'rain_1h', 'snow_1h', 'clouds_all', 'traffic_volume']
    df_copy[numerical_cols_to_scale] = scaler.fit_transform(df_copy[numerical_cols_to_scale]).astype('float64')
    date_cols = ['month', 'day', 'hour', 'weekday']
    df_copy[date_cols] = scaler.fit_transform(df_copy[date_cols]).astype('float64')
    
    NaNcheck = check_nan_values(df_copy) 
    if NaNcheck == True:
        return None
    else:
        return df_copy

def shiftData(data:pd.DataFrame, time_shift:int, target_column:str):
    """
    Shift data and fillna(0) and remove rows 
    """

    df_copy = data.copy()  
    NaNcheck = check_nan_values(df_copy) 
    if NaNcheck == True:
        return None
    
    else:
        df_copy['shifted_target'] = df_copy[target_column].shift(time_shift) # Add shifted target column
        df_copy = df_copy.dropna()# Drop rows with NaN values
        df_copy = df_copy.drop(columns=[target_column])
        df_copy = df_copy.rename(columns={'shifted_target': target_column})

        if ((data.shape[0] - df_copy.shape[0]) > time_shift) or ((data.shape[0] - df_copy.shape[0]) < time_shift) :
            print('Error in data shift')
            return None
        else:
            pass

        return df_copy

def slice_to_numpy_array_sequences(data: pd.DataFrame, sequence_length: int, target_column: str, feature_columns: list):
    """
    Creates time sequences by slicing data using a while loop. Output data to NumPy arrays.
    """

    features = []
    targets = []
    df_copy = data.copy()   
    NaNcheck = check_nan_values(df_copy) 
    if NaNcheck == True:
        return None

    else:
        i = 0
        while i < (len(df_copy) - sequence_length):
            feature_window = df_copy.iloc[i:i + sequence_length][feature_columns].values
            features.append(feature_window)
            target_value = df_copy.iloc[i:i + sequence_length][target_column].values
            #target_value = df_copy[target_column].iloc[i + sequence_length]
            targets.append(target_value)
            i += 1 

        # Convert the lists to NumPy arrays
        features = np.array(features)
        targets = np.array(targets)

        return features, targets, df_copy

def execute_model_training_loop(MODEL, modelName, OPTimizer, lossFunction , NUM_EPOCHS, trainLoader, testLoader):
    
    train_loader = trainLoader
    test_loader = testLoader
    model = MODEL
    optimizer = OPTimizer
    loss_function = lossFunction
    model_name = modelName

    """
    Execute the training loop for the model
    """

    # Train/Test Loop
    train_loss_global = []
    test_loss_global = []
    last_epoch_y_pred = []
    last_epoch_y_act = []
    print(f"[+][+] Start training {model_name} model...")
    global_train_time = time.time()
    start_time = time.time()  # Start time for training
    for epoch in range(NUM_EPOCHS):
        train_loss = 0.0  # Initialize train loss for the epoch
        test_loss = 0.0   # Initialize test loss for the epoch
        epoch_y_pred = []
        epoch_y_actual = []

        # Training loop
        for j, data in enumerate(train_loader):
            X, y = data
            optimizer.zero_grad()
            y_pred = model(X)
            lossA = loss_function(y_pred, y)
            lossA.backward()
            optimizer.step()
            train_loss += lossA.item()  # Accumulate training loss

        train_loss /= len(train_loader)  # Average training loss of Epoch
        train_loss_global.append(train_loss)

        # Test Loop
        model.eval()  # Set the model to evaluation mode 
        with torch.no_grad():  # Disable gradients during testing
            for X_test_batch, y_test_batch in test_loader:
                if X_test_batch.shape[0] == test_loader.batch_size:
                    y_pred_batch = model(X_test_batch) # Get predictions for the batch
                    lossB = loss_function(y_pred_batch, y_test_batch)
                    test_loss += lossB.item() # Accumulate test loss
                    epoch_y_pred.extend(y_pred_batch.cpu().numpy()) 
                    epoch_y_actual.extend(y_test_batch.cpu().numpy()) 
                    epoch_r2 = r2_score(epoch_y_actual, epoch_y_pred)

                    # if last epoch
                    if epoch == max(range(NUM_EPOCHS)):
                        last_epoch_y_pred.extend(y_pred_batch.cpu().numpy()) # .cpu() make available to sklearn metrics
                        last_epoch_y_act.extend(y_test_batch.cpu().numpy()) 
                    else:
                        pass
                else:
                    pass


            test_loss /= len(test_loader) # Average test loss
            test_loss_global.append(test_loss)

        model.train() # set model back to train before the next epoch

        # Print the training loss for the epoch
        print(f"Epoch: {epoch}, Train Loss: {train_loss:.9f}, Test Loss: {test_loss:.9f} , Accuracy: {epoch_r2:.9f} , Time: {time.time() - start_time:.2f} seconds")

        start_time = time.time()  # Reset start time for the next epoch

        if epoch == (max(range(NUM_EPOCHS))) and (len(last_epoch_y_pred)>0):
            # Calculate metrics (example: MSE, RMSE, MAE, r2)
            mae = mean_absolute_error(last_epoch_y_act, last_epoch_y_pred)
            mse = mean_squared_error(last_epoch_y_act, last_epoch_y_pred)
            rmse = np.sqrt(mean_squared_error(last_epoch_y_act, last_epoch_y_pred))
            r2 = r2_score(last_epoch_y_act, last_epoch_y_pred)

        else:
            pass

    print("Training complete.")
    cpuTime = time.time() - global_train_time
    metrics = {'MAE': mae, 'MSE': mse, 'RMSE': rmse, 'R2': r2, 'cpuTime(s)': cpuTime}
    # plot train/test losses
    plt.figure(figsize=(9, 6))
    train_loss_idx = x_idx = [ i for i in range(len(train_loss_global)) ]
    plt.plot(train_loss_idx, train_loss_global, label='Train Loss', color='blue', alpha=0.5)
    plt.plot(train_loss_idx, train_loss_global, label='Test Loss', color='red', alpha=0.5)
    # Add text and box
    ax = plt.gca()
    bbox_props = dict(boxstyle='round, pad=0.5', facecolor='blue', alpha=0.8)
    text_str = '\n'.join([f'{name}: {value:.4f}' for name, value in metrics.items()])
    x, y = 0.98, 0.80
    ha, va = 'right', 'top'
    ax.text(x, y, text_str, transform=ax.transAxes, ha=ha, va=va, bbox=bbox_props, fontsize=10, color='white')
    plt.title('Train/Test Loss function of Epoch')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    plt.show()
    
    return metrics

class TrafficDataset(Dataset):
    def __init__(self, X, y):
        # to tensor
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

class TrafficVolumeLSTM(nn.Module):
    def __init__(self, num_features_col:int, hidden_layer_multiplier:float, output_size:int, num_lstm_layers:int, dropout:float):
        super().__init__()
        self.input_size = num_features_col  # Set input_size based on num_features
        self.hidden_layer_size = int(num_features_col * hidden_layer_multiplier)  # Calculate hidden_layer_size
        self.num_lstm_layers = num_lstm_layers
        self.lstm = nn.LSTM(
            input_size=self.input_size,
            hidden_size=self.hidden_layer_size, # output size
            num_layers=num_lstm_layers,
            batch_first=True,
            dropout=dropout,
        )
        self.train_dataloader = None
        self.test_dataloader = None
    
        self.fc1 = nn.Linear(in_features=self.hidden_layer_size, out_features= output_size)
        self.fc2 = nn.Linear(in_features=output_size, out_features=output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        batch_size = x.size(0)
        # Initialize hidden and cell states
        h0 = torch.zeros(self.num_lstm_layers, batch_size, self.hidden_layer_size).to(x.device)
        c0 = torch.zeros(self.num_lstm_layers, batch_size, self.hidden_layer_size).to(x.device)
        out, _ = self.lstm(x, (h0, c0))
        out = out[:, -1, :]  # Transform the output of the last time step of LSTM
        out = self.fc1(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out

class TrafficVolumeGRU(nn.Module):
    # model for GRU
    def __init__(self, num_features_col:int, hidden_layer_multiplier:float, output_size:int, num_gru_layers:int, dropout:float):
        super().__init__()
        self.input_size = num_features_col
        self.hidden_layer_size = int(num_features_col * hidden_layer_multiplier)
        self.num_gru_layers = num_gru_layers
        self.gru = nn.GRU(
            input_size=self.input_size,
            hidden_size=self.hidden_layer_size,
            num_layers=num_gru_layers,
            batch_first=True,
            dropout=dropout,
        )
        self.train_dataloader = None
        self.test_dataloader = None

        self.fc1 = nn.Linear(in_features=self.hidden_layer_size, out_features=output_size)
        self.fc2 = nn.Linear(in_features=output_size, out_features=output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        batch_size = x.size(0)
        # Initialize hidden state
        h0 = torch.zeros(self.num_gru_layers, batch_size, self.hidden_layer_size).to(x.device)
        out, _ = self.gru(x, h0)
        out = out[:, -1, :]
        out = self.fc1(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out

class TrafficVolumeRF():
    def __init__(self):
        super().__init__()

    # Works
    def trainRandomForestRegressor(self, train_df:pd.DataFrame, test_df:pd.DataFrame, target_col:str):
        train = train_df.copy()
        test = test_df.copy()
        start = time.time()
        # Separate features (X) and target variable (y)
        # try with p_max as target
        y_train = train[target_col]
        X_train = train.drop(columns=[target_col])
        
        y_test = test[target_col]
        X_test = test.drop(columns=[target_col])

        # Create parameter grid for hyperparameter tuning
        param_grid = {
            'max_depth': [3, 6, 9],
            'n_estimators': [10],
            'max_features': ['sqrt', 'log2', None], # Explore different feature selection strategies
            'bootstrap': [True, False],             # Experiment with and without bootstrapping
            'n_jobs': [-1]                          # Use all available cores for parallelization (if applicable)
        }
        
        # Create GridSearchCV object
        grid_search = GridSearchCV(estimator=RandomForestRegressor(random_state=42), param_grid=param_grid, cv=10, scoring='neg_mean_squared_error')

        # Fit GridSearchCV to training data (performs hyperparameter tuning)
        grid_search.fit(X_train, y_train)

        # Extract the best model found by GridSearchCV
        best_model = grid_search.best_estimator_

        # Use the best model for training and evaluation
        model = best_model  # Assign the best model to the 'model' variable
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        # Adjust figure size and other parameters as needed
        # Get feature importances
        importances = best_model.feature_importances_

        # Sort feature importances in descending order
        indices = np.argsort(importances)[::-1]

        # Create a DataFrame to visualize feature importances
        feature_importances = pd.DataFrame({'feature': X_train.columns[indices], 'importance': importances[indices]})

        # Evaluate the model's performance using regression metrics
        # Calculate accuracy
        # Mean Squared Error (MSE)
        # R-squared scores
        mae = mean_absolute_error(y_test, y_pred)
        mse = mean_squared_error(y_test, y_pred)
        rmse = math.sqrt(mse)
        r2 = r2_score(y_test, y_pred)
        stop = time.time()
        cpuTime = stop - start
        metrics = {'MAE': mae, 'MSE': mse, 'RMSE': rmse, 'R2': r2, 'cpuTime(s)': cpuTime}

        # Plot feature importances
        ## Include metrics in the same plot
        plt.figure(figsize=(4, 12))
        plt.barh(feature_importances['feature'], feature_importances['importance'])
        plt.yticks(fontsize=9)
        plt.xticks(fontsize=9)

        # Add text and box
        ax = plt.gca()
        bbox_props = dict(boxstyle='round,pad=0.5', facecolor='blue', alpha=0.8)
        text_str = '\n'.join([f'{name}: {value:.4f}' for name, value in metrics.items()])
        # Calculate box position and size
        x, y = 0.95, 0.95
        ha, va = 'right', 'top'
        ax.text(x, y, text_str, transform=ax.transAxes, ha=ha, va=va, bbox=bbox_props, fontsize=10, color='white')

        plt.xlabel('Importance')
        plt.ylabel('Feature Name')
        plt.title('Feature Importance')
        plt.show()
        
    
        # Create a DataFrame for predictions
        predictions_df = pd.DataFrame({'p_score_predicted': y_pred}, index=X_test.index)
        
        return model, metrics, predictions_df  


def main(target_column:str, time_shift:int, sequence_length:int, num_epochs:int):
    
    """
    Main function to execute the traffic volume prediction pipeline
    
    """
    
    # Example Usage: MAIN START HERE
    # Read the traffic dataset from CSV 
    data_df = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')

    # Format datasets
    print('[+] Formatting Datasets...')
    formatted_data_df = formatData(data=data_df.copy())

    # Split to train/test datasets
    print('[+] Spliting to Train/Test Datasets...')
    train_data_df, test_data_df = train_test_split(formatted_data_df.copy() , test_size=0.20, shuffle=False)

    # Shift datasets
    print('[+] Shifting, Slicing...')
    #time_shift = 7
    #sequence_length = 21
    #target_column = 'traffic_volume'

    feature_columns = formatted_data_df.columns.tolist()
    feature_columns.remove(target_column)
    feature_columns_count = len(feature_columns)
    print(f"     feature_columns_count: {feature_columns_count}")
    print(f"     target_column: {target_column}")

    shifted_train_data_df = shiftData(data=train_data_df.copy(), time_shift=time_shift, target_column=target_column)
    shifted_test_data_df = shiftData(data=test_data_df.copy(), time_shift=time_shift, target_column=target_column)
    print(f"     shifted_train_data_df.shape: {shifted_train_data_df.shape}")
    print(f"     shifted_test_data_df.shape: {shifted_test_data_df.shape}")

   

    # Sequence Slicing
    X_train, y_train, shiftedTrainDF = slice_to_numpy_array_sequences(data=shifted_train_data_df.copy(), 
                                                                    sequence_length=sequence_length, 
                                                                    target_column=target_column, 
                                                                    feature_columns=feature_columns)

    X_test, y_test, shiftedTestDF = slice_to_numpy_array_sequences(data=shifted_test_data_df.copy(), 
                                                                sequence_length=sequence_length, 
                                                                target_column=target_column, 
                                                                feature_columns=feature_columns)

    # Instantiate DataLoaders
    print('[+] DataLoaders...')
    train_loader = DataLoader(TrafficDataset(X_train, y_train), batch_size=64, shuffle=False)
    test_loader = DataLoader(TrafficDataset(X_test, y_test), batch_size=64, shuffle=False)
    ## Verify shape of train/test DataLoader
    X_batch_train, y_batch_train = next(iter(train_loader))
    print("     X_batch_train shape:", X_batch_train.shape)
    print("     y_batch_train shape:", y_batch_train.shape)

    # Instantiate models train/test DataLoaders
    TrafficVolumeLSTM.train_dataloader = train_loader
    TrafficVolumeLSTM.test_dataloader = test_loader
    TrafficVolumeGRU.train_dataloader = train_loader
    TrafficVolumeGRU.test_dataloader = test_loader

    # RandomForestRegression model
    print('[+] RandomForestRegressor - Baseline')
    model_instance = TrafficVolumeRF()
    rf_model, rf_metrics, rf_predictions = model_instance.trainRandomForestRegressor(train_df=shifted_train_data_df.copy(), 
                                                                                    test_df=shifted_test_data_df.copy(), 
                                                                                    target_col=target_column)

    # Instantiate LSTM model
    print('[+] LSTM model')
    LSTM_model = TrafficVolumeLSTM(num_features_col=feature_columns_count, 
                            hidden_layer_multiplier=1.0, 
                            output_size=sequence_length, 
                            num_lstm_layers=2, 
                            dropout=0.50)

    # Train LSTM
    LSTM_metrics = execute_model_training_loop(MODEL=LSTM_model,
                                modelName='LSTM',
                                OPTimizer=optim.Adam(LSTM_model.parameters(), lr=0.001),
                                lossFunction=nn.MSELoss(),
                                NUM_EPOCHS=num_epochs,
                                trainLoader=TrafficVolumeLSTM.train_dataloader,
                                testLoader=TrafficVolumeLSTM.test_dataloader)

    # Instantiate GRU model
    print('[+] GRU model')
    GRU_model = TrafficVolumeGRU(num_features_col=feature_columns_count,
                                hidden_layer_multiplier=1.0, 
                                output_size=sequence_length, 
                                num_gru_layers=2, 
                                dropout=0.50)

    # Train GRU
    GRU_metrics = execute_model_training_loop(MODEL=GRU_model,
                                modelName='GRU',
                                OPTimizer=optim.Adam(GRU_model.parameters(), lr=0.001),
                                lossFunction=nn.MSELoss(),
                                NUM_EPOCHS= num_epochs,
                                trainLoader=TrafficVolumeGRU.train_dataloader,
                                testLoader=TrafficVolumeGRU.test_dataloader)

    return rf_metrics, LSTM_metrics, GRU_metrics

if __name__ == "__main__":
    # Execute the main function
    RF_metrics, LSTM_metrics, GRU_metrics = main(target_column='traffic_volume', time_shift=7, sequence_length=21, num_epochs=100)

Compare Results (CPU training)

Parameters:

  • Epochs: 100
  • Learning rate: 0.001
  • Optimizer: Adam
  • Dropout: 0.50
  • Time shift=7
  • Sequence length=21
MetricRFLSTMGRU
MAE0.09400.06510.0721
MSE0.01800.01130.0123
RMSE0.13420.10610.1135
R20.75850.84610.8239
Max R20.75850.86210.8555
Compute Time32.53s2568s6098s

Results appreciation

Overall Conclusion:

Based on the metrics, LSTM generally outperforms both RF and GRU, achieving the lowest error rates (MAE, MSE, RMSE) and the highest R-squared values. GRU performs better than RF across all performance metrics.

However, there is a substantial trade-off in terms of computation time. RF is significantly faster to train compared to both LSTM and GRU. LSTM and GRU require considerably longer training times, with GRU being the slowest among the three.

If high accuracy, LSTM appears to be the best choice. If speed is critical and a slightly lower accuracy is acceptable, RF might be a more suitable option. GRU offers a middle ground in terms of performance but has the highest computational cost in this comparison.