Traffic data fluctuates constantly or is affected by time. Predicting it can be challenging, but this task will help sharpen your time-series skills. With deep learning, you can use abstract patterns in data that can help boost predictability.
Your task is to build a system that can be applied to help you predict traffic volume or the number of vehicles passing at a specific point and time. Determining this can help reduce road congestion, support new designs for roads or intersections, improve safety, and more! Or, you can use to help plan your commute to avoid traffic!
The dataset provided contains the hourly traffic volume on an interstate highway in Minnesota, USA. It also includes weather features and holidays, which often impact traffic volume.
Time to predict some traffic!
The data:
The dataset is collected and maintained by UCI Machine Learning Repository. The target variable is traffic_volume. The dataset contains the following and has already been normalized and saved into training and test sets:
train_scaled.csv, test_scaled.csv
| Column | Type | Description |
|---|---|---|
temp | Numeric | Average temp in kelvin |
rain_1h | Numeric | Amount in mm of rain that occurred in the hour |
snow_1h | Numeric | Amount in mm of snow that occurred in the hour |
clouds_all | Numeric | Percentage of cloud cover |
date_time | DateTime | Hour of the data collected in local CST time |
holiday_ (11 columns) | Categorical | US National holidays plus regional holiday, Minnesota State Fair |
weather_main_ (11 columns) | Categorical | Short textual description of the current weather |
weather_description_ (35 columns) | Categorical | Longer textual description of the current weather |
hour_of_day | Numeric | The hour of the day |
day_of_week | Numeric | The day of the week (0=Monday, Sunday=6) |
day_of_month | Numeric | The day of the month |
month | Numeric | The number of the month |
traffic_volume | Numeric | Hourly I-94 ATR 301 reported westbound traffic volume |
# Import the relevant libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.figure_factory as ff
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.preprocessing import MinMaxScaler
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader# Read the dataset
traffic_df = pd.read_csv('traffic_enhanced.csv')traffic_df.shapeData Preprocessing
# Check for dtypes and missing data
traffic_df.info()# Check for duplicates
print(f"There are {traffic_df.duplicated().sum()} duplicated rows in the dataset.")# Drop unneeded column
traffic_df = traffic_df.drop(columns=['Unnamed: 0'])
traffic_df.columns# Convert 'date_time' to datetime type
traffic_df['date_time'] = pd.to_datetime(traffic_df['date_time'])
# Convert 'holiday_' columns to int type
holiday_columns = [col for col in traffic_df.columns if col.startswith('holiday_')]
traffic_df[holiday_columns] = traffic_df[holiday_columns].astype(int)
# Convert 'weather_main_' columns to int type
weather_main_columns = [col for col in traffic_df.columns if col.startswith('weather_main_')]
traffic_df[weather_main_columns] = traffic_df[weather_main_columns].astype(int)
# Convert 'weather_description_' columns to int type
weather_description_columns = [col for col in traffic_df.columns if col.startswith('weather_description_')]
traffic_df[weather_description_columns] = traffic_df[weather_description_columns].astype(int)traffic_df.describe().round(1).T1. Create a day_id Column
During data inspection, I observed that the dataset contains an uneven number of hourly records per day — some days have only 16 or 20 hours instead of the expected 24. This irregularity is common in sensor-based data due to outages, maintenance, or missing entries.
To ensure the integrity of sequence creation (especially for time-based modeling), it’s essential to group and track data at the day level. I created a day_id column by extracting the date component from the date_time field. This allows me to:
- Group rows that belong to the same calendar day
- Filter out days with incomplete data
- Confidently create input sequences without blending data across different days
This step is foundational for generating consistent and meaningful sequences that reflect the true temporal structure of the data.
traffic_df = traffic_df.sort_values("date_time").reset_index(drop=True)
traffic_df['date'] = traffic_df['date_time'].dt.date
traffic_df['day_id'] = (traffic_df['date'] != traffic_df['date'].shift()).cumsum()traffic_df['day_id'].describe().round(2)EDA and Visualizations
1. Distribution Check