Course
ARIMA for Time Series Forecasting: A Complete Guide
Let’s take a look at ARIMA, which is one of the most popular (if not the most popular) time series forecasting techniques. ARIMA is popular because it effectively models time series data by capturing both the autoregressive (AR) and moving average (MA) components, while also addressing non-stationarity through differencing (I). This combination makes ARIMA models especially flexible, which is why they are used across very different industries, like finance and weather prediction.
ARIMA models are highly technical, but I will break down the parts so you can develop a strong understanding. Before getting started, it's a good idea to familiarize yourself with some foundational tools. DataCamp offers a lot of good resources, such as our ARIMA Models in Python or ARIMA Models in R courses. You can choose either depending on the language you prefer.
Why Use ARIMA Forecasting?
Throughout finance, economics and environmental sciences etc., ARIMA has great interest because it can identify many complex patterns of our past observations with future needs which makes it a state-of-the-art technique. From predicting the price of stocks, forecasting weather patterns to getting an idea about consumer demand, ARIMA is a great way to make accurate and actionable predictive analyses.
By using ARIMA, we are able to both analyze and forecast time series data in a sophisticated manner that accounts for patterns, trends, and seasonality. This facilitates a 360-degree view of the underlying dynamics for making informed decisions.
Key Components of ARIMA Models
In order to really understand ARIMA, we need to deconstruct its building blocks. Once we have the components down, it will become easier to understand how this time series forecasting method works as a whole. Here, I’ll give a detailed explanation of every component.
Autoregressive (AR) part
The Autoregressive (AR) component builds a trend from past values in the AR framework for predictive models. For clarification, the 'autoregression framework' works like a regression model where you use the lags of the time series' own past values as the regressors.
Integrated (I) part
The Integrated (I) part involves the differencing of the time series component keeping in mind that our time series should be stationary, which really means that the mean and variance should remain constant over a period of time. Basically, we subtract one observation from another so that trends and seasonality are eliminated. By performing differencing we get stationarity. This step is necessary because it helps the model fit the data and not the noise.
Moving average (MA) part
The moving average (MA) component focuses on the relationship between an observation and a residual error. Looking at how the present observation is related to those of the past errors, we can then infer some helpful information about any possible trend in our data.
We can consider the residuals among one of these errors, and the moving average model concept estimates or considers their impact on our latest observation. This is particularly useful for tracking and trapping short-term changes in the data or random shocks. In the (MA) part of a time series, we can gain valuable information about its behavior which in turn allows us to forecast and predict with greater accuracy.
How to Build an ARIMA Model in Python
To build an ARIMA model for forecasting, like gold prices, you can follow these steps. Let’s break it down together.
Data collection
The first step is to tee up an appropriate dataset and prepare our environment.
Find a dataset
Collect or search for a dataset from data source platforms. You want one that has historical data over time. Here is a link to the Kaggle dataset related to gold future prices.
Install packages
We install the packages we need, including statsmodels
and sklearn
.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
Load the data
We then read the data into our local environment.
data = pd.read_csv("future-gc00-daily-prices.csv", index_col="Date")
Data preprocessing
Our dataset is pretty clean, but in other contexts, we would have to handle indexing issues, which is important in time series forecasting. For example, if we were forecasting the opening value of a stock on a particular exchange, we would have to consider that the stock market is not open on weekends.
Check for stationarity
Keeping things stationary makes the modeling task a lot easier, helps to improve our model accuracy and in return provides us with more reliable predictions. While ARIMA models can deal with non-stationarity up to a point, they cannot effectively account for time-varying variance. Here we can use the Augmented Dickey-Fuller test to tell us if our data has a constant mean and variance.
result = adfuller(data["Price"])
print(f"ADF Statistic: {result[0]}")
print(f"p-value: {result[1]}")
Handle missing values
As part of data preprocessing, we also have to consider how to handle missing values using an imputation method like forward filling or mean replacement.
data.fillna(method='ffill', inplace=True) # fill missing values
Perform differencing
To analyze our time series data for stationarity, we should first calculate the differences in our data. If the data is not stationary, apply the differencing technique to transform it into a stationary series. The steps to take in performing differencing are:
- Subtract each observation from the next to give us a new time series of first differences. This creates a new time series that is one element shorter than the original.
- Test if the differenced series is now stationary. If not, then we can take the second difference by differencing the original series again.
- Continue differencing the series until it is stationary. The order of differencing required is the minimum number of differences needed to get a series with no autocorrelation.
if result[1] > 0.05:
data["Price"] = data["Price"].diff().dropna()
result = adfuller(data["Price"])
stationarity_interpretation = "Stationary" if result[1] < 0.05 else "Non-Stationary"
print(f"ADF Statistic after differencing: {result[0]}")
print(f"p-value after differencing: {result[1]}")
print(f"Interpretation: The series is {stationarity_interpretation}.")
ADF Statistic: -11.498371141896145
p-value: 4.5550962204394835e-21
Interpretation: The series is Stationary.
Model identification
When we build an ARIMA model, we have to consider the p, d, and q terms that go into our ARIMA model.
- The first parameter, p, is the number of lagged observations. By considering p, we effectively determine how far back in time we go when trying to predict the current observation. We do this by looking at the autocorrelations of our time series, which are the correlations in our series at previous time lags.
- The second parameter, d, refers to the order of differencing. Differencing simply means finding the differences between consecutive timesteps. It is a way to make our data stationary, which means removing the trends and seasonality. d indicates differencing at which order you get a process stationary.
- The third parameter q refers to the order of the moving average (MA) part of the model. It represents the number of lagged forecast errors included in the model. Unlike a simple moving average, which smooths data, the moving average in ARIMA captures the relationship between an observation and the residual errors from a moving average model applied to lagged observations.
Finding the ARIMA terms
We use tools like ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) to determine the values of p, d, and q. The number of lags where ACF cuts off is q, and where PACF cuts off is p. We also have to choose the appropriate value for d by creating a situation where, after differencing, the data resembles white noise. For our data, we choose 1 for both p and q because we see a significant spike in the first lag for each.
plot_acf(data["Price"], lags=40)
plot_pacf(data["Price"], lags=40)
ACF and PACF plots used to determine ARIMA terms. Image by Author
Parameter estimation
To be clear, the p, d, and q values in ARIMA represent the model's order (lags for autoregression, differencing, and moving average terms), but they are not the actual parameters being estimated. Once the p, d, and q values are chosen, the model estimates additional parameters, such as coefficients for the autoregressive and moving average terms, through Maximum Likelihood Estimation (MLE).
Model fitting
# Fit the ARIMA model
# Initial ARIMA Model parameters
p, d, q = 1, 0, 1
model = ARIMA(data["Price"], order=(p, d, q))
model_fit = model.fit()
model_summary = model_fit.summary()
model_summary
Model statistics and model diagnostics
We now check the residuals and make sure they act like white noise, which means they should have no patterns or trends. One option is to use our ACF and PACF plots again, but this time applied to the residuals. If there are no large spikes in these graphs lag outside of the band, then it means our residuals appear to be white noise. We can also check the residuals of the overall model to make sure there are no obvious patterns, as we are doing here:
Residual plot
# plot residual errors
residuals = model_fit.resid
residuals.plot()
residuals.plot(kind='kde')
plt.show()
ARIMA model residuals. Image by Author
AIC and BIC
We check out the model statistics relevant to model selection. Lower values mean the model fits better, but we also might compare the results with the results from simpler models to avoid overfitting.
print(f"AIC: {model_fit.aic}")
print(f"BIC: {model_fit.bic}")
AIC: 41919.18902176751
BIC: 41937.18705062565
Forecasting
To forecast using an ARIMA model, start by using the fitted model to predict future values based on the data. Once predictions are made, it's helpful to visualize them by plotting the predicted values alongside the actual values. This is accomplished because we use a train/test workflow, where the data is split into training and testing sets. Doing this lets us see how well the model performs on unseen data. Our Model Validation in Python course is a great resource to learn the ins and outs of model validation.
1. Use a train/test workflow
Our first step is to split the data into training and testing versions.
data = data[“Price”]train_size = int(len(data) * 0.8)train, test = data[:train_size], data[train_size:]
# Fit the model to training data. Replace p, d, q with our ARIMA parameters
model = ARIMA(train_data["Price"], order=(p, d, q))
# Forecast
forecast = model_fit.forecast(steps=len(test))
2. Visualize our time series
Our next step is to visually inspect our time series forecast.
# Plotting
plt.figure(figsize=(10, 5))
plt.plot(data.index[:train_size], train, label='Train', color='blue')
plt.plot(data.index[train_size:], test, label='Test', color='green')
plt.plot(data.index[train_size:], forecast, label='Forecast', color='red')
plt.legend()
plt.title('ARIMA Forecast vs Actual')
plt.show()
ARIMA forecast actual vs. predicted values. Image by Author
3. Evaluate model statistics
We evaluate model statistics, particularly the mean squared error, to assess our model's fit. A lower RMSE indicates a better ARIMA model, reflecting smaller differences between actual and predicted values.
# Evaluate model performance on the test set
rmse = mean_squared_error(test_data["Price"], predictions, squared=False)
print(f"RMSE: {rmse}")
“RMSE”: 135.87678712210163
Become a ML Scientist
Common Uses of ARIMA Forecasting
Now, let’s discuss the applications of ARIMA in different industries. A variety of sectors - everything from economics and finance to weather forecasting and health - make use of ARIMA models to derive insights from data and in the quest for predictive accuracy as well. Some big ones are as follows:
Economics and finance
ARIMA’s strength lies in its capacity to handle financial time series that often contain complex autocorrelations and interactions between multiple economic indicators. Its ability to model lag effects and incorporate differencing makes it ideal for forecasting volatile metrics like stock prices or exchange rates.
- Financial Forecasting: Through predicting stock prices, exchange rates and other financial instruments, ARIMA can be used to support investment strategies.
- Economic Modeling: ARIMA models help predict the future of a country or global economy, informing economic policy decisions.
- Demand Planning: ARIMA predicts the demand for consumer goods and services, helping to optimize production planning to control inventory.
Weather forecasting
ARIMA models leverage historical weather patterns to provide short- and long-term forecasts, so they are flexible enough to predict both typical and extreme weather conditions.
- Predicting Temperature and Precipitation: ARIMA models are used in short-term and long-term weather forecasts, incorporating air-sea interactions and many other factors.
- Climate Change Modeling: ARIMA models use historical weather data to better understand trends in the climate and predict what future climates will look like.
Supply chain management
ARIMA’s ability to model lag effects helps supply chain managers manage inventory or anticipate disruptions based on historical patterns and lead times.
- Demand Forecasting: ARIMA has the ability to predict the future demand of products and plan production schedules or stock levels.
- Inventory Management: ARIMA ensures that the right levels of stock are maintained on items so as to not have too much capital tied up in inventory investments and to reduce costs related to over- or understocking.
- Supply Chain Optimization: ARIMA can forecast supply chain disruptions by analyzing the interactions between multiple variables including, for example, transport delays or demand fluctuations.
Healthcare
In healthcare, ARIMA models are particularly valuable because they can predict patient admissions and other important trends.
- Disease Outbreak Prediction: ARIMA models prove utility by predicting the propagation of infectious diseases, which then will pave the way for pre-emptive public health interventions.
- Hospital Admission Forecasting: ARIMA predicts hospital admission rates and helps optimize resources and staff schedules.
- Patient Monitoring: ARIMA is a useful tool for professionals who want to examine medical data to warn of early signs of health issues and tailor-fit treatment strategies.
Things to Consider for Better ARIMA Forecasting
Here are some common mistakes to avoid while working on building ARIMA models:
Overfitting and underfitting
If we choose incorrect p, d, and q values, it can lead to overfitting or underfitting. We overfit when our model is too complex and it gloms onto the noise in our data, so that it doesn’t generalize well to new observations. On the other extreme, underfitting simply means that our model is less complex and cannot capture all of the underlying patterns.
To prevent overfitting, an approach could be to use fewer lag terms and also possibly fewer differencing terms. Underfitting can be fixed by increasing the number of autoregressive terms, if appropriate. One must strike a balance between complexity and simplicity. Techniques such as validation/cross-validation can help.
Stationarity
Stationarity is a statistical assumption that deals with time-dependencies of data. Unreliable forecasts and spurious relationships can result from non-stationary data. Differencing or transformations such as log transformations or seasonal adjustments can be used to make non-stationary data stationary.
Seasonality
The presence of seasonality is another vital component to take into account while dealing with time-series analysis. Daily, weekly, and yearly are some of the fixed intervals over which many real-world datasets exhibit repeated patterns. Disregarding these seasonal patterns can result in improper forecasting. In the context of seasonality, we need seasonal differences and seasonal AR and MA terms in addition to p and q values. Keep in mind that a series can have more than one kind of seasonality.
Residual analysis
One of the most important steps in ARIMA modeling is to check if the residual series that is generated is stationary. Residuals are the difference between observed values and those produced by a model. By looking at the residuals, we can check if our model is able to find and work with the dynamics in the data. The residuals should show a random scatter without indicating any trends or correlations.
When the residuals show patterns or correlations, it means that there is information somewhere that the model has not completely captured. Statistical tests and visual diagnostics, including the Ljung-Box test, as well as histograms and other diagnostic plots can be used to verify that the model is adequate.
Next Steps with ARIMA and Related Models
In many cases, ARIMA is not the final step. Just as ARIMA is an evolution of autoregressive or moving average models, newer ideas have been developed as well. For one thing, ARIMA models themselves can handle both linear and non-linear patterns in a time series. If you want a seasonal forecast, consider SARIMA models, which can handle multi-period/periodic patterns in our time series. SARIMA models are especially useful in areas where data has a recurring pattern or cyclic behavior like sales forecasting and weather predictions. ARIMAX models are another popular options. ARIMAX models are ARIMA models that take an external variable or exogenous regressor. They can help a great deal in the performance and accuracy of our forecasts.
Furthermore, moving into the realm of machine learning, we can think about diving deeper in time series analysis using tools such recurrent neural networks (RNN) and LSTM for predicting complex temporal dependencies. As a final thought, the field of Bayesian time series analysis and understanding how such an approach can provide benefits in forecasting and decision-making.
Conclusion / Final Thoughts
As we have seen, ARIMA is a common statistical model that assesses the time series, and predicts future values by taking into account both autoregressive and moving average elements. It allows us to generate a forecast of the historical data even though the features of a particular dataset might be very different from the features of another dataset. Its adaptability is what makes it a common and widely used forecasting method.
Hands-on experience is important to master the basics of ARIMA. DataCamp offers complete courses based on your learning needs to improve and master the subject of ARIMA modeling. In these top online tutorials, learn the fundamentals of ARIMA modeling as well as the most practical tools and techniques for implementing analytical solutions that solve hard real-world problems with far less effort (and in less time) than you ever thought possible. By the end, you should feel comfortable applying ARIMA modeling in your future data science work. Check out the ARIMA modeling courses available on DataCamp and achieve your maximum potential in time series analysis: Forecasting in R, Time Series with R, ARIMA Models in Python, ARIMA Models in R.
Become an ML Scientist
Upskill in Python to become a machine learning scientist.
I am a part-time lecturer in the Department of Computer Science Apprenticeship program at An-Najah National University with 5 years of Software Engineering experience and 5+ years of Data science fields. My current research focuses on Education EdTech and (FATE) in AI • Data science for Social Good/Impact. In addition to the lecturer role. I am an after-school Coding Educator, teaching Python programming, and a Technical mentor, sharing knowledge with others, and providing insights into the latest trends and techniques. a Professional Trainer for university students about Job Search skills (Interviewing, CV Building, Job Selection, Portfolio, Career Coaching) in the tech world.
Frequently Asked Time Series Forecasting Questions
What is an ARIMA model?
An ARIMA (Autoregressive Integrated Moving Average) model is a popular statistical method for time series forecasting that predicts future values by combining past observations (AR), differencing to achieve stationarity (I), and past errors to refine predictions (MA).
What is the difference between ARIMA vs. exponential smoothing?
We can say that ARIMA models describe a series by its autocorrelations. It is a stronger and more flexible technique that tends to be used to provide stable, long-term forecasts. Exponential smoothing looks more at the trend and seasonality and is typically thought to perform better for short, volatile series.
How is ARIMA forecasting related to regression modeling?
ARIMA forecasting is related to regression modeling as it uses past values and errors to predict future data points, similar to how regression models predict dependent variables using independent ones. ARIMA also handles non-stationary time series by differencing, which aligns it with regression techniques used on stationary data. Unlike traditional regression models, ARIMA explicitly accounts for time-dependent effects, making it more suited for time series forecasting.
Learn with DataCamp
Course
Introduction to Data Science in Python
Course
Financial Forecasting in Python
tutorial
Time Series Forecasting Tutorial
tutorial
Time Series Analysis using R: Tutorial
Salin Kc
16 min
tutorial
Time Series Forecasting With TimeGPT
tutorial
Moving Averages in pandas
tutorial