Course Notes: ARIMA Models in Python

ARIMA Models in Python

ARMA Models

Intro to time series and stationarity

# Creating dummy data for exercise including 3 examples of trends

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Generate a date range
date_rng = pd.date_range(start='2020-01-01', end='2021-01-01', freq='D')

# Create a DataFrame with dummy time series data including different trends
df = pd.DataFrame(date_rng, columns=['date'])

# Linear trend
df['linear_trend'] = np.random.randn(len(date_rng)) + np.linspace(0, 10, len(date_rng))

# Exponential trend
df['exponential_trend'] = np.random.randn(len(date_rng)) + np.exp(np.linspace(0, 2, len(date_rng)))

# Seasonal trend (sinusoidal)
df['seasonal_trend'] = np.random.randn(len(date_rng)) + 5 * np.sin(np.linspace(0, 2 * np.pi, len(date_rng)))

# Set the date column as the index
df.set_index('date', inplace=True)

# Display the first few rows of the DataFrame
df.head()

# Trend

fig, ax = plt.subplots()
df.plot(ax=ax)
plt.show()

Seasonality, Cyclicality & White Noise

Seasonality

Seasonality refers to periodic fluctuations in time series data that occur at regular intervals due to seasonal factors. These patterns repeat over a specific period, such as daily, monthly, or yearly. For example, retail sales often increase during the holiday season, and electricity consumption may rise during summer due to air conditioning use. Seasonality is predictable and can be accounted for in time series models.

Cyclicality

Cyclicality refers to fluctuations in time series data that occur at irregular intervals and are usually influenced by economic or business cycles. Unlike seasonality, cyclic patterns do not have a fixed period and can vary in duration and amplitude. For example, economic recessions and expansions are cyclical but do not occur at regular intervals. Cyclicality is harder to predict and often requires more complex modeling techniques.

White Noise

White noise is a random signal with a constant power spectral density. In the context of time series analysis, white noise refers to a sequence of random variables that are uncorrelated and have a mean of zero and a constant variance. White noise is used as a benchmark to test the randomness of a time series. If a time series is purely white noise, it indicates that there is no discernible pattern or structure in the data.

Summary

Seasonality: Regular, predictable patterns that repeat over fixed periods.
Cyclicality: Irregular, unpredictable patterns influenced by broader economic or business cycles.
White Noise: Random, uncorrelated data with no discernible pattern, mean of zero, and constant variance.

What is Stationarity?

In time series analysis, stationarity is a fundamental concept that refers to a time series whose statistical properties such as mean, variance, and autocorrelation are constant over time. A stationary time series is essential for many statistical modeling techniques because it simplifies the analysis and forecasting processes.

Criteria for Stationarity

A time series is considered stationary if it meets the following three criteria:

Constant Mean:
- The mean of the series should not change over time. This implies that the series fluctuates around a constant level.
Constant Variance:
- The variance of the series should remain constant over time. This means that the spread or dispersion of the series does not change.
Constant Autocovariance:
- The autocovariance (or autocorrelation) of the series should depend only on the lag between observations and not on the actual time at which the covariance is computed. This ensures that the relationship between values at different times is consistent.

Why is Stationarity Important?

Stationarity is crucial because many time series models, such as ARIMA (AutoRegressive Integrated Moving Average), assume that the underlying time series is stationary. Non-stationary data can lead to unreliable and spurious results, making it difficult to make accurate predictions.

How to Test for Stationarity?

Several statistical tests can be used to check for stationarity, including:

Augmented Dickey-Fuller (ADF) Test: Tests the null hypothesis that a unit root is present in the time series.
Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test: Tests the null hypothesis that the time series is stationary around a deterministic trend.
Phillips-Perron (PP) Test: Similar to the ADF test but more robust to certain types of heteroscedasticity and autocorrelation.

If a time series is found to be non-stationary, techniques such as differencing, detrending, or transformation can be applied to achieve stationarity.

Conclusion

Understanding and ensuring stationarity is a critical step in time series analysis. By meeting the criteria of constant mean, constant variance, and constant autocovariance, analysts can apply various statistical models more effectively and make more reliable forecasts.

# Train-test split

# Train data - alll data up to the end of 2018
df_train = df.loc[:'2018']

# Test data - all data from 2019 onwards
df_test = df.loc['2009':]

# Exercises

# Import modules
import pandas as pd
import matplotlib.pyplot as plt

# Load in the time series
# Updated the file path to the correct location
candy = pd.read_csv('datasets/candy_production.csv', 
            index_col='date',
            parse_dates=True)

# Plot and show the time series on axis ax1
fig, ax1 = plt.subplots()
candy.plot(ax=ax1)
plt.show()

# Split the data into a train and test set
candy_train = candy.loc[:'2006']
candy_test = candy.loc['2007':]

# Create an axis
fig, ax = plt.subplots()

# Plot the train and test sets on the axis ax
candy_train.plot(ax=ax)
candy_test.plot(ax=ax)
plt.show()

Making time series stationary

Augmented Dickey-Fuller Test

The Augmented Dickey-Fuller (ADF) test is a statistical test used to determine whether a given time series is stationary or not. Stationarity is a crucial property for time series analysis and forecasting, as many models assume that the underlying time series is stationary.

Key Concepts

Stationarity: A time series is said to be stationary if its statistical properties such as mean, variance, and autocorrelation are constant over time.
Null Hypothesis (H0): The time series has a unit root (i.e., it is non-stationary).
Alternative Hypothesis (H1): The time series does not have a unit root (i.e., it is stationary).

Test Procedure

Formulate Hypotheses:
- Null Hypothesis (H0): The time series has a unit root (non-stationary).
- Alternative Hypothesis (H1): The time series does not have a unit root (stationary).
Calculate Test Statistic: The ADF test calculates a test statistic and compares it to critical values from the Dickey-Fuller distribution.
Decision Rule:
- If the test statistic is less than the critical value, reject the null hypothesis (the series is stationary).
- If the test statistic is greater than the critical value, fail to reject the null hypothesis (the series is non-stationary).

Python Implementation

You can perform the ADF test using the adfuller function from the statsmodels library in Python. Here is an example:

from statsmodels.tsa.stattools import adfuller

# Perform the ADF test
result = adfuller(candy['IPG3113N'])

# Extract and print the test statistic and p-value
test_statistic = result[0]
p_value = result[1]

print(f'Test Statistic: {test_statistic}')
print(f'p-value: {p_value}')

# Critical values for different confidence levels
for key, value in result[4].items():
    print(f'Critical Value ({key}): {value}')

# Creating dummy data for the exercise

import pandas as pd
import numpy as np

# Generate a date range
date_rng = pd.date_range(start='1/1/2020', end='1/01/2021', freq='D')

# Create a DataFrame with dummy data
df = pd.DataFrame(date_rng, columns=['date'])
df['close'] = np.random.randn(len(date_rng)) * 20 + 100  # Random data for 'close' column

# Applying the adfuller test

from statsmodels.tsa.stattools import adfuller

results = adfuller(df['close'])

print(results)

# Taking the difference

df_stationary = df.diff()

print(df_stationary)

# Exercises

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller

# Sample data creation for demonstration purposes
# Replace this with actual data loading in practice
earthquake = pd.DataFrame({
    'earthquakes_per_year': np.random.randint(0, 10, size=100)
})
city = pd.DataFrame({
    'city_population': np.random.randint(1000, 5000, size=100)
})
amazon = pd.DataFrame({
    'close': np.random.rand(100) * 100
})

# Run test on earthquake data
result = adfuller(earthquake['earthquakes_per_year'])

# Print test statistic
print(f'Test Statistic: {result[0]}')

# Print p-value
print(f'p-value: {result[1]}')

# Print critical values
print(f'Critical Value (1%): {result[4]["1%"]}')
print(f'Critical Value (5%): {result[4]["5%"]}')
print(f'Critical Value (10%): {result[4]["10%"]}')

# Run test on city data
result = adfuller(city['city_population'])

# Print test statistic
print(result[0])

# Print p-value
print(result[1])

# Print critical values
print(result[4]) 

# Run the ADF test on the time series
result = adfuller(city['city_population'])

# Plot the time series
fig, ax = plt.subplots()
city.plot(ax=ax)
plt.show()

# Print the test statistic and the p-value
print('ADF Statistic:', result[0])
print('p-value:', result[1])

# Calculate the first difference of the time series
city_stationary = city.diff().dropna()

# Run ADF test on the differenced time series
result = adfuller(city_stationary['city_population'])

# Plot the differenced time series
fig, ax = plt.subplots()
city_stationary.plot(ax=ax)
plt.show()

# Print the test statistic and the p-value
print('ADF Statistic:', result[0])
print('p-value:', result[1])

# Calculate the second difference of the time series
city_stationary = city.diff().diff().dropna()

# Run ADF test on the differenced time series
result = adfuller(city_stationary['city_population'])

# Plot the differenced time series
fig, ax = plt.subplots()
city_stationary.plot(ax=ax)
plt.show()

# Print the test statistic and the p-value
print('ADF Statistic:', result[0])
print('p-value:', result[1])

# Calculate the first difference and drop the nans
amazon_diff = amazon.diff().dropna()

# Run test and print
result_diff = adfuller(amazon_diff['close'])
print(result_diff)

# Calculate the first difference and drop the nans
amazon_diff = amazon.diff()
amazon_diff = amazon_diff.dropna()

# Run test and print
result_diff = adfuller(amazon_diff['close'])
print(result_diff)

# Calculate log-return and drop nans
amazon_log = np.log(amazon / amazon.shift(1))
amazon_log = amazon_log.dropna()

# Run test and print
result_log = adfuller(amazon_log['close'])
print(result_log)

Intro to AR, MA and ARMA models

‌
‌
‌