BoxJenkins Methodology for ARIMA Models (original) (raw)

Box-Jenkins Methodology for ARIMA Models

Last Updated : 23 Jul, 2025

Time series data records data points with respect to time intervals. The analysis of such dataset is important to recognize patterns and making predictions as well as providing informative insights. Box-Jenkins model is a forecasting method that is used to forecasts time series data for a specific period of time.

In this article we will be taking a dive into the Box-Jenkins method for ARIMA modelling as it helps us analyze and forecast time series data.

Table of Content

Let us first discuss an overview about what is an ARIMA model so that we can get a sound understanding about the process.

ARIMA Modelling

ARIMA modelling or Autoregressive Integrated Moving Average is a time series analysis and forecasting method, the ARIMA model is a combination of autoregression, differencing and moving average which are used in the modelling of time series. Let's break it down and discuss the different components one by one:

X_t = c + \phi_1.X_{t-1}+\phi_2.X_{t-2}+...+\phi_p.X_{t-p}+\epsilon_t

Where:

X_t = c+ \epsilon_t+ \theta_1.\epsilon_{t-1}+ \theta_2.\epsilon_{t-2}+...+ \theta_q.\epsilon_{t-q}

Where:

ARIMA(p,d,q):

ARIMA model combines all the AR, I, MA components in it. ARIMA modelling combines all the components mentioned above and its general form is given by:

X_t = c + \phi_1.X_{t-1}+\phi_2.X_{t-2}+...+\phi_p.X_{t-p} + \epsilon_t+ \theta_1.\epsilon_{t-1}+ \theta_2.\epsilon_{t-2}+...+ \theta_q.\epsilon_{t-q}

The general ARIMA forecasting process involves selecting appropriate values for p, d, and q, estimating the model parameters, and using the model to make predictions. The Box-Jenkins methodology is often used for identifying and fitting ARIMA models to time series data.

Let's discuss the box-jenkins method in detail now.

Box-Jenkins Method

Box-Jenkins method is a type of forecasting and analyzing methodology for time series data. Box-Jenkins method comprises of three stages through which time series analysis could be performed. It comprises of different steps including identification, estimation, diagnostic checking, model refinement and forecasting. The Box-Jenkins method is an iterative process, and steps 1 to 4 from identification to model refinement are often repeated until a suitable and well-diagnosed model is obtained. It is important to note that the method assumes that the underlying time series data is generated by a stationary and linear process. The different stages of the Box-Jenkins model could be identified as:

**Identification:

Identification is the first step of Box-Jenkins method it helps in determining the orders of autoregressive (AR), differencing (I), and moving average (MA) components that are appropriate for a given time series. This step helps in identifying the values of p, d and q for the given time series. Let's see the key stages involved in this phase:

Estimation:

Estimation is the second stage in the Box-Jenkins methodology for ARIMA modeling. In this stage, the identified ARIMA model parameters, including the autoregressive (AR), differencing (I), and moving average (MA) components, are estimated based on historical time series data. The primary goal is to fit the chosen ARIMA model to the observed data. Let's see the key stages involved in this phase:

Diagnostic Checking:

Diagnostic checking is an important step in the Box-Jenkins methodology for ARIMA modeling. It involves evaluating the acceptance of the fitted ARIMA model by examining the residuals, which are the differences between the observed and predicted values. The goal is to ensure that the residuals are random and do not contain any patterns or structure. Now, let's discuss the key aspects of diagnostic checking in Box-Jenkins:

Model Refinement:

The model refinement stage in the Box-Jenkins method involves a thorough evaluation of the estimated ARIMA model to ensure that it meets the required statistical assumptions and adequately captures the patterns in the time series data. If there are some issues in the model diagnostics, it will be required to refine the model by altering the orders of autoregressive, integrated and moving average or by considering additional factors which were not considered earlier. After rechecking and re-establishing the order of different components or by considering additional elements the diagnostic checks are again to be performed.

Once a satisfactory model is identified and validated, it could be used for the prediction purposes for future time series data points. Now let's discuss the application of Box-Jenkins method.

Application of Box-Jenkins Methodology

Here we are using apple stock data from yfinance, we will be using Box-Jenkins method to analyze the stock data, here's the step-by-step code with explanation:

**Importing Libraries:

The code imports necessary libraries **yfinance for downloading stock price data, **pandas for data manipulation, **matplotlib.pyplot for plotting, **statsmodels for time series analysis and ARIMA modeling, **warnings to suppress warnings during execution.

Python `

import yfinance as yf import pandas as pd import matplotlib.pyplot as plt from statsmodels.tsa.stattools import adfuller from statsmodels.graphics.tsaplots import plot_acf, plot_pacf from statsmodels.tsa.arima.model import ARIMA from statsmodels.stats.diagnostic import acorr_ljungbox

import warnings warnings.filterwarnings('ignore')

`

**Function Definitions:

Now we will be using the functions that are defined for checking stationarity using the Augmented Dickey-Fuller (ADF) test and for plotting the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF).

Python `

Function to check stationarity using Augmented Dickey-Fuller test

def check_stationarity(ts): result = adfuller(ts) print(f'ADF Statistic: {result[0]}') print(f'p-value: {result[1]}') print(f'Critical Values: {result[4]}')

Function to plot ACF and PACF

def plot_acf_pacf(ts): fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) plot_acf(ts, ax=ax1, lags=20) plot_pacf(ts, ax=ax2, lags=20) plt.show()

`

**Data Loading and Preprocessing:

Stock price data for Apple Inc. (AAPL) is downloaded using yfinance. The data is collected from the start of 2015 to the start of 2023. Log returns are calculated to stabilize variance and make the time series more suitable for modeling.

Python `

Load stock data

stock_symbol = "AAPL" start_date = "2015-01-01" end_date = "2023-01-01" stock_data = yf.download(stock_symbol, start=start_date, end=end_date)['Close']

Log transformation to stabilize variance

log_returns = stock_data.pct_change().dropna() log_returns = log_returns.apply(lambda x: pd.np.log(1 + x))

`

**Stationarity Check and Differencing:

The stationarity of the log returns is checked before and after differencing. The time series is differenced to achieve stationarity. ACF and PACF plots are created for the differenced series to help determine ARIMA orders.

Python `

Check stationarity

check_stationarity(log_returns)

Differencing to make the series stationary

log_returns_diff = log_returns.diff().dropna()

Check stationarity after differencing

check_stationarity(log_returns_diff)

Plot ACF and PACF after differencing

plot_acf_pacf(log_returns_diff)

`

**Output:

ADF Statistic: -13.869148958528394
p-value: 6.51329302121344e-26
Critical Values: {'1%': -3.4336173133865064, '5%': -2.86298332472282, '10%': -2.5675383641200633}
ADF Statistic: -14.058039719328459
p-value: 3.091971442666415e-26
Critical Values: {'1%': -3.433648628001351, '5%': -2.8629971502062155, '10%': -2.5675457254979093}

Box-Jenkins Methodology for ARIMA Models

ACF and PACF Plots

**Model Order Selection with AIC and BIC

The code iterates through different values of p, d, and q to find the combination that minimizes both the AIC and BIC values, helping to identify the optimal ARIMA model order.

Python `

Find optimal values for p, d, q based on AIC and BIC

best_aic = float('inf') best_bic = float('inf') best_order = None

for p in range(3): # Choose a range for p for d in range(2): # Choose a range for d for q in range(3): # Choose a range for q arima_model = ARIMA(log_returns, order=(p, d, q)) arima_results = arima_model.fit()

        # Calculate AIC and BIC
        current_aic = arima_results.aic
        current_bic = arima_results.bic
        
        # Update best values
        if current_aic < best_aic and current_bic < best_bic:
            best_aic = current_aic
            best_bic = current_bic
            best_order = (p, d, q)

print(f'Best AIC: {best_aic}, Best BIC: {best_bic}, Best Order: {best_order}')

`

**Output:

Best AIC: -10277.232291010881, Best BIC: -10260.410146733962, Best Order: (0, 0, 1)

**ARIMA Model Fitting and Diagnostics:

The ARIMA model is fitted using the optimal orders obtained from the AIC and BIC selection process. Diagnostics are performed on the residuals, including checking for stationarity. The Ljung-Box test is conducted to assess the autocorrelation in residuals.

Python `

Fit ARIMA model with the best order

arima_model = ARIMA(log_returns, order=best_order) arima_results = arima_model.fit()

Diagnostics

residuals = arima_results.resid check_stationarity(residuals)

Ljung-Box test for autocorrelation in residuals

lb_test_stat, lb_test_pvalue = acorr_ljungbox(residuals, lags=20) print(f'Ljung-Box test statistics: {lb_test_stat}') print(f'Ljung-Box p-values: {lb_test_pvalue}')

`

**Output:

ADF Statistic: -13.478138873971695
p-value: 3.2812344010002946e-25
Critical Values: {'1%': -3.4336189466940414, '5%': -2.8629840458358933, '10%': -2.5675387480760885}
Ljung-Box test statistics: lb_stat
Ljung-Box p-values: lb_pvalue

**Plotting Results:

Finally, the observed log returns and the fitted values from the ARIMA model are plotted to visualize the model's performance.

Python `

Plotting the predicted vs. actual values

plt.figure(figsize=(12, 6)) plt.plot(log_returns_diff, label='Observed') plt.plot(arima_results.fittedvalues, color='red', label='Fitted', alpha=0.7) plt.legend() plt.title(f'ARIMA{best_order} Model for {stock_symbol} Stock Returns') plt.show()

`

**Output:

b111

Observed vs fitted model with best order

The code mentioned above provides a comprehensive example of applying the Box-Jenkins methodology, including stationarity checks, differencing, model fitting, diagnostics, and result visualization for time series analysis and forecasting of stock returns. Adjustments to the model orders and parameters may be necessary based on the diagnostic results.