A significant part of the time series literature considers the ability of a model to predict the future behaviour of a variable with a reasonable degree of accuracy. This objective is important, as most decision taken today are based on what we think will happen in the future. Depending on the decision under the consideration, the future can of course be the next minute, day, month, year, etc. For example, a day-trader may be interested in the price of a share in the next minute, hour or day, while the governor of a central bank may be interested in the rate of inflation over the next year.

Forecasting the future is not an easy thing to do, especially when it comes to economic variables that reflect on the complex interactions between individuals, firms and organizations. This has lead to the development of a plethora of models that include large sophisticated mathematical variants, as well as those that are fairly simple. In many instances, we find that the forecasts from these models are comparable and the tools that are used to evaluate them would need to be carefully applied.

When evaluating forecasting models, we need to make use of ex-post data (after the fact) that has been realised after the results of the forecast have been generated. In these instances, we would utilise an explicit loss function that may consider whether a forecasting model provides a favourable estimate of the expected future value of a variable. For example, such a loss function would usually look to penalise the forecasted values that are not realised after a specific period of time.

When comparing the forecasts that have been generated by two models we may also be interested in determining whether or not they are significantly different from one another. Separate tests have been developed for these investigations where the models are either nested or non-nested. The application of these statistics should be an important part of any forecasting exercise.

In addition, there may be a large degree of uncertainty associated with the forecasts of various models, which is usually of interest to economists and financial analysts. If the uncertainty is an inherent property of the variable we are looking to forecast, this might be a good thing. However if it is not, then this could make the forecast less useful. In many respects, different loss functions emphasize either the accuracy of the point estimate or confidence around this point, and both of these topics will be discussed in what follows.

Lastly, many large institutions employ a suite of forecasting models, where the final forecast is some weighted average of the forecasts from many different models. Several empirical studies have suggested that by employing a combination of forecasting models, one is able to reduce instances, where there are large forecasting errors.

1 Forecasting notation

To describe the procedure for constructing forecasts, we firstly need to introduce some notation. By way of example, if we have quarterly data and want to obtain forecasts over the next eight quarters (i.e. two years), then we would want to generate eight successive forecasts. Therefore, we would want to generate a number of \(h\)-step ahead forecasts, where \(h=\{1,2, \ldots , 8\}\). The end of the forecasting horizon may be represented by \(H\), where in this case, \(H=8\).

In certain instances, we will refer to the information set, which is denoted \(I_t\). This term refers to the information, relating to the values of a variable, that is available at a particular point in time. Therefore, in most cases \(I_t\) would refer to information pertaining to all the past realised values of a particular variable or variables.

In an out-of-sample evaluation of a model we would want to compare the forecasts against future realisations of the data. To complete this exercise the researcher would divide the available sample size of \(T + H\) into an in-sample portion of size \(R\) and an out-of-sample portion of size \(P\). They would then obtain a sequence of \(h\)-step-ahead out-of-sample forecasts of the variable of interest, which we denote \(y_t\) in the univariate case, using the information set \(I_t\), such that \(R + P - 1 + H = T + H\) and \(H < \infty\).

Figure 1: Notation for different forecasting schemes

The notation that is used in forecasting exercises differs slightly to what we had previously, as the entire sample of realised values is represented by \(T+H\). When looking to perform an out-of-sample forecasting evaluation one would usually generate a number of successive forecasts, where the information set would include an additional single observation when generating successive forecasts, as the positioning of \(I_t\) moves towards \(T\). These out-of-sample exercises could take on one of two forecasting schemes, where a recursive estimation scheme would use all available information (from period \(t=0\) to \(I_t\)). In contrast a rolling window forecasting scheme would use a consistent number of observations for the in-sample estimation period (such that the first observation would increase with each successive forecast). The rolling forecast scheme would be more appropriate when you believe that there may be structural breaks in the data.

2 Forecasting with autoregressive models

In an autoregressive model, we are able to relate the current value of a process to previous values. For example, we may specify an AR(1) model as follows,

\[\begin{eqnarray} y_{t}=\phi_{1}y_{t-1}+\varepsilon_{t} \tag{2.1} \end{eqnarray}\]

where the residual is Gaussian white noise, \(\varepsilon_{t}\) \(\sim \mathsf{i.i.d.} \; \mathcal{N}(0,\sigma ^{2})\). In this case we can simply iterating the model forward over a number of steps to calculate future values of a process,

\[\begin{eqnarray} y_{t + 1} &=&\phi_{1}y_{t}+\varepsilon_{t + 1} \nonumber \\ y_{t + 2} &=&\phi_{1}y_{t + 1}+\varepsilon_{t + 2} \nonumber \\ \vdots &=& \vdots \nonumber \\ y_{t + H} &=&\phi_{1}y_{t + H-1}+\varepsilon_{t + H} \nonumber \end{eqnarray}\]

This expression may be summarise, by inserting the first line into the second line to relate successive future values to the current observed realisation. Hence, for \(h=2\) we have,

\[\begin{eqnarray} y_{t+2} &=&\phi_{1}(\phi_{1}y_{t}+\varepsilon_{t+1})+\varepsilon_{t+2} \nonumber \\ &=&\phi_{1}^{2}y_{t}+\phi_{1}\varepsilon_{t+1}+\varepsilon_{t+2}, \nonumber \end{eqnarray}\]

and after doing this recursively for \(h=3\) we have,

\[\begin{eqnarray} y_{t + 3} & = & \phi_{1}\left(\phi_{1}\left(\phi_{1}y_{t}+\varepsilon_{t + 1} \right) +\varepsilon_{t + 2} \right)+\varepsilon_{t + 3} \nonumber \\ &=&\phi_{1}^{3}y_{t}+\phi_{1}^2\varepsilon_{t+1} + \phi_{1}\varepsilon_{t+2}+\varepsilon_{t+3}, \nonumber \end{eqnarray}\]

This gives rise to a pattern, where after sorting terms and simplifying,

\[\begin{eqnarray} y_{t + h}=\phi_{1}^{h}y_{t}+\overset{h - 1}{\underset{i = 0}{\sum }}\phi_{1}^{i}\varepsilon_{t + h - i} \tag{2.2} \end{eqnarray}\]

Thus \(y_{t + h}\) is a function of \(y_t\), which represents the available information set, \(I_t\), that contains information about all past realised observations of this variable. In this case, the actual observed (i.e. realised) future values of \(y_{t + h}\) will also contain the effects of future shocks. Unfortunately, information about the future realised values of these shocks is not includes in \(I_t\). Hence, this part of the future realised value of \(y_{t+h}\) may give rise to a forecasting error.

2.1 Point forecasts

To compute the point forecast of \(y_{t + h}\), we take the conditional expectation of \(\mathbb{E}_t \left[ y_{t + h}|I_t \right]\). In the case of the simple AR(1) in (2.1), this is equivalent to \(\mathbb{E}_t \left[ y_{t + h}|y_{t}\right]\). In addition, since all the future error terms in this model have an expected future mean value of zero, we can use (2.2) to calculate the one-step ahead forecast \(\mathbb{E}_t \left[ y_{t + 1}|y_{t} \right] =\phi_{1}y_{t}\). Similarly, the two-step ahead forecast would be derived from \(\mathbb{E}_t \left[ y_{t + 2}|y_{t}\right] =\phi_{1}^{2}y_{t}\), and the more general expression may then be given as,

\[\begin{eqnarray} \mathbb{E}_t \left[ y_{t + h} | y_{t}\right] =\phi_{1}^{h}y_{t} \tag{2.3} \end{eqnarray}\]

This expression is also occasionally referred to as a predictor, which we could denote \(\acute{y}_t(h)\). If the variable that we are seeking to forecast is described by a stable AR(1) (i.e. it is assumed that \(|\phi_{1}|<1\)), and the structure does not include a constant (as in (2.1)), then the term \(\phi_{1}^{h}y_{t}\) would tend towards zero as the forecast horizon increases. Hence,

\[\begin{eqnarray*} \mathbb{E}_t \left[ y_{t + h} | y_{t}\right] \rightarrow 0 \hspace{1cm} \text{when } \; h \rightarrow \infty \nonumber \end{eqnarray*}\]

Therefore, the effect of shocks that may be contained in \(y_{t}\) would dissipate, as the forecast horizon increases.

2.2 Intercept

If we assume that an intercept is included to the stable AR(1) equation, such that,

\[\begin{eqnarray} y_{t }=\mu +\phi_{1}y_{t - 1}+\varepsilon_{t} \tag{2.4} \end{eqnarray}\]

Using the same recursions as above, we are able to derive the expression,

\[\begin{eqnarray} \mathbb{E}_t \left[ y_{t + h} | y_{t}\right] =(1+\phi_{1}+\phi_{1}^{2}+ \ldots + \phi_{1}^{h -1})\mu +\phi_{1}^{h}y_{t} \tag{2.5} \end{eqnarray}\]

Hence, the one-step ahead forecast, where \(h=1\), is simply \(\mathbb{E}_t \left[ y_{t + h} | y_{t}\right] =\mu +\phi_{1}y_{t}\). Similarly, the two-step ahead forecast, where \(h=2\), is \(\mathbb{E}_t \left[ y_{t + 2} | y_{t}\right] =(1+\phi_{1})\mu +\phi_{1}^{2}y_{t}\). Therefore, when \(h\) goes to infinity, we see that the forecast becomes the unconditional mean of the process,

\[\begin{eqnarray} \mathbb{E}_t \left[ y_{t + h} | y_{t}\right] \longrightarrow \frac{\mu }{1-\phi_{1}} \hspace{1cm} \text{when }h \rightarrow \infty \tag{2.6} \end{eqnarray}\]

This follows from the summation formula for an infinite sequence, \((1+\phi_1^1+\ldots+\phi_1^{\infty}) = 1 / (1-\phi_1)\).

2.3 For the AR(p) model

For completeness, the case of an AR(\(p\)) model with an intercept is provided below,

\[\begin{eqnarray} y_{t}=\mu +\overset{p}{\underset{i = 1}{\sum}} \phi_{p} y_{t - i}+\varepsilon_{t} \tag{2.7} \end{eqnarray}\]

where \(p\) is the maximum lag length, and the residual is Gaussian white noise. For simplicity, we take the conditional expectation at each forecast horizon when computing the forecast recursively.

\[\begin{eqnarray} \mathbb{E}_t \left[ y_{t + 1} | y_{t}\right] &=&\mu +\phi_{1}y_{t }+\phi_{2}y_{t-1 }+ \ldots +\phi_{p+1}y_{t -p+1} \nonumber \\ \mathbb{E}_t \left[ y_{t + 2} | y_{t}\right] &=&\mu +\phi_{1}\mathbb{E}_t \left[ y_{t + 1} | y_{t}\right] +\phi_{2}y_{t }+ \ldots +\phi_{p+2}y_{t -p+2} \nonumber \\ \mathbb{E}_t \left[ y_{t + 3} | y_{t}\right] &=&\mu +\phi_{1}\mathbb{E}_t \left[ y_{t + 2} | y_{t}\right] +\phi_{2}\mathbb{E}_t \left[ y_{t + 1} | y_{t}\right] +\phi_{3}y_{t }+ \ldots +\phi_{p+3}y_{t -p+3} \nonumber \\ \vdots &=& \vdots \nonumber \\ \mathbb{E}_t \left[ y_{t + h} | y_{t}\right] &=&\mu +\phi_{1}\mathbb{E}_t \left[ y_{t - 1+h} | y_{t}\right] +\phi_{2} \mathbb{E}_t \left[ y_{t - 2+h} | y_{t}\right] + \ldots +\phi_{p}\mathbb{E}_t \left[ y_{t -p+h} | y_{t}\right] \nonumber \\ && + \phi_{h}y_{t} + \ldots + \phi_{p+h}y_{t -p+h} \tag{2.8} \end{eqnarray}\]

Where the last line assumes that \(p>h\).

2.4 Example of autoregressive forecasts

To illustrate the workings of the way in which one is able to construct a forecast from an autoregressive model, we can work with hypothetical AR(1) and AR(2) examples. In this case we assume that the coefficients are as follows:

  • AR(1) with \(\mu =0.4\) and \(\phi_{1}=0.7\)
  • AR(2) with \(\mu =0.3\), \(\phi_{1}=0.6\) and \(\phi_{2}=0.1\).

For both processes we assumed that they start off from the same initial situation, where \(y_{t}=2\) and \(y_{t-1}=1.5\). Using the formulae from (2.5) for the AR(1) model and (2.8) for the AR(2) model, the forecasts for horizons \(h=1, h=2, h=3, h=5\) and \(h=10\) for both processes are reported in Table 1.

These results suggest that the forecasts converge on the unconditional mean of the processes as the forecast horizon increases, where for the AR(1) process, \(0.4/(1-0.7)=1.3\). Similarly, for the AR(2) process, \(0.3/(1-0.6-0.1)=1\).

Forecast horizon
1-step 2-step 3-step 5-step 10-step
AR(1) \(\hspace{1cm}\) 1.8 1.66 1.56 1.45 1.35
AR(2) \(\hspace{1cm}\) 1.65 1.49 1.36 1.19 1.04
Table 1: AR forecasts

3 Forecast errors and uncertainty

To derive a measure for the forecast errors, we are going to make use of the notation for the predictor, \(\acute{y}_t(h)\equiv \mathbb{E}_t \left[ y_{t + h} | y_{t}\right]\). Hence, the forecast error, \(\acute{e}_t(h)\), in period \(t+h\) is denoted,

\[\begin{eqnarray} \acute{e}_t(h) = y_{t+h}-\acute{y}_t(h) \tag{3.1} \end{eqnarray}\]

where \(y_{t+h}\) is the ex-post actual realisation of the value for respective variable. Making use of this expression for the recursive formulas that are provided in (2.2) and (2.3), we can calculate the forecast error at different forecast horizons,

\[\begin{eqnarray} \acute{e}_t(1) = y_{t+1}-\acute{y}_t(1) &=&(\phi_{1}y_{t}+\varepsilon_{t+1})-\phi_{1}y_{t}=\varepsilon_{t+1} \nonumber \\ \acute{e}_t(2) = y_{t+2}-\acute{y}_t(2) &=&(\phi_{1}^{2}y_{t}+\phi_{1}\varepsilon_{t+1}+\varepsilon_{t+2})-\phi_{1}^{2}y_{t}=\phi_{1}\varepsilon_{t+1}+\varepsilon_{t+2} \nonumber \\ \vdots &=& \vdots \nonumber \\ \tag{3.2} \end{eqnarray}\]

where at horizon \(h\),

\[\begin{eqnarray} \acute{e}_t(h) &=& y_{t+h}-\acute{y}_t(h)=\left( \phi_{1}^{h}y_{t}+\overset{h -1}{\underset{i =0}{\sum }}\phi_{1}^{i}\varepsilon_{t+h-i}\right) -\phi_{1}^{h}y_{t} \nonumber \\ &=& \overset{h -1}{\underset{i =0}{\sum }}\phi_{1}^{i}\varepsilon_{t+h-i} \tag{3.3} \end{eqnarray}\]

This result suggests that the forecast errors are just the coefficients of the MA representation of the AR(1) process if the process is stationary (and the MA representation exists). When we assume that the errors in autoregressive model are Gaussian white noise, the expected value of all future realisations of the error, as derived in (3.3) are zero. Therefore, when we assume that \(\mathbb{E}_t \left[ \varepsilon_{t + h} | I_t\right] = 0\), it would be the case that

\[\begin{eqnarray} \mathbb{E}_t \left[\acute{e}_t(h) \right]= \mathbb{E}_t \left[ y_{t+h}-\acute{y}_t(h)\right] =\mathbb{E}_t \left[ y_{t+h}\right] -\mathbb{E}_t \left[ \acute{y}_t(h)\right] =0 \tag{3.4} \end{eqnarray}\]

If this is the case then it would imply that the predictor is unbiased.

3.1 Mean square errors (MSE)

The MSE is a quadratic loss function that is widely used to evaluate the forecasting accuracy of a particular model. In addition, this statistic may also be used as a measure of the forecast error variance, which would usually need to be calculated when constructing forecast intervals. We may denote the MSE for the \(h\)-step ahead forecast error as \(\acute{\sigma}_{t}(h)\), which is derived from the forecast errors, where

\[\begin{eqnarray} \nonumber \mathbb{E}_t \left[ \acute{\sigma}_{t}(h) \right] &=& \mathbb{E}_t \left[ \left(y_{t+h}-\acute{y}_t(h) \right)^{2}\right] \\ &=&\mathbb{E}_t \left[ \left( \overset{h -1}{\underset{i =0}{\sum }}\phi_{1}^{i}\varepsilon_{t+ h -i}\right) \left( \overset{h -1}{\underset{i=0}{\sum }}\phi_{1}^{i}\varepsilon_{t + h -i}\right) \right] \tag{3.5} \end{eqnarray}\]

Note that as the \(\phi\) terms are know a priori, they are not governed by the expectations operator and can be moved to the front of the expression. In addition, since \(\mathbb{E}_t\left[ \varepsilon_{t + h -i}\varepsilon_{t + h -i}\right] =\sigma_{\varepsilon}^{2}\) for all \(h\). Hence,

\[\begin{eqnarray} \nonumber \acute{\sigma}_{t}(h)=\sum_{i = 0}^{h -1}\phi_{1}^{i}\sigma_{\varepsilon }^{2}\sum_{i = 0}^{h -1}\phi_{1}^{i}, \end{eqnarray}\]

which would allow us to derive values for \(1,2,3,\ldots , h\), with the aid of the following expressions,

\[\begin{eqnarray} \acute{\sigma}_{t}(1) &=&\sigma_{\varepsilon }^{2} \nonumber \\ \acute{\sigma}_{t}(2) &=&\sigma_{\varepsilon }^{2}+\phi_{1}^{2}\sigma_{\varepsilon }^{2}=\acute{\sigma}_{t}(1)+\phi_{1}^{2}\sigma_{\varepsilon }^{2} \nonumber \\ \acute{\sigma}_{t}(3) &=&\sigma_{\varepsilon }^{2}+\phi_{1}^{2}\sigma_{\varepsilon }^{2}+\phi_{1}^{4}\sigma_{\varepsilon }^{2}=\acute{\sigma}_{t}(2)+\phi_{1}^{4}\sigma_{\varepsilon }^{2} \nonumber \\ &\vdots& \nonumber \\ \acute{\sigma}_{t}(h) &=&\sigma_{\varepsilon }^{2}(1+\phi_{1}^{2}+\phi_{1}^{4}+ \ldots +\phi_{1}^{2(h-1)})=\acute{\sigma}_{t}(h-1)+\phi_{1}^{h-1}\sigma_{\varepsilon }^{2}\phi_{1}^{h-1} \nonumber \\ \tag{3.6} \end{eqnarray}\]

If we let \(h\rightarrow \infty\), (3.6) implies that \(\sigma_{\varepsilon}^{2}(1+\phi_{1}^{2}+\phi_{1}^{4}+ \ldots +\phi_{1}^{\infty })=\frac{\sigma_{\varepsilon }^{2}}{1 - \phi_{1}^{2}}\). Note that this sequence has the unconditional variance of the process that was provided previously. Thus,

\[\begin{eqnarray} \acute{\sigma}_{t}(h)\rightarrow \frac{\sigma_{\varepsilon }^{2}}{1 -\phi_{1}^{2}} \;\;\;\; \text{where } \;\; h\rightarrow \infty \tag{3.7} \end{eqnarray}\]

Note that the expected variance of the forecasts converges to the unconditional variance of the process, which should not be surprising as the expected forecast errors are derived from the expected values for the future errors of the process. Hence, if we assume that the errors in the process are distributed \(\varepsilon_t \sim \mathcal{N}(0,\sigma^2)\) then the expected forecast errors could also be described by a similar distribution. In which case, \(\mathbb{E}_t \left[ \acute{\sigma}_{t}(h) \right] = \sigma^2\).

3.2 Uncertainty

The previous section suggests that we can use the conditional expectation to derive predicted values of a variable that may be described by a stable autoregressive model that has Gaussian white noise errors. In this case, the forecast will on average be equal to the true value of the variable that we want to forecast, which would imply that the forecasts are unbiased. However, the forecasts will not necessarily be equal to the true value of the process at all periods of time. Therefore, the forecast errors will have a positive variance, which may be measured by the mean square error (MSE).

In many instances it is desirable to report on both the point forecast and some measure of uncertainty that relates to the forecast. For example, most central banks publish fan charts together with their inflation forecasts. These fan charts communicate the central banks view on possible paths for future inflation. In addition, a number of central banks also publish fan charts when referring to the relative success of their past forecasts.

An example of this type of communication is presented in Figure 2, which was included in the South African Reserve Bank Monetary Policy Review (June, 2014). Note that towards the end of the forecast horizon, uncertainty increases as the bands become wider over time. This graph allows the central bank to suggest that although they largely expect inflation to initially increase to a point that is beyond the target range, it is expected to decline after a year whereupon it will return to a point that is within the range. It also suggests that there is quite a large degree of uncertainty surrounding the expected future values of inflation and as such we could realise a relatively wide array of values for the rate of inflation in the future.

Figure 2: South African inflation fan chart (SARB June 2014)

There are a number of different methods that may be used to construct these fan charts. Where all the coefficients in the model are point estimates, we could calculate the MSE to generate distributions for the distribution of future forecast estimates.1 Note that if we assume that the residuals in the model are, \(\varepsilon_{t}\sim \mathsf{i.i.d.} \;\; \mathcal{N}(0,\sigma_{\varepsilon }^{2}\)), it implies that the forecast errors should also be normally distributed,

\[\begin{eqnarray} y_{t+h}-\acute{y}_t(h)\sim \;\; \mathcal{N}(0,\acute{\sigma}_{t}(h)) \tag{3.8} \end{eqnarray}\]

Such that,

\[\begin{eqnarray} \frac{y_{t+h}-\acute{y}_t(h)}{\sqrt{\acute{\sigma}_{t}(h)}}\sim \;\; \mathcal{N}(0,1) \tag{3.9} \end{eqnarray}\]

In this case we could make use of a normal distribution, with \(z_{\alpha}\) defining the upper and lower bounds that may be used to derive the forecast interval for the \(h\)-period ahead forecast,

\[\begin{eqnarray} \left[\acute{y}_t(h)-z_{\alpha /2}\sqrt{\acute{\sigma}_{t}(h)}\;\;\; , \;\;\; \acute{y}_t(h)+z_{\alpha /2}\sqrt{\acute{\sigma}_{t}(h)}\right] \tag{3.10} \end{eqnarray}\]

This allows for the construction of standard confidence intervals for the parameter estimates, where the predictor, \(\acute{y}_t(h)\), and the MSE, \(\acute{\sigma}_{t}(h)\), are used to derive the appropriate interval.

3.3 Example: Forecasts with confidence intervals

Let us show how this works with a numerical example for the AR(1) model used in the previous section. That is, we assumed that we have the model,

  • AR(1) with \(\mu =0.4\), \(\phi_{1}=0.7\), and \(y_{t}=2\)

Let \(\varepsilon_{t}\sim \mathcal{N}(0,0.1\)), and where we elect to work with a 95% confidence interval, \(\alpha =0.05\). This implies that \(z_{\alpha /2}=1.96\) in large samples. For forecast horizon \(h=1,5,10\) we can then use the point forecasts in Table 1 and (3.6) to derive the forecast error variance. Then lastly, with the use of (3.10) we can construct forecast intervals that are provided in Table 2.

Point Estimate MSE Lower Bound Upper Bound
\(\acute{y}_t(1) \hspace{1cm}\) 1.80 0.10 1.18 2.42
\(\acute{y}_t(5)\) 1.45 0.19 0.59 2.30
\(\acute{y}_t(10)\) 1.35 0.20 0.48 2.22
Table 2: Forecast intervals for AR(1) model

Note that the point forecasts are the same as those that are provided in the upper row in Table 2. The second column shows MSE, which converges to the unconditional variance of the AR process as the forecast horizon increases. The confidence intervals suggest that if the errors continue to be distributed \(\varepsilon_{t}\sim \mathcal{N}(0,0.1\)), then there is a 95% probability that the intervals will contain the future value of the random variable, \(y_{t + h}\).

Figure 3: AR(1) fan chart from simulation

After putting together many forecast intervals for different significance levels (i.e. for different values of \(\alpha\) in (3.10)), a fan chart can be constructed for the 30, 50, 70 and 90 percentiles of the forecast distribution, as in Figure 4. This would provide a visual display of the results, as per those of the central bank in Figure 3.

An alternative method of constructing a density forecast would be to simulate \(n\) number of forecasts from the normal distribution with mean \(\acute{y}_t(h)\) and variance \(\acute{\sigma}_{t}(h)\) across all horizons, \(h\). From the vector of forecasts \((n \times 1)\) for each \(h\), a forecast interval can be derived by sorting the numbers (e.g. from lowest to highest) and then choosing the percentile of interest from this sorted vector. Of course, if we are primarily interested in given forecast intervals, then the method that uses (3.10) would be more efficient. Moreover, when approximating the forecast distribution with \(n\) draws, the results would be dependent on the size of \(n\). If \(n\) is not big enough, the procedure for simulating a number of forecasts from a normal distribution will not be equal to the direct approach, which made use of (3.10).

Figure 4: AR(1) histograms from simulation

Figure 4 shows two histograms of the forecast distribution for \(h=1\), given the values in Table 2, which are reported above. As we see, when \(n\) is small, simulating the forecast distribution provides a poor approximation to the normal distribution; however, when \(n\) is large, the simulation works well. If we were to construct a fan chart with the aid of this method we would need to repeat this simulation for each \(h\)-step.

4 Forecast evaluation

To evaluate the performance of a forecast we may consider the use of various different statistics. For example, we may wish to derive an estimate of the bias that is associated with the bias to determine whether or not the we are consistently underestimating or overestimating the expected future mean value. In addition, when comparing the forecasting performance of a number of different models, we would usually want to evaluate a number of successive \(h\)-step ahead forecasts that have been generated over time. After generating the forecasts we could then sum up the forecasting errors to determine which model provides the smallest total forecasting error. However, as we would not wish for the positive and negative values of the forecast errors to cancel each other out, we may wish to take the square or absolute value of these errors. This has given rise to a number of different loss functions, that are considered below. It is important to note that different loss functions emphasize different characteristics of the forecast, and as such they may provide conflicting results. When evaluating uncertainty associated with any forecast, such as the density forecast, other loss functions are considered, and we will briefly examine these methods as well.

4.1 Recursive and rolling window forecasts and errors

When the historical dataset for a particular variable extends over the period \(t=\{1,\ldots , R, \ldots , T+H\}\), where \(R\) represents the end of the initial in-sample period and \(\{ R+1, \ldots , T+H\}\) represents the observations that are used for the out-of-sample evaluation. From the previous discussion, we can estimate the AR(1) model and construct the initial forecast for \(\mathbb{E}[y_{R+h}| y_R]\), which may be represented by the predictor, \(\acute{y}_R(h)\), given the information set \(I_t\).

The out-of-sample forecast error is simply the difference between the forecast and the realisation of \(y_{R+h}\),

\[\begin{eqnarray} \nonumber \acute{e}_R(h)= y_{R+h}-\acute{y}_R(h) \end{eqnarray}\]

Since we do not normally observe this value when we forecast, the forecast error can only be evaluated ex-post. After we have obtained the first forecast error we can then update the dataset by one observation, so that \(I_t = R +1\), before generating a second \(h\)-step ahead forecast. This would allow for the generation of a vector of forecast errors,

\[\begin{eqnarray} \acute{\mathbf{e}}_{h}= \left[ \acute{e}_R(h), \acute{e}_{R+1}(h), \acute{e}_{R+2}(h), \ldots , \acute{e}_{T}(h)\right]^{\prime } \tag{4.1} \end{eqnarray}\]

For example, assume that our dataset extends from 1995q1 to 2014q4, and we want to generate a number of one-step ahead forecasts over the out-of-sample period 2010q1 to 2014q4. This would imply that \(R\) would be 2009q4, and the initial in-sample period would be 1995q1 to 2009q4, which would be used to generate a forecast for 2010q1. After calculating the forecast error (and storing it in vector \(\acute{\mathbf{e}}_{h}\)), the revised in-sample period for a recursive scheme would span 1995q1 to 2010q1. When using a rolling window scheme the in-sample period would be 1995q2 to 2010q1. Either of these methods could be used to generate a forecast for 2010q2, which would be used to calculate a second forecast error. This procedure continues until we have full set of forecast errors in the \(\acute{\mathbf{e}}_{h}\) vector.

In most cases we would want to evaluate both the short-term and long-term forecasting performance of a model. Hence, if we want to consider the forecasting performance of a model that uses quarterly data over a two year period, we would want to evaluate the one- to eight-step ahead forecasts. In these cases we would usually make use of a matrix for capturing the forecasts and the forecasting errors, where each of the vectors for the \(h\)-step ahead forecasting errors \(\acute{\mathbf{e}}_{h}\) are placed in a separate column.

\[\begin{equation} \acute{\mathbf{e}}_{H} = \left\{ \begin{array}{cccc} \acute{e}_R(1) & \acute{e}_R(2) & \ldots & \acute{e}_{R}(H)\\ \acute{e}_{R+1}(1) & \acute{e}_{R+1}(2) & \ldots & \acute{e}_{R+1}(H)\\ \vdots & \vdots & \ddots & \vdots \\ \acute{e}_{T}(1) & \acute{e}_{T}(2) & \ldots & \acute{e}_{T}(H) \\ \end{array} \right\} \tag{4.2} \end{equation}\]

Note that the respective columns or rows in the matrix would represent a time series variable. For example, the first column would represent the one-step ahead forecasts errors over the out-of-sample period. We will see that this will be particularly useful when calculating the respective statistics that are used to evaluate the performance of different forecasting models.

Before moving on it is perhaps worth noting that these forecasting experiments are usually termed pseudo (or quasi) out-of-sample forecasting experiments, where the use of the {pseudo} term would infer:

  • When conducting the evaluation we have at our disposal all the out-of-sample information, which are the realisations of the observed variable. This information could have been used in the formulation and specification of the model (i.e. we see that the out-of-sample portion of the dataset exhibits regime-switching behaviour and as a result we decide to make use of a regime-switching model).
  • It is unlikely that the data that we use would have been captured in real-time. The use of real-time data would describe the source that would have been available at the time when it was initially released. In economics, and specifically in macroeconomics, most data series are heavily revised over time. For example, when the statistical agency of a given country releases a value for gross domestic product for the first time, this number will typically only be an estimate of the true value of economic output. During subsequent periods, this estimate will be subject to a number of revisions as the agency obtains more information about what actually occurred over that period. Databases with true real-time data exist for some countries and where such data is utilised in a forecasting experiment, we refer to it as a real-time out-of-sample forecasting experiment.

Of course, doing a real-time out-of-sample forecasting experiment is subject to the same flaw as described in the first bullet, which is of particular importance when the model has undergone significant revision over time, using information that may not have been available at the time that pertained to the initial in-sample period.

4.2 Bias

Where we continue to use \(R\) as the last observation in the in-sample period, the expected value of the forecast error is a measure of the bias,

\[\begin{equation}\nonumber \mathbb{E}_t \left[\acute{e}_R(h)\right]= E \left[ y_{R+h}-\acute{y}_R(h)\right] \end{equation}\]

In the case where we have generated the column vectors in the \(\acute{\mathbf{e}}_{H}\) matrix may be used to derive an estimate of the mean bias, which could be represented as

\[\begin{eqnarray} \overline{\acute{\mathbf{e}}}_{h}=\frac{1}{\left(T -R\right)} \; \overset{T}{\underset{i = R}{\sum }} \acute{e}_{R+i}(h) \tag{4.3} \end{eqnarray}\]

In most cases, the value of the estimated bias that is closest to zero is considered to be the preferred estimate of the bias. If our forecast is biased, we would have either made the same mistake of predicting too high or too low a value of the variable we are forecasting, or alternatively, we may have made a few particularly large forecasting errors that skew the result.2

4.3 Root mean squared error (RMSE)

The most widely used method for evaluating forecasts is the root mean squared error (RMSE). It is a measure of the size of the forecast errors, and is simply the square-root of the mean squared forecasting error. Hence, when calculating the RMSE for a single forecast error we would derive,

\[\begin{equation}\nonumber \sqrt{\mathbb{E}_t \Big[ \big\{ \acute{e}_R(h) \big\}^{2}\Big] }=\sqrt{\mathbb{E}_t \Big[ \big\{y_{R+h}-\acute{y}_R(h)\big\}^{2}\Big] } \end{equation}\]

The difference between this measure and what was calculated for the bias, is that after squaring all the errors the positive and negative errors would no longer cancel each-other out, when calculating the average over successive forecasts. Therefore, with the use of the vector of out-of-sample forecast errors, we could calculate the RMSE as,

\[\begin{eqnarray} \text{RMSE}_{h}=\sqrt{\frac{1}{\left(T -R \right)} \; \overset{T}{\underset{i = R}{\sum }} \acute{e}^2_{R+i}(h) } \tag{4.4} \end{eqnarray}\]

where \(\acute{e}^2_{R+i} (h)\) is the forecast error as defined above. As such, the RMSE is a symmetric loss function, where forecasts that are either too high or too low are weighted equally. Naturally, smaller forecast errors are considered to be better than larger ones, and as such a low RMSE value indicates better forecasting performance.

Time Outcome AR(1) AR(2)
2.00 \(\acute{y}_{R+h}(1)\) \(\acute{e}_{R+h}(1)\) \(\acute{e}^{2}_{R+h}(1)\) \(\acute{y}_{R+h}(1)\) \(\acute{e}_{S+h}(1)\) \(\acute{e}^{2}_{S+h}(1)\)
1.5 0 0 0 0
t 2 0 0 0 0
t+1 1.8 1.8 0 0 1.65 0.15 0.02
t+2 1.5 1.66 -0.16 0.03 1.58 -0.08 0.01
t+3 1.2 1.45 -0.25 0.06 1.38 -0.18 0.03
t+4 1.4 1.24 0.16 0.03 1.17 0.23 0.05
t+5 1.6 1.38 0.22 0.05 1.26 0.34 0.12
\(\overline{\acute{\mathbf{e}}}_{h}\) -0.01 0 0.09 0
\(RMSE_h\) 0 0.18 0 0.21
Table 3: AR forecasts

Continuing with the two autoregressive examples introduced in Table 1, where the AR(1) model is given as \(y_{t}=0.4+0.7y_{t-1}+\varepsilon_{t},\) and the AR(2) is given as \(y_{t}=0.3+0.6y_{t-1}+0.1y_{t-2}+\varepsilon_{t}\), we have calculated the bias and the RMSE in Table 3. In this example, the forecasting horizon is set to \(h=1,\) and we have assumed some given outcomes that are provided. Note that for the initial forecast, the values are equivalent to those in Table 1. However, since we now have realised outcomes for subsequent periods, we use these to generate the subsequent forecasts.

With only five forecast errors to evaluate, the estimates of both the bias and the RMSE are of little use and are merely included to show how easy it is to compute these results. In an applied evaluation one would usually evaluate a larger set of out-of-sample forecasts.

4.3.1 Decomposing the RMSE

It is possible to show that the RMSE has two sources of potential errors, which may be illustrated in the following example. Where a one-step ahead forecast error is generated by an AR(1) model, \(y_{t}=\mu +\phi_{1}y_{t-1}+\varepsilon_{t}\), we would obtain a value for the predictor \(\acute{y}_t (1)\). This forecast would relate to the value of the process that would be generated by \(\hat{y}_{t +1}=\hat{\mu}+\hat{\phi}_{1}y_{t}+ \varepsilon_{t+1}\). At period \(t\), the values for \(\hat{\mu}\), \(\hat{\phi}_{1}\), and \(\varepsilon_{t+1}\) are unknown. Therefore, we could express the forecast error as,

\[\begin{eqnarray} \acute{e}_t (1) = y_{t+1} - \acute{y}_t (1)=\varepsilon_{t+1}+\left[ (\mu -\hat{\mu})+\left( \phi_{1}-\hat{\phi}_{1}\right) y_{t}\right] \tag{4.5} \end{eqnarray}\]

While we would expect \(\varepsilon_{t+1}\) to take on a value that is close to zero, after we have realised subsequent values of the process, this would not be the case on each occasion (all the time). Using (4.5) to compute the mean squared error we get,

\[\begin{eqnarray} \mathbb{E}_t \left[ \left(y_{t+1}-\acute{y}_t \left(1\right)\right)^{2}\right] =\sigma_{\varepsilon,t+1}^{2}+\mathsf{var}\left[ (\mu -\hat{\mu})+\left( \phi_{1}-\hat{\phi}_{1}\right) y_{t}\right] \tag{4.6} \end{eqnarray}\]

This expression suggests that the MSE is composed of two parts. The first pertains to the part that is a function of the error term and the other is a function of possible changes in the coefficient values. These two parts give rise to uncertainty relating to the shock or error and the parameter uncertainty (which is given by the variance of the parameter estimation discrepancies in (4.6)). If we assume that the parameters in the estimated autoregressive model are identical to their population counterparts, the last term in the square brackets of (4.6) would equal zero. In this case the mean squared error is then equal to \(\sigma_{\varepsilon }^{2}\), which is identical to the MSE we derived in the previous section.

Alternatively, if the parameters of the autoregressive models for \(y_t\) and \(y_{t+1}\) differ, then the variance that is attributed to estimation error in (4.6) will not equal zero. It is worth noting that if the parameter values do not change, which in macroeconomic terms would arise when we have deep or structural parameters, then the RMSE is equal to the square root of MSE. However, when the parameter estimates change over time, as would often occur when using reduced-form models, then the RMSE would be different to the square root of the MSE. When this is the case, we may infer that the model is relatively poor.

4.4 Mean absolute errors (MAE)

As an alternative to the quadratic loss function, which is used in the RMSE, one could make use of linear loss function, such as the MAE. This statistic is calculated as

\[\begin{eqnarray} \nonumber \text{MAE} &=& \mathbb{E}_t \Big[ \left| y_{t+h}-\acute{y}_t(h) \right|\Big] \end{eqnarray}\]

Note that these the RMSE would impose larger penalties on extreme outliers (or very large forecast errors) as a result of the quadratic loss function.

4.5 Comparing out-of-sample and in-sample fit

When performing a forecasting evaluation we evaluate the out-of-sample fit of the model. In contrast we could also consider the in-sample fit of the model, which may be summarised by the coefficient of determination, information criteria, and various other measures.

In most cases, it is not always the case that a model that has a good in-sample fit will ensure that the out-of-sample performance of the model is suitable. The intuition behind this is provided by (4.6), where a good in-sample fit (as suggested by a high \(R^{2}\)) may be obtained by including a large number of regressors. When performing a out-of-sample evaluation, this would typically lower the estimate of \(\sigma_{\varepsilon}^{2}\), but it would also usually increase the estimation uncertainty, which would increase the last part of (4.6).

Thus, when trying to evaluate the fit of an econometric model, it is usually a good idea to assess both the in-sample and out-of-sample performance of the model.

4.6 Comparing different forecasts

Most institutions make use of a suite of forecasting models that may be used to generate respective forecasts for a specific variable. To determine which of these is responsible for the most accurate forecast, we can compare the value of our preferred loss function for each of the models, and simply choose the model that is associated with the best score (i.e. choose the model with the lowest RMSE or bias).

However, the difference in the respective results can be very small, and we may want to consider whether or not the respective forecasts are significantly different from one another. This can be tested with a number of procedures, the most widely used is that of the Diebold and Mariano (1995) test.

To explain this procedure, assume that we have two models with the associated squared forecasting errors,

\[\begin{eqnarray} \acute{e}^{2}_{t,1} (h) &=&\Big[y_{t+h}-\acute{y}_{t,1} \left(h \right)\Big]^{2} \nonumber \\ \acute{e}^{2}_{t,2} (h) &=&\Big[y_{t+h}-\acute{y}_{t,2} \left(h \right)\Big]^{2} \tag{4.7} \end{eqnarray}\]

All of these statistics as in (4.2) to populate two matrices for the forecast errors, \(\acute{\mathbf{e}}_{H,j}\) where \(j=\{1,2\}\). Now where \(d_{t, h}\) is the difference between the forecasting errors from model one, \(\acute{e}^{2}_{t,1} (h)\), and the forecasting errors from model two, \(\acute{e}^{2}_{t,2} (h)\), we may write

\[\begin{equation} \nonumber d_{t, h}=\acute{e}^{2}_{t,1} (h) - \acute{e}^{2}_{t,2} (h) \end{equation}\]

This statistic could be applied along the rows or columns of the \(\acute{\mathbf{e}}_{H,j}\) matrices to see whether the forecast errors are significantly different at a particular time or over a particular forecasting horizon.

To determine whether this difference is significantly different from zero, we regress a constant on \(d_{t, h}\) in a model that may be specified as,

\[\begin{eqnarray} d_{t, h}=\beta_{0}+u_{t} \tag{4.8} \end{eqnarray}\]

This rather intuitive test was formulated in , which employs the hypothesis,3

\[\begin{eqnarray} H_{0}:\beta_{0}=0 \;\; \text{ vs } \;\; H_{1}:\beta_{0}\neq 0 \tag{4.9} \end{eqnarray}\]

In this case the null hypothesis implies no significant difference in forecasting performance. Therefore, if \(\beta_{0}=0\), it will be the case that \(\mathbb{E}_t \left[ d_{t, h}\right] =\mathbb{E}_t \left[ u_{t}\right] =0\) under the standard OLS assumptions. As such, a rejection of the null would imply that the forecasting performance of the two models are significantly different from one another (at some given significance level). The appropriate test statistic to use in this case is the \(t\)-value that relates to the \(\beta_{0}\) coefficient in the OLS regression. In the usual manner, when this statistic is higher in absolute value than the critical value, then we can reject the null hypothesis.

When performing this test it is worth noting that when \(h>1\), the values for \(d_{t, h}\) and the residuals \(u_{t}\) would usually be serially correlated. Therefore, one would usually employ methods for HAC standard errors, which are discussed in Newey and West (1987). For a more detailed treatment of how to compare different forecasts, the interested reader should consult West (2006), which provides an informative summary.

As a final note, we would also need to correct the test statistics in the Diebold and Mariano (1995) test when dealing with nested models. This procedure is discussed in Clark and West (2007).

4.7 Displaying the forecasting results

It is usually a good idea to plot each of the forecasts against the observed data to visualise the results of this forecasting experiment. Such an example is provided in Figure (2.2), where we have made use of an autoregressive model to generate the forecasts. Note that the forecasts converge on the mean and the time that is taken to converge on this mean value is dependent on the values of the coefficients.4

Figure 5: Hairy plot of realised and forecasted values

It is also usually a good idea to take a look at the RMSE statistics both over time and for the different \(h\)-step ahead forecasts. Examples of these diagrams are displayed in Figure 6 and Figure 7, which are taken from Balcilar, Gupta, and Kotzé (2015).

Figure 6: Average RMSE over time for different models

Figure 7: Comparing RMSE for one- to eight-step ahead forecasts

5 Model combination

In the previous section we described how forecasts for the same variable could be compared, when these forecasts are generated by different models. Implicitly we wanted to find the model that would provide the superior forecast, when comparing two alternative model specifications. An alternative approach would be to combine these forecasts from the set of all of the models under consideration. Model combination has a long history in economics, dating back to Bates and Granger (1969), and possibly even further.

The rationale for combining models as opposed to choosing one model among many candidate models is easily motivated from a decision-making perspective. As noted in Timmermann (2006), unless one can identify a particular forecasting model that generates smaller forecasting errors prior to observing any future data, forecast combinations offer potential diversification gains that may reduce the possibility of making large forecasting errors at particular points in time.5 This would certainly be the case where there are large outliers in the mis-specification errors of individual models, which may be reduced with the aid of a combined forecasting strategy.

Many different strategies to combine model forecasts exist, both for point and density forecasts. Covering all of these is outside the scope of this course, so we will only focus on two simple strategies for combining point forecasts using a linear aggregation function for the application of equal weighting and MSE weighting. To ensure that this discussion is relatively intuitive, and without loss of generality, we will assume that we only combine the forecasts of two models. In a more realistic setting, the number of models will typically be much larger.

5.1 Linear opinion pool and weights

In terms of notation, let \(\acute{y}^{c}(h)\) denote the combined forecast for \(h\)-steps ahead. The respective forecasts for each of the two models are denoted \(\acute{y}_{j}(h)\), for \(j=\{1,2\}\). The most simple way of aggregating \(\acute{y}_{j}(h)\) for \(j=1,2\) into one combined forecast \(\acute{y}^{c}(h)\) is to use a linear function,

\[\begin{eqnarray} \acute{y}^{c}(h) &=& w_{h, 1}\acute{y}_{1}(h)+w_{h, 2}\acute{y}_{2}(h) \nonumber \\ &=& w_{h}\acute{y}_{1}(h)+\left( 1-w_{h}\right) \acute{y}_{2}(h) \tag{5.1} \end{eqnarray}\]

where \(w_{h, j}\), for \(j=1,2\) is the weight attached to the model \(j\). Combining individual forecasts into one combined forecast in this manner is often called a linear opinion pool. Typically we normalize the weights so that they sum to unity, as reflected in the second line of the expression in (5.1).

Of course, it would be simple to compute the result in (5.1), if we knew the weight to attach to each model. Two of the simplest weighting schemes include equal weighting and weighting based on the MSE of the forecasts. Equal weights are particularly simple to compute, where

\[\begin{eqnarray} \text{Equal weights: } & \;\;\;\; & w_{h, j}=\frac{1}{2} \;\; \text{ for } \; j=1,2 \tag{5.2} \end{eqnarray}\]

Similarly, one could apply the inverse of the estimates from the MSE of the forecasts to penalise those models that are associated with greater uncertainty. Thus:

\[\begin{eqnarray} \text{MSE weights: } & \;\;\;\; & \left(1-w_{h, j}\right)=\frac{\text{MSE}_{h, j}}{\sum_{j=1}^{2}\text{MSE}_{h, j}} \;\; \text{for } \;\; j=1,2 \tag{5.3} \end{eqnarray}\]

where the terms for MSE\(_{h, j}\) can be derived according the specification that has been provided earlier. Both of these weighting schemes are frequently employed in the forecasting literature as they can be computed without any difficulty. In addition, they also have some desirable theoretical properties that are similar to those that may be derived from diversification. For a further discussion of this topic see, Timmermann (2006), Clemen (1989), and Stock and Watson (2004).

6 Alternative forecasting strategies

Throughout our discussion on forecasting we have focused on the use of autoregressive models as they provide an intuitive appeal when explaining the procedures that are involved in the generation and evaluation of various forecasts. In most cases, the discussion would also apply to other model specifications. However, in certain instances (and for some data) a slightly different approach may be more suitable.

6.1 Direct forecasting

Direct forecasting models are different from iterative forecasting models in that they are specified to forecast at a particular horizon. Thus a simple example of a direct forecasting model would be,

\[\begin{eqnarray} y_{t}=\mu +\phi_{1}x_{t-h}+\varepsilon_{t} \tag{6.1} \end{eqnarray}\]

If \(h=1\) and \(x=y\), this will just be an AR(1) model as described earlier. However, if \(h>1\) or \(x\neq y\), describes a direct forecasting model. For example, if we assume that \(h=4\), then the equation for the model that is estimated would be \(y_{t}=\mu +\phi_{1}x_{t - 4}+\varepsilon_{t}\), and a direct \(4\)-period forecast would be

\[\begin{eqnarray} y_{t + 4}=\mu +\phi_{1}x_{t}+\varepsilon_{t} \tag{6.2} \end{eqnarray}\]

An example of this forecasting strategy may be found in Stock and Watson (2004). In addition, Marcellino (2006) has noted that in theory, the use of iterated forecasts are more efficient when they are correctly specified; however, in the presence of model mis-specification direct forecasts may be more robust.

6.2 Autoregressive distributed lag model

Yet another time series model that is often applied in empirical work is the autoregressive distributed lag model (ADL). The ADL model differs to the simple autoregressive models in that they include other variables (i.e. other than the lag of the dependent variable). In a general form, the ADL model can be written as,

\[\begin{eqnarray} y_{t}=\mu + \sum^p_{i=1} \phi_{i}y_{t-i}+ \sum_{k=1}^K \sum^{J_{k}}_{j=1} \beta_{j, k }x_{t-j, k}+ \varepsilon_{t} \tag{6.3} \end{eqnarray}\]

where the second term on the right-hand side of is the usual autoregressive term, while the third term summarizes the distributed lag component of the ADL model, with respect to \(x_t\). Therefore, in the ADL model we allow for \(k=\{1, \ldots , K\}\) additional regressors, which have \(J\) respective lags. In this way, we allow for \(x_{t-j, k}\) to be included in the model. Each of the \(x_k\) regressors is then allowed to take on a different number of lags. For example,if \(K=2\), where \(J_{k=1}=2\) and \(J_{k=2}=1\) we have

\[\begin{eqnarray} \nonumber y_{t}= \mu +\overset{p}{\underset{i=1}{\sum }}\phi_{i}y_{t-i}+ \beta_{1,1}x_{t-1, 1} + \beta_{2,1}x_{t-2, 1} + \beta_{1,2}x_{t-1, 2} + \varepsilon_{t} \end{eqnarray}\]

The rationale for including additional regressors is that there might be additional information in these variables that is not captured by lags of the dependent variable. However, the inclusion of additional regressors may result in a number of complications when employing an ADL model for forecasting more than one period ahead. For example, when \(h=3\) and \(p=1\), we have the model

\[\begin{eqnarray} y_{t + 3}=\mu +\phi_{1}y_{t + 2}+\beta_{1,1}x_{t + 2,1}+\beta_{2,1}x_{t + 1, 1}+\beta_{1,2}x_{t +2,2}+\varepsilon_{t + 3} \nonumber \end{eqnarray}\]

In this case, the \(y_{t + 2}\) term can be derived in the usual manner, by plugging in the value for \(y\) in the previous period. However, the \(x\) terms are not known, as at time \(t\) we do not necessarily know what the values of \(x_{t + 2,1}, x_{t +1,1}\) and \(x_{t + 2, 2}\) will be. As such, these values would usually need to be predicted outside of the model. To circumvent this problem, multivariate models are usually employed in such forecasting exercises.

7 Conclusion

To derive a forecast for an autoregressive process we can iterate the process forward, after employing the conditional expectation operator. Under the assumption that the process is stable (i.e. stationary), and that the errors are Gaussian white noise, the conditional expectation operator ensures that the variance of the forecast error are minimised. When the forecast error horizon becomes very large, the mean of the autoregressive forecast converges towards the unconditional mean of the process, and the variance of the forecast converges towards the unconditional variance of the process. Density and interval forecasts can be constructed by making use of the normality assumptions of the errors. To evaluate the empirical performance of different forecasting models an out-of-sample forecasting experiment should be conducted. The bias and RMSE statistics are commonly used as evaluation criteria for forecasts. To determine whether the forecasts from different models are significantly different from one another, we could employ the Diebold-Mariano test. Combining forecasts from many individual forecasts into one combined forecast may produce a forecast error that has less outliers.

8 Appendix: Density forecasting and evaluation

In the most recent forecasting literature, there has been increased attention towards forecasting the whole distribution of future outcomes.6 If we assume that we know the distribution of the forecast errors, as we did in the previous section, density forecasts can easily be generated as described in that section. If we do not want to assume that we know the forecast distribution, other simulation based methods have to be employed to construct the density forecasts. We will not cover these methods in this course, however, under the maintained normality assumptions, evaluation of density forecasts is not overly complicated.

8.1 Evaluating density forecast

Whilst it is relatively simple to generate density forecast, the evaluation of these densities is more involved. This is partly due to the fact that the true density is not observed, not even ex-post. In what follows we refer to two of the most widely used evaluation methods that are applied to this problem, which include the probability integral transform (PIT) and the log-score (lnS).7

For a particular forecast density, the log score is simply the logarithm of the forecast density evaluated at the outcome of the random variable we forecast. The theoretical justification for using the log-score is somewhat complicated, but the implementation is rather easy, especially if we construct density forecasts according to the methods that was described above. That is, the estimates of \(\acute{y}_t(h)\) and \(\acute{\sigma}_y(h)\) can be used to compare the forecast density to a normal distribution around the realised outcome. This is easily done with the aid of most software packages.

Therefore, if we were to conduct an out-of-sample forecasting experiment for all forecasts that cover the evaluation period we could then compute the average log-score from the densities. In the density forecasting literature this score is seen upon as an intuitive measure of density fit, or as the density counterpart to the RMSE. However, in contrast to the RMSE, a higher average log score is considered better than a lower value, simply because it reflects a higher likelihood.

The PIT is used to test whether the predictive densities are correctly specified. For example, the 95% forecast interval is an interval that should contain the future value of the series in 95% of repeated applications. Likewise, a 30% forecast interval should contain the future value of the series in 30% of the repeated applications. Considering the whole forecast distribution, if the density forecast is correctly specified we would expect to see realised outcomes in all parts of this distribution, but with more outcomes centred around the peak of the distribution than in the tails. Therefore, if we run a forecasting experiment to construct a density, as described above, this could provide the essence of a repeated application. We could then evaluate whether actual outcomes fall within the forecast densities, which is what the PIT measures.

More formally, it can be shown that data that is modelled as a random variable, from any given distribution can be converted to a random variable that has a uniform distribution using the PIT. Under the maintained normality assumption, it can then be shown that a PIT can be computed by evaluating the normal cumulative probability function for \(\acute{y}_t(h)\) and \(\acute{\sigma}_y(h)\) at the realised outcome. If the density forecasts are correctly specified, and we do this for all forecasts covering the evaluation sample, \(T-R\), then the PITs should be more or less uniformly distributed (i.e. a straight line in a histogram with percentiles on the x-axis). Once again, this is easily done in most statistical computer software packages. Moreover, different test-statistics exist and can be applied to determine whether the PIT are indeed uniformly distributed.

Figure 8: PIT for \(h=1\)

By way of example, we could compare the probability density of a particular \(h\)-step ahead forecast to the distribution of the data from the in-sample period. We would usually start with the evaluation of \(h=1\) as it gets slightly more complicated when \(h>1\), as a result of the potential serial correlation in the forecast errors. To complete this exercise we usually make use histogram to depict the empirical distributions of the PITs. In this example, the solid line represents the number of draws that are expected to be in each bin under a \(U(0,1)\) distribution. The dashed lines represent the 95% confidence interval constructed under the normal approximation of a binomial distribution.

The results from such an exercise are depicted in Figure 8, which would appear to be relatively favourable as there is no part of the transformed \(U(0,1)\) distribution that is beyond the confidence intervals.

9 References

Balcilar, Mehmet, Rangan Gupta, and Kevin Kotzé. 2015. “Forecasting Macroeconomic Data for an Emerging Market with a Nonlinear DSGE Model.” Economic Modelling 44: 215–28.

Bates, J., and Clive W. J. Granger. 1969. “The Combination of Forecasts.” Operations Research Quarterly 20(4): 451–68.

Clark, Todd E., and Kenneth D. West. 2007. “Approximately Normal Tests for Equal Predictive Accuracy in Nested Models.” Journal of Econometrics 138: 291–311.

Clemen, Robert T. 1989. “Combining Forecasts: A Review and Annotated Bibliography.” International Journal of Forecasting 5 (4): 559–83.

Corradi, V., and N. R. Swanson. 2006. “Handbook of Economic Forecasting.” In, edited by G. Elliott, C. W. J. Granger, and A. Timmermann, 1:197–284. Elsevier.

Diebold, Francis X., and Roberto S. Mariano. 1995. “Predictive Accuracy.” Journal of Business and Economic Statistics 13 (3): 253–63.

Hall, Stephen G., and James Mitchell. 2007. “Combining Density Forecasts.” International Journal of Forecasting 23 (1): 1–13.

Jore, Anne Sofie, James Mitchell, and Shaun Vahey. 2008. “Combining Forecast Densities from Vars with Uncertain Instabilities.” Reserve Bank of New Zealand Discussion Paper Series DP2008/18. Reserve Bank of New Zealand.

Kascha, Christian, and Francesco Ravazzolo. 2008. “Combining Inflation Density Forecasts.” Working Paper 2008/22. Norges Bank.

Marcellino, M. 2006. “Handbook of Economic Forecasting.” In, edited by G. Elliott, C. W. J. Granger, and A. Timmermann, 1:879–960. Elsevier.

Newey, W., and K. West. 1987. “A Simple Positive Definite, Heterscedastic and Autocorrelation Consistent Covariance Matrix.” Econometrica 55: 703–5.

Stock, J. H., and M. W. Watson. 2004. “Combining Forecasts of Output Growth in Seven-Country Data Set.” Journal of Forecasting 23: 405–30.

Timmermann, Allan. 2006. “Handbook of Economic Forecasting.” In, edited by G. Elliott, C. W. J. Granger, and A. Timmermann, 1:136–96. Elsevier.

Wallis, K. F. 2005. “Combining Density and Interval Forecasts: A Modest Proposal.” Oxford Bulletin of Economics and Statistics 67(1): 983–94.

West, Mike. 1996. “Asymtotic Inference and Predictive Ability.” Econometrica 64(5): 1067–84.

———. 2006. “Handbook of Economic Forecasting.” In, edited by G. Elliott, C. W. J. Granger, and A. Timmermann, 1:99–134. Elsevier.


  1. For those models that generate random variables for the parameter estimates (i.e. where Bayesian parameter estimation methods are used) we could use the coefficient estimates to generate distributions for the forecast estimates.

  2. In a number of financial applications, one may choose to make use of an asymmetric loss function that penalises incorrect forecasts for negative returns more than those that forecast positive returns.

  3. In addition, the interested reader may also wish to consult West (1996).

  4. This is generally the case with autoregressive models.

  5. The individual forecasts need not come from a formal statistical model and may include a subjective opinions about the future values of variables.

  6. See, Wallis (2005), Hall and Mitchell (2007), Kascha and Ravazzolo (2008), and Jore, Mitchell, and Vahey (2008), among others.

  7. A more detailed examination of different density evaluation procedures is given by Corradi and Swanson (2006).