1
$\begingroup$

I have been working with forecasting for a short while, and one thing has been clear so far: each problem is unique because data to each problem are unique. I find the variety of forecasting methods overwhelming, but because of a few restrictions impose on me, I have been relying on the AR(I)MA formulation. With that said, I ask the following (1):

(1) What metrics, rules (of thumb or not), or tools should I rely on to decide on how much data I should use for my forecasting?

Rob Hyndman (and many others) has shared his thoughts on this. In his post about fitting models to long time series, Hyndman talks about some of the problems one might face if too much data is used. Conversely, he's talked about the problems of fitting models to short time series. Regardless of how long or short the times series is, the question rarely has an easy answer.

To drive my point home, I plotted two different datasets, as seen in the figure below. The data have a 5-second granularity and it goes from 9AM to 11AM. Data (B) has an sharp change in level. At around 09:30AM, the level goes from 10 to 100. In a situation like this, if one wants to predict the next value 5, 10, or 15 second from now (or even 30 minutes from now), I'd argue one should abandon the data before 9:30AM given they don't reflect the recent behavior observed at the forecast origin.

The same can't be said about data (A). Data (A) might have extended periods with the same value, namely zero, but no such clear shift in level. There are no apparent pattern, either. It seems to me that data (A) are harder to model. The shift in (B) allows a forecaster to at least try to fit a model to the data that occur after 9:30AM; it suggests a hint, something one would naturally ask oneself. But for (A), I don't think we can abandon any part of the data, which leads me to question (2):

(2) What's the problem of too much/too little data?

While (1) and (2) are related, I only have a guess for two. To me, the problem lies in the model's order and parameter estimation phases. If one has too little data, the order for the model might be inadequate since too little data do not preserve memory well. On top of that, the standard error around parameters will be high. Here, I don't even mention the impossible problem of estimating more parameters than there are data points available.

To much data is trickier. I can't see how too much would mess up with a model's order. I'd assume there's a point where more or less data won't have deep implications on a model's order as long as we assume the "new data" keep being sampled from the same distribution. Sure, spurious meaningful high lags might appear, but in most cases, a simple look at the ACF will tell us that they don't matter. With parameter estimation, I can't immediately see the problem. The more data I have, the more precise the parameters should be. No? One could argue that the data might be noisy. In this case, the problem is not about data quantity, but data quality, and that affects results regardless of how much data we have.

What I do know is that if I choose to discard some of data (A) when estimate the model's order and the parameters, the results can differ widely. Let's say I assume the data are too erratic; thus only the last, say, 45 minutes will matter in forecasting. This will have profound impact on the estimation procedures.

(3) Yet, if too much data are a possibility, where do I cut off? What do I do to have a defensible way of saying only X amount of data will be necessary?

enter image description here

$\endgroup$
0

2 Answers 2

2
$\begingroup$

Regarding too much data: if the data generating process (DGP) evolves over time, old data will not be representative of the current DGP. Using this irrelevant data for model selection and estimation will be detrimental to the forecast accuracy.

How to know how much is too much? Either you have subject-matter knowledge on changes in the DGP (e.g. you are forecasting the profit of a company and you know the tax code changed at time $t_1$ or a new competitor showed up at time $t_2$ which has affected the profitability since then) or you may try pseudo out-of-sample forecasting by employing shorter and longer training time series and seeing which one works better.

$\endgroup$
2
  • 1
    $\begingroup$ Hello, Richard. Thank you for your answer. Both points seem sound. $\endgroup$ Commented Sep 1, 2023 at 17:35
  • 1
    $\begingroup$ @Jxson99, I am glad you have found it helpful! $\endgroup$ Commented Oct 26, 2023 at 6:24
2
$\begingroup$

If you plan on restricting your space of candidate models to something like ARIMA, then I can see this exercise of trying to pick an appropriate subset of the data might make sense. Looking at the example plots, I would not model either of these with ARIMA.

However, I prefer a different approach which I'll suggest for your consideration: model the way the time series changes. The example plots from the OP show two different cases of non-stationarity.

One approach is to simply throw various machine learning models on various extracted features from the time series. This can be made black-box and automatic via some hyperparameter tuning. This approach can be successful in tackling the way that the stochastic process is non-stationary, but let's focus on white-box modelling.

The first time series appears to be a sequence of pulses. If any of these peaks can be reasonably assumed to be one-off events, I would just model them as explicit functions of time added to the apparent baseline. If these are themselves events that you expect to re-occur according to some pattern, then I would model that. Taking a probabilistic approach you would sample the arrival times and 'sizes' (e.g. height and width) of the pulses first, then add them into the stochastic process as you calculate it forward in time. I'd have to explore the timing and sizes of the peaks in the data to better understand how to proceed.

The second time series appears to be a step change in the signal which can readily be modelled with indicator variables. This is quite a simple regression model (for which you could add distributional assumptions onto):

$$Y_t := \beta_1 \mathbb{I}(t \geq \tau) + \beta_0$$

where $\tau$ is a parameter which you could either assign once, or tune, or assign a prior distribution over.

The main limitations with this white-box approach (in my experience) has been that it is more time-consuming.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.