I have been working with forecasting for a short while, and one thing has been clear so far: each problem is unique because data to each problem are unique. I find the variety of forecasting methods overwhelming, but because of a few restrictions impose on me, I have been relying on the AR(I)MA formulation. With that said, I ask the following (1):
(1) What metrics, rules (of thumb or not), or tools should I rely on to decide on how much data I should use for my forecasting?
Rob Hyndman (and many others) has shared his thoughts on this. In his post about fitting models to long time series, Hyndman talks about some of the problems one might face if too much data is used. Conversely, he's talked about the problems of fitting models to short time series. Regardless of how long or short the times series is, the question rarely has an easy answer.
To drive my point home, I plotted two different datasets, as seen in the figure below. The data have a 5-second granularity and it goes from 9AM to 11AM. Data (B) has an sharp change in level. At around 09:30AM, the level goes from 10 to 100. In a situation like this, if one wants to predict the next value 5, 10, or 15 second from now (or even 30 minutes from now), I'd argue one should abandon the data before 9:30AM given they don't reflect the recent behavior observed at the forecast origin.
The same can't be said about data (A). Data (A) might have extended periods with the same value, namely zero, but no such clear shift in level. There are no apparent pattern, either. It seems to me that data (A) are harder to model. The shift in (B) allows a forecaster to at least try to fit a model to the data that occur after 9:30AM; it suggests a hint, something one would naturally ask oneself. But for (A), I don't think we can abandon any part of the data, which leads me to question (2):
(2) What's the problem of too much/too little data?
While (1) and (2) are related, I only have a guess for two. To me, the problem lies in the model's order and parameter estimation phases. If one has too little data, the order for the model might be inadequate since too little data do not preserve memory well. On top of that, the standard error around parameters will be high. Here, I don't even mention the impossible problem of estimating more parameters than there are data points available.
To much data is trickier. I can't see how too much would mess up with a model's order. I'd assume there's a point where more or less data won't have deep implications on a model's order as long as we assume the "new data" keep being sampled from the same distribution. Sure, spurious meaningful high lags might appear, but in most cases, a simple look at the ACF will tell us that they don't matter. With parameter estimation, I can't immediately see the problem. The more data I have, the more precise the parameters should be. No? One could argue that the data might be noisy. In this case, the problem is not about data quantity, but data quality, and that affects results regardless of how much data we have.
What I do know is that if I choose to discard some of data (A) when estimate the model's order and the parameters, the results can differ widely. Let's say I assume the data are too erratic; thus only the last, say, 45 minutes will matter in forecasting. This will have profound impact on the estimation procedures.
(3) Yet, if too much data are a possibility, where do I cut off? What do I do to have a defensible way of saying only X amount of data will be necessary?
