4
$\begingroup$

Here is a dataset I have:

simulate_year_data <- function(year) { seasonal_pattern <- sin(2 * pi * (1:12) / 12) * 15 + 40 trend <- year * 2 rainfall <- seasonal_pattern + trend + rnorm(12, 0, 10) rainfall <- pmax(rainfall, 0) cumulative_percent <- cumsum(rainfall) / sum(rainfall) data.frame(year = year, month = 1:12, rainfall = rainfall, cumulative_percent = cumulative_percent) } data_list <- lapply(1:10, simulate_year_data) final_data <- do.call(rbind, data_list) 

enter image description here

I specifically want to make a model which can forecast the cumulative_percent variable each month for the next 2 years. This model should obey the following properties as this is cumulative:

  • For any month, the forecast can only be between 0 and 1
  • The forecasts in all months should sum to 1
  • In a given year, the forecast for each month should be greater than the previous month

I could have just taken the overall averages and assume that the future will also have these same values:

monthly_averages <- final_data %>% group_by(month) %>% summarise(avg_cumulative_percent = mean(cumulative_percent)) 

But this will result in all future years having the exact same predictions. I want something which looks at trends, ex: in recent years, the year's total data is occurring in shorter time periods.

Are there some regression models suitable for this task?

$\endgroup$
4
  • 1
    $\begingroup$ Micro-nitpick: your cumulative forecasts should not sum to 1, but end up at 1. I hope to be able to give a few thoughts in the coming days. In the meantime, forecasting entire curves is variously called "functional data forecasting" or "growth curve forecasting", either of which search terms will give you a number of hits. $\endgroup$ Commented Jul 24 at 13:48
  • 6
    $\begingroup$ The basic problem with directly modeling cumulative values is that over time the accumulation of random errors creates a series of strongly interdependent errors of ever increasing variability. It's often better to model the individual series and then accumulate the individually fit (or predicted) values. (Indeed, when presented with plots like these, an experienced analyst would reflexively consider differencing them at the outset--which exactly undoes the accumulation!) $\endgroup$ Commented Jul 24 at 13:59
  • 1
    $\begingroup$ You lose the trend when you take proportions of an annual total. If you were to use something like lm(rainfall ~ year + as.factor(month), data=final_data) you would have something to project forward and come close to capturing the trend as well as an approximation of the seasonal pattern. $\endgroup$ Commented Jul 25 at 0:43
  • $\begingroup$ @Henry's suggestion is a good start, but an immediate improvement would be to employ a cyclic spline of month rather than using twelve discrete indicators. $\endgroup$ Commented Jul 25 at 13:48

1 Answer 1

5
$\begingroup$

Per whuber's comment, it is often better to forecast month-over-month increments and take cumulative sums to obtain a YTD total.

In this case, your cumulative sums will not yield your desired total. (Or, if you work in the interval $[0,1]$, they will not sum to 1. The simplest way would be to rescale everything at the end. Or you could ask yourself whether you really want a percentage forecast, which sounds like you are implicitly assuming that you already know the yearly total, but most likely you are forecasting this as well. If so, consider the MAPA algorithm, which will give you both yearly and monthly forecasts and ensure sum consistency.

Of course your original constraint of monotonically increasing cumulative forecasts then turn into a new constraint of nonnegative monthly forecasts. You can ensure this by using a suitable GAMLSS model. Ziel, 2022 does this for count data (also nonnegative), but you could do this with a gamma or lognormal regression. Alternatively, you could forecast monthly quantities on a log scale to ensure positivity. (Make sure you bias correct your expectation forecasts, see here.)

whuber recommends using cyclic splines rather than monthly dummies. Well... I would concur if you had daily data. For monthly data, monthly dummies can work. The standard methods for monthly forecasts, like SARIMA or ETS, will do something quite similar to monthly dummies.

Either way, you write that

in recent years, the year's total data is occurring in shorter time periods.

Neither cyclic splines, nor monthly dummies, nor SARIMA/ETS will model this kind of dynamics. One possibility would be to use cyclic splines or monthly dummies in a GAMLSS model and use an interaction with time. This will give you a trend in the monthly effects, e.g., with January forecasts increasing over time, and February decreasing. Be aware that this will require a lot more data.

Finally, as I wrote in my initial comment, you could go a completely different route and indeed forecast entire yearly curves using a functional data forecasting approach. Unfortunately, there are much less literature and tools for this.

$\endgroup$