0
$\begingroup$

I wish you good health and the best in life (whatever that means for you in particular).

Context

I want to model aggregated (monthly) tornado counts in the United States for a question on Metaculus. Specifically, I want to predict the number of tornadoes that will occur in the US in June 2024.

The National Oceanic and Atmospheric Administration's (NOAA) Storm Prediction Center provides historical tornado count data by state, month, and year; I wish to constrain myself to only using this data.

Advice Desired

For this problem, I took a multi-level modelling approach. I am not familiar with multi-level modelling and am fairly immature, mathematically speaking. I would very much appreciate advice on the following questions:

  • How and where my attempted problem-formulation description is misaligned with the mathematics I espouse (i.e. my discrepancies / how I wrong; criticism is appreciated!)
  • Whether my model is best interpreted as something other than a multi-level model.
  • What other multi-level modelling approaches might be better suited to the nature of this specific problem.
  • General advice, including on the structure of this question or the associated repository

Attempt

The following can be found in this document (which is in the aforementioned repository):

Tornado counts vary across state, month, and year. One might imagine that climatic factors change over time (across years), that the formation of tornadoes alters within a year, over different months, and that the rate of tornado formation is also affected by local geographical features (state location).

To adequately incorporate information across these categories, we can create a multilevel model. For each relevant category (State, Month, Year) available to us, we want to estimate its intercept and internal variation (e.g., across each State, how does tornado count vary?).

Since we are estimating tornado counts, a Poisson distribution seems appropriate.

Proceeding, let $\alpha_i$ denote the random effect of the location of state $i$ on expected tornado count. Further, let $\gamma_j$ and $\delta_k$ denote the random of effects of month $j$ on and year $k$, respectively, on expected tornado count. Let the expected tornado count for state $i$, month $j$, and year $k$ be called $Y_{ijk}$.

The tornado count $T_{ijk}$ for state $i$, month $j$, and year $k$ can be modelled via $T_{ijk} \sim \text{Poisson}(Y_{ijk})$, where $\log (Y_{ijk}) = \alpha_{\text{state}[i]} + \gamma_{\text{month}[j]} + \delta_{\text{year}[k]}$, i.e. $Y_{ijk}$ is log-linear.

Note that the author is "uncertain" regarding what constitute "appropriate" priors distributions and parameter values to use for a model on tornadoes.

Adaptive priors (NOTE: 10 years of data are being used):

$$ \begin{aligned} \alpha_{i} &\sim \mathcal{N}(\overline{\alpha}, \sigma_{\alpha}) \quad \text{for} \quad i = 1, 2, \dotsc, 50 \\\\ \gamma_{j} &\sim \mathcal{N}(\overline{\gamma}, \sigma_{\gamma}) \quad \text{for} \quad j = 1, 2, \dotsc, 12 \\\\ \delta_{k} &\sim \mathcal{N}(\overline{\delta}, \sigma_{\delta}) \quad \text{for} \quad k = 1, 2, \dotsc, 10 \end{aligned} $$

where

$$ \begin{aligned} \overline{\alpha} &\sim \mathcal{N}(0, \sigma_{\overline{\alpha}}) \\\\ \overline{\gamma} &\sim \mathcal{N}(0, \sigma_{\overline{\gamma}}) \\\\ \overline{\delta} &\sim \mathcal{N}(0, \sigma_{\overline{\delta}}) \end{aligned} $$

and

$$ \begin{aligned} \sigma_{\alpha}, \sigma_{\gamma}, \sigma_{\delta} &\sim \text{Exponential}(1.0) \\\\ \sigma_{\overline{\alpha}}, \sigma_{\overline{\gamma}}, \sigma_{\overline{\delta}} &\sim \mathcal{U}(0.0, 6.0) \end{aligned} $$

Code

NOTE: this is only the part of the code most relevant to the above description.

import jax.numpy as jnp import numpy as np import numpyro as npro import numpyro.distributions as dist def model_01(states=None, months=None, years=None, tornadoes=None): # states, months, years, and tornadoes represented num_states = len(np.unique(states)) num_months = len(np.unique(months)) num_years = len(np.unique(years)) # state effect hyperparameters alpha_mu_sigma = npro.sample("alpha_mu_sigma", dist.Uniform(0.0, 6.0)) alpha_mu = npro.sample("alpha_mu", dist.Normal(0, alpha_mu_sigma)) alpha_sigma = npro.sample("alpha_sigma", dist.Exponential(1.0)) # state effect with npro.plate("states", num_states): alpha = npro.sample("alpha", dist.Normal(alpha_mu, alpha_sigma)) # month effect hyperparameters gamma_mu_sigma = npro.sample("gamma_mu_sigma", dist.Uniform(0.0, 6.0)) gamma_mu = npro.sample("gamma_mu", dist.Normal(0, gamma_mu_sigma)) gamma_sigma = npro.sample("gamma_sigma", dist.Exponential(1.0)) # month effect with npro.plate("months", num_months): gamma = npro.sample("gamma", dist.Normal(gamma_mu, gamma_sigma)) # year effect hyperparameters delta_mu_sigma = npro.sample("delta_mu_sigma", dist.Uniform(0.0, 6.0)) delta_mu = npro.sample("delta_mu", dist.Normal(0, delta_mu_sigma)) delta_sigma = npro.sample("delta_sigma", dist.Exponential(1.0)) # year effect with npro.plate("years", num_years): delta = npro.sample("delta", dist.Normal(delta_mu, delta_sigma)) # expected tornadoes Y = jnp.exp(alpha[states] + gamma[months] + delta[years]) # likelihood npro.sample("obs", dist.Poisson(Y), obs=tornadoes) 

Resources

Link to the above Python code

Multilevel model (Wikipedia)


Thank you very much and I hope you have a nice day.

—FilteredFrames

$\endgroup$
2
  • $\begingroup$ Do you have the data available somewhere? The site you've linked seems to only offer to download data year by year or state by state $\endgroup$ Commented May 25, 2024 at 1:26
  • $\begingroup$ Might want to account for el-nino and la-nina strength by year. Might also care about the sunspot/solar cycle because its the biggest driver of weather behavior. Might want year and month as sine/cosine pairs, so the learner can know that month 12 and month 1 are adjacent. Might want some stats on previous month high, low, mean temperatures. Might want to review how to do proper cross-validation for time-series. $\endgroup$ Commented May 25, 2024 at 1:31

1 Answer 1

1
$\begingroup$

What other multi-level modelling approaches might be better suited to the nature of this specific problem.

If you're just using the data you've listed, there is a lot that you can do to improve the model even without adding the variables that EngrStudent mention.

First, I first a relatively simple Poisson model to your data. I allowed for a trend in year and an effect of month, and I allowed these to vary by state using a hierarchical approach.

A blunt, but useful, tool for model criticism is the posterior predictive check. If my model can produce datasets similar to the one I have, then that is a good sign. I used brms in R to fit these data and here is the posterior predictive from the poisson model. Here is a plot of the posterior predictive versus the real data

enter image description here

As you can see, there is an excess of zeros in the real data that the Poisson likelihood can't handle. Additionally, the predicted counts are simply too high, likely due to the fact we under estimated how many zeros there would be. One way to account for this excess is to use a zero inflated poisson likelihood. Again, same model, but this time we fit a mixture of a point mass at 0 and a poisson. That leads to the following posterior predictive

enter image description here

That's better in so far as we are predicting more zeros! Let's take a different look at the data by plotting a selection of states.

enter image description here

Hmm, the model still looks deficient for these states. There are some periods in which the predictions are not extreme enough. Perhaps this could be addressed with a zero inflated negative binomial.

This process which I have just demonstrated -- in which we fit a model, examine the implications from the model in some way, criticize those implications, and then determine some improvement -- has been coined "continuous model expansion". When you ask about other multilevel modelling approaches, I would suggest you take this sort of approach. While we can give you possible extensions to your model, or offer other variables one might need to condition on, nothing is a replacement for fitting a model and criticizing said model by examining what the model implies.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.