Predicting current area-level counts from repeated cross sectional presence/absence surveys

Question

Problem statement

I’m trying to predict the “current” distribution of wood-burning fireplaces at ZIP code level across 9 California counties based on 15 years of surveys with presence/absence data on fireplaces, indexed by home type (single-unit or multi-unit building) and home ZIP code.

I’m having trouble understanding and expressing my options for formulating a model in “statistician language” in order to achieve my goals. I’m hoping this community can help me with that, so I can follow up with some more specific questions that will use your time effectively, and so this discussion can help others with similar needs.

I have tried to find and understand material like GLMMs: A Practical Guide for Ecology and Evolution and I recognize that one "cannot recommend a single, universal procedure because different methods are appropriate for different problems" (@BenBolker 2008). I have also tried to sharpen my understanding by investigating specific tools like sdmTMB, if only to try and build a sense of the space of available options. Unsurprisingly, the space seems very large and high-dimensional.

Therefore, if this post generates very specific/concrete advice, that’s great, but my primary goal is in improving my ability to concisely and accurately describe my problem and goal so I can be a better help-seeker and responsibly evaluate different approaches. For example, it may make sense in my case to model Y using a spatial random field. But while geographic distance is in some areas a plausibly helpful predictor of similarity, in others (like city centers) that can break down pretty quickly; adjacent neighborhoods can be very distinct, and I don’t want to over-smooth. So maybe (given my goals) it is better not to. I could use help both in articulating my loss function(s) and the possible tradeoffs.

A helpful answer will be something like “your problem is basically a species distribution modeling problem with the following caveats, e.g. urban built-environment variables don’t behave like an ocean or a prairie and so you will encounter certain pitfalls if you take that approach; you should look specifically to imitate SDMs characterized by _____. At the same time, you can also reasonably model it with crossed effects for neighborhood and time and assume a temporal structure that is AR(1), ignoring any spatial autocorrelation; the trade off between these two approaches will mainly be in terms of _____ and depend on _____.”

What I’m working with

There are nuances and messiness to the actual dataset I’m working with that I’m trying not to clutter this post with. What follows is the big picture:

Sampling design: repeated cross-sectional survey. I have inherited a long-running (~15 year) survey that has sampled about 1,000 to 3,000 individual households per year from a nine-county region with a universe of about 3 million households. Households are indexed by ZIP code and year. Most ZIP codes can be spatially indexed with ZCTA polygons (a few percent are PO Boxes). There are about 300 ZIP codes in the region.

Households are not intentionally sampled more than once. Some counties have been oversampled by design; within a given year, each of the 9 counties has the same number of responses as every other county. I’m not sure what the original intent of this was, but I have design weights (srv_wt) to compensate for this. Unfortunately, it does mean that areas of greatest interest tend to be the least sampled, relative to population density. Some ZIP codes and housing types do seem to have systematically higher response rates; see note on post-stratification below. In the most intensively sampled ZCTAs, the total number of responses to date has reached about 1% of the ZCTA's population (i.e., number of households). So repeats are likely to be rare.

As can be seen from Figure 1, there is some zero- and one-inflation but, as explained in the next section, this may be less of a concern than if the end goal were really to estimate rates.

Primary response variable. The survey asks whether the household has a wood-burning fireplace (regardless of whether it is used). At the regional level, there has been a steady decline in the share of households that answer “Yes”, from ~40% at the start of the survey to ~25% in recent years. Figure 1 shows survey responses by housing structure type (single-unit or multi-unit, a key predictor). A few percent consistently “Don’t know” or decline to respond.

Auxiliary variables. The survey itself has some auxiliary measures that can help to predict Y; some can be used to post-stratify. For example, the survey also asks whether the household has a natural gas fireplace; this turns out to be anticorrelated with having a wood-burning fireplace. And, it asks about housing unit characteristics: home type (single unit vs multi unit) and approximate age. It turns out that post-stratifying on home type is important, as the survey has differential non-response favoring single-unit homes, which are ~3x as likely to have a wood-burning fireplace (Figure 1). For post-stratification, I am using ZCTA-level housing stock estimates from the American Community Survey, which are stratified by home type as well as the decade in which the home was constructed (pre- vs post-2000 is a natural break).

Statistical goal and data-generating process

Statistical goal: predict spatially resolved fireplace prevalence circa 2020. The year 2020 was not sampled, but years 2008 through 2019 and 2021 through 2023 were. I would like my model to generate predictions for 2020.

Because context should drive modeling decisions, here's a little more of the big picture. The predictions from this statistical modeling are to be used to calculate spatially resolved emission fluxes (g/s/km²) which will serve as inputs to a physical (atmospheric chemistry) simulation. Among the most important reasons I'm aiming for "circa 2020" predictions is that the atmospheric simulation to which this serves as input is parameterized for 2020 meteorology, and will be used to inform potential interventions. Note that this need for interpolation means I cannot model the survey year (i.e., wave) strictly as a categorical variable.

The canonical way to do this is to first estimate fireplace prevalence rates (#/home), and then scale by population (home/km²). This gives the spatial density (#/km²) of fireplaces, and a dasymetric approach can be used: multiply the ZCTA-level rates by a gridded population density at the resolution of the atmospheric simulation, which is 1x1 km. (Most ZCTAs are far larger than 1x1 km.) The population density comes from the Census and, for practical purposes, can be regarded here as being without error. After that there are some additional steps to estimate emissions per fireplace.

So, although there is zero-inflation and one-inflation observable in Figure 1, it is also the case that this tends to happen in less populated areas. And the error that matters is on the scale of counts (population × rate), not rates. Maybe this means I should formulate a model to penalize error in terms of counts, rather than rates, but I do not know how to do that / have not seen examples. Insights welcome!

To cast this as a missing-data problem, I could say that the data I wish I had (but never will) is the true prevalence of wood-burning fireplaces among all occupied households in this nine-county region during 2020, at 1x1 km scale. An intermediate missing-data problem might be the set of responses (Yes / No / Don’t know) from all 3 million households if they had all been polled in 2020, and had all responded. The latter sets aside the issue of response bias, which might help in focusing answers to this post.

I am not interested in drawing inferences about causal effects of anything, just making good predictions about the unknown quantities (ZCTA-level counts) for a specific, unobserved year (2020). (Aside: when fitting a mixed-effects model, I understand it is important to center and scale a numerical predictor. I can do that, and the center is obvious, but I am wondering what should determine the degree of rescaling. I'm also not super confident in my ability to correctly account for AR(1) ... and if I'm just trying to generate accurate predictions and all of the parameter estimates are "nuisances" ... how much does it matter?)

Mechanistic considerations. Since 2008, which is the earliest year represented in these survey data, wood-burning fireplaces have been prohibited in new construction in the Bay Area. We can hypothesize that the main drivers of the temporal trend since 2008 are remodeling of housing, including retirement of wood-burning fireplaces or replacement with gas-fueled fireplaces. Therefore one could bound the prior on the temporal trend in terms of counts at all spatial levels: it ought to be negative or zero. In terms of rates, the additional of new housing (which cannot have wood-burning fireplaces) will contribute somewhat to an overall decline in prevalence rates (#/home), but there is not much growth in new housing; this is the San Francisco Bay Area. As for the spatial trend(s), there is a wide range of housing stock both in terms of structure type and age.

I can say more about the survey instrument itself but, as I stated up front, I’d rather save that for follow-up.

Goal of this post

What are my top 2 or 3 options for formulating a spatially resolved (ZCTA level) model with the goal of predicting 2020 wood-burning fireplace prevalence in terms of spatial densities (#/km² or #/ZCTA)? I see all kinds of acronyms in the literature and I don’t know how to choose among them or explore them efficiently. I am philosophically inclined toward Bayesian framework, and interested in knowing how (and whether!) to specify a adequately formulated mixed-effects model given the above. It seems much easier to find school/student/item examples rather than site x time x … examples. Maybe I’m looking in the wrong places?

I’ve been using R for about 15 years and am much better equipped to discuss things in that “lingua franca” than the various notations involving equations that seems to be preferred, but I understand why they might be preferred and I’m trying to become more fluent. I can post data/reprexes if that is helpful. I don’t need help running code; I am able to fit classical or Bayesian logistic models using various packages with fixed and/or random effects. And to check the predictions visually, both on my data and on simulations, to see that the code is not generating terribly implausible predictions. If there are specific things I can do to help me better compare the outcomes of different approaches, I would love to know more, beyond just comparing AICs or log likelihoods.

To restate the goal of this post: my understanding is that Cross Validated is trying to be a generalist/abstract forum. So I’m looking for guidance that will help me understand what my best options are for building and checking mixed-effects models with space, time, and at least one more dimension (in this case, home type) for something close to the character of this dataset in terms of its dimensions and sampling. Obviously some part of that could be subjective but I’m hopeful there are at least some worked examples out there of problems of a similar nature to this one, regardless of how they have been described by various individuals.

Figures

Figure 1. Trends in prevalence of wood-burning fireplaces among occupied housing units in (a) single-unit structures and (b) multi-unit structures such as apartments. Note that 2020 is missing. Points are weighted means of survey responses, using survey design weight srv_wt; ranges are calculated using 2 * wtd_prop_se(is_yes(have_wbfp), srv_wt) (code follows). The size of each point is proportional to the sum of srv_wt among the underlying aggregated observations.

Figure 2. Distribution of ZCTA-level wood-burning fireplace prevalence, aggregated across 2008-2022 (excluding 2020 which is missing). Points and ranges are calculated as in Fig 1.

Code

For weighted "margins of error" (ranges) in Figs 1 and 2:

wtd_prop_var <- function (x, w = NULL, na.rm = TRUE, normwt = TRUE) { v <- Hmisc::wtd.var(x, weights = w, na.rm = na.rm, normwt = normwt) return(v) } wtd_prop_se <- function (x, w = NULL, ...) { v <- wtd_prop_var(x, w = w, ...) if (is.null(w)) { return(sqrt(v)) } else { adj <- sum(w^2) / sum(w)^2 return(sqrt(v * adj)) } }

Thank you

Thank you in advance for any and all guidance. Simplest is best! As the old adage goes, “be half as clever as you think you can, because you have to be twice as clever to debug it.”

If clarifications or additional details would be helpful, please don’t hesitate to ask. Thanks again!

The idea about the estimate/prediction is to estimate the sample or also convert it to the population (which has been sampled without bias?). Which of the two is the problem in the question? — Sextus Empiricus
– Sextus Empiricus, Commented Apr 18, 2024 at 7:14
The goal is to predict 2020, but why would you want to do that? What is the underlying xy-problem? Which motivations will play a role in selecting a model to make a prediction? E.g. simply taking the average of 2019 and 2021, or applying some other kernel filter method will give you a result. How do you decide what result is better? — Sextus Empiricus
– Sextus Empiricus, Commented Apr 18, 2024 at 7:15
@SextusEmpiricus Edited post to clarify; thanks! The survey does not intentionally sample any individual household more than once. The predictions will be used to estimate pollutant emissions serving as input to a physical (atmospheric chemistry) model that is parameterized with 2020 meteorology. Of primary interest is the spatial distribution of emissions, but the net predicted emissions should reflect the downward trend(s). There are not enough observations in 2019 and 2021 alone to provide what that modeling teams thinks is a serviceable degree of confidence in the spatial distribution. — dholstius
– dholstius, Commented Apr 18, 2024 at 19:11
This needs a sentence at the beginning like: “I am trying to predict the prevalence of wood-burning fireplaces in a metropolitan area based on 15 years of surveys with data on fireplaces, home type, and home zip code.” — user225256
– user225256, Commented Apr 19, 2024 at 1:12

Stack Exchange Network

Predicting current area-level counts from repeated cross sectional presence/absence surveys

Problem statement

What I’m working with

Statistical goal and data-generating process

Goal of this post

Figures

Code

Thank you

0

Hot Network Questions

Predicting current area-level counts from repeated cross sectional presence/absence surveys

Problem statement

What I’m working with

Statistical goal and data-generating process

Goal of this post

Figures

Code

Thank you

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Related

Hot Network Questions