Highly unequal subsamples sizes in regression (city-level effects)

Question

I am looking to estimate an OLS regression model, to gauge the relationship between various sociodemographic (Census) features and political data at the neighborhood level. As an example, this model will regress voter turnout on education level, income, age composition, and racial composition. Both the dependent and predictor variables will be continuous. This model will include data from several cities and I would like to estimate city-level differences to see if the relationships between variables differ across cities. I gather that the best approach is to estimate a single regression model and include dummies for the cities.

The problem is that the sample size for each city varies very widely (n = 200 for the largest city, but only n = 20 for the smallest). I am confident that the model will have sufficient statistical power to estimate overall relationships between variables; but, would estimating city-level differences be impossible with the disparity in subsample sizes?

Shawn Hemelstrand · Accepted Answer · 2025-04-05 00:25:46Z

I think you have two options here (which touches on my thoughts about fixed vs random effects in modeling):

If you are not interested in the cities in and of themselves (but just want to account for their individual differences), you could fit this as a linear mixed model with cities as random effects. In this way, you can still see the average deviation of each city from the overall voter turnout, as well as model the average variance of these deviations. Mixed models also deal better with group imbalances than standard regressions.
If you are explicitly testing some explicit theoretical aspect of cities (perhaps urban vs. rural), then you can recode your data into these groups accordingly and fit this with an OLS regression (though see below about why this may not be useful). This would not only give you a theoretically motivated goal for fitting this model, but it may rebalance the cities to more accurate estimates (all the smaller cities would be combined under one of the levels of the categorical variable).

Having made these two points, while your response is continuous, I believe voter turnout is a strictly non-negative proportion. You may need to fit a GLM or GLMM to deal with this, depending on which option you choose above. I believe something like a quasibinomial logistic regression or beta regression may work (though the latter requires no proportions that are exactly 0 or 1).

Hi! Thank you very much for the response - this is very useful. I have a couple of follow-ups: — MJ8121
– MJ8121, Commented Apr 5 at 0:55
(1) I am interested in specific city-related effects, so the 2nd option intrigues me. I'd just have to identify a grouping aspect. This project uses redlining data, so practically all of the neighborhoods are urban. (2) Is the main issue that the smallest city only has 20 observations? Or is it the ratio between that number and the number of observations in the largest city? These are census tract data and I could change to block group data. This would increase the sample size - but, it would do so for all the cities. — MJ8121
– MJ8121, Commented Apr 5 at 1:06
The approach you take will just depend on what you are trying to do with your model. If you are trying to test theory, then just fit the model according to the theory you are wanting to test. If you are just trying to predict things, then there is nothing stopping you from taking any approach that you want here, but cross validation will be more necessary when the categorical variables have a limited number of values to work with, so the grouping may or may not matter. — Shawn Hemelstrand
– Shawn Hemelstrand, Commented Apr 5 at 12:34
Validation (or cross validation) is when you attempt to show how well your model works in the real world, usually with some form of data splitting and testing. Validation normally entails using a large portion of your original data to fit the model and then using the rest of the data to see how your model predicts new values. Cross-validation takes on many forms that are less dependent on the specific subset used, which includes LOOCV, k-fold cross validation, and other methods. You can also simulate predictions from your model. All of this is to see how the model performs in the wild. — Shawn Hemelstrand
– Shawn Hemelstrand, Commented Apr 6 at 6:10
For more information on all of that, Introduction to Statistical Learning in R has a good section on this within Chapter 5. — Shawn Hemelstrand
– Shawn Hemelstrand, Commented Apr 6 at 6:11

Stack Exchange Network

Highly unequal subsamples sizes in regression (city-level effects)

1 Answer 1

Linked

Hot Network Questions

Highly unequal subsamples sizes in regression (city-level effects)

1 Answer 1

Linked

Related

Hot Network Questions