I have a "stacked panel" data set with ~600,000 rows. While the data are compiled from a survey , the data are no longer in survey format: rather, these are person-year observations for ~100000 people over ~10 years. Thus, I'm not able to use R's survey package in a strict sense, because I do not have survey data.
It is called mydata and it looks something like this:
unique_pid housing.category agebucket sample.weight year 1_1 0 (30, 45] 29.9 1999 1_1 0 (30, 45] 29.9 2000 1_1 1 (30, 45] 19.9 2001 1_1 1 (30, 45] 39.9 2002 1000_33 0 (15, 30] 10 1982 1000_33 1 (15, 30] 10.2 1983 1000_33 0 (15, 30] 13 1984 1000_33 1 (15, 30] 12 1985 1000_33 0 (15, 30] 10 1986 1000_33 1 (15, 30] 12 1987 88_2 0 (30, 45] 0.99 1990 88_2 0 (30, 45] 0.89 1991 88_2 1 (30, 45] 1.99 1992 88_2 0 (30, 45] 2.99 1993 I am running a weighted logit with R's glm(). My call looks like this:
glm(housing.category ~ agebucket, data= mydata, family="binomial", weights=sample.weight) I get the following warning:
Warning message: In eval(expr, envir, enclos) : non-integer #successes in a binomial glm! My weights are contained in vector sample.weight in the dataframe above. While they are not integers, they are designed to adjust for over- and under-sampling of certain groups in the survey: oversampled populations are assigned values < 1, while undersampled populations are assigned values >1. In this sense, they are frequency weights.
However, when I read the documentation for glm(), I see that the these weights should represent the number of trials - i.e. they should be vector of integers.
So far, I have worked with the survey R package, but my data are no longer panel.
Am I misinterpreting the weights argument? Is this a numerical error? If so, how can I run a logit weighted by sample weights?