I have data representing a population of individuals and a binary outcome of interest. The covariates themselves are often probabilities. For example, covariates 1 through 5 are an estimate of the probability that this individual belongs to group 1 (through 5). There are other non-probability covariates as well.
I have chosen group 1 as my reference group, and I'm curious to test the hypotheses "belonging to group $j$ is associated with an outcome $d$ more than the reference group" for $j=2, ..., 5$ and some predetermined threshold $d$. As an example, my outcome might be "arrested or not", my group membership covariates might be demographic estimates, and my other covariates might be aspects of the alleged crime. If $d=0.1$, then I'm trying to answer the question "does being a member of demographic $j$ increase one's arrest rate (compared to the reference demographic) by at least 10 percentage points?"
I am using statsmodels to run a logistic regression in Python, and I'm curious what the right process is to perform my full analysis. In particular:
Should I standardize (0 mean, unit standard deviation) my data? For the non group-membership covariates I think the answer is "yes" -- I don't especially care about those coefficients. For group membership covariates, the outputs are between 0 and 1 (probabilities). My population is not evenly distributed: some of those group-membership covariates have means closer to 0.8, and others have means closer to 0.1 .
After the regression, do I need to transform the coefficients in some way in order do my hypothesis testing? (E.g., moving away from log-odds toward... something else?)
After I have coefficients, how do I properly compare each group's outcome to the reference group's outcome plus $d$? Should I run a Wald test (or F test? or T test? or something else) along the lines of
regression_results.wald_test("(group2 + d = const)")? I think I want a test for inferiority, superiority, and equality because I have three outcomes of interest: (a) this group's outcome is probably less than the reference group's $+d$; (b) this group's outcome is probably more than the reference group's $+d$; (c) we don't know which is greater.
That last line of questions might be a little sloppy, and I apologize for not having a better footing in hypothesis testing. I come from the land of machine learning, and I'm imagining answering my questions with the 95% confidence interval of $\beta_j - const$ where $\beta_j$ is the coefficient for membership in group $j$. Visualizing the confidence interval, either the whole thing is less than $d$, or the whole thing is greater than $d$, or it contains $d$. Each of those outcomes will drive different decisions. For (a), we are probably happy with the system we have analyzed. For (b), we think the system we analyzed has a problem that needs to be fixed. For (c), we can't tell if there's a problem or not and might need to collect more data or take another approach.
Thanks.
t_testprovides also two-sided confidence interval for the linear restriction. $\endgroup$