2
$\begingroup$

I have data representing a population of individuals and a binary outcome of interest. The covariates themselves are often probabilities. For example, covariates 1 through 5 are an estimate of the probability that this individual belongs to group 1 (through 5). There are other non-probability covariates as well.

I have chosen group 1 as my reference group, and I'm curious to test the hypotheses "belonging to group $j$ is associated with an outcome $d$ more than the reference group" for $j=2, ..., 5$ and some predetermined threshold $d$. As an example, my outcome might be "arrested or not", my group membership covariates might be demographic estimates, and my other covariates might be aspects of the alleged crime. If $d=0.1$, then I'm trying to answer the question "does being a member of demographic $j$ increase one's arrest rate (compared to the reference demographic) by at least 10 percentage points?"

I am using statsmodels to run a logistic regression in Python, and I'm curious what the right process is to perform my full analysis. In particular:

  1. Should I standardize (0 mean, unit standard deviation) my data? For the non group-membership covariates I think the answer is "yes" -- I don't especially care about those coefficients. For group membership covariates, the outputs are between 0 and 1 (probabilities). My population is not evenly distributed: some of those group-membership covariates have means closer to 0.8, and others have means closer to 0.1 .

  2. After the regression, do I need to transform the coefficients in some way in order do my hypothesis testing? (E.g., moving away from log-odds toward... something else?)

  3. After I have coefficients, how do I properly compare each group's outcome to the reference group's outcome plus $d$? Should I run a Wald test (or F test? or T test? or something else) along the lines of regression_results.wald_test("(group2 + d = const)")? I think I want a test for inferiority, superiority, and equality because I have three outcomes of interest: (a) this group's outcome is probably less than the reference group's $+d$; (b) this group's outcome is probably more than the reference group's $+d$; (c) we don't know which is greater.

That last line of questions might be a little sloppy, and I apologize for not having a better footing in hypothesis testing. I come from the land of machine learning, and I'm imagining answering my questions with the 95% confidence interval of $\beta_j - const$ where $\beta_j$ is the coefficient for membership in group $j$. Visualizing the confidence interval, either the whole thing is less than $d$, or the whole thing is greater than $d$, or it contains $d$. Each of those outcomes will drive different decisions. For (a), we are probably happy with the system we have analyzed. For (b), we think the system we analyzed has a problem that needs to be fixed. For (c), we can't tell if there's a problem or not and might need to collect more data or take another approach.

Thanks.

$\endgroup$
1
  • $\begingroup$ statsmodels currently does not allow for one-sided alternatives in hypothesis testing in models github.com/statsmodels/statsmodels/issues/1193 However, t_test provides also two-sided confidence interval for the linear restriction. $\endgroup$ Commented Nov 1, 2024 at 14:13

1 Answer 1

1
$\begingroup$

I'll go through your questions in order:

Should I standardize (0 mean, unit standard deviation) my data? For the non group-membership covariates I think the answer is "yes" -- I don't especially care about those coefficients. For group membership covariates, the outputs are between 0 and 1 (probabilities). My population is not evenly distributed: some of those group-membership covariates have means closer to 0.8, and others have means closer to 0.1 .

There is no need to standardize your coefficients in logistic regression and I'm not sure in which way it would help other than numerical stability for convergence, but I'm guessing that's not an issue here. I always prefer to deal with logistic regression coefficients in their raw forms (categorical or not) and transform those into odds ratios or probabilities depending on what information I want to convey.

After the regression, do I need to transform the coefficients in some way in order do my hypothesis testing? (E.g., moving away from log-odds toward... something else?)

See above point. Transformation also isn't necessary for hypothesis testing here.

After I have coefficients, how do I properly compare each group's outcome to the reference group's outcome plus c ? Should I run a Wald test (or F test? or T test? or something else) along the lines of regression_results.wald_test("(group2 + d = const)")? I think I want a test for inferiority, superiority, and equality because I have three outcomes of interest:

  • (a) this group's outcome is probably less than the reference group's +d
  • (b) this group's outcome is probably more than the reference group's +d
  • (c) we don't know which is greater.

You could just compare the confidence intervals of each group or use a contrast matrix and see if there are major differences between them.

$\endgroup$
1
  • $\begingroup$ Thanks for the response!"You could just compare the confidence intervals of each group" Thanks, I was heading down this route as well. What test or formula would you recommend? The standard standard-error-of-differences or something else? $\endgroup$ Commented Oct 31, 2024 at 15:45

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.