Please excuse some of the ignorance here as my background is weak, I can't find much about my specific question, and this is my first post here. I am trying to understand the validity of a method I am using at work to determine the true difference between groups.
Some background, I am looking to compare the difference in a binary response between 3 treatment groups, A/B, A/C, and B/C. These values are broken down by stratum in which each stratum corresponds to the decile of predicted probability of responding a certain way ("yes" or "no") for the binary response variable. For example, Stratum 1 holds all observations where an individual has a predicted probability of responding "yes" between 0 and 0.09. I have reason to believe that there is some association between an individual's predicted probability decile and their response to the treatment, however, the model scoring predicted probability was inaccurate at the time of the experiment. A significantly better model now provides much more accurate predictions for the response, and I want to resample with weights such that the proportion of individuals with a certain response matches what the new model suggests. For decile 10 with predicted probabilities 0.9 - 0.99, the average proportion of actual individuals who respond "yes" is 0.95, and I would like to create a new sample from each stratum retaining these weights. So if each stratum is to be sampled from with replacement 1000 times, for the 10th decile, 95% of users would be sampled from those with a decile score of 10 and who responded "yes", while the remaining 5% would be sampled from those who responded "no".
I am wondering if it is valid to resample with replacement within stratum using these new weights to create a sample larger than the original. Also, can I then use this sample to create bootstrap samples in which I compare a 2-sample z-test for proportions test statistic to that of the test statistic of the reweighted sample (in order to create a valid p-value)?
I'm aware that the first step relates to post-stratification, and that often it is suggested that the reweighted sample be of the same size as the original, but my sample size is somewhat small. I'm also concerned about using the test statistic as a valid measure since the standard error of my observations may be thrown off due the sampling with replacement. Any help would be appreciated, thank you.