0
$\begingroup$

I would like to know how to calculate the probability that 2 discrete samples come from the same distribution, and if so, which one is the distribution they are coming from.

Let's say we have 3 buckets, and 2 years.

Bucket\year 2019 2020
A 76% 73%
B 20% 22%
C 4% 5%

I would like to claim that they both come from the same distribution (which I'd assume is close to [74.5%,21%,4.5%]) and that the variation between years is just given by random chance (with which probability?). I think in order to make this claim I have to calculate the probability that both samples come from the same distribution (I heard about 'power' for continous random variables, but I don't know if there's an equivalent in discrete verison). Any hint on how to proceed?

Thanks a lot!

As side note: both periods of time have different amount of datapoints, ie. 2019 has 350, 2020 has 400. Is it too much of a problem?

$\endgroup$
4
  • 1
    $\begingroup$ Knowing the number of data points is essential (it makes a big difference to a significance test between them being hundreds or being millions) but them changing between years is not a problem. A chi-square test on the counts (preferably unrounded) rather than proportions may meet your needs $\endgroup$ Commented Sep 19, 2023 at 10:12
  • $\begingroup$ Thanks Henry! So the Chi-square needs a Null-hypothesis. What would it be in this scenario? I could assume that the 'real' distribution is the one of 2019, 2020, or the global (generated using data from both years) $\endgroup$ Commented Sep 19, 2023 at 10:16
  • $\begingroup$ The third option is the appropriate one. Under the null hypothesis (no difference between 2019 and 2020) you can combine the data from the two years. $\endgroup$ Commented Sep 19, 2023 at 10:25
  • $\begingroup$ Unless you assume a prior probability, the answer is definitely no, no matter what. The kind of question you could answer with something like a chi-squared test concerns how consistent the data are with a hypothetical common distribution. Different numbers of data points are no problem, but the degree to which different data might be independent is a key consideration, especially when data are collected over time. $\endgroup$ Commented Sep 19, 2023 at 19:32

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.