Justifying data samples are from different distribution

Question

Let $x \in \{0,1\}^N$, and

\begin{align} D &= \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{M} \end{bmatrix} \end{align}

So that $D \in \{0,1\}^{N \times M} $.

This is the original dataset. The zero indicate a trait and 1 indicate absence of the trait. The order of sequence of the 0's and 1's matter for each $x$. A new data sample $D'$ was generated using a different generation process (for example Boltzman machine).

I am looking for a test statistic to show that $D$ and $D'$ are different distributions or otherwise. For example, it would be possible to use Kolmogorov-Smirnov test, but I am not certain this would be appropriate for the data. Another contending approach is kernel 2 sample test. Again, while this might work I am wondering if there is any caveat.

Or is there any other statistical test that might be more relevant?

References:

Test for difference between 2 empirical discrete distributions

Method to justify claim that two samples come from the same distribution

Is Kolmogorov-Smirnov test valid with discrete distributions?

jblood94 · Accepted Answer · 2023-04-11 16:05:51Z

The hypothesis to test is:

$H_0$: $D$ and $D'$ are random draws from the same population.

$H_1$: The population from which $D$ was drawn is not the same population from which $D'$ was drawn.

A possible approach

Here I modify the notation so that $D \in \{0,1\}^{n \times k}$ and $D' \in \{0,1\}^{m \times k}$.

Combine $D$ and $D'$, then partition the observations in $k$ different ways, each corresponding to the presence of the $i^{\text{th}}$ trait, with $i\in\{1...k\}$. Given $H_0$, the number of observations from $D'$ in the partition with the $i^{\text{th}}$ trait should follow the hypergeometric distribution with PMF:

$$p(a'_i|m,n,a_i)=\frac{\binom{m}{a'_i}\binom{n}{a_i}}{\binom{m+n}{a_i+a'_i}}$$ where $a_i$ and $a'_i$ are the total number of observations in $D$ and $D'$ with the $i^{\text{th}}$ trait.

Further, if $S_i=U(F(a'_i-1;m,n,a_i),F(a'_i;m,n,a_i))$, then $S_i\sim{U(0,1)}$, again assuming $H_0$. Test for $H_0$ by testing if $S\sim{U(0,1)}$, using, e.g., Kolmogorov-Smirnov (which may require relaxing the requirement of independence between the $S_i$).

Demonstrating in R:

set.seed(704776517) n <- 100L m <- 40L k <- 9L # simulate 1000 KS p-values for identically distributed observations p <- replicate( 1e3, { x <- mapply(\(p) rbinom(n, 1, p), seq(0.1, 0.9, length.out = k)) y <- mapply(\(p) rbinom(m, 1, p), seq(0.1, 0.9, length.out = k)) csy <- colSums(y) csxy <- colSums(x) + csy S <- sapply(1:k, \(i) runif(1, phyper(csy[i] - 1, m, n, csxy[i]), phyper(csy[i], m, n, csxy[i]))) ks.test(S, punif)$p.value } ) plot(ecdf(p), col = "blue") lines(0:1, 0:1)

# simulate p-values when the distributions are slightly different p <- replicate( 1e3, { x <- mapply(\(p) rbinom(n, 1, p), seq(0.1, 0.9, length.out = k)) y <- mapply(\(p) rbinom(m, 1, p), seq(0.2, 0.8, length.out = k)) csy <- colSums(y) csxy <- colSums(x) + csy S <- sapply(1:k, \(i) runif(1, phyper(csy[i] - 1, m, n, csxy[i]), phyper(csy[i], m, n, csxy[i]))) ks.test(S, punif)$p.value } ) plot(ecdf(p), col = "blue") lines(0:1, 0:1)

# simulate p-values when the distributions are even more different p <- replicate( 1e3, { x <- mapply(\(p) rbinom(n, 1, p), seq(0.1, 0.9, length.out = k)) y <- mapply(\(p) rbinom(m, 1, p), seq(0.3, 0.7, length.out = k)) csy <- colSums(y) csxy <- colSums(x) + csy S <- sapply(1:k, \(i) runif(1, phyper(csy[i] - 1, m, n, csxy[i]), phyper(csy[i], m, n, csxy[i]))) ks.test(S, punif)$p.value } ) plot(ecdf(p), col = "blue") lines(0:1, 0:1)

The final demonstration is in log-space, so as to test if $S\sim\text{exp}(1)$. (This formulation is not actually needed here, but it demonstrates how to maintain numeric stability if the proportion of observations of $D$ or $D'$ in the partition is very unbalanced).

# simulate p-values when the distributions are very different p <- replicate( 1e4, { x <- mapply(\(p) rbinom(n, 1, p), seq(0.1, 0.9, length.out = k)) y <- mapply(\(p) rbinom(m, 1, p), seq(0.7, 0.3, length.out = k)) csy <- colSums(y) csxy <- colSums(x) + csy S <- sapply(1:k, \(i) (a <- -phyper(csy[i], m, n, csxy[i], log.p = TRUE)) - log1p(runif(1)*expm1(phyper(csy[i] - 1, m, n, csxy[i], log.p = TRUE) + a))) ks.test(S, pexp)$p.value } ) plot(ecdf(p), col = "blue") lines(0:1, 0:1)

Stack Exchange Network

Justifying data samples are from different distribution

1 Answer 1

A possible approach

Linked

Hot Network Questions

Justifying data samples are from different distribution

1 Answer 1

A possible approach

Linked

Related

Hot Network Questions