Comparing the distribution

Question

I want to compare the distribution of 2 independant datasets. Measurements were performed on an experimental dataset (TEST) and compared with a completely independant reference dataset (REF). The idea is to determine if the measurements in the experimental dataset follow the same distribution as the reference.

I looked at the Kolmogorov-Smirnov test (two-sided) but I am not sure it does exactly what I think it does.

Can anyone suggest a test more appropriate to test the difference/similarity in distribution of these 2 datasets?

Thanks!

You could measure the distance between the two distributions using Kullback Leibler divergence to get a better feeling for the dissimilarity and then use the Kolmogorov-Smirnov test just for computing the p-value. — resnet
– resnet, Commented Aug 15, 2019 at 15:27
Try also the Baumgartner-Weiss-Schindler test, cf cran.r-project.org/web/packages/BWStest/index.html — steveo'america
– steveo'america, Commented Aug 15, 2019 at 16:23

Dave · Accepted Answer · 2019-08-15 15:49:16Z

I still want to hear your reasons for doubting KS as an appropriate method, but now that I've looked at your graph more, I say that KS does not apply. Your data are discrete, and KS does not apply to data drawn to discrete distributions. However, you could use a chi-squared test! I wrote about this yesterday. Instead of checking if frequencies match the frequencies expected from a fair die, you'd be checking if the TEST frequencies match the REF frequencies.

Just looking at the graph, the answer is that the distributions are different. There are almost 400 blue observations (seems like it's about 30% of the blue observations) at 0.08, yet red does not get even one instance of 0.08.

For a discussion of KS on discrete distributions: Is Kolmogorov-Smirnov test valid with discrete distributions?

My description of the chi-squared test (turns out to be unrelated to skewness): How to identify if my data set is skewed or not?

Thanks! The data measured is biological, so in theory it is continous: any sample can have a measurement between 0 and 512 (even theoretically more); but the method used for the measurements only records values based on the exponential function. Additionally we know that, due to biological (genetic) constrains, there are more samples showing values around 0.016, 0,25 and 32. I will use a chi-square, seems more appropriate. — Seb Matamoros
– Seb Matamoros, Commented Aug 19, 2019 at 9:44
Do you have ties in your data? The graph makes it look like you do, but your comment makes it sound like you don’t. — Dave
– Dave, Commented Aug 19, 2019 at 10:35
Yes, the test set has several "0" counts for the low and high values, which means I could not use the KS test on the whole range of measured values. — Seb Matamoros
– Seb Matamoros, Commented Aug 20, 2019 at 9:54
Then it sounds like most of your distribution is continuous, so chi-squared would not apply. I’m curious...what process is giving you data, microarray? — Dave
– Dave, Commented Aug 20, 2019 at 10:03
The data represented is minimum inhibitory concentration (MIC), a measure of bacterial antibiotic susceptibility, and obtained using a method called microbroth dilution. The antibiotic is diluted 2X for each measurement point (hence the exponential function). We record how many bacterial strains can resist each concentration. Looking more into the chi-square (I don't use it often), I am not sure if it's appropriate indeed. — Seb Matamoros
– Seb Matamoros, Commented Aug 20, 2019 at 10:17

Stack Exchange Network

Comparing the distribution

1 Answer 1

Linked

Hot Network Questions

Comparing the distribution

1 Answer 1

Linked

Related

Hot Network Questions