Unsure what stats test to use for my data: ANOVA or chi-square

Question

I have a matrix of 17 features by 9 groups with the values being the counts for each feature appearing in that group, eg:

	Grp1	Grp2	Grp3	Grp4	Grp5	Grp6	Grp7	Grp8	Grp9
F1	2	9	0	1	0	0	1	2	0
F2	3	4	0	0	0	0	0	3	1
F3	4	3	0	0	0	0	0	8	1
F4	1	1	8	0	0	0	0	0	0
F5	0	1	0	0	0	0	3	9	1
F6	4	7	1	0	0	0	0	5	0
F7	1	7	0	1	0	0	0	8	2
F8	5	10	1	0	0	0	0	8	1
F9	1	2	0	0	0	0	0	4	2
F10	9	5	0	0	1	0	0	6	5
F11	3	6	0	0	1	0	0	8	3
F12	10	16	0	1	1	0	1	13	3
F13	10	25	1	1	3	0	0	11	10
F14	1	14	0	1	2	0	2	8	3
F15	11	13	0	0	1	1	0	12	1
F16	8	3	0	0	1	0	0	6	3
F17	5	10	0	0	0	0	0	5	4

I want to test whether each feature is significantly associated with 1 or more of the groups. For example, F1 might be associated with Grp1 and Grp2, while F4 might be associated with Grp3.

Would appreciate any advice on what test to use and how to properly format the analysis. If you're able to provide example R code that would be amazing and very appreciated.

Edit:

I wrote this to begin to analyse the data.

fisher.groups <- function(mat){ res.mat <- matrix(data = NA, nrow = nrow(mat), ncol = ncol(mat), dimnames = list(rownames(mat), colnames(mat))) for (i in rownames(mat)) { for (j in colnames(mat)) { idx <- which(rownames(mat) == i) jdx <- which(colnames(mat) == j) tmp.mat <- matrix(c(mat[idx,jdx], sum(mat[ idx,-jdx]), sum(mat[-idx, jdx]), sum(mat[-idx,-jdx])), nrow = 2, ncol = 2, byrow = TRUE) res.mat[i,j] <- fisher.test(tmp.mat, alternative = "greater")$p.value } } return(res.mat) }

Does this make sense?

What are these groups and features? Are you thinking that the features chosen may depend on the group? — Peter Flom
– Peter Flom, Commented Dec 12, 2024 at 15:52
The features are cells labelled with a specific barcode, and the groups are programmes of expression. So a population of cells labelled with a specific barcode may be 50% programme 1 and 50% programme 2. Or another group might be 100% programme 3. — user22423300
– user22423300, Commented Dec 12, 2024 at 16:01
F1 might be associated with Grp1 and Grp2, sorry, I don't get it. Why does F1 associate with Grp1 (2 counts) and Grp2 (9 counts), but not with, say, Grp8 (2 counts)? In general, could you elaborate on what you mean by "associate"? Are you looking for groups with high counts? If so, high relative to what? — dariober
– dariober, Commented Dec 12, 2024 at 16:01
@dariober Yes you're right, that was a bad example on my part. I was just trying to suggest a situation where or could be '1 or more' — user22423300
– user22423300, Commented Dec 12, 2024 at 16:03
To clarify: is this a cell biology study, with the sum over all columns for each "feature" (row) the number of cells with that barcode, and the "Group" representing some gene-expression or similar program? Can a single cell belong to more than one Group? — EdM
– EdM, Commented Dec 12, 2024 at 16:11

Moudhaffer Bouallegui · Accepted Answer · 2024-12-12 15:42:22Z

0

Consider Fisher’s Exact Test for features with small counts. It is particularly well-suited for cases with small counts. Note that it becomes computationally intensive for larger matrices.

edited Dec 12, 2024 at 15:42

answered Dec 12, 2024 at 15:23

Moudhaffer Bouallegui

11 bronze badge

2

$\begingroup$ There is a glaring problem with your first recommendation (a chi-squared test): many of the expected counts are sufficiently small as to call into question the appropriateness of the chi-squared test. The standardized residuals will be particularly misleading for the very small expected values (such as all the cells for Grp3, Grp4, Grp6, and Grp7). $\endgroup$

whuber
– whuber ♦

2024-12-12 15:33:08 +00:00
Commented Dec 12, 2024 at 15:33
$\begingroup$ You are indeed correct, editing my comment approopriately. Thank you for pointing out the error in my answer! $\endgroup$

Moudhaffer Bouallegui
– Moudhaffer Bouallegui

2024-12-12 15:39:50 +00:00
Commented Dec 12, 2024 at 15:39
1

$\begingroup$ Both chi-square and Fisher's give one statistics for the whole table. The OP asks about association with a particular group. $\endgroup$

Peter Flom
– Peter Flom

2024-12-12 15:51:37 +00:00
Commented Dec 12, 2024 at 15:51
$\begingroup$ Welcome to CV, Moudhaffer. We appreciate your efforts to improve on your post. $\endgroup$

whuber
– whuber ♦

2024-12-12 16:15:43 +00:00
Commented Dec 12, 2024 at 16:15

Add a comment |

jginestet · Accepted Answer · 2024-12-12 19:31:54Z

This is a bit of a connundrum; I do not think I can provide an "answer", but I will try to provide some insights.

You have categorical data, with counts, following a binomial/multinomial distribution (how many times is Fx present) . So that limits the types of analysis/tests we could use. Your counts also are not very high (in fact many are 0's); that further limits things. In addition, that means that your tests will not have very high power (e.g. Group6 only has a single feature).

One could think of using multinomial regression, trying to regress the feature counts based on the groups as predictors, but you only have 1 predictor per cell (cells can belong to 1 and only 1 group), and you only have 1 observation of the multinomial distribution for each group... So I can not think of other ways than contingency tables? (maybe others can?)

The first, obvious (?) test is a Fisher exact test (17x9 contingency table) (just as the code snippet you have). If significant, this means that the counts of features are different between the groups (i.e. the counts do not all come from the same distribution(s)). But it will not tell you which counts are different between which group/feature. But if it is not significant, then the counts could all come from the same distribution(s), and there is nothing to find here.

Assuming the test will be significant (most likely it will be), you could now look at pairs of features, accross all groups; that will tell you which features are significantly different accross all groups (but still not which specific features are responsible for this significant result), or not (then these 2 features are "not that differently distributed"). That would require 136 tests (136 17x2 contingency tables). So you will a have a (non trivial) multiple comparison issue, and will need to use a multiple comparison correction (MCC) (Holm-Sidak?), which will further reduce your power.

You could also compare all pairs of groups accross all features, to see which groups are different (or not) accross all features. And that will not tell you which feature(s) is(are) responsible with this difference. That would require 36 tests (36 2x9 tables), facing again a MCC issue.

Last, you could of course run all possible 2x2 tests (testing all pairs of features, against all pairs of groups); that will (finally!) tell you which features behave differently between which groups. But... That would require 4896 tests; it would not only be tedious, but the MCC will elinate any power you may have had.

So you need to reduce the number of comparison. Is there a group which is a control group, or a baseline? Then you can limit comparison to this group (the 36 comparison become 8). Same for the features? Is there a special feature to compare the others to? That reduces the 136 tests to 16... Or do you have a particular hypothesis, which concerns only a few features accross a few groups (you collected the additional data, just because you might as well have, but your hypothesis is more focused -as opposed to being a "fishing expedition")? Any such reduction will greatly minimize the MCC issue, but depends on the specific of the experiment/research.

Based on the earlier "per feature", or "per group" tests, you may decide to combine the features/groups. If 2 features are not that different between groups, then you could aggregate them (combine their counts) (create feature ${F1}^{'}$ which is "Fx or Fy"). Same for groups (combine groups which do not show a difference). That would again minimize the MCC issue and increase your power (because of the higher counts), but may or not make sense based on your context.

Thank you for your suggestion. I have added a more detailed response in an additional answer. — user22423300
– user22423300, Commented Dec 12, 2024 at 23:09
@user22423300 Glad this worked. Feel free to upvote if the answer helped you. It will also then help others, in similar situation. — jginestet
– jginestet, Commented Dec 13, 2024 at 2:31
Thanks, I don't have enough forum reputation to upvote but i did press it! — user22423300
– user22423300, Commented Dec 13, 2024 at 11:18

user22423300 · Accepted Answer · 2024-12-12 23:19:53Z

Thanks to all who responsed.

What I ended up doing was a combination of using the code above and additional steps along the lines of what @jginestet suggested.

Firstly, I have reduced the number of groups - at least two of the groups appear to be closely associated with cell cycle / QC and so these have been removed from the comparison because if a feature significantly associates/is enriched in that group, I am less interested in that feature overall.

Secondly, I reduced the number of features. I was originally looking at all features that vary across different sets of sample groups. But have narrowed my comparison to specific groups (in this case different timepoints).

Finally, the code that I shared sets up a series of contingency tables such that for every feature and every group the function cycles through variables and creates tables:

	Grp N	Other Grps
F n	...	...
Other F	...	...

I then do multiple correction testing using p.adjust(). This seems to give sensible results - the p-value corresponds to the groups I had assumed were significant but also not all the group/feature combos I had assumed were significant actually ended up being statistically significant due to sizes of groups/feature counts.

If anyone has any other comments, I will gladly welcome additional feedback.

Stack Exchange Network

Unsure what stats test to use for my data: ANOVA or chi-square

3 Answers 3

Hot Network Questions

	Grp1	Grp2	Grp3	Grp4	Grp5	Grp6	Grp7	Grp8	Grp9
F1	2	9	0	1	0	0	1	2	0
F2	3	4	0	0	0	0	0	3	1
F3	4	3	0	0	0	0	0	8	1
F4	1	1	8	0	0	0	0	0	0
F5	0	1	0	0	0	0	3	9	1
F6	4	7	1	0	0	0	0	5	0
F7	1	7	0	1	0	0	0	8	2
F8	5	10	1	0	0	0	0	8	1
F9	1	2	0	0	0	0	0	4	2
F10	9	5	0	0	1	0	0	6	5
F11	3	6	0	0	1	0	0	8	3
F12	10	16	0	1	1	0	1	13	3
F13	10	25	1	1	3	0	0	11	10
F14	1	14	0	1	2	0	2	8	3
F15	11	13	0	0	1	1	0	12	1
F16	8	3	0	0	1	0	0	6	3
F17	5	10	0	0	0	0	0	5	4

	Grp1	Grp2	Grp3	Grp4	Grp5	Grp6	Grp7	Grp8	Grp9
F1	2	9	0	1	0	0	1	2	0
F2	3	4	0	0	0	0	0	3	1
F3	4	3	0	0	0	0	0	8	1
F4	1	1	8	0	0	0	0	0	0
F5	0	1	0	0	0	0	3	9	1
F6	4	7	1	0	0	0	0	5	0
F7	1	7	0	1	0	0	0	8	2
F8	5	10	1	0	0	0	0	8	1
F9	1	2	0	0	0	0	0	4	2
F10	9	5	0	0	1	0	0	6	5
F11	3	6	0	0	1	0	0	8	3
F12	10	16	0	1	1	0	1	13	3
F13	10	25	1	1	3	0	0	11	10
F14	1	14	0	1	2	0	2	8	3
F15	11	13	0	0	1	1	0	12	1
F16	8	3	0	0	1	0	0	6	3
F17	5	10	0	0	0	0	0	5	4

Unsure what stats test to use for my data: ANOVA or chi-square

3 Answers 3

Related

Hot Network Questions

	Grp1	Grp2	Grp3	Grp4	Grp5	Grp6	Grp7	Grp8	Grp9
F1	2	9	0	1	0	0	1	2	0
F2	3	4	0	0	0	0	0	3	1
F3	4	3	0	0	0	0	0	8	1
F4	1	1	8	0	0	0	0	0	0
F5	0	1	0	0	0	0	3	9	1
F6	4	7	1	0	0	0	0	5	0
F7	1	7	0	1	0	0	0	8	2
F8	5	10	1	0	0	0	0	8	1
F9	1	2	0	0	0	0	0	4	2
F10	9	5	0	0	1	0	0	6	5
F11	3	6	0	0	1	0	0	8	3
F12	10	16	0	1	1	0	1	13	3
F13	10	25	1	1	3	0	0	11	10
F14	1	14	0	1	2	0	2	8	3
F15	11	13	0	0	1	1	0	12	1
F16	8	3	0	0	1	0	0	6	3
F17	5	10	0	0	0	0	0	5	4