Summarizing factor counts by other variables in R

Question

I'm attempting to summarize the number of observations of levels in a factor variable by other variables in the same dataset. We are running a clinical training study where patients and controls describe pictures, and I'm conducting an analysis of the types of errors patients made. I want to see whether the specific training conditions and session types (baseline, training, post-testing, etc) affect what errors are produced. The data look as follows:

| ParticipantID | Group | SessionType | TrainingCondition | ErrorType | | p1 | Control | Baseline | Alternating | GE | | p1 | Control | Baseline | Alternating | RR | | p1 | Control | Post-Test | Alternating | NT | ... | p2 | Patient | Baseline | Single | GE |

There are three levels of the SessionType variable (Baseline, Immediate Post, 1 Week Post), two of the TrainingCondition variable (Alternating & Single), and 5 of the ErrorType variable (GE, NS, LE, NT, RR). What I need is a summary of how often each level of ErrorType occurred by Group, SessionType, and TrainingCondition. Ideally, I'd get something like this:

| Group | SessionType | TrainingCondition | ErrorType | Count | | Control | Baseline | Alternating | GE | 5 | | Control | Post-test | Alternating | GE | 10 | ... | Patient | Baseline | Single | NT | 7 | &c.

I've tried several possible solutions, but none have resulted in what I want. The closest is this code using the tidyverse:

error.sum <- df %>% group_by(trainingCondition, Group, SessionType, ErrorType) %>% summarise(Count = count(df, ErrorType)$n)`

Which resulted in something close, but not there. All counts have been duplicated in the output:

Alternating | Control | Baseline | GE | 596 | Alternating | Control | Baseline | GE | 46 | Alternating | Control | Baseline | GE | 79 | Alternating | Control | Baseline | GE | 187 | Alternating | Control | Baseline | GE | 500 | Alternating | Control | Baseline | GE | 1853 | Alternating | Control | Baseline | GE | 37 | Alternating | Control | Baseline | NT | 596 | Alternating | Control | Baseline | NT | 46 | Alternating | Control | Baseline | NT | 79 | Alternating | Control | Baseline | NT | 187 | Alternating | Control | Baseline | NT | 500 | Alternating | Control | Baseline | NT | 1853 | Alternating | Control | Baseline | NT | 37 |

I suspect count() counted the overall instances of each error type rather than counts of ErrorType by the other variables? I'm not sure. Any help would be greatly appreciated!

Try df %>% count(trainingCondition, Group, SessionType, ErrorType, name = "Count"). — stefan
– stefan, Commented Nov 27, 2023 at 18:34
Inside summarize, you can use n() for the number of rows within a group, so df %>% group_by(trainingCondition, Group, SessionType, ErrorType) %>% summarise(Count = n()). However, this is such a common operation that count is intended a helper function to make it easier as stefan illustrates. Also, if you take a look at the ?count help page, the first line say that "df %>% count(a, b) is roughly equivalent to df %>% group_by(a, b) %>% summarise(n = n()). — Gregor Thomas
– Gregor Thomas, Commented Nov 27, 2023 at 18:53
Thank you both so much for your replies. This solution works brilliantly! I had no idea about count() including more extensive functionality like this. — Artabanos
– Artabanos, Commented Nov 27, 2023 at 19:35

Collectives™ on Stack Overflow

Summarizing factor counts by other variables in R

0

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.