I'm attempting to summarize the number of observations of levels in a factor variable by other variables in the same dataset. We are running a clinical training study where patients and controls describe pictures, and I'm conducting an analysis of the types of errors patients made. I want to see whether the specific training conditions and session types (baseline, training, post-testing, etc) affect what errors are produced. The data look as follows:
| ParticipantID | Group | SessionType | TrainingCondition | ErrorType | | p1 | Control | Baseline | Alternating | GE | | p1 | Control | Baseline | Alternating | RR | | p1 | Control | Post-Test | Alternating | NT | ... | p2 | Patient | Baseline | Single | GE | There are three levels of the SessionType variable (Baseline, Immediate Post, 1 Week Post), two of the TrainingCondition variable (Alternating & Single), and 5 of the ErrorType variable (GE, NS, LE, NT, RR). What I need is a summary of how often each level of ErrorType occurred by Group, SessionType, and TrainingCondition. Ideally, I'd get something like this:
| Group | SessionType | TrainingCondition | ErrorType | Count | | Control | Baseline | Alternating | GE | 5 | | Control | Post-test | Alternating | GE | 10 | ... | Patient | Baseline | Single | NT | 7 | &c. I've tried several possible solutions, but none have resulted in what I want. The closest is this code using the tidyverse:
error.sum <- df %>% group_by(trainingCondition, Group, SessionType, ErrorType) %>% summarise(Count = count(df, ErrorType)$n)` Which resulted in something close, but not there. All counts have been duplicated in the output:
Alternating | Control | Baseline | GE | 596 | Alternating | Control | Baseline | GE | 46 | Alternating | Control | Baseline | GE | 79 | Alternating | Control | Baseline | GE | 187 | Alternating | Control | Baseline | GE | 500 | Alternating | Control | Baseline | GE | 1853 | Alternating | Control | Baseline | GE | 37 | Alternating | Control | Baseline | NT | 596 | Alternating | Control | Baseline | NT | 46 | Alternating | Control | Baseline | NT | 79 | Alternating | Control | Baseline | NT | 187 | Alternating | Control | Baseline | NT | 500 | Alternating | Control | Baseline | NT | 1853 | Alternating | Control | Baseline | NT | 37 | I suspect count() counted the overall instances of each error type rather than counts of ErrorType by the other variables? I'm not sure. Any help would be greatly appreciated!
df %>% count(trainingCondition, Group, SessionType, ErrorType, name = "Count").n()for the number of rows within a group, sodf %>% group_by(trainingCondition, Group, SessionType, ErrorType) %>% summarise(Count = n()). However, this is such a common operation thatcountis intended a helper function to make it easier as stefan illustrates. Also, if you take a look at the?counthelp page, the first line say that "df %>% count(a, b)is roughly equivalent todf %>% group_by(a, b) %>% summarise(n = n()).count()including more extensive functionality like this.