Summarize and count data in R with dplyr

Question

Goal: Summarize/count responses in the same row of an occured stimuli with dplyr.

Background: I got some excellent help in another topic: Loop through dataframe in R and measure time difference between two values

Now, I am working with the same/ similar dataset and my goal is to count the responses on perceived stimuli of users in the same row as where the stimuli occured. The dataset looks like this:

structure(list(User = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), StimuliA = c(1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), StimuliB = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L), R2 = c(0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L ), R3 = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), R4 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), R5 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), R6 = c(0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), R7 = c(0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("User", "StimuliA", "StimuliB", "R2", "R3", "R4", "R5", "R6", "R7"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -20L), spec = structure(list( cols = structure(list(User = structure(list(), class = c("collector_integer", "collector")), StimuliA = structure(list(), class = c("collector_integer", "collector")), StimuliB = structure(list(), class = c("collector_integer", "collector")), R2 = structure(list(), class = c("collector_integer", "collector")), R3 = structure(list(), class = c("collector_integer", "collector")), R4 = structure(list(), class = c("collector_integer", "collector")), R5 = structure(list(), class = c("collector_integer", "collector")), R6 = structure(list(), class = c("collector_integer", "collector")), R7 = structure(list(), class = c("collector_integer", "collector"))), .Names = c("User", "StimuliA", "StimuliB", "R2", "R3", "R4", "R5", "R6", "R7")), default = structure(list(), class = c("collector_guess", "collector"))), .Names = c("cols", "default"), class = "col_spec"))

Desired output: The desired output would be summarized list with all responses aggregate in the same row of the occured stimuli:

U StimuliA StimuliB R2 R3 R4 R5 R6 R7 1 1 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 1 0 1 1 2 0 0 1 0 1 0 1 0 0 0 0 0 0 2 1 0 3 0 0 0 0 0 2 0 1 1 0 0 0 2 0

In the sample, line 1 notes a stimuli for A and line 2 a 1 for R7. The outcome in the desired result is then a row with a 1 at StimuliA and a 1 at R7. Then it starts again because in the line 3 we have a new 1 for StimuliA.

In the end for every Stimuli there will be a summary of the following occured Responses (R2-R7) in the same row. The value of Stimuli (A or B) stays 1.

Question: I feel I can achieve this with the dplyr package, but my previous attempts have not concluded in much useful output. How would I structure the syntax with the dplyr commands or should I search for a solution in another direction? Would i mutate the same existing dataframe or create a new one?

Thanks for all the inputs and help!

In base R, you could do aggregate(. ~ User + StimuliA + StimuliB, data=dat, sum) In dplyr syntax, maybe dat %>% group_by(., User, StimuliA, StimuliB) %>% summarize_all(sum). — lmo
– lmo, Commented Jul 17, 2017 at 14:55
This question isn't very clear but, as I understand it, there is one row with a stimulus i.e. a 1 in either StimuliA or StimuliB, followed by several responses to that stimulus where the StimuliA and StimuliB are 0 but one of the other variables is equal to 1. The question is, I think, asking how to aggregate the n rows following a stimulus to the row with a stimulus. — Eumenedies
– Eumenedies, Commented Jul 17, 2017 at 14:58
df %>% group_by(User) %>% mutate(Sta = cumsum(StimuliA), Stb = cumsum(StimuliB)) %>% group_by(User, Sta, Stb) %>% summarise(StA = sum(StimuliA), StB = sum(StimuliB), R2 = sum(R2), R3 = sum(R3), R4 = sum(R4), R5 = sum(R5), R6 = sum(R6), R7 = sum(R7)) %>% select(-Sta, -Stb) — Eumenedies
– Eumenedies, Commented Jul 17, 2017 at 15:03
@Eumenedies yes, sry I will update the question. Once a stimuli occured, either a 1 for Stimuli A or B, then i would like to summarize/count all the following responses R2-R7 in the same row. — svnnf
– svnnf, Commented Jul 18, 2017 at 14:19
@Eumenedies I updated the information. Unfortunately, I don't fully understand your solution. What is the reason for calculating the cumsum for StimuliA? — svnnf
– svnnf, Commented Jul 18, 2017 at 15:34

lmo · Accepted Answer · 2017-07-18 16:11:55Z

1

Here is a two line solution in base R. First, create an ID that is unique to each user-(new)stimulus combination. This is accomplished with paste and cumsum.

dat$stims <- with(dat, paste(cumsum(StimuliA), cumsum(StimuliB), sep="_"))

Then use aggregate to calculate the responses for each of the new IDs

aggregate(. ~ User + stims, data=dat, sum) User stims StimuliA StimuliB R2 R3 R4 R5 R6 R7 1 1 1_0 1 0 0 0 0 0 0 1 2 1 2_0 1 0 1 1 0 0 1 0 3 1 2_1 0 1 1 2 0 0 1 0 4 1 2_2 0 1 0 0 0 0 0 0 5 2 3_2 1 0 3 0 0 0 0 0 6 2 3_3 0 1 1 0 0 0 2 0

answered Jul 18, 2017 at 16:11

lmo

38.6k9 gold badges63 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

svnnf Over a year ago

Follow-up question: In the orginial dataset i have a coloumn with dates. When i try the method with this coloumn included, R gives me an error, because they are a factor. How would I have to transform the values in this coloumn that it works with the date aswell. All i needed is the date of the stimuli of the row, where the responses (R2-R7) are getting aggregated.

lmo Over a year ago

You don't want to work with dates as factors. Transform the date to a Date variable using as.Date (many posts on this on SO). One method then would be to separately aggregate the date variable by User and stims similar to above, taking the min rather than the sum. Then merge the two resulting data.frames. If this does not make sense, it might be worth asking a new question that links to this question, adding the additional problem of the date variable. Also include an example dataset that includes this variable.

svnnf Over a year ago

i posted a new question with the sample here: stackoverflow.com/questions/45322102/…

Collectives™ on Stack Overflow

Summarize and count data in R with dplyr

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Linked

Related