0

I have a data set as I've shown below:

df <- tribble( ~id, ~price, ~type, ~number_of_book, "1", 10, "X", 3, "1", 2, "X", 1, "1", 5, "Y", 1, "2", 7, "X", 4, "2", 6, "X", 1, "2", 6, "Y", 2, "3", 2, "X", 4, "3", 8, "X", 2, "3", 1, "Y", 4, "3", 9, "Y", 5, ) 

Now, I want to answer this question: for each id and for each selected price group, what percentage of books is X and what percentage is Y? In other word, what is the distribution of the type of books for each id and price group?

To do this, first I need to have this data set as far as I visualize it in my mind:

agg_df <- tribble( ~type, ~id, ~less_than_two, ~two-five, ~five-six, ~more_than_six, "X", "1", 1, 0, 0, 3, "Y", "1", 0, 1, 0, 0, "X", "2", 0, 0, 1, 4, "Y", "2", 0, 0, 2, 2, "X", "3", 4, 0, 0, 2, "Y", "3", 4, 0, 0, 5, ) 

And then, this will be my desired data set:

desired_df <- tribble( ~type, ~id, ~less_than_two, ~three-five, ~five-six, ~more_than_six, "X", "1", "100%", "0%", "0%", "100%", "Y", "1", "0%", "100%", "0%", "0%", "X", "2", "0%", "0%", "33.3%", "66.6%", "Y", "2", "0%", "0%", "66.6%", "33.3%", "X", "3", "50%", "0%", "0%", "28.5%", "Y", "3", "50%", "0%", "0%", "71.4%", ) 

This desired data set shows me that when id is "3" and the price bin is more than six dollars there are two books in X type, but five books in Y type. So, here is the distribution: X(28.5%) and Y(71.4%).

Note: I had a similar question here, but now it is more complex manipulation that I could not manage to get it: How to manipulate (aggregate) the data in R?

I would appreciate if you could help me. Thanks in advance.

7
  • Sorry, I've just corrected the data. Commented Dec 8, 2019 at 23:48
  • @akrun unfortunately, the code did not give the correct answer. Commented Dec 8, 2019 at 23:52
  • @akrun I've tried both price and the number of book in your code, but it still does not give the result. :/ Commented Dec 8, 2019 at 23:58
  • 1
    Can you check if this helps ? df %>% mutate(price_group = c("less_than_two", "three_five", "five_six", "more_than_six")[findInterval(price, c(2, 5, 6), left.open = TRUE) + 1]) %>% group_by(id, type, price_group) %>% summarise(number_of_book = sum(number_of_book)) %>% group_by(id, price_group) %>% mutate(n = number_of_book/sum(number_of_book) * 100) %>% select(-number_of_book) %>% pivot_wider(names_from = price_group, values_from = n) Commented Dec 9, 2019 at 1:25
  • Note that the findInterval was already mentioned in the comments Commented Dec 9, 2019 at 3:05

3 Answers 3

2

We can divide price into different groups using findInterval, calculate the sum for number_of_book for each id, type and price_group and then calculate it's ratio for each id and price_group. Finally, we get the data in wider format using pivot_wider.

library(dplyr) df %>% mutate(price_group = c("less_than_two", "three_five", "five_six", "more_than_six") [findInterval(price, c(2, 5, 6), left.open = TRUE) + 1]) %>% group_by(id, type, price_group) %>% summarise(number_of_book = sum(number_of_book)) %>% group_by(id, price_group) %>% mutate(n = number_of_book/sum(number_of_book) * 100) %>% select(-number_of_book) %>% tidyr::pivot_wider(names_from = price_group, values_from = n, values_fill = list(n = 0)) # id type less_than_two more_than_six three_five five_six # <chr> <chr> <dbl> <dbl> <dbl> <dbl> #1 1 X 100 100 0 0 #2 1 Y 0 0 100 0 #3 2 X 0 100 0 33.3 #4 2 Y 0 0 0 66.7 #5 3 X 50 28.6 0 0 #6 3 Y 50 71.4 0 0 
Sign up to request clarification or add additional context in comments.

Comments

2

We can create a bin group with cut on the 'price' column, grouped by 'id', 'grp', create the percentage by dividing the 'number_of_book' with sum of 'number_of_book' and reshape into 'wide' format

library(dplyr) library(tidyr) df %>% group_by(id,grp = cut(price, breaks = c(-Inf, 2, 5, 6, Inf), c('less_than_two', 'three-five', 'five-six', 'more_than_six')), add = TRUE) %>% mutate(Perc = 100 *number_of_book/sum(number_of_book)) %>% select(-price, -number_of_book) %>% mutate(rn = row_number()) %>% pivot_wider(names_from = grp, values_from = Perc, values_fill = list(Perc = 0)) %>% select(-rn) # A tibble: 6 x 6 # Groups: id [3] # id type more_than_six less_than_two `three-five` `five-six` # <chr> <chr> <dbl> <dbl> <dbl> <dbl> #1 1 X 100 100 0 0 #2 1 Y 0 0 100 0 #3 2 X 100 0 0 33.3 #4 2 Y 0 0 0 66.7 #5 3 X 28.6 50 0 0 #6 3 Y 71.4 50 0 0 

5 Comments

Ops, in my real data set, it still does not give the accurate result. I don't know why but after I run the code, the price column remained in data.
@zineda If you check the line select(-price, -number_of_book) %>%, it is removing the price column. May be you have included price also as grouping variable?. It is also possible that you loaded some other package with select function. In that case, use specificailly dplyr::select(-price, -number_of_book)
@zineda Please check the str(df) with str(originaldata)
"id" is integer, "price" is numerical, "type" is character, "number_of_book" is integer. Even I add the dplyr:: I am getting this in the console: Adding missing grouping variables: price
@zineda Not clear though as I am not able to reproduce it
1

Maybe not the perfect solution, but an another way is to use case_when to define different categories:

library(tidyverse) df %>% group_by(id, type, price) %>% mutate(Less2 = case_when(price <= 2 ~ cumsum(number_of_book)), Three_Five = case_when(price %in% 3:5 ~ cumsum(number_of_book)), Five_six = case_when(price %in% 5:6 ~ cumsum(number_of_book)), More_six = case_when(price >6 ~ cumsum(number_of_book))) %>% replace(is.na(.),0) %>% ungroup(.) %>% group_by(id, type) %>% summarise_at(vars(Less2:More_six), ~sum(.)) %>% ungroup(.) %>% group_by(id) %>% mutate_at(vars(Less2:More_six), ~ replace_na(./sum(.), 0)) # A tibble: 6 x 6 # Groups: id [3] id type Less2 Three_Five Five_six More_six <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 1 X 100 0 0 100 2 1 Y 0 100 100 0 3 2 X 0 0 33.3 100 4 2 Y 0 0 66.7 0 5 3 X 50 0 0 28.6 6 3 Y 50 0 0 71.4 ``` 

6 Comments

I would use findInterval.. Regarding your last mutate, can it be shortened with mutate_at i.e. mutate_at(vars(Less2:More_six), ~ replace_na(./sum(.), 0))
Thanks for the update with the mutate_at, it makes things much nicer now. I did not know about findInterval`, I will take a look ;)
Similarly, you could also use summarise_at for this line, summarise(Less2 = sum(Less2).....
I've tried to code this, but it takes ages to run my real data set. :/
Thanks @RonakShah for the tip. I'm editing my answer
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.