Dataset collapsing/reducing

Question

I have a dataset ds with 3 columns named A, B, C. Columns A and B have repeated values. How can I obtain another dataset that contains a list of the values in C for each of the unique combinations of A and B?

For example,

ds = Dataset[ {<| "A" -> 2, "B" -> 3, "C" -> 100 |>, <| "A" -> 2, "B" -> 4, "C" -> 200 |>, <| "A" -> 2, "B" -> 3, "C" -> 300 |>}]

I want to get

Dataset[ {<| "A" -> 2, "B" -> 3, "Clist" -> {100, 300} |>, <| "A" -> 2, "B" -> 4, "Clist" -> {200} |>}]

How can I do that?

and a slightly related: Reshaping associations, generalization of GroupBy — Kuba
– Kuba, Commented Jul 26, 2016 at 9:12

WReach · Accepted Answer · 2016-07-26 14:42:32Z

To get a list of C values for each A/B combination:

ds[GroupBy[{#A, #B}&] /* Values , <| "A" -> Query[First, "A"] , "B" -> Query[First, "B"] , "CList" -> Query[All, "C"] |> ]

This is not as succinct as SQL's GROUP BY operator, but it does allow us to easily perform multiple aggregations if desired:

ds[GroupBy[{#A, #B}&] /* Values , <| "A" -> Query[First, "A"] , "B" -> Query[First, "B"] , "CList" -> Query[All, "C"] , "CMean" -> Query[Mean, "C"] , "CMin" -> Query[Min, "C"] , "CMax" -> Query[Max, "C"] |> ]

I really think you should write a tutorial on Dataset[] at some point... I keep picking up new stuff from your answers on this topic. :) — J. M.'s missing motivation
– J. M.'s missing motivation, Commented Jul 26, 2016 at 14:44

Kuba · Accepted Answer · 2016-07-26 12:12:13Z

8

Failed to find a duplicate:

by = {"A", "B"}; Values @ GroupBy[ds, Query[by], MapAt[First, List /@ by] @* Merge[Identity]]

or:

ds // GroupBy[Query[{"A", "B"}] -> (#C &)] // KeyValueMap[<|#, "Clist" -> #2|> &] ds // GroupBy[Query[{"A", "B"}]] // KeyValueMap[<|#, "Clist" -> #2[[;; , "C"]]|> &]

edited Jul 26, 2016 at 12:12

answered Jul 26, 2016 at 9:03

Kuba

139k13 gold badges297 silver badges803 bronze badges

$\begingroup$ Thanks for an accept but it is always good to hold on a day or two. Let's not discourage others. $\endgroup$

Kuba
– Kuba

2016-07-26 09:07:43 +00:00
Commented Jul 26, 2016 at 9:07
$\begingroup$ awesome, what if I wanted to "reduce" the resulting dataset by computing say the mean of the list instead? $\endgroup$

amrods
– amrods

2016-07-26 09:07:43 +00:00
Commented Jul 26, 2016 at 9:07
$\begingroup$ @amrods Then take a look at the first topic linked in comments. It seems to be exactly that.Merge[Mean] is nice in a way that it will automatically reduce A and B lists to the repeating element. Here we have a more general case where I have to use First manually. $\endgroup$

Kuba
– Kuba

2016-07-26 09:09:22 +00:00
Commented Jul 26, 2016 at 9:09
3

$\begingroup$ Thanks. I really thought there would be a cleaner way of achieving this, since I think is a fairly common operation on a dataset. $\endgroup$

amrods
– amrods

2016-07-26 09:21:55 +00:00
Commented Jul 26, 2016 at 9:21

Add a comment |

Edmund · Accepted Answer · 2016-07-26 10:38:01Z

You may use GroupBy and Merge.

ds[GroupBy[{#"A", #"B"} &] /* Values, Merge[Identity] /* Query[{"A" -> First, "B" -> First}]]

Hope this helps

Stack Exchange Network

Dataset collapsing/reducing

3 Answers 3

Linked

Hot Network Questions

Dataset collapsing/reducing

3 Answers 3

Linked

Related

Hot Network Questions