18
$\begingroup$

Assume you import data from a Table source of the following format.

<< GeneralUtilities`; fields = {"Country", "Region", "BU", "Year", "Date", "Sales"}; organization = {{"Argentina", "LATAM", "Americas"}, {"SouthAfrica", "Africa", "EAME"}, {"Brazil", "LATAM", "Americas"}, {"Japan", "Japan", "APAC"}, {"Australia", "ASEAN", "APAC"}, {"Germany", "Europe", "EAME"}}; SeedRandom[0]; list = Flatten[ Table[Join[ organization[[i]], {year, DateObject[{year, month, 1}], RandomInteger[{100, 1000}]}], {i, 6}, {year, 2004, 2013}, {month, 1, 6, 5}], 2]; sales = Dataset[AssociationThread[fields, #] & /@ list] 

enter image description here

I would like to summarize the data at the year level. If working with a database, an SQL command of the following format would allow you to create a dataset that is still flat.

SELECT sales.Country, sales.Region, sales.BU, sales.Year, Sum(sales.Sales) AS SumOfSales FROM sales GROUP BY sales.Country, sales.Region, sales.BU, sales.Year;

enter image description here

Using

sales[GroupBy@Key["Country"], GroupBy@Key["Region"], GroupBy@Key["BU"], GroupBy[Key["Year"]], Total, "Sales"] 

Creates a multilevel hierarchical data structure, which is not as simple to operate as a table type of dataset.

enter image description here

Is there a way to operate (total,mean, median,etc) on a dataset by grouping on several keys of interest while keeping the dataset flat the same way as done with the SQL procedure?

$\endgroup$
2
  • 1
    $\begingroup$ PatoCriollo: the GeneralUtilities package does not seem to be documented in V. 10. Could you please describe what is it used for and how you found it? $\endgroup$ Commented Aug 9, 2014 at 8:46
  • $\begingroup$ @magma See here. I believe that package was first introduced in that answer by Taliesin. $\endgroup$ Commented Aug 9, 2014 at 12:24

5 Answers 5

14
$\begingroup$

Probably far from ideal, but this works:

sales[ GroupBy[{#Country, #Region, #BU, #Year} & -> Key["Sales"]] ][Normal, Total ][All, Apply[Append] ] 

Mathematica graphics

(Thanks to WReach for the tip on the unusual but useful linebreak pattern.)


Update

This works too, and preserves the keys. Now if only I could specify Normal to be descending here ...

sales[ GroupBy[#, KeyTake[{"Country", "Region", "BU", "Year"}] -> KeyTake["Sales"], Total] & ][Normal ][All, Apply[Join]] 

Mathematica graphics

$\endgroup$
8
  • $\begingroup$ Brilliant! Dataset[AssociationThread[Drop[fields, {5}], #] & /@sales[ GroupBy[{#Country, #Region, #BU, #Year} & -> Key["Sales"]] ][Normal, Total ][All, Apply[Append]]//Normal] puts the keys back again. ] $\endgroup$ Commented Aug 9, 2014 at 4:15
  • 1
    $\begingroup$ @PatoCriollo That came out really wordy though, I'm hoping for a shorter solution! It would be good to have something like Merge that doesn't merge each key using the same function. $\endgroup$ Commented Aug 9, 2014 at 4:20
  • $\begingroup$ @Szabolcs I'm using your answer for a similar problem that I have, but I don't really understand the syntax. I was hoping you could help. How are [Normal][All,Apply[Join]] applied to the result from Total? $\endgroup$ Commented Aug 20, 2014 at 20:17
  • 1
    $\begingroup$ @MitchellKaplan This dataset contains an association, the keys of which are also associations. Normal will convert this outer association, which has the form <| <|...|> -> <| "Sales" -> 1 |>, <|...|> -> <| "Sales" -> 2 |>, ... |> to a simple list of rules, { <|...|> -> <| "Sales" -> 1 |>, <|...|> -> <| "Sales" -> 2 |>, ... }. You can see e.g. the first element of this list using ...[Normal][1]. Now what we want to do is join the left-hand-side of -> together with its right-hand-side into a single association. Instead of having <| "a" -> 1 |> -> <| "b" -> 2 |> we want ... $\endgroup$ Commented Aug 20, 2014 at 20:26
  • 1
    $\begingroup$ ... <| "a" -> 1, "b" -> 2 |>. This is what Apply is good for: change the head Rule (i.e. ->) into something else. What we need here is to change it to Join. Apply[Join][...] is equivalent to Apply[Join, ...] so we can use the form Apply[Join]. I hope this clears it up a bit. Regarding this method, while it works, it's probably not efficient and I am not really happy with it. I find it too convoluted. I am hoping for something simpler ... I'm not yet experienced with Dataset, and even though my answer got accepted, I'm not too confident about it. $\endgroup$ Commented Aug 20, 2014 at 20:28
8
$\begingroup$

Here is an alternative:

 Query[Map[Total] /* Normal /* Map[Apply@Append]]@ sales[GroupBy[{#Country, #Region, #BU, #Year} & -> Key["Sales"]]] 

Mathematica graphics

OR

sales[GroupBy[{#Country, #Region, #BU, #Year} & -> Key["Sales"]]][ Map[Total] /* Normal /* Map[Apply@Append]] 
$\endgroup$
7
$\begingroup$

Here is another possibility:

sales[ GroupBy[KeyTake[{"Country","Region","BU","Year"}]] /* Normal /* (Association@@@#&) , <| "Sales" -> Query[Total, "Sales"] |> ] 

This approach has the interesting property that it can be "scaled up" to perform multiple aggregations at the same time:

sales[ GroupBy[KeyTake[{"Country","Region","BU"}]] /* Normal /* (Association@@@#&) , <| "Sales" -> Query[Total, "Sales"] , "MinYear" -> Query[Min, "Year"] , "MaxYear" -> Query[Max, "Year"] |> ] 

dataset screenshot

$\endgroup$
1
  • $\begingroup$ This is the best answer. It works great even though I am a bit puzzled about how it works :) $\endgroup$ Commented Mar 30, 2021 at 9:56
5
$\begingroup$

A possibility

sales[ GroupBy[KeyTake[{"Country", "Region", "BU", "Year"}] -> KeyDrop["Date"]] /* Values, merge[{"Sales" -> Total}, First] ] 

merge is an operator such that you can specify a merging function for particular keys, and a default one

merge[r : {__Rule}, def_] := Merge[Identity] /* Query[{ Query[KeyDrop@Keys@r, def], Query[KeyTake[#], #2] & @@@ r} // Flatten] /* Merge[First] 

or something among these lines

groupBy2D[groupby_, newCols : {__Rule}] := With[{tr = Transpose[#, AllowedHeads -> All] &}, Query[ GroupBy[KeyTake[groupby]] /* Values, Query[{First, tr /* Query[<|newCols|>]}] /* Merge[First] /* KeyTake[groupby]] ] 

so that

sales[ groupBy2D[ {"Country", "Region", "BU", "Year"}, {"SumOfSales" -> (Total@#Sales &)} ] ] 

These are probably not too efficient

$\endgroup$
1
  • $\begingroup$ groupBy2D[groupby_, newCols : {__Rule}] := With[{tr = Transpose[#, AllowedHeads -> All] &}, Query[GroupBy[KeyTake[groupby]] /* Values, Query[{First, tr /* Query[<|newCols|>]}] /* Merge[First] /* KeyTake[Join[groupby, Keys[newCols]]]] ], just changing last KeyTake from KeyTake[groupby] to KeyTake[Join[groupby, Keys[newCols]]] includes the newCols in the returned Dataset. Very handy construction, thank you @Rojo $\endgroup$ Commented Mar 14, 2015 at 17:52
2
$\begingroup$

Its a bit late to add a comment, but I found that GeneralUtilities` has a some operators such as AssociationPairs and AssociationMapThread. I used them to adjust the internal Dataset format. Since a GroupBy leaves a single association in the Dataset whose keys are the grouping keys, you need to process what is essentially a single association and make each k-> v in that association a "row." I used

dsGroupByResult[AssociationPairs] 

to fix the structure. However it loses the column names and the "rows" are now lists.

To add columns back in, I use AssociationMapThread to "add" the columns back in and restructured back into a list of associations. My GroupBys usually output an association of values (e.g. mean, min, max for a numerical leaf column) so I just use ##2 since it already has keys on the values.

dsGroupByResult[AssocationMapThread[<|"theGroupingKeyColumnName"->#1 (*or whatever *), ##2|>]&] 

I think both of these functions should be included in the standard package.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.