Merge function performing too slowly; what can be done about it?

Question

I have an extremely large CSV file that contains only entries like the following:

Nil,+1 int,+1 int,-1 int,-1 Nil,+1 Nil,-1 Dictionary,+1 Dictionary,-1 Array,+1 Nil,+1 String,+1

I have parsed the file in Wolfram via

ds = Import["/path/to/large/file.txt", {"CSV", "Dataset"}, HeaderLines -> 0]; listOfAssoc = (Association[Rule @@ #1]) & /@ (ds // Normal); Merge[listOfAssoc, Total]

which yields

<|"int" -> -6159, "Nil" -> 72282, "Array" -> -9, "Dictionary" -> -15, "String" -> -371, "bool" -> -266, "float" -> 15857, "RID" -> 0, "Rect2" -> 0, "Color" -> -23, "PoolVector2Array" -> 0, "PoolRealArray" -> 0, "PoolIntArray" -> 0, "Vector2" -> -10, "PoolStringArray" -> 0, "Transform" -> -2, "Transform2D" -> 0, "Object" -> 1042, "Vector3" -> 612, "PoolVector3Array" -> 0, "PoolColorArray" -> 0, "Plane" -> -4, "Quat" -> 0, "AABB" -> 0, "Basis" -> 0, "NodePath" -> 0, "PoolByteArray" -> 0|>

This looks about correct to me as far as the actual computation goes (i.e. I'm just trying to add up a bunch of 1's and -1's from log data to see if certain types in another computer program are leaking memory, and this computation was very helpful for that).

Problem.

The last line of this computation (Merge[listOfASsoc, Total]) takes several minutes (like 10 minutes). Am I doing something wrong?

Merge[list, Total] seems to have roughly $O(n^2)$ time complexity for large number of non-empty associations in list. Maybe this is the source of trouble? In this case you could possibly perform the operation more efficiently in some sort of divide and conquer method thanks to properties of Total... — kirma
– kirma, Commented Dec 24, 2019 at 20:29
You might consider replacing Merge[listOfAssoc, Total] with First@NestWhile[Merge[Total] /@ Partition[#, UpTo@64] &, listOfAssoc, Length[#] > 1 &]. This changes the order of summation which may or may not be relevant in your application, but at least yields much more linear time complexity. — kirma
– kirma, Commented Dec 24, 2019 at 20:47
kirma: The operation is now immediate. That fixed the issue. — George
– George, Commented Dec 24, 2019 at 21:05

kirma · Accepted Answer · 2019-12-24 21:22:36Z

This would seem to really be a question on runtime complexity of Merge, or more specifically in this case Merge[Total]. It seems to have roughly $O(n^2)$, instead of expected mostly-linear complexity.

In the case of Total as the merging operator this problem can be solved by a divide-and-conquer approach working around the quadratic growth of running time:

First@NestWhile[ Merge[Total] /@ Partition[#, UpTo@64] &, listOfAssoc, Length[#] > 1 &]

This splits the input to sublists of at most 64 items and performs the merge operator on all of these sublists individually, repeating until only one list item is left (which is the result). This is identical to the original Merge[Total] operation apart from the order of the summation, which is safe in most scenarios which are not sensitive to summation order (a well known floating point number problem).

Does it count as a bug that Merge[Total] performs worse performance than expected? — QuantumDot
– QuantumDot, Commented Dec 25, 2019 at 4:43
@QuantumDot I think it shouldn't be this bad. (It isn't obvious without thousands of associations, anyway!) Maybe the assumption the implementor has had is that the amount of associations is low, and this makes it an "irrelevant" question. — kirma
– kirma, Commented Dec 25, 2019 at 5:01
I think this can be written Fold[Merge[fn]@*List, Partition[listOfAssoc, UpTo@64]] — Mr.Wizard
– Mr.Wizard, Commented Dec 26, 2019 at 8:03
@Mr.Wizard That works too, but the logic of it is different from my solution. Also, efficiency on really large lists is probably lower(?). — kirma
– kirma, Commented Dec 26, 2019 at 12:09
I added a benchmark. If I didn't mess up your method it appears they perform similarly in version 10.1. — Mr.Wizard
– Mr.Wizard, Commented Dec 28, 2019 at 9:45

Mr.Wizard · Accepted Answer · 2019-12-28 09:45:07Z

It appears GroupBy does not suffer from this performance issue, so here is an alternative implementation using it, compared to Merge:

myMerge[list_, fn_] := GroupBy[Catenate @ Normal[list], Keys -> Values, fn] SeedRandom[1] ascList = Table[<|a -> Random[], b -> Random[]|>, {20000}]; Merge[ascList, Total] // RepeatedTiming myMerge[ascList, Total] // RepeatedTiming

{2.29, <|a -> 9944.07, b -> 9990.23|>} {0.038, <|a -> 9944.07, b -> 9990.23|>}

Further benchmarking

Benchmarking of more methods including one using Carl Woll's GroupByList.

I don't have UpTo in version 10.1 so I wrote these without it.

Needs["GeneralUtilities`"] GroupByList = ResourceFunction["GroupByList"]; kirmaMerge[fn_][list_] := First@NestWhile[Merge[fn] /@ Partition[#, 64, 64, 1, {}] &, list, Length[#] > 1 &] wizMerge1[fn_][list_] := GroupBy[Catenate@Normal[list], Keys -> Values, fn] wizMerge2[fn_][list_] := <| Reap[KeyValueMap[Sow[#2, #] &] /@ list;, _, # -> fn[#2] &][[2]] |> wizMerge3[fn_][list_] := Fold[Merge[fn]@*List, {}, Partition[list, 64, 64, 1, {}]] wizMerge4[fn_][list_] := GroupByList[Catenate@Values@list, Catenate@Keys@list, fn] gen = Association /@ Table[RandomChoice[{"a", "b", "c", "d"}] -> Random[], {#}, {2}] &; fns = {Merge, kirmaMerge, wizMerge1, wizMerge2, wizMerge3, wizMerge4}; BenchmarkPlot[Through[fns[Total]], gen, 2]

The method based PositionIndex seems to be still several times faster. And it saves the generation of the long list of associations in the first place (which definitely does not come for free). — Henrik Schumacher
– Henrik Schumacher, Commented Dec 28, 2019 at 15:56
@Henrik I chose only to seek a faster Merge alternative because it seems to have a significant performance issue. I did not try to include your method because it was a way to avoid the Association creation in the first place, rather than speeding processing of existing Association data. If you would like me to try to include your method I shall, but I will be starting with the same Association data because that is my interest in this question, even if yours is the better answer. — Mr.Wizard
– Mr.Wizard, Commented Dec 28, 2019 at 20:27

kglr · Accepted Answer · 2019-12-25 09:35:12Z

If all the associations have same set of keys, then Tr and Total are faster than all the methods posted so far.

Wrapping the list of associations with Dataset and using Total is also quite fast.

Using Mr. Wizard's test setup,

SeedRandom[1] ascList = Table[<|a -> Random[], b -> Random[]|>, {30000}]; Tr[ascList] // RepeatedTiming

{0.014, <|a -> 14984.3, b -> 15053.|>}

Total[ascList] // RepeatedTiming

{0.014, <|a -> 14984.3, b -> 15053.|>}

Dataset[ascList][Total] // Normal // RepeatedTiming

{0.041, <|a -> 14984.3, b -> 15053.|>}

compare to the methods form kirma's and Mr. Wizard's answers:

First @ NestWhile[Merge[Total] /@ Partition[#, UpTo @ 64] &, ascList, Length[#] > 1 &] // RepeatedTiming

{0.054, <|a -> 14984.3, b -> 15053.|>}

myMerge[ascList, Total] // RepeatedTiming

{0.095, <|a -> 14984.3, b -> 15053.|>}

$Version

"11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)"

+1 for including what I didn't, but I was considering the case where not all keys are common to all associations. A related question of mine for those interested by this: (63293) — Mr.Wizard
– Mr.Wizard, Commented Dec 25, 2019 at 6:34
Thank you @Mr.Wizard. Very good point. I added the qualification to the post. — kglr
– kglr, Commented Dec 25, 2019 at 9:35

Henrik Schumacher · Accepted Answer · 2019-12-25 17:59:34Z

In some constellations, when many keys are duplicates, this can be faster by two orders of magnitude than Merge and still ten times faster than kirma's method. It keeps the keys and the values in separate lists. The latter one can be packed which is also helpul for indexing and summation operations.

words = RandomWord[100]; n = 1000000; keys = RandomChoice[words, n]; values = RandomInteger[{-10, 10}, n]; listOfAssoc = Association /@ Rule @@@ Transpose[{keys, values}]; r1 = Merge[listOfAssoc, Total]; // AbsoluteTiming // First r2 = First@ NestWhile[Merge[Total] /@ Partition[#, UpTo@64] &, listOfAssoc, Length[#] > 1 &]; // AbsoluteTiming // First r3 = Map[ list \[Function] Total[values[[list]]], PositionIndex[keys] ]; // AbsoluteTiming // First r1 == r2 == r3

17.7318

2.37986

0.17167

True

Stack Exchange Network

Merge function performing too slowly; what can be done about it?

4 Answers 4

Further benchmarking

Linked

Hot Network Questions

Merge function performing too slowly; what can be done about it?

4 Answers 4

Further benchmarking

Linked

Related

Hot Network Questions