Hello I often need to use groupByKey in my code but I know it's a very heavy operation. Since I'm working to improve performance I was wondering if my approach to remove all groupByKey calls is efficient.
I was used to create an RDD from another RDD and creating pair of type (Int, Int)
rdd1 = [(1, 2), (1, 3), (2 , 3), (2, 4), (3, 5)]
and since I needed to obtain something like this:
[(1, [2, 3]), (2 , [3, 4]), (3, [5])]
what I used was out = rdd1.groupByKey but since this approach might be very problematic with huge datasets I thought to use this solution:
Instead of creating my RDD rdd1 of pairs of type (Int, Int) what I do is creating it of pairs of type (Int, List[Int]) so my rdd1 was something like this
rdd1 = [(1, [2]), (1, [3]), (2 , [3]), (2, [4]), (3, [5])]
but this time to reach the same result I used reduceByKey(_ ::: _) joining all the values by key, which is supposed to be faster. Do you think using this approach might improve performance? I'm afraid of this type (Int, List[Int]) isn't stupid creating a pair which value is a list containing only 1 element?
Do you think is there a faster way to reach the same result, using some other method? Thank you.
aggregateByKeyorcombineByKeyinstead, with an empty List as intializer, and then list.add and list.addAll for the combiner and merger, respectively. This would avoid creating single-element lists in the first place. I believed thatgroupByKeywould be already optimized to work better in such cases, though.