How to group and add up in spark? [duplicate]

Question

I have an RDD like this:

{"key1" : "fruit" , "key2" : "US" , "key3" : "1" } {"key1" : "fruit" , "key2" : "US" , "key3" : "2" } {"key1" : "vegetable" , "key2" : "US" , "key3" : "1" } {"key1" : "fruit" , "key2" : "Japan" , "key3" : "3" } {"key1" : "vegetable" , "key2" : "Japan" , "key3" : "3" }

My goal is to first group by key1 and then group by key2 and finally add key3.

I am expecting final result like,

key1 key2 key3 "fruit" , "US" , 3 "vegetable" , "US" , 1 "fruit" , "Japan" , 3 "vegetable" , "Japan" , 3

My code begins as below ,

rdd_arm = rdd_arm.map(lambda x: x[1])

rdd_arm includes the above key : value format.

I am not sure where to go next. Could some one help me out?

Community · Accepted Answer · 2017-05-23 12:24:17Z

2

I solved it myself.

I had to create a key including multiple keys and then add up.

rdd_arm.map( lambda x : x[0] + ", " + x[1] , x[2] ).reduceByKey( lambda a,b : a + b )

Below question was useful.

How to group by multiple keys in spark?

edited May 23, 2017 at 12:24

CommunityBot

11 silver badge

answered Aug 19, 2016 at 7:39

Yu Watanabe

6514 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

gsamaras Over a year ago

Allow me to say that this didn't work for me, I was getting errors of undefined name, and after getting getting by them, I was not able to make it work. As a result I posted a new answer, hope you like it! I upvoted the question though, since it made me practice! Thanks!

gsamaras · Accepted Answer · 2016-08-19 18:58:56Z

Let's create your RDD:

In [1]: rdd_arm = sc.parallelize([{"key1" : "fruit" , "key2" : "US" , "key3" : "1" }, {"key1" : "fruit" , "key2" : "US" , "key3" : "2" }, {"key1" : "vegetable" , "key2" : "US" , "key3" : "1" }, {"key1" : "fruit" , "key2" : "Japan" , "key3" : "3" }, {"key1" : "vegetable" , "key2" : "Japan" , "key3" : "3" }]) In [2]: rdd_arm.collect() Out[2]: [{'key1': 'fruit', 'key2': 'US', 'key3': '1'}, {'key1': 'fruit', 'key2': 'US', 'key3': '2'}, {'key1': 'vegetable', 'key2': 'US', 'key3': '1'}, {'key1': 'fruit', 'key2': 'Japan', 'key3': '3'}, {'key1': 'vegetable', 'key2': 'Japan', 'key3': '3'}]

First, you have to create a new key, which will be the pair of key1 and key2. The value of it will be key3, so you want to do something like this:

In [3]: new_rdd = rdd_arm.map(lambda x: (x['key1'] + ", " + x['key2'], x['key3'])) In [4]: new_rdd.collect() Out[4]: [('fruit, US', '1'), ('fruit, US', '2'), ('vegetable, US', '1'), ('fruit, Japan', '3'), ('vegetable, Japan', '3')]

Then, we want to add the values of the keys that are duplicates, simply be calling reduceByKey(), like this:

In [5]: new_rdd = new_rdd.reduceByKey(lambda a, b: int(a) + int(b)) In [6]: new_rdd.collect() Out[6]: [('fruit, US', 3), ('fruit, Japan', '3'), ('vegetable, US', '1'), ('vegetable, Japan', '3')]

and we are done!

Of course, this could be one-liner, like this:

new_rdd = rdd_arm.map(lambda x: (x['key1'] + ", " + x['key2'], x['key3'])).reduceByKey(lambda a, b: int(a) + int(b))

Collectives™ on Stack Overflow

How to group and add up in spark? [duplicate]

2 Answers 2

1 Comment

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Linked

Related