Spark: use reduceByKey instead of groupByKey and mapByValues

Question

I have an RDD with duplicates values with the following format:

[ {key1: A}, {key1: A}, {key1: B}, {key1: C}, {key2: B}, {key2: B}, {key2: D}, ..]

I would like the new RDD to have the following output and to get ride of duplicates.

[ {key1: [A,B,C]}, {key2: [B,D]}, ..]

I have manage to do this with the following code by putting the values in a set to get ride of duplicates.

RDD_unique = RDD_duplicates.groupByKey().mapValues(lambda x: set(x))

But I am trying to achieve this more elegantly in 1 command with

RDD_unique = RDD_duplicates.reduceByKey(...)

I have not managed to come up with a lambda function that gets me the same result in the reduceByKey function.

To remove the duplicate, have you tried spark.apache.org/docs/1.1.1/api/python/… ? — ccheneson
– ccheneson, Commented Jun 17, 2015 at 14:53

abalcerek · Accepted Answer · 2015-06-17 16:01:14Z

You can do it like this:

data = (sc.parallelize([ {key1: A}, {key1: A}, {key1: B}, {key1: C}, {key2: B}, {key2: B}, {key2: D}, ..])) result = (data .mapValues(lambda x: {x}) .reduceByKey(lambda s1, s2: s1.union(s2)))

Collectives™ on Stack Overflow

Spark: use reduceByKey instead of groupByKey and mapByValues

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related