1

I have an RDD with duplicates values with the following format:

[ {key1: A}, {key1: A}, {key1: B}, {key1: C}, {key2: B}, {key2: B}, {key2: D}, ..] 

I would like the new RDD to have the following output and to get ride of duplicates.

[ {key1: [A,B,C]}, {key2: [B,D]}, ..] 

I have manage to do this with the following code by putting the values in a set to get ride of duplicates.

RDD_unique = RDD_duplicates.groupByKey().mapValues(lambda x: set(x)) 

But I am trying to achieve this more elegantly in 1 command with

RDD_unique = RDD_duplicates.reduceByKey(...) 

I have not managed to come up with a lambda function that gets me the same result in the reduceByKey function.

1

1 Answer 1

3

You can do it like this:

data = (sc.parallelize([ {key1: A}, {key1: A}, {key1: B}, {key1: C}, {key2: B}, {key2: B}, {key2: D}, ..])) result = (data .mapValues(lambda x: {x}) .reduceByKey(lambda s1, s2: s1.union(s2))) 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.