2

Given

data = sc.parallelize([(1,'winter is coming'),(2,'ours is the fury'),(3,'the old the true the brave')]) 

My desired output is

[('fury',[2],('true',[3]),('is',[1,2]),('old',[3]),('the',[2,3]),('ours',[2]),('brave',[3]),('winter',[1]),('coming',[1])] 

I'm not sure how to map the following output

[(1,'winter'),(1,'is'),(1,'coming'),(2,'ours'),(2,'is'),....etc.]` 

I tried using

data.flatMap(lambda x: [(x[0], v) for v in x[1]] 

but this ended up mapping the key to each letter of the string instead of the word. Should flatMap, map or split function be used here?

After mapping, I plan to reduce the paired RDDs with similar keys and inverse key and value by using

data.reduceByKey(lambda a,b: a+b).map(lambda x:(x[1],x[0])).collect() 

Is my thinking correct?

1 Answer 1

1

You can flatMap and create tuples where keys are reused and an entry is created for each word (obtained using split()):

data.flatMap(lambda pair: [(pair[0], word) for word in pair[1].split()]) 

When collected, that outputs

[(1, 'winter'), (1, 'is'), (1, 'coming'), (2, 'ours'), (2, 'is'), (2, 'the'), (2, 'fury'), (3, 'the'), (3, 'old'), (3, 'the'), (3, 'true'), (3, 'the'), (3, 'brave')] 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.