PySpark: How to Split String Value in Paired RDD and Map with Key

Question

Given

data = sc.parallelize([(1,'winter is coming'),(2,'ours is the fury'),(3,'the old the true the brave')])

My desired output is

[('fury',[2],('true',[3]),('is',[1,2]),('old',[3]),('the',[2,3]),('ours',[2]),('brave',[3]),('winter',[1]),('coming',[1])]

I'm not sure how to map the following output

[(1,'winter'),(1,'is'),(1,'coming'),(2,'ours'),(2,'is'),....etc.]`

I tried using

data.flatMap(lambda x: [(x[0], v) for v in x[1]]

but this ended up mapping the key to each letter of the string instead of the word. Should flatMap, map or split function be used here?

After mapping, I plan to reduce the paired RDDs with similar keys and inverse key and value by using

data.reduceByKey(lambda a,b: a+b).map(lambda x:(x[1],x[0])).collect()

Is my thinking correct?

ernest_k · Accepted Answer · 2020-04-25 14:41:44Z

You can flatMap and create tuples where keys are reused and an entry is created for each word (obtained using split()):

data.flatMap(lambda pair: [(pair[0], word) for word in pair[1].split()])

When collected, that outputs

[(1, 'winter'), (1, 'is'), (1, 'coming'), (2, 'ours'), (2, 'is'), (2, 'the'), (2, 'fury'), (3, 'the'), (3, 'old'), (3, 'the'), (3, 'true'), (3, 'the'), (3, 'brave')]

Collectives™ on Stack Overflow

PySpark: How to Split String Value in Paired RDD and Map with Key

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related