Given
data = sc.parallelize([(1,'winter is coming'),(2,'ours is the fury'),(3,'the old the true the brave')]) My desired output is
[('fury',[2],('true',[3]),('is',[1,2]),('old',[3]),('the',[2,3]),('ours',[2]),('brave',[3]),('winter',[1]),('coming',[1])] I'm not sure how to map the following output
[(1,'winter'),(1,'is'),(1,'coming'),(2,'ours'),(2,'is'),....etc.]` I tried using
data.flatMap(lambda x: [(x[0], v) for v in x[1]] but this ended up mapping the key to each letter of the string instead of the word. Should flatMap, map or split function be used here?
After mapping, I plan to reduce the paired RDDs with similar keys and inverse key and value by using
data.reduceByKey(lambda a,b: a+b).map(lambda x:(x[1],x[0])).collect() Is my thinking correct?