1

Given an RDD with several key-value pairs, where each value is actually a list of values, how do I split the value lists so that I end up with simple key-value pairs?

from pyspark import SparkConf, SparkContext conf = SparkConf() sc = SparkContext(conf=conf) foo = sc.parallelize([(0,[1,1,4]),(1,[3,5])]) bar = foo.map(magic) bar.collect() >>>>[(0,1),(0,1),(0,4),(1,3),(1,5)] 

What would magic look like to achieve what I want?

2 Answers 2

2

Figured it out:

bar=foo.flatMap(lambda l: [(l[0], value) for value in l[1]]) 

I realize that it is a rather simple problem and solution, but I'll leave it up in case anyone else is struggling starting out with pyspark.

Sign up to request clarification or add additional context in comments.

Comments

0

Python lets you combine arbitrarily many generator expressions, essentially letting you "unwrap" a nested structure like this. Each "layer" will be a new for _ in _

lambda l: [(key, value) for (key, values) in l for value in values] 
>>> l = [(0,[1,1,4]),(1,[3,5])] >>> [(key, value) for (key, values) in l for value in values] [(0, 1), (0, 1), (0, 4), (1, 3), (1, 5)] 

4 Comments

This does not seem to work with pyspark. Using your lambda function for magic, I get TypeError: cannot unpack non-iterable int object
That means that at some level of the nesting, you tried to "unwrap" an integer type rather than a container type. Can you paste exactly what you tried?
bar = foo.map(lambda l: [(key, value) for (key, values) in l for value in values])
Hmmm, I may be out of my depth here. I copied my second snippet from a python interpreter, so assuming you're using Python 3 I think we should see the same behavior. It's possible that pyspark is coming into play here in some way I can't see, in which case my answer probably isn't helpful: my answer is generic Python.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.