Spark how to split (key-value list) into key-value pairs

Question

Given an RDD with several key-value pairs, where each value is actually a list of values, how do I split the value lists so that I end up with simple key-value pairs?

from pyspark import SparkConf, SparkContext conf = SparkConf() sc = SparkContext(conf=conf) foo = sc.parallelize([(0,[1,1,4]),(1,[3,5])]) bar = foo.map(magic) bar.collect() >>>>[(0,1),(0,1),(0,4),(1,3),(1,5)]

What would magic look like to achieve what I want?

emilaz · Accepted Answer · 2019-04-07 19:37:25Z

Figured it out:

bar=foo.flatMap(lambda l: [(l[0], value) for value in l[1]])

I realize that it is a rather simple problem and solution, but I'll leave it up in case anyone else is struggling starting out with pyspark.

quiet_laika · Accepted Answer · 2019-04-07 19:12:32Z

0

Python lets you combine arbitrarily many generator expressions, essentially letting you "unwrap" a nested structure like this. Each "layer" will be a new for _ in _

lambda l: [(key, value) for (key, values) in l for value in values]

>>> l = [(0,[1,1,4]),(1,[3,5])] >>> [(key, value) for (key, values) in l for value in values] [(0, 1), (0, 1), (0, 4), (1, 3), (1, 5)]

answered Apr 7, 2019 at 19:12

quiet_laika

17410 bronze badges

4 Comments

emilaz Over a year ago

This does not seem to work with pyspark. Using your lambda function for magic, I get TypeError: cannot unpack non-iterable int object

quiet_laika Over a year ago

That means that at some level of the nesting, you tried to "unwrap" an integer type rather than a container type. Can you paste exactly what you tried?

emilaz Over a year ago

bar = foo.map(lambda l: [(key, value) for (key, values) in l for value in values])

quiet_laika Over a year ago

Hmmm, I may be out of my depth here. I copied my second snippet from a python interpreter, so assuming you're using Python 3 I think we should see the same behavior. It's possible that pyspark is coming into play here in some way I can't see, in which case my answer probably isn't helpful: my answer is generic Python.

Collectives™ on Stack Overflow

Spark how to split (key-value list) into key-value pairs

2 Answers 2

Comments

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Related