how to specify the partition for mapPartition in spark

Question

What I would like to do is compute each list separately so for example if I have 5 list ([1,2,3,4,5,6],[2,3,4,5,6],[3,4,5,6],[4,5,6],[5,6]) and I would like to get the 5 lists without the 6 I would do something like :

data=[1,2,3,4,5,6]+[2,3,4,5,6,7]+[3,4,5,6,7,8]+[4,5,6,7,8,9]+[5,6,7,8,9,10] def function_1(iter_listoflist): final_iterator=[] for sublist in iter_listoflist: final_iterator.append([x for x in sublist if x!=6]) return iter(final_iterator) sc.parallelize(data,5).glom().mapPartitions(function_1).collect()

then cut the lists so I get the first lists again. Is there a way to simply separate the computation? I don't want the lists to mix and they might be of different sizes.

thank you

Philippe

no not always the last element I just want to compute lists in parallel of each other. the entry of parallelize is whatever works and this worked for lists of same size. If there is a way to not use parallelize and give directly the partition that would be great. I just want it to compute different list seperate from each other and give me the different results which are also lists — Philippe C
– Philippe C, Commented Nov 6, 2015 at 8:49

zero323 · Accepted Answer · 2015-11-06 08:59:06Z

As far as I understand your intentions all you need here is to keep individual lists separate when you parallelize your data:

data = [[1,2,3,4,5,6], [2,3,4,5,6,7], [3,4,5,6,7,8], [4,5,6,7,8,9], [5,6,7,8,9,10]] rdd = sc.parallelize(data) rdd.take(1) # A single element of a RDD is a whole list ## [[1, 2, 3, 4, 5, 6]]

Now you can simply map using a function of your choice:

def drop_six(xs): return [x for x in xs if x != 6] rdd.map(drop_six).take(3) ## [[1, 2, 3, 4, 5], [2, 3, 4, 5, 7], [3, 4, 5, 7, 8]]

actually I just tried is with mapPartitions instead of map and it didn't work! it gave me the same list as input, I think there is something I don't understand even after stackoverflow.com/questions/21185092/… and the documentation
does map computes in parallel because of the parallelize call?
And it shouldn't work. Moreover there is no reason to use mapPartitions here. map over RDD works in parallel and simplifying things a lot you can think that it is a result of parallelize.

Collectives™ on Stack Overflow

how to specify the partition for mapPartition in spark

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Linked

Related