0

What I would like to do is compute each list separately so for example if I have 5 list ([1,2,3,4,5,6],[2,3,4,5,6],[3,4,5,6],[4,5,6],[5,6]) and I would like to get the 5 lists without the 6 I would do something like :

data=[1,2,3,4,5,6]+[2,3,4,5,6,7]+[3,4,5,6,7,8]+[4,5,6,7,8,9]+[5,6,7,8,9,10] def function_1(iter_listoflist): final_iterator=[] for sublist in iter_listoflist: final_iterator.append([x for x in sublist if x!=6]) return iter(final_iterator) sc.parallelize(data,5).glom().mapPartitions(function_1).collect() 

then cut the lists so I get the first lists again. Is there a way to simply separate the computation? I don't want the lists to mix and they might be of different sizes.

thank you

Philippe

1
  • no not always the last element I just want to compute lists in parallel of each other. the entry of parallelize is whatever works and this worked for lists of same size. If there is a way to not use parallelize and give directly the partition that would be great. I just want it to compute different list seperate from each other and give me the different results which are also lists Commented Nov 6, 2015 at 8:49

1 Answer 1

1

As far as I understand your intentions all you need here is to keep individual lists separate when you parallelize your data:

data = [[1,2,3,4,5,6], [2,3,4,5,6,7], [3,4,5,6,7,8], [4,5,6,7,8,9], [5,6,7,8,9,10]] rdd = sc.parallelize(data) rdd.take(1) # A single element of a RDD is a whole list ## [[1, 2, 3, 4, 5, 6]] 

Now you can simply map using a function of your choice:

def drop_six(xs): return [x for x in xs if x != 6] rdd.map(drop_six).take(3) ## [[1, 2, 3, 4, 5], [2, 3, 4, 5, 7], [3, 4, 5, 7, 8]] 
Sign up to request clarification or add additional context in comments.

3 Comments

actually I just tried is with mapPartitions instead of map and it didn't work! it gave me the same list as input, I think there is something I don't understand even after stackoverflow.com/questions/21185092/… and the documentation
does map computes in parallel because of the parallelize call?
And it shouldn't work. Moreover there is no reason to use mapPartitions here. map over RDD works in parallel and simplifying things a lot you can think that it is a result of parallelize.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.