10

I want to create a number of instances of a class based on values in a pandas.DataFrame. This I've got down.

import itertools import multiprocessing as mp import pandas as pd class Toy: id_iter = itertools.count(1) def __init__(self, row): self.id = self.id_iter.next() self.type = row['type'] if __name__ == "__main__": table = pd.DataFrame({ 'type': ['a', 'b', 'c'], 'number': [5000, 4000, 30000] }) for index, row in table.iterrows(): [Toy(row) for _ in range(row['number'])] 

Multiprocessing Attempts

I've been able to parallelize this (sort of) by adding the following:

pool = mp.Pool(processes=mp.cpu_count()) m = mp.Manager() q = m.Queue() for index, row in table.iterrows(): pool.apply_async([Toy(row) for _ in range(row['number'])]) 

It seems that this would be faster if the numbers in row['number'] are substantially longer than the length of table. But in my actual case, table is thousands of lines long, and each row['number'] is relatively small.

It seems smarter to try and break up table into cpu_count() chunks and iterate within the table. But now we're at the edge of my python skills.

I've tried things that the python interpreter screams at me for, like:

pool.apply_async( for index, row in table.iterrows(): [Toy(row) for _ in range(row['number'])] ) 

Also things that "can't be pickled"

Parallel(n_jobs=4)( delayed(Toy)([row for _ in range(row['number'])]) \ for index, row in table.iterrows() ) 

Edit

This may gotten me a little bit closer, but still not there. I create the class instances in a separate function,

def create_toys(row): [Toy(row) for _ in range(row['number'])] .... Parallel(n_jobs=4, backend="threading")( (create_toys)(row) for i, row in table.iterrows() ) 

but I'm told 'NoneType' object is not iterable.

5
  • Did you see this question? stackoverflow.com/questions/26784164/… Commented Jun 9, 2015 at 19:36
  • No I didn't; looking at it now. Commented Jun 9, 2015 at 19:44
  • I can see how that applies, but I can't quite coerce it to my problem. Commented Jun 9, 2015 at 20:15
  • You create a number of Toy instances, but it looks like you just throw them away. It's not clear why you're doing any of this, which makes it hard to suggest ways to do it better. Commented Jun 10, 2015 at 1:43
  • In my real case the class calls a write method that writes the instance to an xml tree. That's an entirely different question... Commented Jun 10, 2015 at 1:48

1 Answer 1

3

It's a little bit unclear to me what the output you are expecting is. Do you just want a big list of the form

[Toy(row_1) ... Toy(row_n)] 

where each Toy(row_i) appears with multiplicity row_i.number?

Based on the answer mentioned by @JD Long I think you could do something like this:

def process(df): L = [] for index, row in table.iterrows(): L += [Toy(row) for _ in range(row['number'])] return L table = pd.DataFrame({ 'type': ['a', 'b', 'c']*10, 'number': [5000, 4000, 30000]*10 }) p = mp.Pool(processes=8) split_dfs = np.array_split(table,8) pool_results = p.map(process, split_dfs) p.close() p.join() # merging parts processed by different processes result = [a for L in pool_results for a in L] 
Sign up to request clarification or add additional context in comments.

3 Comments

This is exactly what I needed, though that last line took me a long time to figure out. I ended up on this question before seeing you had already covered what I needed!
Nice one, I actually quite dislike that syntax, I find it quite unreadable and I can never remember which order the loops run in. (not sure how I'd do it differently though)
can you please have a look at this question :- stackoverflow.com/questions/53561794/…

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.