I want to create a number of instances of a class based on values in a pandas.DataFrame. This I've got down.
import itertools import multiprocessing as mp import pandas as pd class Toy: id_iter = itertools.count(1) def __init__(self, row): self.id = self.id_iter.next() self.type = row['type'] if __name__ == "__main__": table = pd.DataFrame({ 'type': ['a', 'b', 'c'], 'number': [5000, 4000, 30000] }) for index, row in table.iterrows(): [Toy(row) for _ in range(row['number'])] Multiprocessing Attempts
I've been able to parallelize this (sort of) by adding the following:
pool = mp.Pool(processes=mp.cpu_count()) m = mp.Manager() q = m.Queue() for index, row in table.iterrows(): pool.apply_async([Toy(row) for _ in range(row['number'])]) It seems that this would be faster if the numbers in row['number'] are substantially longer than the length of table. But in my actual case, table is thousands of lines long, and each row['number'] is relatively small.
It seems smarter to try and break up table into cpu_count() chunks and iterate within the table. But now we're at the edge of my python skills.
I've tried things that the python interpreter screams at me for, like:
pool.apply_async( for index, row in table.iterrows(): [Toy(row) for _ in range(row['number'])] ) Also things that "can't be pickled"
Parallel(n_jobs=4)( delayed(Toy)([row for _ in range(row['number'])]) \ for index, row in table.iterrows() ) Edit
This may gotten me a little bit closer, but still not there. I create the class instances in a separate function,
def create_toys(row): [Toy(row) for _ in range(row['number'])] .... Parallel(n_jobs=4, backend="threading")( (create_toys)(row) for i, row in table.iterrows() ) but I'm told 'NoneType' object is not iterable.
Toyinstances, but it looks like you just throw them away. It's not clear why you're doing any of this, which makes it hard to suggest ways to do it better.writemethod that writes the instance to an xml tree. That's an entirely different question...