3

For more setup, see this question. I want to create lots of instances of class Toy, in parallel. Then I want to write them to an xml tree.

import itertools import pandas as pd import lxml.etree as et import numpy as np import sys import multiprocessing as mp def make_toys(df): l = [] for index, row in df.iterrows(): toys = [Toy(row) for _ in range(row['number'])] l += [x for x in toys if x is not None] return l class Toy(object): def __new__(cls, *args, **kwargs): if np.random.uniform() <= 1: return super(Toy, cls).__new__(cls, *args, **kwargs) def __init__(self, row): self.id = None self.type = row['type'] def set_id(self, x): self.id = x def write(self, tree): et.SubElement(tree, "toy", attrib={'id': str(self.id), 'type': self.type}) if __name__ == "__main__": table = pd.DataFrame({ 'type': ['a', 'b', 'c', 'd'], 'number': [5, 4, 3, 10]}) n_cores = 2 split_df = np.array_split(table, n_cores) p = mp.Pool(n_cores) pool_results = p.map(make_toys, split_df) p.close() p.join() l = [a for L in pool_results for a in L] box = et.Element("box") box_file = et.ElementTree(box) for i, toy in itertools.izip(range(len(l)), l): Toy.set_id(toy, i) [Toy.write(x, box) for x in l] box_file.write(sys.stdout, pretty_print=True) 

This code runs beautifully. But I redefined the __new__ method to only have a random chance of instantiating a class. So if I set if np.random.uniform() < 0.5, I want to create half as many instances as I asked for, randomly determined. Doing this returns the following error:

Exception in thread Thread-3: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 763, in run self.__target(*self.__args, **self.__kwargs) File "/usr/lib/python2.7/multiprocessing/pool.py", line 380, in _handle_results task = get() AttributeError: 'NoneType' object has no attribute '__dict__' 

I don't know what this even means, or how to avoid it. If I do this process monolithically, as in l = make_toys(table), it runs well for any random chance.

Another solution

By the way, I know that this can be solved by leaving the __new__ method alone and instead rewriting make_toys() as

def make_toys(df): l = [] for index, row in df.iterrows(): prob = np.random.binomial(row['number'], 0.1) toys = [Toy(row) for _ in range(prob)] l += [x for x in toys if x is not None] return l 

But I'm trying to learn about the error.

1
  • I don't know what happened to the previous answer, but note that I am removing the None elements from the list of objects. Commented Jun 11, 2015 at 12:46

1 Answer 1

6

I think you've uncovered a surprising "gotcha" caused by Toy instances becoming None as they are passed through the multiprocessing Pool's result Queue.

The multiprocessing.Pool uses Queue.Queues to pass results from the subprocesses back to the main process.

Per the docs:

When an object is put on a queue, the object is pickled and a background thread later flushes the pickled data to an underlying pipe.

While the actual serialization might be different, in spirit the pickling of an instance of Toy becomes a stream of bytes such as this:

In [30]: import pickle In [31]: pickle.dumps(Toy(table.iloc[0])) Out[31]: "ccopy_reg\n_reconstructor\np0\n(c__main__\nToy\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\n(dp5\nS'type'\np6\nS'a'\np7\nsS'id'\np8\nNsb." 

Notice that the module and class of the object is mentioned in the stream of bytes: __main__\nToy.

The class itself is not pickled. There is only a reference to the name of the class.

When the stream of bytes is unpickled on the other side of the pipe, Toy.__new__ is called to instantiate a new instance of Toy. The new object's __dict__ is then reconstituted using unpickled data from the byte stream. When the new object is None, it has no __dict__ attribute, and hence the AttributeError is raised.

Thus, as a Toy instance is passed through the Queue, it might become None on the other side.

I believe this is the reason why using

class Toy(object): def __new__(cls, *args, **kwargs): x = np.random.uniform() <= 0.5 if x: return super(Toy, cls).__new__(cls, *args, **kwargs) logger.info('Returning None') 

leads to

AttributeError: 'NoneType' object has no attribute '__dict__' 

If you add logging to your script,

import itertools import pandas as pd import lxml.etree as et import numpy as np import sys import multiprocessing as mp import logging logger = mp.log_to_stderr(logging.INFO) def make_toys(df): result = [] for index, row in df.iterrows(): toys = [Toy(row) for _ in range(row['number'])] result += [x for x in toys if x is not None] return result class Toy(object): def __new__(cls, *args, **kwargs): x = np.random.uniform() <= 0.97 if x: return super(Toy, cls).__new__(cls, *args, **kwargs) logger.info('Returning None') def __init__(self, row): self.id = None self.type = row['type'] def set_id(self, x): self.id = x def write(self, tree): et.SubElement(tree, "toy", attrib={'id': str(self.id), 'type': self.type}) if __name__ == "__main__": table = pd.DataFrame({ 'type': ['a', 'b', 'c', 'd'], 'number': [5, 4, 3, 10]}) n_cores = 2 split_df = np.array_split(table, n_cores) p = mp.Pool(n_cores) pool_results = p.map(make_toys, split_df) p.close() p.join() l = [a for L in pool_results for a in L] box = et.Element("box") box_file = et.ElementTree(box) for i, toy in itertools.izip(range(len(l)), l): toy.set_id(i) for x in l: x.write(box) box_file.write(sys.stdout, pretty_print=True) 

you will find that the AttributeError only occurs after a logging message of the form

[INFO/MainProcess] Returning None 

Notice that the logging message comes from the MainProcess, not one of the PoolWorker processes. Since the Returning None message comes from Toy.__new__, this shows that Toy.__new__ was called by the main process. This corroborates the claim that unpickling is calling Toy.__new__ and transforming instances of Toy into None.


The moral of the story is that for Toy instances to be passed through a multiprocessing Pool's Queue, Toy.__new__ must always return an instance of Toy. And as you noted, the code can be fixed by instantiating only the desired number of Toys in make_toys:

def make_toys(df): result = [] for index, row in df.iterrows(): prob = np.random.binomial(row['number'], 0.1) result.extend([Toy(row) for _ in range(prob)]) return result 

By the way, it is non-standard to call instance methods with Toy.write(x, box) when x is an instance of Toy. The preferred way is to use

x.write(box) 

Similary, use toy.set_id(i) instead of Toy.set_id(toy, i).

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for the style tips; I'm pretty new to Python. But I thought that the line [x for x in toys if x is not None] removed those non-objects.
Also, moving the chance to the for loop would be suboptimal, especially because __init__ in my actual code does quite a bit. I'd rather not go through all that if I'm just going to throw the object away.
Oh dear, I missed that. The Nones must be coming from somewhere else.
I think you've uncovered a surprising "gotcha" which is caused by some Toy instances becoming None as they are passed through the multiprocessing Pool's result Queue.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.