Data locality optimization in Python

Question

The question is about how is the memory allocated in Python for this class, and the question is asked more out of curiosity than for real life value.

The following post processing class handles huge amounts of data (numpy arrays). This is simplified version of the real code to make the problem readable.The shared data of from the base class is now replaced with the characteristic_length variable.

class Reader(object): characteristic_length = 1 def factory(type_name, file_name): if type_name == "OFReader": return OFReader(file_name) ... assert 0, "Bad reader instantiation: " + type_name ... class OFReader(Reader): def __init__(self, of_file_name): self.file_name = of_file_name self.read_data() .... def length(self): return self.data[:,0] / self.characteristic_length ...

Being a stranger to the python way of handling memory, I am curious whether placing the characteristic_length in the subclass would improve the performance a lot?

Based on the python documentation, I imagine that derived class instance has a pointer to the address where the base class is defined. If the lookup in the subclass fails, it is continued in the base class. This sounds like using the base class attributes to store data shared by the derived classes is awfully expensive.

In the real application the shared data is somewhat large numpy array, but common to all derived classes. Being lazy, I want to avoid reading the data for each subclass separately. Any pointers on how to solve this effectively?

Roland Smith · Accepted Answer · 2015-11-03 10:54:08Z

1

The simplest solution is to make the numpy array a global. If you want to prevent users from modifying that array you can set the writable flag to False.

Alternatively you could make it a class attribute of the Reader class.

answered Nov 3, 2015 at 10:54

Roland Smith

43.8k3 gold badges69 silver badges98 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

PekkaRo Over a year ago

Sorry, I wasn't very clear on the formulation, the characteristic length is now a place holder for the shared data. Global seems a bit less evil in Python than in other languages. How do global and class attribute compare in terms of memory locality? I guess global is one scope higher up?

Roland Smith Over a year ago

Honestly I've never worried about memory locality in Python. Nor memory usage as long as my OS's VM subsystem could handle it. That's one of the basic strengths of python; not having to worry about memory management. But if you really worry about it, measure both options by profiling them. That will give you a clear answer for your situation. (Although the answer might be that it doesn't matter.)

PekkaRo Over a year ago

Thanks. Will measure and update. But, like you said, in reality, it does not matter. IO takes so much longer. Also, pypy will most likely be the most important optimization step. Although, the optimization of memory might become important there.

Roland Smith Over a year ago

If I/O is a bottleneck (which it often is, given how much slower disk access is compared to RAM), look at mmap, that might help. And if your program is parallelizable, combine mmap with multiprocessing where every instance reads the data by itself. Once the first mmap is done, the pages will be in the cache making the access for the other instances faster. And if your resources allow for it you van always throw a big RAM-disk or and SSD at the problem. :-)

Collectives™ on Stack Overflow

Data locality optimization in Python

1 Answer 1

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Related