Can cython optimize python dictionary memory and lookup speed as well?

Question

I have a class which primarily contains the three dicts:

class KB(object): def __init__(self): # key:str value: list of str linear_patterns = defaultdict(list) # key:str value: list of str nonlinear_patterns = defaultdict(list) # key: str value: dict pattern_meta_info = {} ... self.__initialize() def __initialize(self): # the 3 dicts are populated ...

The size of the 3 dicts are below:

linear_patterns: 100,000 non_linear_patterns: 900,000 pattern_meta_info: 700,000

After the program is run and done, it takes about 15 seconds to release the memory. When I reduces the number of the dict sizes above by loading less data in initialization, the memory release is faster, so I judge it's due to these dict sizes that cause memory release slower. The total program takes about 8G memory. Also, after the dicts are built, all operations are lookup, no modifications.

Is there a way to use cython to optimize the 3 data structures above, especially in terms of memory usage? Is there a similar cython dictionary that can replaces the python dicts?

if your dict is going to contain python objects, then probably you aren't going to beat dict which is already implemented in C btw. You could potentially write your own hash-map implementation depending on the nature of your data. E.g., Python str objects and int objects are highly space inefficienct compared to similar C primitives, e.g. sys.getsizeof('') and sys.getsizeof(1) — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Feb 9, 2022 at 20:46
@juanpa.arrivillaga "own hash-map implementation": use cython or python? I don't have much cython experience, but I can learn if it helps here. — marlon
– marlon, Commented Feb 9, 2022 at 21:00
If you don't need all the data all the time, you can use generator functions stackoverflow.com/questions/231767/… — ToTamire
– ToTamire, Commented Feb 9, 2022 at 21:20
You might be able to use C++ map or unordered_map instead of the Python dict. The one thing that might catch you out - Python strings can be shared (so a string that appears multiple times need not take up more memory) while C++ strings won't be — DavidW
– DavidW, Commented Feb 9, 2022 at 21:24

ShadowRanger · Accepted Answer · 2022-02-11 13:35:23Z

It seems unlikely that a different dictionary or object type would change much. Destructor performance is dominated by the memory allocator. That will be roughly the same unless you switch to a different malloc implementation.

If this is only about object destruction at the end of your program, most languages (but not Python) would allow you to use call exit while keeping the KB object alive. The OS will release the memory much quicker when the process terminates. So why bother? Unfortunately that doesn't work with Python's sys.exit() since this merely raises an exception.

Everything else relies on changing the data structure or algorithm. Are your strings highly redundant? Maybe you can reuse string objects by interning them. Keep them in a shared set to use the same string in multiple places. A simple call to string = sys.intern(string) is enough. Unlike in earlier versions of Python, this will not keep the string object alive beyond its use so you don't run the risk of leaking memory in a long-running process.

You could also pool the strings in one large allocation. If access is relatively rare, you could change the class to use one large io.StringIO object for its contained strings and all dictionaries just deal with (offset, length) tuples into that buffer.

That still leaves many tuple and integer objects but those use specialized allocators that may be faster. Also, the length integer will come from the common pool of small integers and not allocate a new object.

A final thought: 8 GB of string data. You sure you don't want a small sqlite or dbm database? Could be a temporary file

8GB is all memory used, not the 3 dicts, but they should account for a large part. Do you have a link that tells how to make string objects shared? For example, the str 'test' is shared in two dicts, and they could be key and/or value in the dicts. How to make it shared with less memory? I think making strings shared should be first thing I should try.
@DavidW thanks for the pointer, which means I should only enclose all string with intern(my_str) when adding it to the dict? That seems to be very. simple? I thought I need to first create some type of hash from each string and then do complex stuff. –
@Homer512: You're incorrect to claim sys.exit will do anything to improve cleanup performance. sys.exit is implemented using the exception mechanism (it's a thin wrapper that invokes raise SystemExit), and the process still does all the normal cleanup (including cleaning up outstanding memory allocations). The only thing that would avoid that cost is os._exit (which directly terminates the process without running any outstanding except/finally/with block cleanup), and it's strongly discouraged in most cases (because it bypasses the normal cleanup guarantees).
@Homer512: Excellent, up-voted. I suppose if you really wanted to, you could combine sys.exit and os._exit to get the effect you're going for. Just wrap the top-level "main" code with try:/except SystemExit as e: then test e.code to figure out if it's int (use as os._exit argument), str (send to stderr) or None (exit normally with code 0). It would still bypass atexit handlers, but you'd unwind the stack normally (invoking except/finally/with cleanup), then os._exit would skip the final attempts to collect memory. Probably not worth the bother/risk even so. :-)

Collectives™ on Stack Overflow

Can cython optimize python dictionary memory and lookup speed as well?

1 Answer 1

11 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

11 Comments

Linked

Related