I have a class which primarily contains the three dicts:
class KB(object): def __init__(self): # key:str value: list of str linear_patterns = defaultdict(list) # key:str value: list of str nonlinear_patterns = defaultdict(list) # key: str value: dict pattern_meta_info = {} ... self.__initialize() def __initialize(self): # the 3 dicts are populated ... The size of the 3 dicts are below:
linear_patterns: 100,000 non_linear_patterns: 900,000 pattern_meta_info: 700,000 After the program is run and done, it takes about 15 seconds to release the memory. When I reduces the number of the dict sizes above by loading less data in initialization, the memory release is faster, so I judge it's due to these dict sizes that cause memory release slower. The total program takes about 8G memory. Also, after the dicts are built, all operations are lookup, no modifications.
Is there a way to use cython to optimize the 3 data structures above, especially in terms of memory usage? Is there a similar cython dictionary that can replaces the python dicts?
dictwhich is already implemented in C btw. You could potentially write your own hash-map implementation depending on the nature of your data. E.g., Pythonstrobjects andintobjects are highly space inefficienct compared to similar C primitives, e.g.sys.getsizeof('')andsys.getsizeof(1)maporunordered_mapinstead of the Python dict. The one thing that might catch you out - Python strings can be shared (so a string that appears multiple times need not take up more memory) while C++ strings won't be