Skip to content

Conversation

@BaiBaiHi
Copy link

@BaiBaiHi BaiBaiHi commented Apr 14, 2020

Add flag to clear index cache after reindex.

By default, reindex causes index to cache values which potentially increases the memory consumption significantly in the case of multiindexes.

In [2]: idx = pd.MultiIndex.from_product([pd.date_range('2010-01-01', '2015-01-01'), range(1000)], names=['date', 'id']) In [3]: idx2 = pd.MultiIndex.from_product([pd.date_range('2010-01-01', '2015-01-01'), range(500)], names=['date', 'id']) In [4]: df = pd.DataFrame({'a': 1}, index=idx) In [5]: df.memory_usage(deep=True, index=True) # Original Memory Usage Out[5]: Index 7453600 a 14616000 dtype: int64 In [6]: df.reindex(idx2) # df is still the same as original. In [7]: df.memory_usage(deep=True, index=True) # Memory usage after reindex Out[7]: Index 91339680 a 14616000 dtype: int64 

With clear_cache=True

In [20]: idx = pd.MultiIndex.from_product([pd.date_range('2010-01-01', '2015-01-01'), range(1000)], names=['date', 'id']) In [21]: idx2 = pd.MultiIndex.from_product([pd.date_range('2010-01-01', '2015-01-01'), range(500)], names=['date', 'id']) In [22]: df = pd.DataFrame({'a': 1}, index=idx) In [23]: df.memory_usage(deep=True, index=True) # Original Memory Usage Out[23]: Index 7453600 a 14616000 dtype: int64 In [24]: df.reindex(idx2) # df is still the same as original. In [25]: df.memory_usage(deep=True, index=True) # Memory usage after reindex Out[25]: Index 7453600 a 14616000 dtype: int64 
@pep8speaks
Copy link

pep8speaks commented Apr 14, 2020

Hello @BaiBaiHi! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-14 04:22:14 UTC
@BaiBaiHi BaiBaiHi force-pushed the flag-to-clear-cache-on-reindex branch from da02dfd to 9752373 Compare April 14, 2020 00:30
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what’s the point of a user option for this? if u can potentially clear the cache (assume them at an operation actually created it and it wasn’t thee before) the ok
would need to pass tests and benchmarks

@BaiBaiHi BaiBaiHi force-pushed the flag-to-clear-cache-on-reindex branch from 9752373 to f25ffbc Compare April 14, 2020 02:33
@BaiBaiHi BaiBaiHi force-pushed the flag-to-clear-cache-on-reindex branch from f25ffbc to cee9565 Compare April 14, 2020 04:22
@BaiBaiHi
Copy link
Author

@jreback I was trying to be cautious and preserve the default behavior, but after digging into it a bit more, it doesn't seem like keeping the cache there actually provides any benefits.
I've updated the PR so it automatically clears the mapping.

@BaiBaiHi BaiBaiHi changed the title ENH: Flag to clear index cache after reindex ENH: Clear index cache after reindex Apr 14, 2020
else:
indexer = self._engine.get_indexer(target)

self._engine.clear_mapping()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this being done in get_indexer? This seems like the wrong place.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to self._engine.get_indexer in the lines prior is what causes the memory to blow up so things like joins with multiindexes will also see the same issue since it calls get_indexer.

In [1]: idx = pd.MultiIndex.from_product([pd.date_range('2010-01-01', '2015-01-01'), range(1000)], names=['date', 'id']) ...: idx2 = pd.MultiIndex.from_product([pd.date_range('2010-01-01', '2015-01-01'), range(500)], names=['date', 'id']) ...: df = pd.DataFrame({'a': 1}, index=idx) In [2]: df.memory_usage(deep=True, index=True) Out[2]: Index 7453600 a 14616000 dtype: int64 In [3]: df.join(pd.DataFrame(index=idx2)) In [4]: df.memory_usage(deep=True, index=True) Out[4]: Index 91339680 a 14616000 dtype: int64 

Since this is a problem that seems specific to multiindexes and this is the only place that calls the get_indexer method of the MultiIndex engine, I thought this was the best place to address the issue. Lemme know if you think there's a better location though!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me that we want to do this though. Allocating the engine can be expensive so this will hurt the performance of things that would reuse the cached engine.

@BaiBaiHi BaiBaiHi requested a review from jreback April 14, 2020 14:38
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this would need performance checks (run all the indexing save)

the cache is there for a reason; changing this will need analysis

@simonjayhawkins
Copy link
Member

@BaiBaiHi closing this PR as stale. maybe raise an issue for discussion.

@simonjayhawkins simonjayhawkins added the Performance Memory or execution speed performance label May 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance

5 participants