Skip to content

Conversation

@ryansdowning
Copy link

@ryansdowning ryansdowning commented Mar 12, 2022

Hello, first time contributing to pandas so if I did something wrong, please let me know!

In the discussion for #46248 I proposed a very simple solution using collections.defaultdict. However, after implementing the change, I found a performance improvement of 10x, but that was still a far stretch from the 1000-10000x improvement I was expecting. After more testing I found that the majority of the time was now being spent by copying the dictionary to the defaultdict. To avoid unnecessarily copying the dictionary's data, I use dict.get with a default of np.nan for dictionaries that do not implement the __missing__ method added ReadOnlyNanDefaultDict which gives us the behavior of defaultdict(lambda: np.nan, dictionary_object) without copying the data. As given by the name, it is also read only since that is all that is needed by IndexOpsMixin._map_values. With this change, I see the huge performance boost of 1000-10000x.

This is a slight deviation from the defaultdict solution proposed in the issue, but it performs much better and behaves in the same way.

Lastly, I tried to add type hints for IndexOpsMixin._map_values but could not find the proper way of typing Series without causing a circular import. Is "Series" the way to go, or is there something better?

@jreback jreback added Performance Memory or execution speed performance Apply Apply, Aggregate, Transform, Map labels Mar 12, 2022
# necessary functionality, without copying the dictionary.
dict_with_default = ReadOnlyNanDefaultDict(mapper)

mapper = lambda x: dict_with_default[x]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just use the Series path?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify what you mean here?

@ryansdowning
Copy link
Author

ryansdowning commented Mar 14, 2022

I am running into a behavior inconsistency that is leading to some tests failing due to this change.

In [1]: import pandas as pd In [2]: import numpy as np In [3]: from collections import defaultdict In [4]: s = pd.Series([1, 2, np.nan]) In [5]: mapper = {1: "a", 2: "b", np.nan: "c"} In [6]: default_mapper = defaultdict(str, mapper) In [7]: s.map(mapper) Out[7]: 0 a 1 b 2 c dtype: object In [8]: s.map(default_mapper) Out[8]: 0 a 1 b 2 dtype: object

On the most recent main branch, mapping for dictionaries vs dictionaries with missing methods behave differently with np.nan as keys. I don't know enough about the inner workings of the map vectorization, but I suspect it has something to do with lib.map_infer which I cannot locate the source for. Nevertheless, I am unsure how to proceed from here given this inconsistent behavior.

@rhshadrach
Copy link
Member

@ryansdowning - not sure if this might help, but one has to be very careful with np.nan in dictionaries. E.g.

map = {np.nan: 1} arr = np.asarray([np.nan]) print(map.get(arr[0])) 

produces None. I'd guess that the internals are treating dict differently than other Mappings. The code for lib.map_infer is in pandas._libs.lib.pyx starting (roughly) on L2866.

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 22, 2022
@mroeschke
Copy link
Member

Thanks for the effort on this PR so far, but it appears to have gone stale. If interested in continuing, please merge in main, address the failures in the CI, and we can reopen.

@mroeschke mroeschke closed this May 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apply Apply, Aggregate, Transform, Map Performance Memory or execution speed performance Stale

4 participants