Series.map Performance Improvement When Using Dictionaries #46348

ryansdowning · 2022-03-12T22:05:25Z

closes PERF: pd.Series.map slow for huge dictionary #46248
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Hello, first time contributing to pandas so if I did something wrong, please let me know!

In the discussion for #46248 I proposed a very simple solution using collections.defaultdict. However, after implementing the change, I found a performance improvement of 10x, but that was still a far stretch from the 1000-10000x improvement I was expecting. After more testing I found that the majority of the time was now being spent by copying the dictionary to the defaultdict. To avoid unnecessarily copying the dictionary's data, I use dict.get with a default of np.nan for dictionaries that do not implement the __missing__ method added ReadOnlyNanDefaultDict which gives us the behavior of defaultdict(lambda: np.nan, dictionary_object) without copying the data. As given by the name, it is also read only since that is all that is needed by IndexOpsMixin._map_values. With this change, I see the huge performance boost of 1000-10000x.

This is a slight deviation from the defaultdict solution proposed in the issue, but it performs much better and behaves in the same way.

Lastly, I tried to add type hints for IndexOpsMixin._map_values but could not find the proper way of typing Series without causing a circular import. Is "Series" the way to go, or is there something better?

…to non-default-dictionaries

…ct which avoids copying the data

pandas/core/base.py

jreback · 2022-03-12T22:41:47Z

pandas/core/base.py

+ # necessary functionality, without copying the dictionary.
+ dict_with_default = ReadOnlyNanDefaultDict(mapper)
+
+ mapper = lambda x: dict_with_default[x]


can we just use the Series path?

Could you clarify what you mean here?

ryansdowning · 2022-03-14T16:47:30Z

I am running into a behavior inconsistency that is leading to some tests failing due to this change.

In [1]: import pandas as pd In [2]: import numpy as np In [3]: from collections import defaultdict In [4]: s = pd.Series([1, 2, np.nan]) In [5]: mapper = {1: "a", 2: "b", np.nan: "c"} In [6]: default_mapper = defaultdict(str, mapper) In [7]: s.map(mapper) Out[7]: 0 a 1 b 2 c dtype: object In [8]: s.map(default_mapper) Out[8]: 0 a 1 b 2 dtype: object

On the most recent main branch, mapping for dictionaries vs dictionaries with missing methods behave differently with np.nan as keys. I don't know enough about the inner workings of the map vectorization, but I suspect it has something to do with lib.map_infer which I cannot locate the source for. Nevertheless, I am unsure how to proceed from here given this inconsistent behavior.

rhshadrach · 2022-03-22T21:07:44Z

@ryansdowning - not sure if this might help, but one has to be very careful with np.nan in dictionaries. E.g.

map = {np.nan: 1} arr = np.asarray([np.nan]) print(map.get(arr[0]))

produces None. I'd guess that the internals are treating dict differently than other Mappings. The code for lib.map_infer is in pandas._libs.lib.pyx starting (roughly) on L2866.

github-actions · 2022-04-22T00:05:31Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2022-05-07T02:59:46Z

Thanks for the effort on this PR so far, but it appears to have gone stale. If interested in continuing, please merge in main, address the failures in the CI, and we can reopen.

ryansdowning added 4 commits March 12, 2022 15:22

perf(core/base): Fix issue pandas-dev#46248 by adding np.nan default …

85187d4

…to non-default-dictionaries

perf(core/base): improve performance by using a custom nan default di…

1beb8cf

…ct which avoids copying the data

fix(core/base): fixed flake8 warnings

85f2ba9

fix(core/base): Add typehint for _map_values na_action argument

8b57fb3

jreback added Performance Memory or execution speed performance Apply Apply, Aggregate, Transform, Map labels Mar 12, 2022

jreback requested changes Mar 12, 2022

View reviewed changes

fix(core/base): reimplement performance fix with dict.get

6995dd7

github-actions bot added the Stale label Apr 22, 2022

mroeschke closed this May 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Series.map Performance Improvement When Using Dictionaries #46348

Series.map Performance Improvement When Using Dictionaries #46348

Uh oh!

ryansdowning commented Mar 12, 2022 •

edited

Loading

Uh oh!

jreback Mar 12, 2022

ryansdowning Mar 15, 2022

ryansdowning commented Mar 14, 2022 •

edited

Loading

rhshadrach commented Mar 22, 2022

github-actions bot commented Apr 22, 2022

mroeschke commented May 7, 2022

Labels

4 participants

Uh oh!

Series.map Performance Improvement When Using Dictionaries #46348

Series.map Performance Improvement When Using Dictionaries #46348

Uh oh!

Conversation

ryansdowning commented Mar 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback Mar 12, 2022

Choose a reason for hiding this comment

ryansdowning Mar 15, 2022

Choose a reason for hiding this comment

ryansdowning commented Mar 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

rhshadrach commented Mar 22, 2022

github-actions bot commented Apr 22, 2022

mroeschke commented May 7, 2022

Labels

4 participants

ryansdowning commented Mar 12, 2022 •

edited

Loading

ryansdowning commented Mar 14, 2022 •

edited

Loading