Series.map Performance Improvement When Using Dictionaries #46348
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.Hello, first time contributing to pandas so if I did something wrong, please let me know!
In the discussion for #46248 I proposed a very simple solution using
collections.defaultdict. However, after implementing the change, I found a performance improvement of 10x, but that was still a far stretch from the 1000-10000x improvement I was expecting. After more testing I found that the majority of the time was now being spent by copying the dictionary to thedefaultdict. To avoid unnecessarily copying the dictionary's data, I usedict.getwith a default ofnp.nanfor dictionaries that do not implement the__missing__methodaddedWith this change, I see the huge performance boost of 1000-10000x.ReadOnlyNanDefaultDictwhich gives us the behavior ofdefaultdict(lambda: np.nan, dictionary_object)without copying the data. As given by the name, it is also read only since that is all that is needed byIndexOpsMixin._map_values.This is a slight deviation from the
defaultdictsolution proposed in the issue, but it performs much better and behaves in the same way.Lastly, I tried to add type hints for
IndexOpsMixin._map_valuesbut could not find the proper way of typingSerieswithout causing a circular import. Is"Series"the way to go, or is there something better?