Skip to content

PERF: pd.Series.map slow for huge dictionary #46248

@ryansdowning

Description

@ryansdowning

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

This issue is a duplicate of #34717 since I do not believe it has been resolved. Using the same example as the previous issue, I can reproduce the poor performance of pd.Series.map with large dictionaries, as opposed to using pd.Series.apply with dict.get.

In [3]: pd.__version__ Out[3]: '1.4.1' In [4]: %cpaste Pasting code; enter '--' alone on the line to stop or use Ctrl-D. :n = 1000000 domain = np.arange(0, n) ranges = domain+10 maptable = pd.Series(ranges, index=domain).sort_index().to_dict() query_vals = pd.Series([1,2,3]):::: :-- In [5]: %timeit query_vals.map(maptable) 343 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [6]: %timeit query_vals.apply(lambda x: maptable.get(x)) 81.4 µs ± 862 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

At the very least I think it would be useful to include some disclaimer in the documentation for pd.Series.map explaining this drawback more clearly and presenting the alternative solution, but of course a fix for this would be ideal. I would be happy to take this issue if I can be pointed in the right direction.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 06d230151e6f18fdb8139d09abf539867a8cd481 python : 3.9.7.final.0 python-bits : 64 OS : Linux OS-release : 5.13.0-28-generic Version : #31~20.04.1-Ubuntu SMP Wed Jan 19 14:08:10 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.4.1 numpy : 1.21.2 pytz : 2021.3 dateutil : 2.8.2 pip : 21.2.4 setuptools : 58.0.4 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 7.31.1 pandas_datareader: None bs4 : None bottleneck : 1.3.2 fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : 2.8.1 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None 

Prior Performance

No response

Metadata

Metadata

Assignees

Labels

PerformanceMemory or execution speed performance

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions