-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this issue exists on the latest version of pandas.
-
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
This issue is a duplicate of #34717 since I do not believe it has been resolved. Using the same example as the previous issue, I can reproduce the poor performance of pd.Series.map with large dictionaries, as opposed to using pd.Series.apply with dict.get.
In [3]: pd.__version__ Out[3]: '1.4.1' In [4]: %cpaste Pasting code; enter '--' alone on the line to stop or use Ctrl-D. :n = 1000000 domain = np.arange(0, n) ranges = domain+10 maptable = pd.Series(ranges, index=domain).sort_index().to_dict() query_vals = pd.Series([1,2,3]):::: :-- In [5]: %timeit query_vals.map(maptable) 343 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [6]: %timeit query_vals.apply(lambda x: maptable.get(x)) 81.4 µs ± 862 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)At the very least I think it would be useful to include some disclaimer in the documentation for pd.Series.map explaining this drawback more clearly and presenting the alternative solution, but of course a fix for this would be ideal. I would be happy to take this issue if I can be pointed in the right direction.
Installed Versions
INSTALLED VERSIONS ------------------ commit : 06d230151e6f18fdb8139d09abf539867a8cd481 python : 3.9.7.final.0 python-bits : 64 OS : Linux OS-release : 5.13.0-28-generic Version : #31~20.04.1-Ubuntu SMP Wed Jan 19 14:08:10 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.4.1 numpy : 1.21.2 pytz : 2021.3 dateutil : 2.8.2 pip : 21.2.4 setuptools : 58.0.4 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 7.31.1 pandas_datareader: None bs4 : None bottleneck : 1.3.2 fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : 2.8.1 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None Prior Performance
No response