-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
Open
Labels
Arrowpyarrow functionalitypyarrow functionalityPerformanceMemory or execution speed performanceMemory or execution speed performanceStringsString extension data type and string dataString extension data type and string datahashinghash_pandas_objecthash_pandas_object
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this issue exists on the latest version of pandas.
-
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
When investigating Dask's shuffle performance on string[pyarrow] data, I observed pd.util.hash_pandas_object was slightly less performant on string[pyarrow] data than on regular object Python objects. This surprised me as I would have expected hashing pyarrow-backed data to be faster than Python objects
In [1]: import pandas as pd In [2]: s = pd.Series(range(2_000)) In [3]: s_object = s.astype(object) In [4]: s_pyarrow = s.astype("string[pyarrow]") In [5]: %timeit pd.util.hash_pandas_object(s_object) 859 µs ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) In [6]: %timeit pd.util.hash_pandas_object(s_pyarrow) 1.01 ms ± 29.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)Installed Versions
INSTALLED VERSIONS ------------------ commit : 87cfe4e38bafe7300a6003a1d18bd80f3f77c763 python : 3.10.6.final.0 python-bits : 64 OS : Darwin OS-release : 21.6.0 Version : Darwin Kernel Version 21.6.0: Wed Aug 10 14:25:27 PDT 2022; root:xnu-8020.141.5~2/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.5.0 numpy : 1.23.3 pytz : 2022.4 dateutil : 2.8.2 setuptools : 65.4.1 pip : 22.2.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.5.0 pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 9.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None Prior Performance
No response
Metadata
Metadata
Assignees
Labels
Arrowpyarrow functionalitypyarrow functionalityPerformanceMemory or execution speed performanceMemory or execution speed performanceStringsString extension data type and string dataString extension data type and string datahashinghash_pandas_objecthash_pandas_object