Skip to content

PERF: pd.util.hash_pandas_object slower on string[pyarrow] than object dtypes #48964

@jrbourbeau

Description

@jrbourbeau

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

When investigating Dask's shuffle performance on string[pyarrow] data, I observed pd.util.hash_pandas_object was slightly less performant on string[pyarrow] data than on regular object Python objects. This surprised me as I would have expected hashing pyarrow-backed data to be faster than Python objects

In [1]: import pandas as pd In [2]: s = pd.Series(range(2_000)) In [3]: s_object = s.astype(object) In [4]: s_pyarrow = s.astype("string[pyarrow]") In [5]: %timeit pd.util.hash_pandas_object(s_object) 859 µs ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) In [6]: %timeit pd.util.hash_pandas_object(s_pyarrow) 1.01 ms ± 29.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Installed Versions

INSTALLED VERSIONS ------------------ commit : 87cfe4e38bafe7300a6003a1d18bd80f3f77c763 python : 3.10.6.final.0 python-bits : 64 OS : Darwin OS-release : 21.6.0 Version : Darwin Kernel Version 21.6.0: Wed Aug 10 14:25:27 PDT 2022; root:xnu-8020.141.5~2/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.5.0 numpy : 1.23.3 pytz : 2022.4 dateutil : 2.8.2 setuptools : 65.4.1 pip : 22.2.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.5.0 pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 9.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None 

Prior Performance

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Arrowpyarrow functionalityPerformanceMemory or execution speed performanceStringsString extension data type and string datahashinghash_pandas_object

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions