PERF: hash_pandas_object performance regression between 2.1.0 and 2.1.1

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Given the following example:

from hashlib import sha256 import numpy as np import pandas as pd import time size = int(1e4) while size <= int(3e5): df = pd.DataFrame(np.random.normal(size=(size, 50))) df_t = df.T start = time.process_time() hashed_rows = pd.util.hash_pandas_object(df_t.reset_index()).values t = (size, time.process_time() - start) print(f'{t[0}, {t[1]}') size = size + 10000

The runtimes between version 2.1.0 and 2.1.1 are drastically different, with 2.1.1 showing polynomial/exponential runtime.

Installed Versions

For 2.1.0

INSTALLED VERSIONS ------------------ commit : ba1cccd19da778f0c3a7d6a885685da16a072870 python : 3.9.18.final.0 python-bits : 64 OS : Linux OS-release : 5.15.49-linuxkit-pr Version : #1 SMP PREEMPT Thu May 25 07:27:39 UTC 2023 machine : aarch64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.1.0 numpy : 1.26.0 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 67.8.0 pip : 23.2.1 Cython : 3.0.2 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : None pandas_datareader : None bs4 : None bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : 2023.9.1 gcsfs : None matplotlib : 3.8.0 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 13.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.11.2 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

For 2.1.1

INSTALLED VERSIONS ------------------ commit : e86ed377639948c64c429059127bcf5b359ab6be python : 3.9.18.final.0 python-bits : 64 OS : Linux OS-release : 5.15.49-linuxkit-pr Version : #1 SMP PREEMPT Thu May 25 07:27:39 UTC 2023 machine : aarch64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.1.1 numpy : 1.26.0 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 67.8.0 pip : 23.2.1 Cython : 3.0.2 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : None pandas_datareader : None bs4 : None bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : 2023.9.1 gcsfs : None matplotlib : 3.8.0 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 13.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.11.2 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

Prior Performance

Same example as above. 2.1.0 is fine. 2.1.1 is not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: hash_pandas_object performance regression between 2.1.0 and 2.1.1 #55245

Pandas version checks

Reproducible Example

Installed Versions

Prior Performance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

PERF: hash_pandas_object performance regression between 2.1.0 and 2.1.1 #55245

Description

Pandas version checks

Reproducible Example

Installed Versions

Prior Performance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions