-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
Closed
Labels
Copy / view semanticsPerformanceMemory or execution speed performanceMemory or execution speed performanceRegressionFunctionality that used to work in a prior pandas versionFunctionality that used to work in a prior pandas versionhashinghash_pandas_objecthash_pandas_object
Milestone
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this issue exists on the latest version of pandas.
-
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Given the following example:
from hashlib import sha256 import numpy as np import pandas as pd import time size = int(1e4) while size <= int(3e5): df = pd.DataFrame(np.random.normal(size=(size, 50))) df_t = df.T start = time.process_time() hashed_rows = pd.util.hash_pandas_object(df_t.reset_index()).values t = (size, time.process_time() - start) print(f'{t[0}, {t[1]}') size = size + 10000The runtimes between version 2.1.0 and 2.1.1 are drastically different, with 2.1.1 showing polynomial/exponential runtime.
Installed Versions
For 2.1.0
INSTALLED VERSIONS ------------------ commit : ba1cccd19da778f0c3a7d6a885685da16a072870 python : 3.9.18.final.0 python-bits : 64 OS : Linux OS-release : 5.15.49-linuxkit-pr Version : #1 SMP PREEMPT Thu May 25 07:27:39 UTC 2023 machine : aarch64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.1.0 numpy : 1.26.0 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 67.8.0 pip : 23.2.1 Cython : 3.0.2 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : None pandas_datareader : None bs4 : None bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : 2023.9.1 gcsfs : None matplotlib : 3.8.0 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 13.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.11.2 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None For 2.1.1
INSTALLED VERSIONS ------------------ commit : e86ed377639948c64c429059127bcf5b359ab6be python : 3.9.18.final.0 python-bits : 64 OS : Linux OS-release : 5.15.49-linuxkit-pr Version : #1 SMP PREEMPT Thu May 25 07:27:39 UTC 2023 machine : aarch64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.1.1 numpy : 1.26.0 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 67.8.0 pip : 23.2.1 Cython : 3.0.2 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : None pandas_datareader : None bs4 : None bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : 2023.9.1 gcsfs : None matplotlib : 3.8.0 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 13.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.11.2 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None Prior Performance
Same example as above. 2.1.0 is fine. 2.1.1 is not.
Metadata
Metadata
Assignees
Labels
Copy / view semanticsPerformanceMemory or execution speed performanceMemory or execution speed performanceRegressionFunctionality that used to work in a prior pandas versionFunctionality that used to work in a prior pandas versionhashinghash_pandas_objecthash_pandas_object

