Skip to content

PERF: merge_ordered(how="inner") without prior filtering is slow and uses much more memory. #56115

@starhel

Description

@starhel

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd import numpy as np start, stop, step = 0, 100_000_000, 2 df1 = pd.DataFrame({"key": np.arange(start, stop, step), "val1": np.arange(start, stop, step), "val2": np.arange(start, stop, step)}, copy=False) df1.info() start, stop, step = 50_000_000, 70_000_000, 1 df2 = pd.DataFrame({"key": np.arange(start, stop, step), "val3": np.arange(start, stop, step), "val4": np.arange(start, stop, step)}, copy=False) df2.info() # This filtering should not be necessary for inner join as it only drops unused data df1 = df1.query("key >= @df2.key.min() and key <= @df2.key.max()") df2 = df2.query("key in @df1.key") df = pd.merge_ordered(df1, df2, on="key", how="inner") df.info()

Installed Versions

INSTALLED VERSIONS ------------------ commit : 2a953cf80b77e4348bf50ed724f8abc0d814d9dd python : 3.10.9.final.0 python-bits : 64 OS : Linux OS-release : 6.2.0-36-generic Version : #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 9 15:34:04 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : pl_PL.UTF-8 LOCALE : pl_PL.UTF-8 pandas : 2.1.3 numpy : 1.26.2 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 22.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None bs4 : None bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.2 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None 

Prior Performance

diff
Plot generated by memory_profiler:

  • black line: with 2 queries before merge_ordered
  • blue line: without 2 queries before merge_ordered

Metadata

Metadata

Assignees

No one assigned

    Labels

    Needs TriageIssue that has not been reviewed by a pandas team memberPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions