perf: improve membership check performance in column filtering #61046

allrob23 · 2025-03-04T01:52:00Z

closes PERF: Optimize membership check in column filtering for better performance #61045
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

allrob23 · 2025-03-04T12:50:55Z

pre-commit.ci autofix

mroeschke · 2025-03-04T17:57:34Z

Do you have an example benchmark where this PR improves the performance of read_csv?

allrob23 · 2025-03-04T23:16:54Z

This optimization was flagged by a tool I’m developing, which performs code inspection to identify potential performance improvements. However, it doesn’t measure execution times, so I haven’t benchmarked the actual impact yet.

From a theoretical perspective, the change makes sense since the previous implementation performed lookups in a list (O(n)) while the new approach uses a set (O(1)).

Would you be able to help me create a proper benchmark for this case?

github-actions · 2025-04-04T00:08:05Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

allrob23 · 2025-04-04T00:41:03Z

I'm busy with other things right now, but I'll try to create a benchmark as soon as possible.

for more information, see https://pre-commit.ci

allrob23 · 2025-04-26T15:17:18Z

Hello @mroeschke , sorry for the delay, I managed to do the benchmark and would like your opinion and review.

Benchmark Explanation

This benchmark script measures the performance of pandas.read_csv with usecols filtering after a code modification that optimizes column selection using a set lookup.

Since the change affects only specific code paths when usecols is provided, the benchmark was designed to:

Vary the number of columns (10 to 50,000) to simulate different dataset sizes.
Select different percentages of columns (1%, 10%, 50%, 90%) to trigger and measure the filtering logic.
Run multiple iterations (5 by default, easily adjustable) to collect stable mean and median read times.

Benachmark results

This is the results in my branch:

 Number of Columns Select Percentage Mean Time (s) Median Time (s) 10 1% 0.0008 0.0008 10 10% 0.0008 0.0008 10 50% 0.0013 0.0013 10 90% 0.0018 0.0018 100 1% 0.0009 0.0009 100 10% 0.0019 0.0019 100 50% 0.0062 0.0063 100 90% 0.0106 0.0106 1000 1% 0.0025 0.0025 1000 10% 0.0155 0.0118 1000 50% 0.0533 0.0527 1000 90% 0.0958 0.0935 10000 1% 0.0176 0.0173 10000 10% 0.1173 0.1120 10000 50% 0.5866 0.5692 10000 90% 1.0830 1.0885 50000 1% 0.0913 0.0907 50000 10% 0.6784 0.7101 50000 50% 3.3062 3.2970 50000 90% 5.9495 5.8902

this is the results in Main branch

 Number of Columns Select Percentage Mean Time (s) Median Time (s) 10 1% 0.0009 0.0009 10 10% 0.0008 0.0008 10 50% 0.0013 0.0013 10 90% 0.0017 0.0017 100 1% 0.0009 0.0009 100 10% 0.0020 0.0020 100 50% 0.0065 0.0064 100 90% 0.0110 0.0110 1000 1% 0.0024 0.0024 1000 10% 0.0160 0.0118 1000 50% 0.0548 0.0547 1000 90% 0.1046 0.1014 10000 1% 0.0181 0.0174 10000 10% 0.1259 0.1203 10000 50% 0.8275 0.8285 10000 90% 1.8634 1.8587 50000 1% 0.0924 0.0917 50000 10% 0.9305 0.9910 50000 50% 9.2642 9.2397 50000 90% 24.1806 24.3157

The results show a performance improvement with modification, especially as the number of columns and selected percentage increase.

In small datasets (10–100 columns), performance is similar between the branches, as expected the overhead of column filtering is minimal.

However, as the number of columns grows (1000, 10000, 50000), the modification consistently outperforms the main branch, especially for larger filter percentage values (50%, 90%).

For example:

With 10,000 columns and 90% selected, the mean time dropped from 1.86s (main) to 1.08s (new).
With 50,000 columns and 90% selected, the mean time dropped from 24.18s (main) to 5.94s (new).

How I ran the benchmark

1. Synchronize branches

git remote add allrob23-fork https://github.com/allrob23/pandas.git git fetch allrob23-fork

2. Create `bench.py` script with code

get the code from this gist https://gist.github.com/allrob23/dea86dba17bbfaa988e2b5a08b27db79

3. Run the code

python3 bench.py

4. Get the results printed at the end

5. Change to this PR branch

git checkout perf-columns-set

6. Run the code again

python3 bench.py

7. Get the results printed at the end

I ran this using docker + compile pandas with reasons:

# git git clone https://github.com/pandas-dev/pandas/ cd pandas

# docker docker build -t pandas-dev . docker run -it --rm -v ${PWD}:/home/pandas pandas-dev

# compile with reasons python -m pip install -ve . --no-build-isolation -Ceditable-verbose=true

mroeschke · 2025-04-26T18:10:46Z

Thanks @allrob23

mroeschke added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Mar 4, 2025

github-actions bot added the Stale label Apr 4, 2025

allrob23 and others added 2 commits April 26, 2025 10:48

perf: improve membership check performance in column filtering

5d8e1b6

[pre-commit.ci] auto fixes from pre-commit.com hooks

7e8c88d

for more information, see https://pre-commit.ci

allrob23 force-pushed the perf-columns-set branch from 2977503 to 7e8c88d Compare April 26, 2025 14:48

mroeschke approved these changes Apr 26, 2025

View reviewed changes

mroeschke added this to the 3.0 milestone Apr 26, 2025

mroeschke removed the Stale label Apr 26, 2025

mroeschke merged commit 44c5613 into pandas-dev:main Apr 26, 2025
42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

perf: improve membership check performance in column filtering #61046

perf: improve membership check performance in column filtering #61046

Uh oh!

allrob23 commented Mar 4, 2025 •

edited

Loading

allrob23 commented Mar 4, 2025

mroeschke commented Mar 4, 2025

allrob23 commented Mar 4, 2025

github-actions bot commented Apr 4, 2025

allrob23 commented Apr 4, 2025

allrob23 commented Apr 26, 2025 •

edited

Loading

Uh oh!

mroeschke commented Apr 26, 2025

Labels

2 participants

Uh oh!

perf: improve membership check performance in column filtering #61046

perf: improve membership check performance in column filtering #61046

Uh oh!

Conversation

allrob23 commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

allrob23 commented Mar 4, 2025

mroeschke commented Mar 4, 2025

allrob23 commented Mar 4, 2025

github-actions bot commented Apr 4, 2025

allrob23 commented Apr 4, 2025

allrob23 commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Explanation

Benachmark results

How I ran the benchmark

1. Synchronize branches

2. Create bench.py script with code

3. Run the code

4. Get the results printed at the end

5. Change to this PR branch

6. Run the code again

7. Get the results printed at the end

I ran this using docker + compile pandas with reasons:

Uh oh!

mroeschke commented Apr 26, 2025

Labels

2 participants

allrob23 commented Mar 4, 2025 •

edited

Loading

allrob23 commented Apr 26, 2025 •

edited

Loading

2. Create `bench.py` script with code