Skip to content

Conversation

@allrob23
Copy link
Contributor

@allrob23 allrob23 commented Mar 4, 2025

@allrob23
Copy link
Contributor Author

allrob23 commented Mar 4, 2025

pre-commit.ci autofix

@mroeschke mroeschke added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Mar 4, 2025
@mroeschke
Copy link
Member

Do you have an example benchmark where this PR improves the performance of read_csv?

@allrob23
Copy link
Contributor Author

allrob23 commented Mar 4, 2025

This optimization was flagged by a tool I’m developing, which performs code inspection to identify potential performance improvements. However, it doesn’t measure execution times, so I haven’t benchmarked the actual impact yet.

From a theoretical perspective, the change makes sense since the previous implementation performed lookups in a list (O(n)) while the new approach uses a set (O(1)).

Would you be able to help me create a proper benchmark for this case?

@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2025

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 4, 2025
@allrob23
Copy link
Contributor Author

allrob23 commented Apr 4, 2025

I'm busy with other things right now, but I'll try to create a benchmark as soon as possible.

@allrob23
Copy link
Contributor Author

allrob23 commented Apr 26, 2025

Hello @mroeschke , sorry for the delay, I managed to do the benchmark and would like your opinion and review.

Benchmark Explanation

This benchmark script measures the performance of pandas.read_csv with usecols filtering after a code modification that optimizes column selection using a set lookup.

Since the change affects only specific code paths when usecols is provided, the benchmark was designed to:

  1. Vary the number of columns (10 to 50,000) to simulate different dataset sizes.
  2. Select different percentages of columns (1%, 10%, 50%, 90%) to trigger and measure the filtering logic.
  3. Run multiple iterations (5 by default, easily adjustable) to collect stable mean and median read times.

Benachmark results

This is the results in my branch:

 Number of Columns Select Percentage Mean Time (s) Median Time (s) 10 1% 0.0008 0.0008 10 10% 0.0008 0.0008 10 50% 0.0013 0.0013 10 90% 0.0018 0.0018 100 1% 0.0009 0.0009 100 10% 0.0019 0.0019 100 50% 0.0062 0.0063 100 90% 0.0106 0.0106 1000 1% 0.0025 0.0025 1000 10% 0.0155 0.0118 1000 50% 0.0533 0.0527 1000 90% 0.0958 0.0935 10000 1% 0.0176 0.0173 10000 10% 0.1173 0.1120 10000 50% 0.5866 0.5692 10000 90% 1.0830 1.0885 50000 1% 0.0913 0.0907 50000 10% 0.6784 0.7101 50000 50% 3.3062 3.2970 50000 90% 5.9495 5.8902 

this is the results in Main branch

 Number of Columns Select Percentage Mean Time (s) Median Time (s) 10 1% 0.0009 0.0009 10 10% 0.0008 0.0008 10 50% 0.0013 0.0013 10 90% 0.0017 0.0017 100 1% 0.0009 0.0009 100 10% 0.0020 0.0020 100 50% 0.0065 0.0064 100 90% 0.0110 0.0110 1000 1% 0.0024 0.0024 1000 10% 0.0160 0.0118 1000 50% 0.0548 0.0547 1000 90% 0.1046 0.1014 10000 1% 0.0181 0.0174 10000 10% 0.1259 0.1203 10000 50% 0.8275 0.8285 10000 90% 1.8634 1.8587 50000 1% 0.0924 0.0917 50000 10% 0.9305 0.9910 50000 50% 9.2642 9.2397 50000 90% 24.1806 24.3157 

The results show a performance improvement with modification, especially as the number of columns and selected percentage increase.

In small datasets (10–100 columns), performance is similar between the branches, as expected the overhead of column filtering is minimal.

However, as the number of columns grows (1000, 10000, 50000), the modification consistently outperforms the main branch, especially for larger filter percentage values (50%, 90%).

For example:

  • With 10,000 columns and 90% selected, the mean time dropped from 1.86s (main) to 1.08s (new).
  • With 50,000 columns and 90% selected, the mean time dropped from 24.18s (main) to 5.94s (new).

How I ran the benchmark

1. Synchronize branches

git remote add allrob23-fork https://github.com/allrob23/pandas.git git fetch allrob23-fork 

2. Create bench.py script with code

get the code from this gist https://gist.github.com/allrob23/dea86dba17bbfaa988e2b5a08b27db79

3. Run the code

python3 bench.py 

4. Get the results printed at the end

5. Change to this PR branch

git checkout perf-columns-set 

6. Run the code again

python3 bench.py 

7. Get the results printed at the end

I ran this using docker + compile pandas with reasons:

# git git clone https://github.com/pandas-dev/pandas/ cd pandas 
# docker docker build -t pandas-dev . docker run -it --rm -v ${PWD}:/home/pandas pandas-dev 
# compile with reasons python -m pip install -ve . --no-build-isolation -Ceditable-verbose=true 
@mroeschke mroeschke added this to the 3.0 milestone Apr 26, 2025
@mroeschke mroeschke removed the Stale label Apr 26, 2025
@mroeschke mroeschke merged commit 44c5613 into pandas-dev:main Apr 26, 2025
42 checks passed
@mroeschke
Copy link
Member

Thanks @allrob23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

IO CSV read_csv, to_csv Performance Memory or execution speed performance

2 participants