Skip to content

Conversation

@KevsterAmp
Copy link
Contributor

@KevsterAmp KevsterAmp commented Aug 26, 2024

I added an alternative ndarray with the same length on _get_values_for_csv's output, used on write_csv_rows crude testing

Py_ssize_t i, j = 0, k = len(data_index), N = 100, ncols = len(cols)

Tests

Using the same code as the referenced issue:

import pandas as pd import pyarrow as pa import pyarrow.csv as csv import time NUM_ROWS = 10000000 NUM_COLS = 20 # Example Multi-Index DataFrame df = pd.DataFrame( { f"col_{col_idx}": range(col_idx * NUM_ROWS, (col_idx + 1) * NUM_ROWS) for col_idx in range(NUM_COLS) } ) df = df.set_index(["col_0", "col_1"], drop=False) # Timing Operation A start_time = time.time() df.to_csv("file_A.csv", index=False) end_time = time.time() print(f"Operation A time: {end_time - start_time} seconds") # Timing Operation B start_time = time.time() df_reset = df.reset_index(drop=True) df_reset.to_csv("file_B.csv", index=False) end_time = time.time() print(f"Operation B time: {end_time - start_time} seconds")

Output before performance improvement

Operation A time: 869.2354643344879 seconds Operation B time: 42.1906418800354 seconds 

Output after performance improvement

Operation A time: 51.408071756362915 seconds Operation B time: 45.78637385368347 seconds 

Operation B is used for time comparison when resetting index, the change improves the performance on Operation A

@KevsterAmp KevsterAmp changed the title add alternative ix when self.nlevel is 0 PERF: Performance Improvement on DataFrame.to_csv() when index=False Aug 26, 2024
ix = (
self.data_index[slicer]._get_values_for_csv(**self._number_format)
if self.nlevels != 0
else np.full(end_i - start_i, None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use np.empty instead?

@mroeschke
Copy link
Member

For you benchmark could you show before and after timings

@mroeschke mroeschke added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Aug 26, 2024
@KevsterAmp
Copy link
Contributor Author

Output before performance improvement

Operation A time: 869.2354643344879 seconds Operation B time: 42.1906418800354 seconds 

Output (after performance improvement)

Operation A time: 51.408071756362915 seconds Operation B time: 45.78637385368347 seconds 

Operation B is used for time comparison when resetting index, the change improves the performance on Operation A

Added output times to the description as well

@mroeschke mroeschke added this to the 3.0 milestone Aug 27, 2024
@mroeschke mroeschke merged commit bd81fef into pandas-dev:main Aug 27, 2024
@mroeschke
Copy link
Member

Thanks @KevsterAmp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

IO CSV read_csv, to_csv Performance Memory or execution speed performance

2 participants