Skip to content

Conversation

@abeltavares
Copy link
Contributor

@abeltavares abeltavares commented Apr 13, 2024

@abeltavares abeltavares force-pushed the BUG/Parquet-size-grows-exponential-for-categorical-data branch from 8a5fe98 to c74b597 Compare April 13, 2024 00:24
@abeltavares
Copy link
Contributor Author

This PR is ready for review. Thanks.

@mroeschke
Copy link
Member

/preview

@github-actions
Copy link
Contributor

Website preview of this PR available at: https://pandas.pydata.org/preview/pandas-dev/pandas/58245/

categories, not just those present in the data. This behavior
is expected and consistent with pandas' handling of categorical data.
* To manage file size and ensure a more predictable roundtrip process,
consider using `remove_unused_categories` on the DataFrame before saving.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
consider using `remove_unused_categories` on the DataFrame before saving.
consider using :meth:`Categorical.remove_unused_categories` on the DataFrame before saving.
Comment on lines 2886 to 2887
is expected and consistent with pandas' handling of categorical data.
* To manage file size and ensure a more predictable roundtrip process,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is expected and consistent with pandas' handling of categorical data.
* To manage file size and ensure a more predictable roundtrip process,
is expected and consistent with pandas' handling of categorical data.
To manage file size and ensure a more predictable roundtrip process,
@mroeschke mroeschke added IO CSV read_csv, to_csv IO Parquet parquet, feather Docs and removed IO CSV read_csv, to_csv labels Apr 15, 2024
@abeltavares abeltavares force-pushed the BUG/Parquet-size-grows-exponential-for-categorical-data branch from c74b597 to d5020e1 Compare April 18, 2024 19:45
@abeltavares abeltavares requested a review from mroeschke April 18, 2024 20:35
@mroeschke mroeschke added this to the 3.0 milestone Apr 19, 2024
@mroeschke mroeschke merged commit 2fbabb1 into pandas-dev:main Apr 19, 2024
@mroeschke
Copy link
Member

Thanks @abeltavares

pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
…ev#58245) Co-authored-by: Abel Tavares <abel.tavares@ctw.bmwgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Docs IO Parquet parquet, feather

2 participants