-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
PERF: Slow performance of to_dict (#46470) #46487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Slow performance of to_dict (#46470) #46487
Conversation
397bc47 to 4f0f872 Compare | @rhshadrach there are lots of failures on the builds that seem unrelated to my changes, am I missing something? |
9aa150f to 70ed030 Compare
phofl left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please avoid doing the refactoring and the changes in parallel? This makes it really hard to review
| @phofl yeah very enough, but as I was writing tests i realised that a few of the other |
+1. @RogerThomas can you leave as one function for now and refactor in a follow up (or vice-versa)? |
| @rhshadrach @phofl, sure thing. Just to be sure before I start, do you want me to; a) Keep the code that I added that handles correct dtype coercing for some orient types and just move the code back into the original method |
| @RogerThomas - I think it's best to have separate PRs for performance and correctness. Best to get correctness first, then improve performance (in general). |
| @rhshadrach sorry not sure i follow, the ticket that this is supposed to close is purely a performance issue. Are you saying i should create a pr with the fix for the other orient types and then continue with this pr to address the slow performance of to_dict records? |
| @RogerThomas - yes, that's correct. If they are independent, then any order is okay. If they are not independent, then best practice is to get correct behavior and then work on performance. |
| Ok thanks @rhshadrach, do i need to create a separate github issue for the pr to fix the other orient types? |
| @RogerThomas - I prefer to, but in general, no. E.g. you can specify the PR # in the whatsnew when doing a bugfix instead of an issue #. |
| Ok thanks @rhshadrach, I'll try and get to both in the next few days |
70ed030 to 96ac6fa Compare | @rhshadrach @phofl i've removed the helper function, i've updated
How to interpret this table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice perf gain! Some requests/questions below. I only make comments on the first case, similar remarks apply to other ones as well. Can you also run asvs for this. From within asv_bench:
asv continuous -f 1.1 upstream/main HEAD -b ^frame_methods.ToDict
pandas/core/frame.py Outdated
| for t in self.itertuples(index=False, name=None) | ||
| ] | ||
| else: | ||
| data = [list(t) for t in self.itertuples(index=False, name=None)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you share code between any of these cases? e.g. make a helper function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback done
| { | ||
| "a": [1, "hello", 3], | ||
| "b": [1.1, "world", 3.3], | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this hits all of the new added code?
| @jreback would you have time to discuss the above? |
| @jreback @rhshadrach any chance we could pick this up, the project I'm working on could really do with the optimisations |
| @RogerThomas - certainly; can you resolve conflict and will look. |
…andas-devgh-46470-slow-performance-of-to-dict
| @rhshadrach done |
rhshadrach left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @RogerThomas - from what I can tell the only outstanding issues are (a) potentially moving the entire to_dict implementation to pandas.io and (b) tests. For (a), I think this is a good idea and would make for a good followup if you'd like to tackle it @RogerThomas, otherwise I plan to.
For the tests, I'm seeing ~350 tests that hit the to_dict of either DataFrame or Series. Some of the more explicit tests of to_dict:
test_to_dict_timestamp test_to_dict_index_not_unique_with_index_orient test_to_dict_invalid_orient test_to_dict_short_orient_raises test_to_dict test_to_dict_errors test_to_dict_not_unique_warning test_to_dict_box_scalars test_to_dict_tz test_to_dict_index_dtypes test_to_dict_numeric_names test_to_dict_wide test_to_dict_orient_dtype test_to_dict_scalar_constructor_orient_dtype test_to_dict_mixed_numeric_frame test_to_dict_orient_tight test_to_dict_returns_native_types test_to_dict_index_false_error test_to_dict_index_false Looking through these, it appears to me the addition of testing object dtype gives good coverage of the code here.
I plan to run the to_dict ASVs on here this evening as a last step in approving, but otherwise looks good to me.
cc @jreback
| ASVs look great! |
| Thanks @rhshadrach!! Just to be clear then, is the last blocker moving the entire to_dict method to pandas.io? |
| moving to pandas.in is a follow up are there sufficient tests here? |
As far as I can tell, yes. I commented on this in #46487 (review). Are you looking for something more specific? |
| @jreback @rhshadrach the tests I added here, I believe, give us decent coverage across a range of dtypes |
| @jreback - friendly ping. |
| @rhshadrach anything I can do to help speed this up, I'd really like to get it in and am afraid it's going to go stale and fall by the wayside. |
| @jreback - friendly ping. |
jreback left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine
need to move the note
also prob good idea to factor this code outside of frame.py in a follow up
@rhshadrach pls merge when good by you
doc/source/whatsnew/v1.5.0.rst Outdated
| - Performance improvement when setting values in a pyarrow backed string array (:issue:`46400`) | ||
| - Performance improvement in :func:`factorize` (:issue:`46109`) | ||
| - Performance improvement in :class:`DataFrame` and :class:`Series` constructors for extension dtype scalars (:issue:`45854`) | ||
| - Performance improvement in :meth:`DataFrame.to_dict` and :meth:`Series.to_dict` when using any non-object dtypes (:issue:`46470`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to move the note to 2.0
…andas-devgh-46470-slow-performance-of-to-dict
| Thanks @jreback I've moved the whatsnew doc to 2.0.0, let me know if there's anything else |
rhshadrach left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
| Thanks @RogerThomas - great work! I've opened #49845 as a followup, would you have any interest tackling this? |
| Thanks @rhshadrach, for sure, I'll do that |
Co-authored-by: Roger Thomas <roger.thomas@oliverwyman.com> Co-authored-by: RogerThomas <roger.thomas@gmail.com>
Improves performance of
to_dictmethod for dataframes and seriesFor
orient=indexandorient=listperformance has decreased but this is because, prior to this PR, we were not coercing types to native python types, we now are. (I assume we want this, but maybe not?)to_dict("records")#46470 (Replace xxxx with the Github issue number)doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.