-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
BUG: Fix duplicates in intersection of multiindexes #36927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 25 commits
cdefaae fbd63f2 53a37d1 5675a4e 134936c 582c0b9 7805de5 67691df 8fb0055 66b519f cb1477b 0fb2561 3c19d57 10524fd 3dde0ee a0a1a33 45dfb84 d71a499 d873d5a e90239a c2b448a 742716e 321797a a980ec0 972fd48 fe1ded4 8e4d47b File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -2822,7 +2822,7 @@ def intersection(self, other, sort=False): | |
| self._assert_can_do_setop(other) | ||
| other = ensure_index(other) | ||
| | ||
| if self.equals(other): | ||
| if self.equals(other) and not self.has_duplicates: | ||
| return self._get_reconciled_name_object(other) | ||
| | ||
| if not is_dtype_equal(self.dtype, other.dtype): | ||
| | @@ -2847,6 +2847,7 @@ def _intersection(self, other, sort=False): | |
| except TypeError: | ||
| pass | ||
| else: | ||
| result = algos.unique1d(result) | ||
| return result | ||
| | ||
| try: | ||
| | @@ -2858,7 +2859,7 @@ def _intersection(self, other, sort=False): | |
| indexer = algos.unique1d(Index(rvals).get_indexer_non_unique(lvals)[0]) | ||
| indexer = indexer[indexer != -1] | ||
| | ||
| result = other.take(indexer)._values | ||
| result = other.take(indexer).unique()._values | ||
| Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add an assert that L2867 is always unique (e.g. we are guaranteed uniqu here L2862 (and I think safe_sort guaranteess this), but let's be explicit (and add a comment). Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you should check if this is_unique first. it will be computed but is likely faster than always doing this (and copies yet again). Member Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We may have a numpy array here, so we can not do result.is_unique. Is there a better way than using algos.unique on result and comparing shapes then? Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure you can do (measure perf) Member Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, did not get that. Did it now, is in #38154 | ||
| | ||
| if sort is None: | ||
| result = algos.safe_sort(result) | ||
| | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -3601,6 +3601,8 @@ def intersection(self, other, sort=False): | |
| other, result_names = self._convert_can_do_setop(other) | ||
| | ||
| if self.equals(other): | ||
| if self.has_duplicates: | ||
| return self.unique().rename(result_names) | ||
| Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @phofl it looks like we do something slightly different in many of the The RangeIndex one is equivalent to the Index/Intervalindex bc it never has duplicates (can add that check to make the code match exactly). Can the others be made to have identical logic? Member Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there still an open case where I can be of help? Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we're all set for now, will take another look after all the current Index.intersection PRs go through. thanks Member Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like several of them are still either which i think is wrong if it has duplicates, or which isnt handling the has-duplicates case like the others. can these all be identical? Also if you really get on a roll, Member Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will look through them and try to standardize them as much as possible | ||
| return self.rename(result_names) | ||
| | ||
| if not is_object_dtype(other.dtype): | ||
| | @@ -3619,10 +3621,12 @@ def intersection(self, other, sort=False): | |
| uniq_tuples = None # flag whether _inner_indexer was successful | ||
| if self.is_monotonic and other.is_monotonic: | ||
| try: | ||
| uniq_tuples = self._inner_indexer(lvals, rvals)[0] | ||
| sort = False # uniq_tuples is already sorted | ||
| inner_tuples = self._inner_indexer(lvals, rvals)[0] | ||
| sort = False # inner_tuples is already sorted | ||
| except TypeError: | ||
| pass | ||
| else: | ||
| uniq_tuples = algos.unique(inner_tuples) | ||
| | ||
| if uniq_tuples is None: | ||
| other_uniq = set(rvals) | ||
| | ||
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
| | @@ -311,7 +311,9 @@ def should_reindex_frame_op( | |||
| # TODO: any other cases we should handle here? | ||||
| cols = left.columns.intersection(right.columns) | ||||
| | ||||
| if len(cols) and not (cols.equals(left.columns) and cols.equals(right.columns)): | ||||
| if len(cols) and not ( | ||||
| cols.equals(left.columns.unique()) and cols.equals(right.columns.unique()) | ||||
| ): | ||||
| ||||
| def test_column_dups_operations(self): |
This one is crashing in line 255 because of memory issues. The condition is never fullfilled, so it blows up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the condition in master is never fulfilled or the condition in the PR?
btw, usually let the person who asked the question hit the "resolve conversation" button
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, did not know that. Will do that in the future.
The input df has columns [A,A]. When intersection is unique, then cols=[A], while left.columns=[A,A] and right.columns=[A,A]. Without the change adding the unique, cols.equals(left.columns) and cols.equals(right.columns) will always be False. So we always return True, which results in duplicating the columns for every passthrough, hence the memory explosion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still needed?
we now guaranteee that intersection returns unique columns, right so this should no longer be the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is the problem. Is intersection is unique, cols.equals(left.columns) won't be True. This leads to a recursion in the test mentioned which blows up the memory, because we can not exit this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok pls add some comments to this effect then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would prob calculate left_uniques and right_uniques and make them variables as a bit simpler to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, changed it and added a comment
Uh oh!
There was an error while loading. Please reload this page.