Skip to content

Conversation

@juleek
Copy link
Contributor

@juleek juleek commented Feb 15, 2023

Description of the bug

The crux of the issue was in BaseGrouper::groups:

to_groupby = zip(*(ping.grouping_vector for ping in self.groupings)) 

We are going to iterate over to_groupby, which is a zip of ping.grouping_vector.

groupings are populated in get_grouper, below is an excerpt of relevant code that we executed in case of grouping by date:

 # Taking this branch because grouping_vector is `TimeGrouper` elif isinstance(grouping_vector, Grouper): newgrouper, newobj = grouping_vector._get_grouper(self.obj, validate=False) # newgrouper is BinGrouper after the line above if isinstance(newgrouper, ops.BinGrouper): grouping_vector = newgrouper

after this, Grouping::grouping_vector is BinGrouper, and this is what zip() will try to iterate over.

Before the fix, when we were iterating over BinGrouper, its base class __iter__ was used, which was using self.indices, which was:

defaultdict(<class 'list'>, {Timestamp('2000-01-01 00:00:00'): [0, 1, 2], Timestamp('2000-01-02 00:00:00'): [3, 4, 5]}) 

When you iterate over a dictionary in Python, you iterate over the keys, and more importantly we have only two of them here. Second Grouping has array in its grouping vector consisting of 6 elements:

['a' 'b' 'a' 'b' 'a' 'b'] 

so, the zip() function just truncated it.

Fix

By adding __iter__ in BinGrouper, the zip() function now iterate over the elements from self.groupings[0].grouping_vector, which is, in our case:

DatetimeIndex(['2000-01-01', '2000-01-01', '2000-01-01', '2000-01-02', '2000-01-02', '2000-01-02'], dtype='datetime64[ns]', name='b', freq=None) 
@juleek
Copy link
Contributor Author

juleek commented Feb 16, 2023

@jbrockmendel @rhshadrach Can you please have a look and/or advise? Thanks.

return result

def __iter__(self) -> Iterator[Hashable]:
return iter(self.groupings[0].grouping_vector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the parent class method wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to add description. Please, see the description of the PR.

Copy link
Member

@rhshadrach rhshadrach Mar 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to refactor the indices here to align better with the base class rather than override iter. But I'm not certain if that's a good idea and I'm okay with this fix. cc @jbrockmendel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel - friendly ping.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. i had in mind something more like @rhshadrach's suggestion, that needing this here might indicate that the self.indices (which the parent class method iterates over) might be wrong

@juleek
Copy link
Contributor Author

juleek commented Feb 23, 2023

@jbrockmendel I fixed all your comments. Can you please review the PR? Thanks.

@juleek
Copy link
Contributor Author

juleek commented Feb 25, 2023

@jbrockmendel I am wondering whether you had time to have a look the PR? If not, should I add someone else to the PR to review it?

Also, am I missing something or the tests a quite flaky?
For example, right now NumPy is failing with this:

Fatal Python error: Segmentation fault 
@jbrockmendel
Copy link
Member

Please be patient, there are a lot of PRs to review.

The segfault is unrelated and is affecting all PRs at the moment. You can ignore it.

@juleek
Copy link
Contributor Author

juleek commented Mar 9, 2023

@jbrockmendel have you had time to have a look?

@juleek
Copy link
Contributor Author

juleek commented Mar 18, 2023

@jbrockmendel Any news, please?

@juleek juleek force-pushed the work_in_progress branch from 6ad38ad to 66bf4c8 Compare March 20, 2023 17:52
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; I've opened #52468 to track a possible followup.

@rhshadrach rhshadrach added this to the 2.1 milestone Apr 8, 2023
@rhshadrach rhshadrach merged commit ebe484a into pandas-dev:main Apr 8, 2023
@rhshadrach
Copy link
Member

Thanks @juleek!

@taytzehao taytzehao mentioned this pull request Aug 14, 2023
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

4 participants