Skip to content

Conversation

@sinhrks
Copy link
Member

@sinhrks sinhrks commented Jun 28, 2014

Related to #7453 2nd issue:

  • GroupBy.size created by TimeGrouper raises AttributeError
@jreback jreback added this to the 0.14.1 milestone Jun 28, 2014
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you not using value_counts? this should be very similar (if not call irectly), BaseGrouper.size

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 reasons.

  • value_counts cannot fulfill intermediate timestamps which is not included in the group key. For example, if the TimeGrouper categorize [2011-01-01, 2011-01-02, 2011-01-04] with daily frequency, the resulted group keys must be [2011-01-01, 2011-01-02, 2011-01-03, 2011-01-04]
  • BinGrouper doesn't retain original Index to be directly passed to value_counts. To use value_counts, it must be created using similar logic.
indices = self.indices labels = [] for k, v in compat.iteritems(indices): labels.extend([k] * len(v)) value_counts(labels) 
@jreback
Copy link
Contributor

jreback commented Jun 28, 2014

ok can u add a bench for size in that case?
for both a regular groupby and TimeGrouper?

@sinhrks
Copy link
Member Author

sinhrks commented Jun 29, 2014

OK, added benchmarks.

@jreback
Copy link
Contributor

jreback commented Jun 29, 2014

can I post benchmarks for those 2? thxs

@sinhrks
Copy link
Member Author

sinhrks commented Jun 30, 2014

Here it is. Cannot measure base performance of groupby_dt_timegrouper_size because of the bug, but it looks not slow comparing to groupby_dt_size

------------------------------------------------------------------------------- Test name | head[ms] | base[ms] | ratio | ------------------------------------------------------------------------------- groupby_dt_size | 42.2956 | 55.4504 | 0.7628 | groupby_dt_timegrouper_size | 34.2657 | ------- | ------ | 
@jreback
Copy link
Contributor

jreback commented Jun 30, 2014

run then with -Hto test current perf (with your fix) with comparable samples (or do in ipython).

@sinhrks
Copy link
Member Author

sinhrks commented Jun 30, 2014

Yep, current perf of groupby_dt_timegrouper_size is attached above.

jreback added a commit that referenced this pull request Jun 30, 2014
BUG: GroupBy.size created by TimeGrouper raises AttributeError
@jreback jreback merged commit d21f44b into pandas-dev:master Jun 30, 2014
@sinhrks sinhrks deleted the bingrouper branch July 5, 2014 04:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

2 participants