Skip to content

Conversation

@sinhrks
Copy link
Member

@sinhrks sinhrks commented Apr 9, 2016

Also fixed an issue which categorical value_counts resets name.

@sinhrks sinhrks added Sparse Sparse Data Type Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Apr 9, 2016
@sinhrks sinhrks added this to the 0.18.1 milestone Apr 9, 2016
@sinhrks sinhrks added the Bug label Apr 9, 2016
@sinhrks sinhrks force-pushed the sparse_valuecounts branch from 79164d6 to 56a415a Compare April 9, 2016 05:50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually with only a small rewriting I think you could do:

if com.is_extension_dtype(values) or com.is_datetimelike(values): result = values.values_counts(dropna) result.name = name else: ..... 

if we define DateTimeIndex.value_counts that properly handles localization before/after (instead of in here)
will be much cleaner.

Further it makes sense to define .value_counts() in tseries/base.py (and move the inference from core/base.py/value_counts.py there. To handle the dispatching of period/DatetimeIndex.

a bit of refactoring but moves functionaility where it should be.

@sinhrks sinhrks force-pushed the sparse_valuecounts branch 2 times, most recently from 81fac4e to 91c4c2f Compare April 9, 2016 18:00
@sinhrks
Copy link
Member Author

sinhrks commented Apr 9, 2016

Found current master doesn't work for datetimetz. Let me check.

values = [pd.Timestamp('2011-01-01 09:00', tz='US/Eastern'), pd.Timestamp('2011-01-01 10:00', tz='US/Eastern'), pd.Timestamp('2011-01-01 11:00', tz='US/Eastern'), pd.Timestamp('2011-01-01 09:00', tz='US/Eastern'), pd.Timestamp('2011-01-01 09:00', tz='US/Eastern'), pd.Timestamp('2011-01-01 11:00', tz='US/Eastern')] s = pd.Series(values, name='xxx') # NG, index must have tz s.value_counts() # 2011-01-01 14:00:00 3 # 2011-01-01 16:00:00 2 # 2011-01-01 15:00:00 1 # Name: xxx, dtype: int64 
@sinhrks sinhrks force-pushed the sparse_valuecounts branch 2 times, most recently from 89ab6ea to add609f Compare April 24, 2016 01:56
@sinhrks sinhrks added Timezones Timezone data dtype Categorical Categorical Data Type labels Apr 24, 2016
@sinhrks
Copy link
Member Author

sinhrks commented Apr 24, 2016

Additionally fixing the other Categorical issue.

pd.CategoricalIndex([1, 2, 3, 1, 2, 3]).value_counts(normalize=True) # UnboundLocalError: local variable 'counts' referenced before assignment 
@sinhrks sinhrks force-pushed the sparse_valuecounts branch 2 times, most recently from c14cbf5 to f1d62ee Compare April 24, 2016 03:32
@jreback
Copy link
Contributor

jreback commented Apr 25, 2016

@sinhrks pls rebase when you can

@sinhrks sinhrks force-pushed the sparse_valuecounts branch from f1d62ee to 456a4a9 Compare April 28, 2016 12:38
@codecov-io
Copy link

codecov-io commented Apr 28, 2016

Current coverage is 84.04%

Merging #12835 into master will increase coverage by +0.07%

@@ master #12835 diff @@ ========================================== Files 136 136 Lines 49908 49931 +23 Methods 0 0 Messages 0 0 Branches 0 0 ========================================== + Hits 41876 41962 +86  + Misses 8032 7969 -63  Partials 0 0 
  1. File ...ndas/core/groupby.py (not in diff) was modified. more
    • Misses -63
    • Partials 0
    • Hits +63

Powered by Codecov. Last updated by 38a7531

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't happen here as is_extension_type will do this

@sinhrks sinhrks force-pushed the sparse_valuecounts branch from 456a4a9 to 57f6667 Compare April 28, 2016 13:22
Copy link
Member Author

@sinhrks sinhrks Apr 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed @slow as groupby.value_coutns has separate logic depending on numpy version.

@sinhrks sinhrks force-pushed the sparse_valuecounts branch from 57f6667 to caeaed9 Compare April 28, 2016 21:19
@sinhrks sinhrks force-pushed the sparse_valuecounts branch from caeaed9 to 2392e7c Compare April 28, 2016 22:15
result = result.sort_values(ascending=ascending)
values = values.view(np.int64)
keys, counts = htable.value_count_scalar64(values, dropna)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, though I still think we should handle in tseries/base, e.g. define DatetimeIndexOpsMixIN and just defer to it like we are doing for sparse/categorical.

Copy link
Member Author

@sinhrks sinhrks Apr 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparse/categorical value_counts only supports dropna kwds. It may better to be an internal method.

If we do the same thing using DatetimeIndexOpsMixin, the impl should be:

  • DatetimeIndexOpsMixin 's public value_counts calls algorithms.py (as it needs to handle kwds like normalize)
  • algorithms.py convert datetime-likes to datetime-like Index to handle Series
  • Calls DatetimeIndexOpsMixin 's internal _value_counts

Suppose to be little complex, or do you have anything in your mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah ok, it can ge tricky. ok, we'll keep in mind for later. ideally to put specific logic. merging.

@jreback jreback closed this in 8439d28 Apr 29, 2016
@sinhrks sinhrks deleted the sparse_valuecounts branch April 29, 2016 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Categorical Categorical Data Type Sparse Sparse Data Type Timezones Timezone data dtype

3 participants