-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
Series.value_counts: Preserve original ordering #24302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Hello @tomspur! Thanks for updating the PR.
Comment last updated on December 23, 2018 at 20:34 Hours UTC |
| tm.assert_series_equal(Series(vc.index), s) | ||
| | ||
| | ||
| def test_original_ordering_value_counts2(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have tests for value_counts, don’t add a new file just change the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll move them to that location as well.
| raise ValueError("This Series is a view of some other array, to " | ||
| "sort in-place you must create a copy") | ||
| | ||
| def _try_kind_sort(arr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why you changing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't get it working to separate the sorting and ascending vs non-ascending ordering further below in the code and changed it to pass it on to nargsort that does both at once
pandas/core/algorithms.py Outdated
| # Use same index as the original values | ||
| if result.index.isna().sum() > 0: | ||
| fill_value = result[result.index.isna()].values[0] | ||
| result = result.reindex(unique(values), fill_value=fill_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? this is way beyond scope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that _value_counts_arraylike replaces all generic nan values (e.g. None) with np.NaN and reindexing has a mismatch between the original None and the new np.NaN in the index. The fill_value replaces the new missing value with the original value again.
Alternatively, it is also possible to use a generic values = values.fillna(np.NaN) to replace the Nones with np.NaNs if there are nans in the index. I'll push a follow up commit about this in a bit. Would that be better than the above?
Codecov Report
@@ Coverage Diff @@ ## master #24302 +/- ## =========================================== - Coverage 92.28% 43% -49.28% =========================================== Files 162 162 Lines 51827 51829 +2 =========================================== - Hits 47827 22289 -25538 - Misses 4000 29540 +25540
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@ ## master #24302 +/- ## ========================================== + Coverage 92.3% 92.3% +<.01% ========================================== Files 163 163 Lines 51947 51953 +6 ========================================== + Hits 47949 47955 +6 Misses 3998 3998
Continue to review full report at Codecov.
|
pandas/core/algorithms.py Outdated
| result = Series(counts, index=keys, name=name) | ||
| | ||
| # Use same index as the original values | ||
| if result.index.isna().sum() > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this section. Are there cases where values.fillna(np.nan) does anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What basically happens a few lines further is a reindex of a value_counts:
vals = [1, None, 2, 2] vals.value_counts(dropna=False).reindex(pd.Series(pd.unique([1, None, 2, 2]))) The value_counts transforms the None into NaN in the index, so that the count will be NaN as well (and this value count is removed from the returning reindexed Series).
The fillna is used to transform the None from the reindex as well, so that this is executed instead, so that the resulting Series still contains the value count:
pd.Series(vals).value_counts(dropna=False).reindex(pd.Series(pd.unique(vals)).fillna(np.NaN) I hope this makes sense, because I am quite confused as well from times. If you have another solution to fix this edge case, please let me know...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomspur I am not sure what you are doing here at all. see the OP for the solution, which is a simple re-index. this PR is way out of scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback
The re-index seems to work for sort=False and non-categorical indices and I could add that in this PR.
I wanted to further fix this for sort=True, when several values have the same count, e.g. this from the tests that were added in this PR:
In [2]: Series(list('bacaef')) In [3]: s.value_counts(sort=True) Out[3]: a 2 c 1 e 1 b 1 f 1 dtype: int64 And I wanted to have them in the order abcef as well in that case.
I'll have a look then at sort=False for now as you suggested and leave sort=True for another PR
jreback left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments
a119a45 to 0dc7a21 Compare Ensure that value_counts returns the same ordering of the indices than the input object when sorting the values no matter if it is ascending or descending. This fixes pandas-dev#12679.
0dc7a21 to e966aa7 Compare doc/source/whatsnew/v0.24.0.rst Outdated
| | ||
| - Bug where C variables were declared with external linkage causing import errors if certain other C libraries were imported before Pandas. (:issue:`24113`) | ||
| - Require at least 0.28.2 version of ``cython`` to support read-only memoryviews (:issue:`21688`) | ||
| - :meth:`Series.value_counts` returns the counts in the same ordering as the original series when using ``sort=False`` (:issue:`12679`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to api breaking changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and pushed
pandas/core/algorithms.py Outdated
| result = result.sort_values(ascending=ascending) | ||
| elif bins is None: | ||
| uniq = unique(values) | ||
| if not isinstance(result.index, CategoricalIndex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this check needed? (or maybe it just needs to be for an ordered categorical)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue was the test TestCategoricalSeriesAnalytics.test_value_counts:
cats = Categorical(list('abcccb'), categories=list('cabd')) s = Series(cats, name='xxx') res = s.value_counts(sort=False) which returns 0 for the dcategory as well, which is not in the unique(values). Is there another possibility to get access to that initial categorical from the index?
pandas/tests/test_algos.py Outdated
| if not compat.is_platform_32bit(): | ||
| tm.assert_series_equal(result, expected) | ||
| | ||
| def test_value_counts_nonsorted_single_occurance(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paramterize on sort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and pushed
pandas/tests/test_algos.py Outdated
| vc = s.value_counts(sort=False, ascending=True) | ||
| tm.assert_series_equal(Series(vc.index), s) | ||
| | ||
| @pytest.mark.xfail(reason="sort=True does not guarantee the same order") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this xfail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not get it working for the sort=True case and left the tests only for possible future fixing... Would you prefer deleting them and adding them when sort=True works as well?
pandas/tests/test_algos.py Outdated
| vc = s.value_counts(sort=True, ascending=True) | ||
| tm.assert_series_equal(Series(vc.index), s) | ||
| | ||
| def test_value_counts_nonsorted_double_occurance(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parametrize these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The double_occurance tests have different expected results and I wouldn't parametrize it due to that. Or would you do this as well in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I c, ok then, parameterize over ascending though
pandas/core/algorithms.py Outdated
| if sort: | ||
| result = result.sort_values(ascending=ascending) | ||
| elif bins is None: | ||
| uniq = unique(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so do
uniq = uniques(values) for the non-EA case (above)
and do uniq = Series(values)._values.unique() for the EA case. though this means computing it twice. maybe have to work on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it above, although it is not computed twice now. Did you mean it like this?
| if sort: | ||
| result = result.sort_values(ascending=ascending) | ||
| elif bins is None: | ||
| if not isinstance(result.index, ABCCategoricalIndex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should not be necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It unfortunately is for cases, where the categories contain more elements than the values: #24302 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not necessary otherwise the impl is incorrect
categoricals uniq already handles this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem so as unique does not contain d in this example:
In [31]: s = pd.Series(pd.Categorical(list('baabc'), categories=list('abcd'))) In [32]: s.value_counts() Out[32]: b 2 a 2 c 1 d 0 dtype: int64 In [33]: pd.unique(s) Out[33]: [b, a, c] Categories (3, object): [b, a, c] There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/categorical.py#L2267 also mentions that unused categories are not returned.
It would be consistent to not return it in the value_counts in the above example as well, but would change the current behaviour... What do you think?
pandas/tests/test_algos.py Outdated
| vc = s.value_counts(sort=True, ascending=True) | ||
| tm.assert_series_equal(Series(vc.index), s) | ||
| | ||
| def test_value_counts_nonsorted_double_occurance(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I c, ok then, parameterize over ascending though
1dded41 to 355f872 Compare
WillAyd left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you merge master?
| - :func:`read_csv` will now raise a ``ValueError`` if a column with missing values is declared as having dtype ``bool`` (:issue:`20591`) | ||
| - The column order of the resultant :class:`DataFrame` from :meth:`MultiIndex.to_frame` is now guaranteed to match the :attr:`MultiIndex.names` order. (:issue:`22420`) | ||
| - :func:`pd.offsets.generate_range` argument ``time_rule`` has been removed; use ``offset`` instead (:issue:`24157`) | ||
| - :meth:`Series.value_counts` returns the counts in the same ordering as the original series when using ``sort=False`` (:issue:`12679`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to 0.25
| Closing as stale. Ping if you'd like to continue |
Ensure that value_counts returns the same ordering of the indices than the input object
when sorting the values no matter if it is ascending or descending.
git diff upstream/master -u -- "*.py" | flake8 --diff