Skip to content

Conversation

@babky
Copy link
Contributor

@babky babky commented May 22, 2018

The nonzero operation returned the nonzero locations of the underlying index. However we need to get the nonzero locations in the real array. For this operation to be faster an inverse index structure would be beneficial or it could be implemented using binary search.

sa = pd.SparseArray([float('nan'), float('nan'), 1, 0, 0, 2, 0, 0, 0, 3, 0, 0])

returned 0, 3, 7. The index is shifted by two because of the two first NaNs and that's why the 0, 3, 7 are returned. The correct result would be 2, 5, 9 and is found in the method.

For the above sample the code works. However for other implementations of SparseIndex it could be broken.

The nonzero operation returned the nonzero locations of the underlying index. However we need to get the nonzero locations in the real array. For this operation to be faster an inverse index structure would be beneficial or it could be implemented using binary search. ```python sa = pd.SparseArray([float('nan'), float('nan'), 1, 0, 0, 2, 0, 0, 0, 3, 0, 0]) ``` returned `0, 3, 7`. The index is shifted by two because of the two first `NaN`s and that's why the `0, 3, 7` are returned. The correct result would be `2, 5, 9` and is found in the method.
@pep8speaks
Copy link

pep8speaks commented May 22, 2018

Hello @babky! Thanks for updating the PR.

Comment last updated on November 08, 2018 at 09:20 Hours UTC
@mroeschke mroeschke added the Sparse Sparse Data Type label May 23, 2018
@jbrockmendel
Copy link
Member

@babky int32 isn’t imported, needs to be changed to np.int32.

@jbrockmendel
Copy link
Member

@babky small fixup needed to get this working

@babky
Copy link
Contributor Author

babky commented Aug 20, 2018

hi, i will fix them in near future, was on a vacation...

@babky babky closed this Aug 20, 2018
@babky babky reopened this Aug 20, 2018
@jreback
Copy link
Contributor

jreback commented Aug 20, 2018

needs tests, whatsnew entry, and should be a vectorized soln.

@jorisvandenbossche
Copy link
Member

cc @TomAugspurger pinging you here since you are working on sparse, to make sure this PR would not conflict with your work

@TomAugspurger
Copy link
Contributor

At a glance, this approach won’t work with my implementation since calls super to get ndarray.nonzero

@babky
Copy link
Contributor Author

babky commented Nov 8, 2018

The issue looks gone in the current master. I've added some unit tests to cover the issue. From my really fast overlook of the tests I think that the tests could be added.

@codecov
Copy link

codecov bot commented Nov 8, 2018

Codecov Report

Merging #21175 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@ Coverage Diff @@ ## master #21175 +/- ## ========================================== + Coverage 92.25% 92.25% +<.01%  ========================================== Files 161 161 Lines 51390 51390 ========================================== + Hits 47411 47412 +1  + Misses 3979 3978 -1
Flag Coverage Δ
#multiple 90.65% <ø> (ø) ⬆️
#single 42.31% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/sparse.py 91.94% <0%> (+0.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db2066b...2c912f1. Read the comment docs.

@babky
Copy link
Contributor Author

babky commented Nov 9, 2018

@jbrockmendel @TomAugspurger: I believe the issue is gone in master branch and I just added the tests which should prevent a regression.

s.iloc[indexer]


class TestSparseArray(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this to pandas/tests/arrays/sparse/test_array.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both things completed 👍, sorry for it.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a whatsnew in 0.24.0.txt? I don't see one saying that we fixed nonzero.

@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Nov 11, 2018
@TomAugspurger
Copy link
Contributor

Moved it one more time. We reorganized things.

- Bug in ``DataFrame.groupby`` not including ``fill_value`` in the groups for non-NA ``fill_value`` when grouping by a sparse column (:issue:`5078`)
- Bug in unary inversion operator (``~``) on a ``SparseSeries`` with boolean values. The performance of this has also been improved (:issue:`22835`)
- Bug in :meth:`SparseArary.unique` not returning the unique values (:issue:`19595`)
- Bug in ``SparseArray.nonzero`` and `SparseDataFrame.dropna` returning shifted/invalid results (:issue:`21172`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add :func: to reference the api of these

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

tm.assert_sp_array_equal(res, exp)

def test_nonzero(self):
sa = pd.SparseArray([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number

2, 0, 0, 0,
3, 0, 0
])
tm.assert_numpy_array_equal(np.array([2, 5, 9], dtype=np.int32),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use

result = expected = tm.assert_numpy_array_equal(result, expected) 
assert type(res[column]) is SparseSeries

def test_dropna(self):
tm.assert_sp_frame_equal(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the result= and expected= format

add the issue number as a comment

pd.SparseDataFrame({"F2": [0, 1]}),
pd.SparseDataFrame(
{"F1": [float('nan'), float('nan')], "F2": [0, 1]}
).dropna(axis=1, inplace=False, how='all')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test for inplace=True/False, how='all', 'any'

via parametrization

@babky
Copy link
Contributor Author

babky commented Nov 16, 2018

@jreback the comments should be resolved now

df_info.txt Outdated
@@ -0,0 +1,8 @@
<class 'pandas.core.frame.DataFrame'>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Committed by accident?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes :-( 👍

- Bug in ``DataFrame.groupby`` not including ``fill_value`` in the groups for non-NA ``fill_value`` when grouping by a sparse column (:issue:`5078`)
- Bug in unary inversion operator (``~``) on a ``SparseSeries`` with boolean values. The performance of this has also been improved (:issue:`22835`)
- Bug in :meth:`SparseArary.unique` not returning the unique values (:issue:`19595`)
- Bug in :funct:``SparseArray.nonzero`` and :func:``SparseDataFrame.dropna`` returning shifted/invalid results (:issue:`21172`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function -> func.

Just single backticks around SparseArray.nonzero and SparseDataFrame.dropna

def test_dropna(self):
# Tests regression #21172.
expected = pd.SparseDataFrame({"F2": [0, 1]})
for inplace, how in product((True, False), ('all', 'any')):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be parametrized over two hings: inplace and how: https://docs.pytest.org/en/latest/parametrize.html#pytest-mark-parametrize-parametrizing-test-functions

@pytest.mark.parametrize("inplcae", [True, False]) @pytest.mark.parametrize("how", ["all", "any"]) def test_dropna(self, how, inplace): ....
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch, did not know that, thank you 🌮

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you remove the df_info file and ping on green.

@jreback jreback merged commit 3abff0d into pandas-dev:master Nov 17, 2018
@jreback
Copy link
Contributor

jreback commented Nov 17, 2018

thanks @babky

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Sparse Sparse Data Type

7 participants