Skip to content

Conversation

@scottgigante
Copy link
Contributor

@scottgigante scottgigante commented May 13, 2020

@scottgigante scottgigante marked this pull request as ready for review May 13, 2020 15:56
@scottgigante
Copy link
Contributor Author

I don't have authority to request reviews, so @TomAugspurger @mroeschke @jorisvandenbossche

@jreback jreback added the Sparse Sparse Data Type label May 14, 2020
@scottgigante scottgigante force-pushed the bugfix/sparse-fill-value branch from a4744ea to 3c96b60 Compare May 14, 2020 15:15
@jreback jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label May 14, 2020
@jreback jreback added this to the 1.1 milestone May 14, 2020
@jorisvandenbossche
Copy link
Member

I am not sure this fix is entirely correct. While it certainly fixed the original buggy case, like:

In [7]: df1 = pd.DataFrame({"A": pd.SparseArray([0, 0, 0]), 'B': [1,2,3]}) In [8]: df1.loc[df1['B'] != 2] Out[8]: A B 0 0 1 2 0 3 

(before, this gave NaNs in the first column).

But it also seems to introduce a new bug, eg when doing a reindex:

In [9]: df1.reindex([0,1,3]) Out[9]: A B 0 0 1.0 1 0 2.0 3 0 NaN 

This should give a NaN and not 0 at row 3, I think?

@scottgigante
Copy link
Contributor Author

So here's the problem as far as I understand: in https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals/blocks.py#L1219 , blk.fill_value is used to mean the value which is used when you take an index that doesn't exist. In sparse, dtype.fill_value is used to mean the value which is used for filling sparse entries. I don't see an easy way of disentangling these.

@jorisvandenbossche
Copy link
Member

The naming is potentially a bit confusing here (fill_value in take is generally pointing to what is called "na_value" elsewhere).
But so the question is: for SparseArray, should we be using the "SparseDtype.fill_value" when doing a reindex, or rather the "SparseDtype.na_value". It's not directly clear to me which of the two is correct / desired.

Using "na_value" feels more correct (that's what reindexing does for all other array types), but using "fill_value" can avoid densifying the sparse array in a reindexing operation.

Looking at SparseArray.take itself:

In [15]: a = pd.arrays.SparseArray([1.0, 0.0, np.nan, 0.0], fill_value=0.0) In [16]: a Out[16]: [1.0, 0.0, nan, 0.0] Fill: 0.0 IntIndex Indices: array([0, 2], dtype=int32) In [18]: a.take([0, 1, -1, -1], allow_fill=True) Out[18]: [1.0, 0.0, nan, nan] Fill: 0.0 IntIndex Indices: array([0, 2, 3], dtype=int32) 

Here it is clearly "na_value" that is used by default.

@scottgigante scottgigante force-pushed the bugfix/sparse-fill-value branch from 3c96b60 to 0e8c204 Compare May 17, 2020 21:39
@scottgigante
Copy link
Contributor Author

I'm fundamentally unsure how to make this happen, if this is the desired functionality:

>>> df1 = pd.DataFrame({"A": pd.SparseArray([0, 0, 0]), 'B': [1,2,3]}) >>> df1.reindex([0,1,3]) A B 0 0 1.0 1 0 2.0 3 NaN NaN 
@scottgigante scottgigante force-pushed the bugfix/sparse-fill-value branch 2 times, most recently from e8d7648 to c161ae9 Compare May 18, 2020 01:24
@scottgigante scottgigante force-pushed the bugfix/sparse-fill-value branch from c161ae9 to 5188e03 Compare May 20, 2020 14:21
@jreback
Copy link
Contributor

jreback commented May 25, 2020

this is fine, can merge on green.

@jreback jreback merged commit a94b13a into pandas-dev:master Jun 1, 2020
@jreback
Copy link
Contributor

jreback commented Jun 1, 2020

thanks @scottgigante

@jorisvandenbossche
Copy link
Member

@jreback as shown by the discussion above (and @scottgigante's "I'm fundamentally unsure how to make this happen"), this PR was clearly not ready to be merged.

@scottgigante sorry for the slow follow-up. It's however a complicated topic with not directly a clear easy solution (as you experienced yourself), so it requires some time to delve in, which I didn't find yet

@jorisvandenbossche
Copy link
Member

@jreback can you react here? Otherwise I am planning to revert this change. It's not clear to me that this is the right fix, as it possibly causes another regression

@jreback
Copy link
Contributor

jreback commented Jun 3, 2020

@jorisvandenbossche I am unclear what you think is the problem. This closed a few issues w/o causing others. If you want to revert ok by me, but the soln looked fine.

@scottgigante
Copy link
Contributor Author

scottgigante commented Jun 3, 2020

@jreback @jorisvandenbossche it's not clear to me that the (unintended) consequence of this PR is a regression but it's worth discussing: what should be the outcome of this action?

df1 = pd.DataFrame({"A": pd.arrays.SparseArray([1, 2, 0]), 'B': [1,2,3]}) df1.reindex([0,1,3]) 

Current behaviour:

>>> df1.reindex([0,1,3]) A B 0 1.0 1.0 1 2.0 2.0 3 NaN NaN >>> df1.reindex([0, 1, 3])['A'].sparse.density 1.0 

New behaviour:

>>> df1.reindex([0,1,3]) A B 0 1.0 1.0 1 2.0 2.0 3 0.0 NaN >>> df1.reindex([0, 1, 3])['A'].sparse.density 0.6666666666666666 

I don't know that filling a sparse array with non-sparse values when reindexing by a nonexistent index is the right thing to do, but if that is desirable then this is a regression (and I don't know how to fix it).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 4, 2020

Sparse aside: in principle, reindexing always introduces missing values when new labels are present in the indexer.
(as the non sparse column B does in the example above)

So the question for sparse is: do we follow that rule, or do we deviate by using the "fill value" instead of a "missing value" (na_value).

And as I showed in #34158 (comment), SparseArray.take uses na_value and not fill_value. If we think this behaviour is correct, then reindex should also do that IMO

@TomAugspurger
Copy link
Contributor

Reindex should always introduce missing values. This behavior on master looks incorrect:

In [6]: pd.Series([0, 1, 2], dtype=pd.SparseDtype(int)).reindex([0, 1, 3]) Out[6]: 0 0.0 1 1.0 3 NaN dtype: Sparse[float64, 0] In [7]: pd.DataFrame({"A": pd.SparseArray([0, 1, 2], dtype=pd.SparseDtype(int))}).reindex([0, 1, 3]) /Users/taugspurger/.virtualenvs/pandas-dev/bin/ipython:1: FutureWarning: The pandas.SparseArray class is deprecated and will be removed from pandas in a future version. Use pandas.arrays.SparseArray instead. #!/Users/taugspurger/Envs/pandas-dev/bin/python Out[7]: A 0 0 1 1 3 0 In [8]: _.dtypes Out[8]: A Sparse[int64, 0] dtype: object

Series is correct, but DataFrame is filling with fill_value rather than the na value.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Jul 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Sparse Sparse Data Type

4 participants