-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
Allows for merging of SparseDataFrames, and fixes __array__ interface #19488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 21 commits
5b48613 555fb91 89677b0 97bd19b b04ac64 d24e464 a796280 be09289 1aef901 42680ba 2084ed3 77c41b7 25fd08a cd583f7 45e7cd3 6522d6b 029d37b 730c152 171f5dd bde2588 bfe3065 d13daa7 3ddba8b 4fa7dec 353171b File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -238,6 +238,7 @@ Other Enhancements | |
| | ||
| - ``IntervalIndex.astype`` now supports conversions between subtypes when passed an ``IntervalDtype`` (:issue:`19197`) | ||
| - :class:`IntervalIndex` and its associated constructor methods (``from_arrays``, ``from_breaks``, ``from_tuples``) have gained a ``dtype`` parameter (:issue:`19262`) | ||
| - :func:`pandas.merge` now supports merging of :class:`SparseDataFrame` (:issue:`13665`) | ||
| | ||
| .. _whatsnew_0230.api_breaking: | ||
| | ||
| | @@ -555,7 +556,7 @@ Sparse | |
| | ||
| - Bug in which creating a ``SparseDataFrame`` from a dense ``Series`` or an unsupported type raised an uncontrolled exception (:issue:`19374`) | ||
| - Bug in :class:`SparseDataFrame.to_csv` causing exception (:issue:`19384`) | ||
| - | ||
| - Bug in :class:`SparseSeries.__array__` returning only non-faills (:issue:`13665`) | ||
| ||
| | ||
| Reshaping | ||
| ^^^^^^^^^ | ||
| | @@ -591,3 +592,4 @@ Other | |
| ^^^^^ | ||
| | ||
| - Improved error message when attempting to use a Python keyword as an identifier in a ``numexpr`` backed query (:issue:`18221`) | ||
| - Improved algorithms.take_1d handling of ``SparseArray`` (:issue:`19506`) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -1315,6 +1315,11 @@ def take_nd(arr, indexer, axis=0, out=None, fill_value=np.nan, mask_info=None, | |
| undefined if allow_fill == False and -1 is present in indexer. | ||
| """ | ||
| | ||
| if is_sparse(arr): | ||
| Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So something that is a bit of a hotspot IMHO is that inside of algos it iterates through the memview and could cause some weird security / segfault issues. This fills that hole but really should fix that at a lower level. Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, this seems sub-optimal. We shouldn't have to densify just to slice, right? Can you take a look at And since you're touching performance-related code, could you run the sparse-related ASVs? Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thank you for pointing this out. exactly what I was looking for. Will run the ASVs | ||
| return take_nd(arr.get_values(), indexer, axis=axis, out=out, | ||
| fill_value=fill_value, mask_info=mask_info, | ||
| allow_fill=allow_fill) | ||
| | ||
| # dispatch to internal type takes | ||
| if is_categorical(arr): | ||
| return arr.take_nd(indexer, fill_value=fill_value, | ||
| | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -3089,6 +3089,7 @@ def make_block(values, placement, klass=None, ndim=None, dtype=None, | |
| # GH#19265 pyarrow is passing this | ||
| warnings.warn("fastpath argument is deprecated, will be removed " | ||
| "in a future release.", DeprecationWarning) | ||
| | ||
| if klass is None: | ||
| dtype = dtype or values.dtype | ||
| klass = get_block_type(values, dtype) | ||
| | @@ -5304,6 +5305,22 @@ def concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy): | |
| elif is_uniform_join_units(join_units): | ||
| b = join_units[0].block.concat_same_type( | ||
| [ju.block for ju in join_units], placement=placement) | ||
| elif is_sparse_join_units(join_units): | ||
| Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you ever go down the initial | ||
| values = concatenate_join_units(join_units, concat_axis, copy=copy) | ||
| Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is a mess Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, yes it is. I'll work on cleaning this up tomorrow since there's too many branches in this code. | ||
| | ||
| if len(values.shape) == 2: | ||
| values = values[0] | ||
| else: | ||
| assert len(values.shape) == 1 | ||
| | ||
| block = join_units[0].block | ||
| | ||
| if block: | ||
| fill_value = block.fill_value | ||
| else: | ||
| fill_value = np.nan | ||
| array = SparseArray(values, fill_value=fill_value) | ||
| b = make_block(array, klass=SparseBlock, placement=placement) | ||
| else: | ||
| b = make_block( | ||
| concatenate_join_units(join_units, concat_axis, copy=copy), | ||
| | @@ -5313,6 +5330,18 @@ def concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy): | |
| return BlockManager(blocks, axes) | ||
| | ||
| | ||
| def is_sparse_join_units(join_units): | ||
| Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Docstring would be nice, at least noting that this is true if any blocks are sparse. | ||
| """ | ||
| Check if all of the join units are sparse. This leads to building | ||
| SparseArray over dense array representations so that we can merge | ||
| SparseSeries / SparseDataFrame | ||
| | ||
| This is very similar to how pandas.concat works for conatting two | ||
| SparseDataFrame / SparseSeries | ||
| """ | ||
| return all(type(ju.block) is SparseBlock for ju in join_units) | ||
| Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is very similar (and maybe could be DRYed up) like pd.concat([sparse, sparse]) | ||
| | ||
| | ||
| def is_uniform_join_units(join_units): | ||
| """ | ||
| Check if the join units consist of blocks of uniform type that can | ||
| | @@ -5686,7 +5715,10 @@ def is_na(self): | |
| def get_reindexed_values(self, empty_dtype, upcasted_na): | ||
| if upcasted_na is None: | ||
| # No upcasting is necessary | ||
| fill_value = self.block.fill_value | ||
| | ||
| # You would think that you want self.block.fill_value here | ||
| # But in reality that will fill with a bunch of wrong values | ||
| fill_value = np.nan | ||
| Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this was surprising to me. If you don't pass in np.nan what ends up happening is that if you merge two sparse frames it will concat with a bunch of Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. huh? Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea this confused me a bunch. evens = pd.DataFrame({'A': range(0, 20, 2), 'B': range(10)}) threes = pd.DataFrame({'A': range(0, 30, 3), 'B': range(10)}) threes.merge(evens, how="left", on="A")Yields As you'd expect for a dense dataframe but if you sparsify it with a fill_value other than np.nan / None you get weird results evens = pd.DataFrame({'A': range(0, 20, 2), 'B': range(10)}).to_sparse(fill_value=8675309) threes = pd.DataFrame({'A': range(0, 30, 3), 'B': range(10)}).to_sparse(fill_value=90210) threes.merge(evens, how="left", on="A")Yields Is that expected behavior? | ||
| values = self.block.get_values() | ||
| else: | ||
| fill_value = upcasted_na | ||
| | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -38,6 +38,8 @@ | |
| concatenate_block_managers) | ||
| from pandas.util._decorators import Appender, Substitution | ||
| | ||
| from pandas.core.sparse.array import SparseArray | ||
| | ||
| from pandas.core.sorting import is_int64_overflow_possible | ||
| import pandas.core.algorithms as algos | ||
| import pandas.core.sorting as sorting | ||
| | @@ -665,7 +667,6 @@ def _maybe_restore_index_levels(self, result): | |
| result.set_index(names_to_restore, inplace=True) | ||
| | ||
| def _maybe_add_join_keys(self, result, left_indexer, right_indexer): | ||
| | ||
| left_has_missing = None | ||
| right_has_missing = None | ||
| | ||
| | @@ -731,7 +732,11 @@ def _maybe_add_join_keys(self, result, left_indexer, right_indexer): | |
| if mask.all(): | ||
| key_col = rvals | ||
| else: | ||
| key_col = Index(lvals).where(~mask, rvals) | ||
| # Might need to be IntIndex not Index | ||
| Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. don't do this, use | ||
| if isinstance(lvals, SparseArray): | ||
| key_col = Index(lvals.get_values()).where(~mask, rvals) | ||
| Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if this has memory or performance issues. But this is the best solution I could come to with this. The other solution would be to look at using lvals.sp_index and implement a where on it that works. One thing I have noticed is that IntIndex doesn't act quite like Index which makes for doing these masks tricky in sparse land. Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It'd be nice to avoid | ||
| else: | ||
| key_col = Index(lvals).where(~mask, rvals) | ||
| | ||
| if result._is_label_reference(name): | ||
| result[name] = key_col | ||
| | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -38,7 +38,6 @@ | |
| from pandas.util._decorators import Appender | ||
| from pandas.core.indexes.base import _index_shared_docs | ||
| | ||
| | ||
| _sparray_doc_kwargs = dict(klass='SparseArray') | ||
| | ||
| | ||
| | @@ -259,6 +258,7 @@ def __array_wrap__(self, out_arr, context=None): | |
| ufunc, args, domain = context | ||
| # to apply ufunc only to fill_value (to avoid recursive call) | ||
| args = [getattr(a, 'fill_value', a) for a in args] | ||
| | ||
| with np.errstate(all='ignore'): | ||
| fill_value = ufunc(self.fill_value, *args[1:]) | ||
| else: | ||
| | @@ -292,9 +292,9 @@ def __setstate__(self, state): | |
| self._fill_value = fill_value | ||
| | ||
| def __len__(self): | ||
| try: | ||
| if hasattr(self, 'sp_index'): | ||
| Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This felt really icky to me to try and catch something... Feel free to push back on that change. Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How was this failing before? Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have no clue! 😄 I couldn't figure out why the try / except was there in the first place tbh. Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. huh? Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't essential to this PR. I just don't understand why this code was here in the first place. Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Meaning I can take it out to bring down the code changes | ||
| return self.sp_index.length | ||
| except: | ||
| else: | ||
| return 0 | ||
| | ||
| def __unicode__(self): | ||
| | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -73,6 +73,10 @@ def __init__(self, data=None, index=None, columns=None, default_kind=None, | |
| if columns is None: | ||
| raise Exception("cannot pass a series w/o a name or columns") | ||
| data = {columns[0]: data} | ||
| elif isinstance(data, BlockManager): | ||
| fill_value_size = len(set(b.fill_value for b in data.blocks)) | ||
| Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok I need feedback on this change. Basically what this does is: if every SparseSeries's fill_value is the same it sets the default_fill_value to that. This to me seems really intuitive, but I trust your judgement. It also makes testing a whole of a hell lot easier which is why I did it. I would argue pretty heavily that as a user if I added sparse series that all had fill_value=0 then I would expect the sparse data frame to have default_fill_value = 0 as well. Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes this is the only reason to have a standalone | ||
| if default_fill_value is None and fill_value_size == 1: | ||
| default_fill_value = data.blocks[0].fill_value | ||
| | ||
| if default_fill_value is None: | ||
| default_fill_value = np.nan | ||
| | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -175,7 +175,7 @@ def values(self): | |
| | ||
| def __array__(self, result=None): | ||
| """ the array interface, return my values """ | ||
| return self.block.values | ||
| return self.block.values.values | ||
| Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will return the dense version of SparseSeries when calling Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we need some more asv's for sparse I think | ||
| | ||
| def get_values(self): | ||
| """ same as values """ | ||
| | @@ -271,6 +271,7 @@ def __array_wrap__(self, result, context=None): | |
| | ||
| See SparseArray.__array_wrap__ for detail. | ||
| """ | ||
| | ||
| if isinstance(context, tuple) and len(context) == 3: | ||
| ufunc, args, domain = context | ||
| args = [getattr(a, 'fill_value', a) for a in args] | ||
| | @@ -279,8 +280,18 @@ def __array_wrap__(self, result, context=None): | |
| else: | ||
| fill_value = self.fill_value | ||
| | ||
| # GH 14167 | ||
| # Since we are returning a dense representation of | ||
| # SparseSeries sparse_index might not align when calling | ||
| # ufunc on the array. There doesn't seem to be a better way | ||
| # to do this unfortunately. | ||
| Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This wasn't intuitive to me but the array*_ functions defined in numpy don't work quite as expected in certain circumstances. What would happen to the dense version of the array would be that it would have 3 npoints when really it should only have 2. By zeroing out the sparse_index it fixes the problem, because it relies on the dense index instead going around the problem. Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. these should simply operate on the sp_values, not the on the densified (or maybe its possible that some classes of ufuncs could operate on the dense, but not sure) Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so the expected behavior is | ||
| if len(result) != self.sp_index.npoints: | ||
| sparse_index = None | ||
| else: | ||
| sparse_index = self.sp_index | ||
| | ||
| return self._constructor(result, index=self.index, | ||
| sparse_index=self.sp_index, | ||
| sparse_index=sparse_index, | ||
| fill_value=fill_value, | ||
| copy=False).__finalize__(self) | ||
| | ||
| | @@ -402,8 +413,8 @@ def abs(self): | |
| ------- | ||
| abs: type of caller | ||
| """ | ||
| return self._constructor(np.abs(self.values), | ||
| index=self.index).__finalize__(self) | ||
| | ||
| return np.abs(self) | ||
| | ||
| def get(self, label, default=None): | ||
| """ | ||
| | @@ -544,7 +555,7 @@ def to_dense(self, sparse_only=False): | |
| index = self.index.take(int_index.indices) | ||
| return Series(self.sp_values, index=index, name=self.name) | ||
| else: | ||
| return Series(self.values.to_dense(), index=self.index, | ||
| return Series(self.get_values(), index=self.index, | ||
| name=self.name) | ||
| | ||
| @property | ||
| | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify: merging sparse to sparse? Sparse and dense?