-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
SparseArray is an ExtensionArray #22325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 204 commits
ee187eb 32c1372 b265659 8dfc898 9c57725 13952ab 7a6e7fa 1016af1 072abec 0ad61cc 5b0b524 224744a 620b5fb 164c401 65f83d6 0b3c682 69a5d13 f2b5862 fa80fc5 3f20890 484adb0 1df1190 4246ac4 a849699 c4da319 a2f158f 26b671a 375e160 0a37050 3c2cb0f 27c6378 e52dae9 b6d8430 640c4a5 6b61597 427234f e055629 a79359c de3aa71 21f4ee3 c1e594a dc7f93f eb09d21 7dcf4b2 b39658a a8b76bd e041313 595535e 7700299 f1ff7da 33fa6f7 40c035e 1d49cc7 6f4b6b6 6f037b5 7da220e bfbe4ab c5666b6 ff6037c 5c362ef 55cac36 c4e8784 a00f987 a6d7eac 4b4f9bd 82801be 1a149dc fde19d7 a7ba8f6 5064217 e31e8aa 79c8e9c 26993fe 6eeec11 50de326 5ef1747 f31970c f1b860f 5c44275 33bc8f8 9bf13ad de1fb5b da580cd 88b73c3 afde64d e603d3d ec5eb9a a72ee1a f147635 c35c7c2 e159ef2 d48a8fa 3bcf57e 31d401f a4369c2 608b499 14e60c9 550f163 821cc91 e21ed21 aeb8c8c 34c90ed 2103959 26af959 e5920c2 084a967 bb17760 dde7852 f1b4e6b 6a31077 02aa7f7 3a7ee2d d6fe191 b1ea874 2213b83 94664c4 e54160c 04a2dbb fb01d1a f78ae81 11d5b40 ba70753 82bab3c 2990124 a9d0f17 0c52c37 998f113 38b0356 7206d94 fe771b5 12e424c 3bd567f f816346 1a1dcf4 e3d9173 2715cdb 4e40599 0aa3934 a3becb6 5660b9a dd3cba5 cc65b8a 06dce5f f7351d3 2055494 f310322 0008164 027f6d8 c0d9875 44b218c 47fa73a c2c489f 3729927 9ba49e1 543ac7c f66ef6f ba8fc9d 9185e33 11799ab 73e7626 ebece16 7db6990 be21f42 e857363 d0ee038 54f4417 2082d86 f846606 ce8e0ac 1f6590e b758469 f6b0924 232518c e8b37da 0197e0c 62326ae f008c38 88c6126 5c8662e 78798cf b051424 78979b6 2333db1 b41d473 d6a2479 a23c27c 7372eb3 cab8c54 52ae275 9c9b49e f5d7492 b4b4cbc bf98b9d f3d2681 7d4d3ba 57c03c2 0dbc33e c217cf5 2ea7a91 8f2f228 c83bed7 53e494e 627b9ce df0293a a590418 7821f19 ee26c52 40390f1 15a164d 88432c8 3e7ec90 7b0a179 20d8815 3e81c69 1098a7a 10d204a 69075d8 0764baa a4a47c5 a5b6c39 70d8268 7aed79f 11e55aa 11606af 2f73179 1b3058a f4ec928 8c67ca2 cc89ec7 3f713d4 886fe03 75099af 731fc06 f91141d 37a4b57 4aad8e1 File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -373,6 +373,34 @@ is the case with :attr:`Period.end_time`, for example | |
| | ||
| p.end_time | ||
| | ||
| .. _whatsnew_0240.api_breaking.sparse_values: | ||
| | ||
| ``SparseArray`` is now an ``ExtensionArray`` | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
| | ||
| ``SparseArray`` now implements the ``ExtensionArray`` interface (:issue:`21978`, :issue:`19056`, :issue:`22835`). | ||
| To conform to this interface, and for consistency with the rest of pandas, some API breaking | ||
| changes were made: | ||
TomAugspurger marked this conversation as resolved. Show resolved Hide resolved | ||
| | ||
| - ``SparseArray`` is no longer a subclass of :class:`numpy.ndarray`. To convert a SparseArray to a NumPy array, use :meth:`numpy.asarray`. | ||
| - ``SparseArray.dtype`` and ``SparseSeries.dtype`` are now instances of :class:`SparseDtype`, rather than ``np.dtype``. Access the underlying dtype with ``SparseDtype.subtype``. | ||
| - :meth:`numpy.asarray(sparse_array)` now returns a dense array with all the values, not just the non-fill-value values (:issue:`14167`) | ||
| - ``SparseArray.take`` now matches the API of :meth:`pandas.api.extensions.ExtensionArray.take` (:issue:`19506`). | ||
| * The default value of ``allow_fill`` has changed from ``False`` to ``True``. | ||
TomAugspurger marked this conversation as resolved. Show resolved Hide resolved | ||
| * The ``out`` and ``mode`` parameters are now longer accepted (previously, this raised if they were specified). | ||
| * Passing a scalar for ``indices`` is no longer allowed. | ||
jorisvandenbossche marked this conversation as resolved. Show resolved Hide resolved | ||
| - The result of concatenating a mix of sparse and dense Series is a Series with sparse values, rather than a ``SparseSeries``. | ||
TomAugspurger marked this conversation as resolved. Show resolved Hide resolved | ||
| - ``SparseDataFrame.combine`` and ``DataFrame.combine_first`` no longer supports combining a sparse column with a dense column while preserving the sparse subtype. The result will be an object-dtype SparseArray. | ||
| - Setting :attr:`SparseArray.fill_value` to a fill value with a different dtype is now allowed. | ||
| Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can't remember if I asked before, but do we actually want this? I don't think the above makes much sense, so not sure this is good to allow. For me it seems logical to restrict the fill_value of the same dtype as the data. Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The somewhat strange thing is that on master we do allow that in the SparseArray constructor In [13]: s = pd.SparseArray([1, 2, 0], fill_value=np.nan) In [14]: s Out[14]: [1, 2, 0] Fill: nan IntIndex Indices: array([0, 1, 2], dtype=int32)I don't have strong opinions here, other than that people shouldn't be setting Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i agree the fill type should match the dtype but since missing value support is allowed here it is prob ok. | ||
| | ||
| | ||
| Some new warnings are issued for operations that require or are likely to materialize a large dense array: | ||
| | ||
| - A :class:`errors.PerformanceWarning` is issued when using fillna with a ``method``, as a dense array is constructed to create the filled array. Filling with a ``value`` is the efficient way to fill a sparse array. | ||
TomAugspurger marked this conversation as resolved. Show resolved Hide resolved | ||
| - A :class:`errors.PerformanceWarning` is now issued when concatenating sparse Series with differing fill values. The fill value from the first sparse array continues to be used. | ||
| | ||
| In addition to these API breaking changes, many :ref:`performance improvements and bug fixes have been made <whatsnew_0240.bug_fixes.sparse>`. | ||
| | ||
| .. _whatsnew_0240.api.datetimelike.normalize: | ||
| | ||
| Tick DateOffset Normalize Restrictions | ||
| | @@ -621,6 +649,7 @@ Other API Changes | |
| - :class:`pandas.io.formats.style.Styler` supports a ``number-format`` property when using :meth:`~pandas.io.formats.style.Styler.to_excel` (:issue:`22015`) | ||
| - :meth:`DataFrame.corr` and :meth:`Series.corr` now raise a ``ValueError`` along with a helpful error message instead of a ``KeyError`` when supplied with an invalid method (:issue:`22298`) | ||
| - :meth:`shift` will now always return a copy, instead of the previous behaviour of returning self when shifting by 0 (:issue:`22397`) | ||
| - Slicing a single row of a DataFrame with multiple ExtensionArrays of the same type now preserves the dtype, rather than coercing to object (:issue:`22784`) | ||
| | ||
| .. _whatsnew_0240.deprecations: | ||
| | ||
| | @@ -860,13 +889,6 @@ Groupby/Resample/Rolling | |
| - :func:`RollingGroupby.agg` and :func:`ExpandingGroupby.agg` now support multiple aggregation functions as parameters (:issue:`15072`) | ||
| - Bug in :meth:`DataFrame.resample` and :meth:`Series.resample` when resampling by a weekly offset (``'W'``) across a DST transition (:issue:`9119`, :issue:`21459`) | ||
| | ||
| Sparse | ||
| ^^^^^^ | ||
| | ||
| - | ||
| - | ||
| - | ||
| | ||
| Reshaping | ||
| ^^^^^^^^^ | ||
| | ||
| | @@ -884,6 +906,20 @@ Reshaping | |
| - Bug in :func:`merge` when merging ``datetime64[ns, tz]`` data that contained a DST transition (:issue:`18885`) | ||
| - Bug in :func:`merge_asof` when merging on float values within defined tolerance (:issue:`22981`) | ||
| | ||
| .. _whatsnew_0240.bug_fixes.sparse: | ||
| | ||
| Sparse | ||
TomAugspurger marked this conversation as resolved. Show resolved Hide resolved | ||
| ^^^^^^ | ||
| | ||
| - Updating a boolean, datetime, or timedelta column to be Sparse now works (:issue:`22367`) | ||
TomAugspurger marked this conversation as resolved. Show resolved Hide resolved Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. really, we have support for this? again i agree this is a nice feature, but we are decreasing support generally for sparse, so not anxious to advertise this | ||
| - Bug in :meth:`Series.to_sparse` with Series already holding sparse data not constructing properly (:issue:`22389`) | ||
| - Providing a ``sparse_index`` to the SparseArray constructor no longer defaults the na-value to ``np.nan`` for all dtypes. The correct na_value for ``data.dtype`` is now used. | ||
| - Bug in ``SparseArray.nbytes`` under-reporting its memory usage by not including the size of its sparse index. | ||
| - Improved performance of :meth:`Series.shift` for non-NA ``fill_value``, as values are no longer converted to a dense array. | ||
| - A SparseDtype with boolean subtype is considered bool by :meth:`api.types.is_bool_dtype`. | ||
TomAugspurger marked this conversation as resolved. Outdated Show resolved Hide resolved | ||
| - Bug in ``DataFrame.groupby`` not including ``fill_value`` in the groups for non-NA ``fill_value`` when grouping by a sparse column (:issue:`5078`) | ||
| - Bug in unary inversion operator (``~``) on a ``SparseSeries`` with boolean values. The performance of this has also been improved (:issue:`22835`) | ||
| | ||
| Build Changes | ||
| ^^^^^^^^^^^^^ | ||
| | ||
| | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -93,11 +93,13 @@ def _get_series_result_type(result, objs=None): | |
| def _get_frame_result_type(result, objs): | ||
| """ | ||
| return appropriate class of DataFrame-like concat | ||
| if all blocks are SparseBlock, return SparseDataFrame | ||
| if all blocks are sparse, return SparseDataFrame | ||
| otherwise, return 1st obj | ||
| """ | ||
| | ||
| if result.blocks and all(b.is_sparse for b in result.blocks): | ||
| if (result.blocks and ( | ||
| all(is_sparse(b) for b in result.blocks) or | ||
| Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. related to my comment above. cannot is_sparse not simply check if its an EA and if it has a Sparse Dtype? then you simply need to pass the Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll give that a shot. Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add a comment here, its not obvious what you are doing Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how can obj be a SparseFrame here? is this tested? Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think a comment of mine may have been lost. This is hit in several places (e.g. What part can I clarify here? | ||
| all(isinstance(obj, ABCSparseDataFrame) for obj in objs))): | ||
| from pandas.core.sparse.api import SparseDataFrame | ||
| return SparseDataFrame | ||
| else: | ||
| | @@ -554,61 +556,23 @@ def _concat_sparse(to_concat, axis=0, typs=None): | |
| a single array, preserving the combined dtypes | ||
| """ | ||
| | ||
| from pandas.core.sparse.array import SparseArray, _make_index | ||
| from pandas.core.sparse.array import SparseArray | ||
TomAugspurger marked this conversation as resolved. Show resolved Hide resolved | ||
| | ||
| def convert_sparse(x, axis): | ||
| # coerce to native type | ||
| if isinstance(x, SparseArray): | ||
| x = x.get_values() | ||
| else: | ||
| x = np.asarray(x) | ||
| x = x.ravel() | ||
| if axis > 0: | ||
| x = np.atleast_2d(x) | ||
| return x | ||
| fill_values = [x.fill_value for x in to_concat | ||
| if isinstance(x, SparseArray)] | ||
| | ||
| if typs is None: | ||
| typs = get_dtype_kinds(to_concat) | ||
| if len(set(fill_values)) > 1: | ||
| raise ValueError("Cannot concatenate SparseArrays with different " | ||
| "fill values") | ||
TomAugspurger marked this conversation as resolved. Show resolved Hide resolved | ||
| | ||
| if len(typs) == 1: | ||
| # concat input as it is if all inputs are sparse | ||
| # and have the same fill_value | ||
| fill_values = {c.fill_value for c in to_concat} | ||
| if len(fill_values) == 1: | ||
| sp_values = [c.sp_values for c in to_concat] | ||
| indexes = [c.sp_index.to_int_index() for c in to_concat] | ||
| | ||
| indices = [] | ||
| loc = 0 | ||
| for idx in indexes: | ||
| indices.append(idx.indices + loc) | ||
| loc += idx.length | ||
| sp_values = np.concatenate(sp_values) | ||
| indices = np.concatenate(indices) | ||
| sp_index = _make_index(loc, indices, kind=to_concat[0].sp_index) | ||
| | ||
| return SparseArray(sp_values, sparse_index=sp_index, | ||
| fill_value=to_concat[0].fill_value) | ||
| | ||
| # input may be sparse / dense mixed and may have different fill_value | ||
| # input must contain sparse at least 1 | ||
| sparses = [c for c in to_concat if is_sparse(c)] | ||
| fill_values = [c.fill_value for c in sparses] | ||
| sp_indexes = [c.sp_index for c in sparses] | ||
| | ||
| # densify and regular concat | ||
| to_concat = [convert_sparse(x, axis) for x in to_concat] | ||
| result = np.concatenate(to_concat, axis=axis) | ||
| | ||
| if not len(typs - {'sparse', 'f', 'i'}): | ||
| # sparsify if inputs are sparse and dense numerics | ||
| # first sparse input's fill_value and SparseIndex is used | ||
| result = SparseArray(result.ravel(), fill_value=fill_values[0], | ||
| kind=sp_indexes[0]) | ||
| else: | ||
| # coerce to object if needed | ||
| result = result.astype('object') | ||
| return result | ||
| fill_value = list(fill_values)[0] | ||
TomAugspurger marked this conversation as resolved. Outdated Show resolved Hide resolved | ||
| | ||
| # TODO: Fix join unit generation so we aren't passed this. | ||
| to_concat = [x if isinstance(x, SparseArray) | ||
| else SparseArray(x.squeeze(), fill_value=fill_value) | ||
| for x in to_concat] | ||
| | ||
| return SparseArray._concat_same_type(to_concat) | ||
| | ||
| | ||
| def _concat_rangeindex_same_dtype(indexes): | ||
| | ||
Uh oh!
There was an error while loading. Please reload this page.