-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
PERF: optimize DataFrame.sparse.from_spmatrix performance #32825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 8 commits
3ffe7f2 93bf825 e21701b e095e7f 11afe40 eda0732 40f4cd6 508fda5 0053817 42792a3 c8f7abc e541b0d 0a0bafc ce6619e e063c8a 8efe3ad 20c3685 d750eb4 File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -227,15 +227,24 @@ def from_spmatrix(cls, data, index=None, columns=None): | |
| 1 0.0 1.0 0.0 | ||
| 2 0.0 0.0 1.0 | ||
| """ | ||
| from pandas import DataFrame | ||
| from pandas import DataFrame, SparseDtype | ||
| from . import IntIndex, SparseArray | ||
WillAyd marked this conversation as resolved. Outdated Show resolved Hide resolved | ||
| | ||
| data = data.tocsc() | ||
| Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could be | ||
| index, columns = cls._prep_index(data, index, columns) | ||
| Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the problem is that this Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, I see. Thanks for confirming that passing duplicate columns names in an Index object is expected to work in Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That seems to be it, as doing You can add in here a | ||
| sparrays = [SparseArray.from_spmatrix(data[:, i]) for i in range(data.shape[1])] | ||
| data = dict(enumerate(sparrays)) | ||
| result = DataFrame(data, index=index) | ||
| result.columns = columns | ||
| return result | ||
| n_rows, n_columns = data.shape | ||
| data.sort_indices() | ||
| Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be already done in Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe add a comment about that inline? | ||
| indices = data.indices | ||
| indptr = data.indptr | ||
| data = data.data | ||
| ||
| dtype = SparseDtype(data.dtype, 0) | ||
| arrays = [] | ||
| for i in range(n_columns): | ||
| sl = slice(indptr[i], indptr[i + 1]) | ||
| idx = IntIndex(n_rows, indices[sl], check_integrity=False) | ||
| arr = SparseArray._simple_new(data[sl], idx, dtype) | ||
| arrays.append(arr) | ||
| Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW, also tried with a generator here to avoid pre-allocating all the arrays, but it doesn't really matter. Most of the remaining run time is in | ||
| return DataFrame._from_arrays(arrays, columns=columns, index=index) | ||
| ||
| | ||
| def to_dense(self): | ||
| """ | ||
| | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given what we said in the other PR, you can remove this random state again (it's covered by the global setup function which will be run for every benchmark)