Skip to content

Conversation

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Mar 19, 2020

Apart from this being more idiomatic, it also avoids creating a SparseArray through the normal machinery (including validation of the input etc) for the empty list.

With this PR:

In [1]: data = np.array([1, 2, 3], dtype=float) In [2]: index = pd.core.arrays.sparse.IntIndex(5, np.array([0, 2, 4])) In [3]: dtype = pd.SparseDtype("float64", 0) In [4]: pd.arrays.SparseArray._simple_new(data, index, dtype) Out[4]: [1.0, 0, 2.0, 0, 3.0] Fill: 0 IntIndex Indices: array([0, 2, 4], dtype=int32) In [5]: %timeit pd.arrays.SparseArray._simple_new(data, index, dtype) 381 ns ± 4.83 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 

while on released version this gives around 50µs (100x slower)

Noticed while investigating #32196

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Sparse Sparse Data Type labels Mar 19, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Mar 19, 2020
@rth
Copy link
Contributor

rth commented Mar 19, 2020

Thanks @jorisvandenbossche! Quick benchmark result below when running pd.DataFrame.sparse.from_spmatrix on a random sparse CSR array of given n_samples, n_features with a density=0.01,

label master (s) PR (s) n_samples n_features 100 100000 14.5374 10.1247 10000 10000 1.5599 1.1037 100000 100 0.0180 0.0134 

so overall this makes that method around 30% faster.

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorisvandenbossche lgtm

@simonjayhawkins simonjayhawkins merged commit 34f3360 into pandas-dev:master Mar 19, 2020
@jorisvandenbossche jorisvandenbossche deleted the sparse-simple-new branch March 19, 2020 11:43
@jorisvandenbossche
Copy link
Member Author

@rth thanks for the timings! Yes, it was indeed a large part of the original slow from_spmatrix, the snippet in the issue does most of the rest

@jreback
Copy link
Contributor

jreback commented Mar 19, 2020

so we have asvs for this? also add a whats new note

@rth
Copy link
Contributor

rth commented Mar 19, 2020

Note added in #32825 that should work for both PRs I think.

SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020
jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Mar 23, 2020
jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Mar 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance Sparse Sparse Data Type

4 participants