Skip to content

BUG: Serializing sparse dataframe 15x slower than converting to dense version + serializing the dense version #41023

@melkonyan

Description

@melkonyan

Code to reproduce:

import pandas as pd from scipy import sparse as sc import numpy as np np.random.seed(42) vals = np.random.randint(0, 10, size=(1000, 1000)) keep = vals > 3 vals[keep] = 0 sparse_mtx = sc.coo_matrix(vals) sparse_pd = pd.DataFrame.sparse.from_spmatrix(sparse_mtx) num_tries = 30 t1 = timeit.timeit(lambda: sparse_pd.to_csv('sparse_pd.csv'), number=num_tries) t2 = timeit.timeit(lambda: sparse_pd.sparse.to_dense().to_csv('sparse_pd.csv'), number=num_tries) overhead = t1/t2 print(t1, t2, overhead)

Output:

56.591012510471046 3.7841985523700714 14.954556883657089 

Versions:

  • python == 3.9.2
  • pandas == 1.2.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO CSVread_csv, to_csvPerformanceMemory or execution speed performanceSparseSparse Data Type

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions