Using pandas to efficiently read in a large CSV file without crashing

Question

I am trying to read a .csv file called ratings.csv from http://grouplens.org/datasets/movielens/20m/ the file is 533.4MB in my computer.

This is what am writing in jupyter notebook

import pandas as pd ratings = pd.read_cv('./movielens/ratings.csv', sep=',')

The problem from here is the kernel would break or die and ask me to restart and its keeps repeating the same. There is no any error. Please can you suggest any alternative of solving this, it is as if my computer has no capability of running this.

This works but it keeps rewriting

chunksize = 20000 for ratings in pd.read_csv('./movielens/ratings.csv', chunksize=chunksize): ratings.append(ratings) ratings.head()

Only the last chunk is written others are written-off

cs95 · Accepted Answer · 2019-07-16 01:22:34Z

16

You should consider using the chunksize parameter in read_csv when reading in your dataframe, because it returns a TextFileReader object you can then pass to pd.concat to concatenate your chunks.

chunksize = 100000 tfr = pd.read_csv('./movielens/ratings.csv', chunksize=chunksize, iterator=True) df = pd.concat(tfr, ignore_index=True)

If you just want to process each chunk individually, use,

chunksize = 20000 for chunk in pd.read_csv('./movielens/ratings.csv', chunksize=chunksize, iterator=True): do_something_with_chunk(chunk)

edited Jul 16, 2019 at 1:22

answered Aug 24, 2017 at 20:25

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Developer Over a year ago

I have tried this though its not crashing but the kernel run for more than 40 mins without terminating.... and I just cancelled it. How long should I expect for 20M records to be read?

cs95 Over a year ago

@Developer Increased chunksize and set iterator=True. Try it again.

Developer Over a year ago

Can you please assist with that edits. It is fast but I have failed to append data every time it is written @cOLDsLEEP

Developer Over a year ago

Still there is an issue now its only take the first chunk, other chunks are not recorded, there are 20M data but that method will only keep 20K data, only the first chunk @cOLDsLEEP

cs95 Over a year ago

@Developer I would refer you to this: stackoverflow.com/questions/33642951/…

|

Yury Wallet · Accepted Answer · 2018-05-31 12:52:33Z

try like this - 1) load with dask and then 2) convert to pandas

import pandas as pd import dask.dataframe as dd import time t=time.clock() df_train = dd.read_csv('../data/train.csv') df_train=df_train.compute() print("load train: " , time.clock()-t)

Collectives™ on Stack Overflow

Using pandas to efficiently read in a large CSV file without crashing

2 Answers 2

6 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Linked

Related