1

I wrote a small simple script to read and process a huge CSV file (~150GB), which reads 5e6 rows per loop, converts it to a Pandas DataFrame, do something with it, and then keeps reading the next 5e6 rows.

Albeit it does the job, at every iteration it takes longer to find the next chunk of rows to read, as it has to skip larger number of rows. I read many answers regarding the use of chunk (as a reader iterator), although once the chunk has been read I would then need to concatenate the chunks to create a DataFrame (with all sort of issues regarding truncated rows and stuff), so I prefer not to go down that road.

Is it possible to use some kind of cursor to remind the read_csv function to start reading from where it stopped?

The main part of the code looks like this:

while condition is True: df = pd.read_csv(inputfile, sep=',', header = None, skiprows = sr, nrows = 5e6) # do something with df sr = sr + 5e6 # if something goes wrong the condition turns False 

1 Answer 1

4

Using your approach Pandas will have to start reading this huge CSV file from the very beginning again and again in order to skip rows...

I think you do want to use chunksize parameter:

reader = pd.read_csv(inputfile, sep=',', header=None, chunksize=5*10**6) for df in reader: # do something with df if (something goes wrong): break 
Sign up to request clarification or add additional context in comments.

1 Comment

Yep! I did try with the chunksize parameter, but for some reason it wasn't working when converting it into a DataFrame. Thanks a lot!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.