read data from a huge CSV file efficiently

Question

I was trying to process my huge CSV file (more than 20G), but the process was killed when reading the whole CSV file into memory. To avoid this issue, I am trying to read the second column line by line.

For example, the 2nd column contains data like

xxx, computer is good

xxx, build algorithm

import collections wordcount = collections.Counter() with open('desc.csv', 'rb') as infile: for line in infile: wordcount.update(line.split())

My code is working for the whole columns, how to only read the second column without using CSV reader?

You could use iteration (for loops/yield) instead of loading a lot of data into memory. I don't know how much control you have over the individual parts, so I can't give an example. — Dennis Kuypers
– Dennis Kuypers, Commented Oct 13, 2016 at 16:26
@DennisKuypers, Thanks. what do you mean by how much control? — Kun
– Kun, Commented Oct 13, 2016 at 18:47
What I mean is: Can you change the code or are you just taking the result of one library into the next. Maybe you can use for something in descs: to iterate over results one by one. You probably have to omit the .tolist(). Again, I don't know the libraries, so I can not tell you the proper way — Dennis Kuypers
– Dennis Kuypers, Commented Oct 13, 2016 at 23:37
@DennisKuypers I modified the question, is that clear to you? — Kun
– Kun, Commented Oct 13, 2016 at 23:59

bjornruffians · Accepted Answer · 2016-10-17 20:45:34Z

As far as I know, calling csv.reader(infile) opens and reads the whole file...which is where your problem lies.

You can just read line-by-line and parse manually:

X=[] with open('desc.csv', 'r') as infile: for line in infile: # Split on comma first cols = [x.strip() for x in line.split(',')] # Grab 2nd "column" col2 = cols[1] # Split on spaces words = [x.strip() for x in col2.split(' ')] for word in words: if word not in X: X.append(word) for w in X: print w

That will keep a smaller chunk of the file in memory at a given time (one line). However, you may still potentially have problems with variable X increasing to quite a large size, such that the program will error out due to memory limits. Depends how many unique words are in your "vocabulary" list

Thanks, but your example only works for the first column, right? If it was the 3rd column, x.strip() would not be right, correct?
Sorry, maybe I misunderstood what you meant by columns. I assumed "computer is good" was on its own line, followed by "build algorithm" on the next line. Where I used x.strip()/split() could be csv.reader() just as easily if that works for your input files.
Thanks, it is working. So it is not good to use CSV reader when file is too huge, right?

Mike Robins · Accepted Answer · 2017-08-16 01:18:23Z

It looks like the code in your question is reading the 20G file and splitting each line into space separated tokens then creating a counter that keeps a count of every unique token. I'd say that is where your memory is going.

From the manual csv.reader is an iterator

a reader object which will iterate over lines in the given csvfile. csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called

so it is fine to iterate through a huge file using csv.reader.

import collections wordcount = collections.Counter() with open('desc.csv', 'rb') as infile: for row in csv.reader(infile): # count words in strings from second column wordcount.update(row[1].split())

Collectives™ on Stack Overflow

read data from a huge CSV file efficiently

2 Answers 2

3 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Related