1

I was trying to process my huge CSV file (more than 20G), but the process was killed when reading the whole CSV file into memory. To avoid this issue, I am trying to read the second column line by line.

For example, the 2nd column contains data like

  1. xxx, computer is good
  2. xxx, build algorithm

    import collections wordcount = collections.Counter() with open('desc.csv', 'rb') as infile: for line in infile: wordcount.update(line.split()) 

My code is working for the whole columns, how to only read the second column without using CSV reader?

4
  • You could use iteration (for loops/yield) instead of loading a lot of data into memory. I don't know how much control you have over the individual parts, so I can't give an example. Commented Oct 13, 2016 at 16:26
  • @DennisKuypers, Thanks. what do you mean by how much control? Commented Oct 13, 2016 at 18:47
  • What I mean is: Can you change the code or are you just taking the result of one library into the next. Maybe you can use for something in descs: to iterate over results one by one. You probably have to omit the .tolist(). Again, I don't know the libraries, so I can not tell you the proper way Commented Oct 13, 2016 at 23:37
  • @DennisKuypers I modified the question, is that clear to you? Commented Oct 13, 2016 at 23:59

2 Answers 2

1

As far as I know, calling csv.reader(infile) opens and reads the whole file...which is where your problem lies.

You can just read line-by-line and parse manually:

X=[] with open('desc.csv', 'r') as infile: for line in infile: # Split on comma first cols = [x.strip() for x in line.split(',')] # Grab 2nd "column" col2 = cols[1] # Split on spaces words = [x.strip() for x in col2.split(' ')] for word in words: if word not in X: X.append(word) for w in X: print w 

That will keep a smaller chunk of the file in memory at a given time (one line). However, you may still potentially have problems with variable X increasing to quite a large size, such that the program will error out due to memory limits. Depends how many unique words are in your "vocabulary" list

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks, but your example only works for the first column, right? If it was the 3rd column, x.strip() would not be right, correct?
Sorry, maybe I misunderstood what you meant by columns. I assumed "computer is good" was on its own line, followed by "build algorithm" on the next line. Where I used x.strip()/split() could be csv.reader() just as easily if that works for your input files.
Thanks, it is working. So it is not good to use CSV reader when file is too huge, right?
1

It looks like the code in your question is reading the 20G file and splitting each line into space separated tokens then creating a counter that keeps a count of every unique token. I'd say that is where your memory is going.

From the manual csv.reader is an iterator

a reader object which will iterate over lines in the given csvfile. csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called

so it is fine to iterate through a huge file using csv.reader.

import collections wordcount = collections.Counter() with open('desc.csv', 'rb') as infile: for row in csv.reader(infile): # count words in strings from second column wordcount.update(row[1].split()) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.