21

I am reading a file that is 10mb in size and which contains some id's. I read them into a list in ruby. I am concerned that it might cause memory issues in the future, when the number of id's in file might increase. Is there a effective way of reading a large file in batches?

Thank you

4 Answers 4

47

With Lazy Enumerators and each_slice, you can get the best of both worlds. You don't need to worry about cutting lines in the middle, and you can iterate over multiple lines in a batch. batch_size can be chosen freely.

header_lines = 1 batch_size = 2000 File.open("big_file") do |file| file.lazy.drop(header_lines).each_slice(batch_size) do |lines| # do something with batch of lines end end 

It could be used to import a huge CSV file into a database:

require 'csv' batch_size = 2000 File.open("big_data.csv") do |file| headers = file.first file.lazy.each_slice(batch_size) do |lines| csv_rows = CSV.parse(lines.join, headers: headers) # do something with 2000 csv rows, e.g. bulk insert them into a database end end 
Sign up to request clarification or add additional context in comments.

6 Comments

I vaguely remembered this, is the answer I was looking for. The right way to read a file!
how should it work? What is the purpose of using lazy?
I did some benchmarks with a huge amount of data, memory usage is the same as without .lazy.each_slice chain.
@Ilya thanks for the benchmark. I'll investigate after my holidays.
Cannot understand the magic too. Does this reads the whole file into memory and then iterates through its lines?
|
12

there's no universal way.

1) you can read file by chunks:

File.open('filename','r') do |f| chunk = f.read(2048) ... end 

disadvantage: you can miss a substring if it'd be between chunks, i.e. you look for "SOME_TEXT", but "SOME_" is a last 5 bytes of 1st 2048-byte chunk, and "TEXT" is a 4 bytes of 2nd chunk

2) you can read file line-by-line

File.open('filename','r') do |f| line = f.gets ... end 

disadvantage: this way it'd be 2x..5x slower than first method

Comments

6

Building upon @Eric Duminil's answer. CSV class also supports lazy enumerators directly.

require 'csv' batch_size = 2000 csv = CSV.open("big_data.csv", headers: true) csv.lazy.each_slice(batch_size) do |csv_rows| # do something with 2000 csv rows, e.g. bulk insert them into a database end 

The benefit of this approach is we would get the already parsed CSV rows.

1 Comment

Definitely the best answer. As a followup since this gives you an array of CSV::Row objects it is the easier method if you'd like to manipulate data somehow. In my case I'm doing csv_rows.map(&:to_h) and since map consumes memory I'm keeping batch_size as 250.
-1

If you're worried this much about speed/memory efficiency, have you considered shelling out to the shell and use grep, awk, sed etc.? If I knew a bit more about the structure of the input file and what you're trying to extract, I could potentially construct a command for you.

4 Comments

Sorry, the question is specifically about ruby. It wouldn't make sense to use shell commands for a canonical ruby question.
The question says the author wants to "read IDs into a list in Ruby". Nowhere does it say that the reading needs to happen in Ruby – only the storing in the list. Also, shelling out isn't a special thing – it doesn't require extra libraries or whatever, it's just a feature of the language. So I don't quite follow your line of argument.
I guess you're trying to help, but your answer isn't really useful, and should have been a comment. The question is specifically tagged Ruby & Ruby-on-Rails, and I posted a bounty for a Ruby answer. Ruby has excellent text processing capabilities, Rails runs on systems which don't have grep/awk/sed, and Ruby code can be much more readable than awk. Care needs to be taken not to use too much memory, and that's what the question is about.
Another option that comes to mind is to use the split command offered by Linux: You could use split -l 1000 to split the input file into separate equally sized files and then process them one-by-one with Ruby, thus keeping most of the logic in Ruby while having the file size (and consequently also memory usage) relatively constant even as the overall number of lines grows.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.