Ruby - Read file in batches

Question

I am reading a file that is 10mb in size and which contains some id's. I read them into a list in ruby. I am concerned that it might cause memory issues in the future, when the number of id's in file might increase. Is there a effective way of reading a large file in batches?

Thank you

Eric Duminil · Accepted Answer · 2021-05-07 20:36:10Z

With Lazy Enumerators and each_slice, you can get the best of both worlds. You don't need to worry about cutting lines in the middle, and you can iterate over multiple lines in a batch. batch_size can be chosen freely.

header_lines = 1 batch_size = 2000 File.open("big_file") do |file| file.lazy.drop(header_lines).each_slice(batch_size) do |lines| # do something with batch of lines end end

It could be used to import a huge CSV file into a database:

require 'csv' batch_size = 2000 File.open("big_data.csv") do |file| headers = file.first file.lazy.each_slice(batch_size) do |lines| csv_rows = CSV.parse(lines.join, headers: headers) # do something with 2000 csv rows, e.g. bulk insert them into a database end end

I vaguely remembered this, is the answer I was looking for. The right way to read a file!
I did some benchmarks with a huge amount of data, memory usage is the same as without .lazy.each_slice chain.
@Ilya thanks for the benchmark. I'll investigate after my holidays.
Cannot understand the magic too. Does this reads the whole file into memory and then iterates through its lines?

zed_0xff · Accepted Answer · 2010-06-02 22:53:30Z

there's no universal way.

1) you can read file by chunks:

File.open('filename','r') do |f| chunk = f.read(2048) ... end

disadvantage: you can miss a substring if it'd be between chunks, i.e. you look for "SOME_TEXT", but "SOME_" is a last 5 bytes of 1st 2048-byte chunk, and "TEXT" is a 4 bytes of 2nd chunk

2) you can read file line-by-line

File.open('filename','r') do |f| line = f.gets ... end

disadvantage: this way it'd be 2x..5x slower than first method

the_spectator · Accepted Answer · 2024-01-03 12:28:36Z

Building upon @Eric Duminil's answer. CSV class also supports lazy enumerators directly.

require 'csv' batch_size = 2000 csv = CSV.open("big_data.csv", headers: true) csv.lazy.each_slice(batch_size) do |csv_rows| # do something with 2000 csv rows, e.g. bulk insert them into a database end

The benefit of this approach is we would get the already parsed CSV rows.

Definitely the best answer. As a followup since this gives you an array of CSV::Row objects it is the easier method if you'd like to manipulate data somehow. In my case I'm doing csv_rows.map(&:to_h) and since map consumes memory I'm keeping batch_size as 250.

Clemens Kofler · Accepted Answer · 2021-05-14 15:32:09Z

-1

If you're worried this much about speed/memory efficiency, have you considered shelling out to the shell and use grep, awk, sed etc.? If I knew a bit more about the structure of the input file and what you're trying to extract, I could potentially construct a command for you.

answered May 14, 2021 at 15:32

Clemens Kofler

1,9689 silver badges12 bronze badges

4 Comments

Eric Duminil Over a year ago

Sorry, the question is specifically about ruby. It wouldn't make sense to use shell commands for a canonical ruby question.

Clemens Kofler Over a year ago

The question says the author wants to "read IDs into a list in Ruby". Nowhere does it say that the reading needs to happen in Ruby – only the storing in the list. Also, shelling out isn't a special thing – it doesn't require extra libraries or whatever, it's just a feature of the language. So I don't quite follow your line of argument.

Eric Duminil Over a year ago

I guess you're trying to help, but your answer isn't really useful, and should have been a comment. The question is specifically tagged Ruby & Ruby-on-Rails, and I posted a bounty for a Ruby answer. Ruby has excellent text processing capabilities, Rails runs on systems which don't have grep/awk/sed, and Ruby code can be much more readable than awk. Care needs to be taken not to use too much memory, and that's what the question is about.

Clemens Kofler Over a year ago

Another option that comes to mind is to use the split command offered by Linux: You could use split -l 1000 to split the input file into separate equally sized files and then process them one-by-one with Ruby, thus keeping most of the logic in Ruby while having the file size (and consequently also memory usage) relatively constant even as the overall number of lines grows.

Collectives™ on Stack Overflow

Ruby - Read file in batches

4 Answers 4

6 Comments

Comments

1 Comment

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

Comments

1 Comment

4 Comments

Linked

Related