Parse large stdin ruby

Question

I need to write script for parse large input data (30GB). I need extract all numbers from stdin text and output it by order desc.

Example of usage: cat text_file_30gb.txt | script

Now I use for parse:

numbers = [] $stdin.each_line do |line| numbers += line.scan(/\d+/).map(&:to_i) end numbers.uniq!.sort!.reverse!

But I tried to pass text from 60MB file to script and it parsed it for 50 min

Are the way for speed up script?

UPD. Profiling result:

 %self total self wait child calls name 95.42 5080.882 4848.293 0.000 232.588 1 IO#each_line 3.33 169.246 169.246 0.000 0.000 378419 String#scan 0.26 15.148 13.443 0.000 1.705 746927 <Class::Time>#now 0.18 9.310 9.310 0.000 0.000 378422 Array#uniq! 0.15 14.446 7.435 0.000 7.011 378423 Array#map 0.14 7.011 7.011 0.000 0.000 8327249 String#to_i 0.10 5.179 5.179 0.000 0.000 378228 Array#sort! 0.03 1.508 1.508 0.000 0.000 339416 String#% 0.03 1.454 1.454 0.000 0.000 509124 Symbol#to_s 0.02 0.993 0.993 0.000 0.000 48488 IO#write 0.02 1.593 0.945 0.000 0.649 742077 Numeric#quo 0.01 0.649 0.649 0.000 0.000 742077 Fixnum#fdiv 0.01 0.619 0.619 0.000 0.000 509124 String#intern 0.01 0.459 0.459 0.000 0.000 315172 Fixnum#to_s 0.01 0.453 0.453 0.000 0.000 746927 Fixnum#+ 0.01 0.383 0.383 0.000 0.000 72732 Array#reject 0.01 16.100 0.307 0.000 15.793 96976 *Enumerable#inject 0.00 15.793 0.207 0.000 15.585 150322 *Array#each ...

Well, you probably can't hold all 30 GB in memory all at once, so you'll need to sort them on disk. Also, it might be faster if you didn't use ruby for this (instead use C or something). — Adrian
– Adrian, Commented Oct 14, 2017 at 20:53
@Adrian I agree with you absolutely but I need do it on Ruby for exam :( — QNester
– QNester, Commented Oct 14, 2017 at 21:00
Did you try something like IO.foreach('text_file_30gb.txt').lazy.grep(/\d+/). Also refer to this it might help: blog.honeybadger.io/… — Cyzanfar
– Cyzanfar, Commented Oct 14, 2017 at 21:04
If numbers.uniq! is far smaller than numbers (i.e., lots of dups), you might make numbers a set rather than an array. That would reduce memory requirements but I doubt that it would speed the calculations. What is your rough estimate of the counts of numbers (not digits) and unique numbers in the file? — Cary Swoveland
– Cary Swoveland, Commented Oct 15, 2017 at 6:11
I think numbers += ... is taking way too much, as it allocates new array per += call. However, use << instead. That will add value to the existing array instance. I am adding a quick example based on your current one, which takes up 11 to 12 minutes just with few tweaks. — nhm tanveer
– nhm tanveer, Commented Oct 16, 2017 at 3:23

nhm tanveer · Accepted Answer · 2017-10-16 03:51:52Z

Thanks for the excellent problem.

I couldn't dig for a long time. However, this is what I can see as a quick fix to reduce 50 mins mark to 11 mins. At least 4.5 times faster.

require 'ruby-prof' def profile(&block) RubyProf::FlatPrinter.new(RubyProf.profile(&block)).print($stdout) end numbers = [] profile do $stdin.each_line do |line| line.scan(/\d+/) {|digit| numbers << digit.to_i } end numbers.uniq!.sort!.reverse! end

The reason is pretty simple. As you can see += on array allocates new array instead of pushing new values to the existing reference. A quick fix is using << instead. A big win that along cut the whole lag.

Still, there are some significant glitches if you juggle with larger file set. I don't have anything top of my head.

Collectives™ on Stack Overflow

Parse large stdin ruby

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related