I need to write script for parse large input data (30GB). I need extract all numbers from stdin text and output it by order desc.
Example of usage: cat text_file_30gb.txt | script
Now I use for parse:
numbers = [] $stdin.each_line do |line| numbers += line.scan(/\d+/).map(&:to_i) end numbers.uniq!.sort!.reverse! But I tried to pass text from 60MB file to script and it parsed it for 50 min
Are the way for speed up script?
UPD. Profiling result:
%self total self wait child calls name 95.42 5080.882 4848.293 0.000 232.588 1 IO#each_line 3.33 169.246 169.246 0.000 0.000 378419 String#scan 0.26 15.148 13.443 0.000 1.705 746927 <Class::Time>#now 0.18 9.310 9.310 0.000 0.000 378422 Array#uniq! 0.15 14.446 7.435 0.000 7.011 378423 Array#map 0.14 7.011 7.011 0.000 0.000 8327249 String#to_i 0.10 5.179 5.179 0.000 0.000 378228 Array#sort! 0.03 1.508 1.508 0.000 0.000 339416 String#% 0.03 1.454 1.454 0.000 0.000 509124 Symbol#to_s 0.02 0.993 0.993 0.000 0.000 48488 IO#write 0.02 1.593 0.945 0.000 0.649 742077 Numeric#quo 0.01 0.649 0.649 0.000 0.000 742077 Fixnum#fdiv 0.01 0.619 0.619 0.000 0.000 509124 String#intern 0.01 0.459 0.459 0.000 0.000 315172 Fixnum#to_s 0.01 0.453 0.453 0.000 0.000 746927 Fixnum#+ 0.01 0.383 0.383 0.000 0.000 72732 Array#reject 0.01 16.100 0.307 0.000 15.793 96976 *Enumerable#inject 0.00 15.793 0.207 0.000 15.585 150322 *Array#each ...
IO.foreach('text_file_30gb.txt').lazy.grep(/\d+/). Also refer to this it might help: blog.honeybadger.io/…numbers.uniq!is far smaller thannumbers(i.e., lots of dups), you might makenumbersa set rather than an array. That would reduce memory requirements but I doubt that it would speed the calculations. What is your rough estimate of the counts of numbers (not digits) and unique numbers in the file?numbers += ...is taking way too much, as it allocates new array per+=call. However, use<<instead. That will add value to the existing array instance. I am adding a quick example based on your current one, which takes up 11 to 12 minutes just with few tweaks.