The challenge is to filter a large file quickly.
- Input: Each line has three space separated positive integers.
- Output: All input lines
AB,Tthat satisfy either of the following criterion.
- There exists another input line
C,D,UwhereD = Aand0 <= T - U < 100. - There exists another input line
C,D,UwhereB = Cand0 <= U - T < 100.
To make a test file use the following python script which will also be used for testing. It will make a 1.3G file. You can of course reduce nolines for testing.
import random nolines = 50000000 # 50 million for i in xrange(nolines): print random.randint(0,nolines-1), random.randint(0,nolines-1), random.randint(0,nolines-1) Rules. The fastest code when tested on an input file I make using the above script on my computer wins. The deadline is one week from the time of the first correct entry.
My Machine The timings will be run on my machine. This is a standard 8GB RAM ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.
Some relevant timing information
Correctness issues
There are small out by one errors I have highlighted in the comments below the various answers.
Timings updated to run the following before each test.
sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' time wc test.file real 0m26.835s user 0m18.363s sys 0m0.495s time sort -n largefile.file > /dev/null real 1m32.344s user 2m9.530s sys 0m6.543s Status of entries
I run the following line before each test.
sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' - Perl (Waiting for bug fix.)
- Scala 1 minutes 37 seconds by @James_pic. (Using scala -J-Xmx6g Filterer largefile.file output.txt)
- Java. 1 minute 23 seconds by @Geobits. (Using java -Xmx6g Filter_26643)
- C. 2 minutes 21 seconds by @ScottLeadley.
- C. 28 seconds by @James_pic.
- Python+pandas. Maybe there is a simple "groupby" solution?
- C. 28 seconds by @KeithRandall.
The winners are Keith Randall and James_pic.
I couldn't tell their running times apart and both are almost as fast as wc!