Revisions to Filter a large file quickly

edited tags

Link

edited Jul 25, 2022 at 18:32

The Fifth Marshal

6.3k
1
27
46

deleted 123 characters in body

Source Link

edited May 15, 2014 at 20:00

user9206

The challenge is to filter a large file quickly.

Input: Each line has three space separated positive integers.
Output: All input lines A B, Tthat satisfy either of the following criterion.

There exists another input line C, D, U where D = A and 0 <= T - U < 100.
There exists another input line C, D, U where B = C and 0 <= U - T < 100.

To make a test file use the following python script which will also be used for testing. It will make a 1.3G file. You can of course reduce nolines for testing.

import random nolines = 50000000 # 50 million for i in xrange(nolines): print random.randint(0,nolines-1), random.randint(0,nolines-1), random.randint(0,nolines-1)

Rules. The fastest code when tested on an input file I make using the above script on my computer wins. The deadline is one week from the time of the first correct entry.

My Machine The timings will be run on my machine. This is a standard 8GB RAM ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.

Some relevant timing information

Correctness issues

There are small out by one errors I have highlighted in the comments below the various answers.

Timings updated to run the following before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' time wc test.file real 0m26.835s user 0m18.363s sys 0m0.495s time sort -n largefile.file > /dev/null real 1m32.344s user 2m9.530s sys 0m6.543s

Status of entries

I run the following line before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches'

Perl (Waiting for bug fix.)
Scala 1 minutes 37 seconds by @James_pic. (Using scala -J-Xmx6g Filterer largefile.file output.txt)
Java. 1 minute 23 seconds by @Geobits. (Using java -Xmx6g Filter_26643)
C. 2 minutes 21 seconds by @ScottLeadley.
C. 28 seconds by @James_pic.
Python+pandas. Maybe there is a simple "groupby" solution?
C. 28 seconds by @KeithRandall.

The winners are Keith Randall and James_pic.

I couldn't tell their running times apart and both are almost as fast as wc!

The challenge is to filter a large file quickly.

Input: Each line has three space separated positive integers.
Output: All input lines A B, Tthat satisfy either of the following criterion.

There exists another input line C, D, U where D = A and 0 <= T - U < 100.
There exists another input line C, D, U where B = C and 0 <= U - T < 100.

To make a test file use the following python script which will also be used for testing. It will make a 1.3G file. You can of course reduce nolines for testing.

import random nolines = 50000000 # 50 million for i in xrange(nolines): print random.randint(0,nolines-1), random.randint(0,nolines-1), random.randint(0,nolines-1)

Rules. The fastest code when tested on an input file I make using the above script on my computer wins. The deadline is one week from the time of the first correct entry.

My Machine The timings will be run on my machine. This is a standard 8GB RAM ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.

Some relevant timing information

Correctness issues

There are small out by one errors I have highlighted in the comments below the various answers.

Timings updated to run the following before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' time wc test.file real 0m26.835s user 0m18.363s sys 0m0.495s time sort -n largefile.file > /dev/null real 1m32.344s user 2m9.530s sys 0m6.543s

Status of entries

I run the following line before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches'

Perl (Waiting for bug fix.)
Scala 1 minutes 37 seconds by @James_pic. (Using scala -J-Xmx6g Filterer largefile.file output.txt)
Java. 1 minute 23 seconds by @Geobits. (Using java -Xmx6g Filter_26643)
C. 2 minutes 21 seconds by @ScottLeadley.
C. 28 seconds by @James_pic.
Python+pandas. Maybe there is a simple "groupby" solution?
C. 28 seconds by @KeithRandall.

The winners are Keith Randall and James_pic.

I couldn't tell their running times apart and both are almost as fast as wc!

The challenge is to filter a large file quickly.

Input: Each line has three space separated positive integers.
Output: All input lines A B, Tthat satisfy either of the following criterion.

There exists another input line C, D, U where D = A and 0 <= T - U < 100.
There exists another input line C, D, U where B = C and 0 <= U - T < 100.

To make a test file use the following python script which will also be used for testing. It will make a 1.3G file. You can of course reduce nolines for testing.

import random nolines = 50000000 # 50 million for i in xrange(nolines): print random.randint(0,nolines-1), random.randint(0,nolines-1), random.randint(0,nolines-1)

Rules. The fastest code when tested on an input file I make using the above script on my computer wins. The deadline is one week from the time of the first correct entry.

My Machine The timings will be run on my machine. This is a standard 8GB RAM ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.

Some relevant timing information

Timings updated to run the following before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' time wc test.file real 0m26.835s user 0m18.363s sys 0m0.495s time sort -n largefile.file > /dev/null real 1m32.344s user 2m9.530s sys 0m6.543s

Status of entries

I run the following line before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches'

Perl (Waiting for bug fix.)
Scala 1 minutes 37 seconds by @James_pic. (Using scala -J-Xmx6g Filterer largefile.file output.txt)
Java. 1 minute 23 seconds by @Geobits. (Using java -Xmx6g Filter_26643)
C. 2 minutes 21 seconds by @ScottLeadley.
C. 28 seconds by @James_pic.
Python+pandas. Maybe there is a simple "groupby" solution?
C. 28 seconds by @KeithRandall.

The winners are Keith Randall and James_pic.

I couldn't tell their running times apart and both are almost as fast as wc!

Notice removed Draw attention by user9206

occurred May 14, 2014 at 19:45

Bounty Ended with James_pic's answer chosen by CommunityBot

occurred May 14, 2014 at 19:45

added 135 characters in body

Source Link

edited May 14, 2014 at 19:45

user9206

The challenge is to filter a large file quickly.

Input: Each line has three space separated positive integers.
Output: All input lines A B, Tthat satisfy either of the following criterion.

There exists another input line C, D, U where D = A and 0 <= T - U < 100.
There exists another input line C, D, U where B = C and 0 <= U - T < 100.

To make a test file use the following python script which will also be used for testing. It will make a 1.3G file. You can of course reduce nolines for testing.

import random nolines = 50000000 # 50 million for i in xrange(nolines): print random.randint(0,nolines-1), random.randint(0,nolines-1), random.randint(0,nolines-1)

Rules. The fastest code when tested on an input file I make using the above script on my computer wins. The deadline is one week from the time of the first correct entry.

My Machine The timings will be run on my machine. This is a standard 8GB RAM ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.

Some relevant timing information

Correctness issues

There are small out by one errors I have highlighted in the comments below the various answers.

Timings updated to run the following before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' time wc test.file real 0m26.835s user 0m18.363s sys 0m0.495s time sort -n largefile.file > /dev/null real 1m32.344s user 2m9.530s sys 0m6.543s

Status of entries

I run the following line before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches'

Perl (Waiting for bug fix.)
Scala 1 minutes 37 seconds by @James_pic. (Using scala -J-Xmx6g Filterer largefile.file output.txt)
Java. 1 minute 23 seconds by @Geobits. (Using java -Xmx6g Filter_26643)
C. 2 minutes 21 seconds by @ScottLeadley.
C. 28 seconds by @James_pic.
Python+pandas. Maybe there is a simple "groupby" solution?
C. 28 seconds by @KeithRandall.

The winners are Keith Randall and James_pic.

I couldn't tell their running times apart and both are almost as fast as wc!

The challenge is to filter a large file quickly.

Input: Each line has three space separated positive integers.
Output: All input lines A B, Tthat satisfy either of the following criterion.

There exists another input line C, D, U where D = A and 0 <= T - U < 100.
There exists another input line C, D, U where B = C and 0 <= U - T < 100.

To make a test file use the following python script which will also be used for testing. It will make a 1.3G file. You can of course reduce nolines for testing.

import random nolines = 50000000 # 50 million for i in xrange(nolines): print random.randint(0,nolines-1), random.randint(0,nolines-1), random.randint(0,nolines-1)

Rules. The fastest code when tested on an input file I make using the above script on my computer wins. The deadline is one week from the time of the first correct entry.

My Machine The timings will be run on my machine. This is a standard 8GB RAM ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.

Some relevant timing information

Correctness issues

There are small out by one errors I have highlighted in the comments below the various answers.

Timings updated to run the following before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' time wc test.file real 0m26.835s user 0m18.363s sys 0m0.495s time sort -n largefile.file > /dev/null real 1m32.344s user 2m9.530s sys 0m6.543s

Status of entries

I run the following line before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches'

Perl (Waiting for bug fix.)
Scala 1 minutes 37 seconds by @James_pic. (Using scala -J-Xmx6g Filterer largefile.file output.txt)
Java. 1 minute 23 seconds by @Geobits. (Using java -Xmx6g Filter_26643)
C. 2 minutes 21 seconds by @ScottLeadley.
C. 28 seconds by @James_pic.
Python+pandas. Maybe there is a simple "groupby" solution?
C. 28 seconds by @KeithRandall.

The challenge is to filter a large file quickly.

Input: Each line has three space separated positive integers.
Output: All input lines A B, Tthat satisfy either of the following criterion.

There exists another input line C, D, U where D = A and 0 <= T - U < 100.
There exists another input line C, D, U where B = C and 0 <= U - T < 100.

To make a test file use the following python script which will also be used for testing. It will make a 1.3G file. You can of course reduce nolines for testing.

import random nolines = 50000000 # 50 million for i in xrange(nolines): print random.randint(0,nolines-1), random.randint(0,nolines-1), random.randint(0,nolines-1)

Rules. The fastest code when tested on an input file I make using the above script on my computer wins. The deadline is one week from the time of the first correct entry.

My Machine The timings will be run on my machine. This is a standard 8GB RAM ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.

Some relevant timing information

Correctness issues

There are small out by one errors I have highlighted in the comments below the various answers.

Timings updated to run the following before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' time wc test.file real 0m26.835s user 0m18.363s sys 0m0.495s time sort -n largefile.file > /dev/null real 1m32.344s user 2m9.530s sys 0m6.543s

Status of entries

I run the following line before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches'

Perl (Waiting for bug fix.)
Scala 1 minutes 37 seconds by @James_pic. (Using scala -J-Xmx6g Filterer largefile.file output.txt)
Java. 1 minute 23 seconds by @Geobits. (Using java -Xmx6g Filter_26643)
C. 2 minutes 21 seconds by @ScottLeadley.
C. 28 seconds by @James_pic.
Python+pandas. Maybe there is a simple "groupby" solution?
C. 28 seconds by @KeithRandall.