Skip to main content
edited tags
Link
The Fifth Marshal
  • 6.3k
  • 1
  • 27
  • 46
deleted 123 characters in body
Source Link
user9206
user9206

The challenge is to filter a large file quickly.

  • Input: Each line has three space separated positive integers.
  • Output: All input lines A B, Tthat satisfy either of the following criterion.
  1. There exists another input line C, D, U where D = A and 0 <= T - U < 100.
  2. There exists another input line C, D, U where B = C and 0 <= U - T < 100.

To make a test file use the following python script which will also be used for testing. It will make a 1.3G file. You can of course reduce nolines for testing.

import random nolines = 50000000 # 50 million for i in xrange(nolines): print random.randint(0,nolines-1), random.randint(0,nolines-1), random.randint(0,nolines-1) 

Rules. The fastest code when tested on an input file I make using the above script on my computer wins. The deadline is one week from the time of the first correct entry.

My Machine The timings will be run on my machine. This is a standard 8GB RAM ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.

Some relevant timing information

Correctness issues

There are small out by one errors I have highlighted in the comments below the various answers.

Timings updated to run the following before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' time wc test.file real 0m26.835s user 0m18.363s sys 0m0.495s time sort -n largefile.file > /dev/null real 1m32.344s user 2m9.530s sys 0m6.543s 

Status of entries

I run the following line before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' 
  • Perl (Waiting for bug fix.)
  • Scala 1 minutes 37 seconds by @James_pic. (Using scala -J-Xmx6g Filterer largefile.file output.txt)
  • Java. 1 minute 23 seconds by @Geobits. (Using java -Xmx6g Filter_26643)
  • C. 2 minutes 21 seconds by @ScottLeadley.
  • C. 28 seconds by @James_pic.
  • Python+pandas. Maybe there is a simple "groupby" solution?
  • C. 28 seconds by @KeithRandall.

The winners are Keith Randall and James_pic.

I couldn't tell their running times apart and both are almost as fast as wc!

The challenge is to filter a large file quickly.

  • Input: Each line has three space separated positive integers.
  • Output: All input lines A B, Tthat satisfy either of the following criterion.
  1. There exists another input line C, D, U where D = A and 0 <= T - U < 100.
  2. There exists another input line C, D, U where B = C and 0 <= U - T < 100.

To make a test file use the following python script which will also be used for testing. It will make a 1.3G file. You can of course reduce nolines for testing.

import random nolines = 50000000 # 50 million for i in xrange(nolines): print random.randint(0,nolines-1), random.randint(0,nolines-1), random.randint(0,nolines-1) 

Rules. The fastest code when tested on an input file I make using the above script on my computer wins. The deadline is one week from the time of the first correct entry.

My Machine The timings will be run on my machine. This is a standard 8GB RAM ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.

Some relevant timing information

Correctness issues

There are small out by one errors I have highlighted in the comments below the various answers.

Timings updated to run the following before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' time wc test.file real 0m26.835s user 0m18.363s sys 0m0.495s time sort -n largefile.file > /dev/null real 1m32.344s user 2m9.530s sys 0m6.543s 

Status of entries

I run the following line before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' 
  • Perl (Waiting for bug fix.)
  • Scala 1 minutes 37 seconds by @James_pic. (Using scala -J-Xmx6g Filterer largefile.file output.txt)
  • Java. 1 minute 23 seconds by @Geobits. (Using java -Xmx6g Filter_26643)
  • C. 2 minutes 21 seconds by @ScottLeadley.
  • C. 28 seconds by @James_pic.
  • Python+pandas. Maybe there is a simple "groupby" solution?
  • C. 28 seconds by @KeithRandall.

The winners are Keith Randall and James_pic.

I couldn't tell their running times apart and both are almost as fast as wc!

The challenge is to filter a large file quickly.

  • Input: Each line has three space separated positive integers.
  • Output: All input lines A B, Tthat satisfy either of the following criterion.
  1. There exists another input line C, D, U where D = A and 0 <= T - U < 100.
  2. There exists another input line C, D, U where B = C and 0 <= U - T < 100.

To make a test file use the following python script which will also be used for testing. It will make a 1.3G file. You can of course reduce nolines for testing.

import random nolines = 50000000 # 50 million for i in xrange(nolines): print random.randint(0,nolines-1), random.randint(0,nolines-1), random.randint(0,nolines-1) 

Rules. The fastest code when tested on an input file I make using the above script on my computer wins. The deadline is one week from the time of the first correct entry.

My Machine The timings will be run on my machine. This is a standard 8GB RAM ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.

Some relevant timing information

Timings updated to run the following before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' time wc test.file real 0m26.835s user 0m18.363s sys 0m0.495s time sort -n largefile.file > /dev/null real 1m32.344s user 2m9.530s sys 0m6.543s 

Status of entries

I run the following line before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' 
  • Perl (Waiting for bug fix.)
  • Scala 1 minutes 37 seconds by @James_pic. (Using scala -J-Xmx6g Filterer largefile.file output.txt)
  • Java. 1 minute 23 seconds by @Geobits. (Using java -Xmx6g Filter_26643)
  • C. 2 minutes 21 seconds by @ScottLeadley.
  • C. 28 seconds by @James_pic.
  • Python+pandas. Maybe there is a simple "groupby" solution?
  • C. 28 seconds by @KeithRandall.

The winners are Keith Randall and James_pic.

I couldn't tell their running times apart and both are almost as fast as wc!

Notice removed Draw attention by user9206
Bounty Ended with James_pic's answer chosen by CommunityBot
added 135 characters in body
Source Link
user9206
user9206

The challenge is to filter a large file quickly.

  • Input: Each line has three space separated positive integers.
  • Output: All input lines A B, Tthat satisfy either of the following criterion.
  1. There exists another input line C, D, U where D = A and 0 <= T - U < 100.
  2. There exists another input line C, D, U where B = C and 0 <= U - T < 100.

To make a test file use the following python script which will also be used for testing. It will make a 1.3G file. You can of course reduce nolines for testing.

import random nolines = 50000000 # 50 million for i in xrange(nolines): print random.randint(0,nolines-1), random.randint(0,nolines-1), random.randint(0,nolines-1) 

Rules. The fastest code when tested on an input file I make using the above script on my computer wins. The deadline is one week from the time of the first correct entry.

My Machine The timings will be run on my machine. This is a standard 8GB RAM ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.

Some relevant timing information

Correctness issues

There are small out by one errors I have highlighted in the comments below the various answers.

Timings updated to run the following before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' time wc test.file real 0m26.835s user 0m18.363s sys 0m0.495s time sort -n largefile.file > /dev/null real 1m32.344s user 2m9.530s sys 0m6.543s 

Status of entries

I run the following line before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' 
  • Perl (Waiting for bug fix.)
  • Scala 1 minutes 37 seconds by @James_pic. (Using scala -J-Xmx6g Filterer largefile.file output.txt)
  • Java. 1 minute 23 seconds by @Geobits. (Using java -Xmx6g Filter_26643)
  • C. 2 minutes 21 seconds by @ScottLeadley.
  • C. 28 seconds by @James_pic.
  • Python+pandas. Maybe there is a simple "groupby" solution?
  • C. 28 seconds by @KeithRandall.

The winners are Keith Randall and James_pic.

I couldn't tell their running times apart and both are almost as fast as wc!

The challenge is to filter a large file quickly.

  • Input: Each line has three space separated positive integers.
  • Output: All input lines A B, Tthat satisfy either of the following criterion.
  1. There exists another input line C, D, U where D = A and 0 <= T - U < 100.
  2. There exists another input line C, D, U where B = C and 0 <= U - T < 100.

To make a test file use the following python script which will also be used for testing. It will make a 1.3G file. You can of course reduce nolines for testing.

import random nolines = 50000000 # 50 million for i in xrange(nolines): print random.randint(0,nolines-1), random.randint(0,nolines-1), random.randint(0,nolines-1) 

Rules. The fastest code when tested on an input file I make using the above script on my computer wins. The deadline is one week from the time of the first correct entry.

My Machine The timings will be run on my machine. This is a standard 8GB RAM ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.

Some relevant timing information

Correctness issues

There are small out by one errors I have highlighted in the comments below the various answers.

Timings updated to run the following before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' time wc test.file real 0m26.835s user 0m18.363s sys 0m0.495s time sort -n largefile.file > /dev/null real 1m32.344s user 2m9.530s sys 0m6.543s 

Status of entries

I run the following line before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' 
  • Perl (Waiting for bug fix.)
  • Scala 1 minutes 37 seconds by @James_pic. (Using scala -J-Xmx6g Filterer largefile.file output.txt)
  • Java. 1 minute 23 seconds by @Geobits. (Using java -Xmx6g Filter_26643)
  • C. 2 minutes 21 seconds by @ScottLeadley.
  • C. 28 seconds by @James_pic.
  • Python+pandas. Maybe there is a simple "groupby" solution?
  • C. 28 seconds by @KeithRandall.

The challenge is to filter a large file quickly.

  • Input: Each line has three space separated positive integers.
  • Output: All input lines A B, Tthat satisfy either of the following criterion.
  1. There exists another input line C, D, U where D = A and 0 <= T - U < 100.
  2. There exists another input line C, D, U where B = C and 0 <= U - T < 100.

To make a test file use the following python script which will also be used for testing. It will make a 1.3G file. You can of course reduce nolines for testing.

import random nolines = 50000000 # 50 million for i in xrange(nolines): print random.randint(0,nolines-1), random.randint(0,nolines-1), random.randint(0,nolines-1) 

Rules. The fastest code when tested on an input file I make using the above script on my computer wins. The deadline is one week from the time of the first correct entry.

My Machine The timings will be run on my machine. This is a standard 8GB RAM ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.

Some relevant timing information

Correctness issues

There are small out by one errors I have highlighted in the comments below the various answers.

Timings updated to run the following before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' time wc test.file real 0m26.835s user 0m18.363s sys 0m0.495s time sort -n largefile.file > /dev/null real 1m32.344s user 2m9.530s sys 0m6.543s 

Status of entries

I run the following line before each test.

sync && sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' 
  • Perl (Waiting for bug fix.)
  • Scala 1 minutes 37 seconds by @James_pic. (Using scala -J-Xmx6g Filterer largefile.file output.txt)
  • Java. 1 minute 23 seconds by @Geobits. (Using java -Xmx6g Filter_26643)
  • C. 2 minutes 21 seconds by @ScottLeadley.
  • C. 28 seconds by @James_pic.
  • Python+pandas. Maybe there is a simple "groupby" solution?
  • C. 28 seconds by @KeithRandall.

The winners are Keith Randall and James_pic.

I couldn't tell their running times apart and both are almost as fast as wc!

added 40 characters in body
Source Link
user9206
user9206
Loading
deleted 58 characters in body
Source Link
user9206
user9206
Loading
added 183 characters in body
Source Link
user9206
user9206
Loading
added 349 characters in body
Source Link
user9206
user9206
Loading
added 34 characters in body
Source Link
user9206
user9206
Loading
edited body
Source Link
user9206
user9206
Loading
added 46 characters in body
Source Link
user9206
user9206
Loading
added 69 characters in body
Source Link
user9206
user9206
Loading
added 58 characters in body
Source Link
user9206
user9206
Loading
deleted 25 characters in body
Source Link
user9206
user9206
Loading
added 77 characters in body
Source Link
user9206
user9206
Loading
Tweeted twitter.com/#!/StackCodeGolf/status/464154540862738433
Notice added Draw attention by user9206
Bounty Started worth 100 reputation by CommunityBot
added 30 characters in body
Source Link
user9206
user9206
Loading
deleted 2 characters in body
Source Link
user9206
user9206
Loading
added 19 characters in body
Source Link
user9206
user9206
Loading
Source Link
user9206
user9206
Loading