Skip to main content

Timeline for Filter a large file quickly

Current License: CC BY-SA 3.0

15 events
when toggle format what by license comment
May 14, 2014 at 12:38 comment added NPSF3000 @Lembik Ignore those warnings, they are fine. I understand it needs to run on your system (and it should, thanks to mono) I am just too busy to debug it myself (flying to Sydney in 10 hours, for a busy few days)
May 14, 2014 at 8:43 comment added user9206 The warnings are NPSF3000.cs(134,24): warning CS0219: The variable tasks' is assigned but its value is never used NPSF3000.cs(61,30): warning CS0219: The variable solved' is assigned but its value is never used. It is a shame I can't run your code yet. The question does state that my system is ubuntu and I need to be able to run the code to test it.
May 13, 2014 at 22:32 comment added NPSF3000 @Lembik, I don't run linux and have little time ATM to debug :( What are the warnings? Note: I don't create any threads, I simply queue work for the threadpool to handle for me.
May 13, 2014 at 18:33 comment added user9206 It looks like you are creating a huge number of threads. Do you mean to do that?
May 13, 2014 at 18:14 comment added user9206 Can you give me instructions for running it in linux? I tried mcs NPSF3000.cs which gives a couple of warnings and then mono NPSF3000.exe ~/largefile.file which gets as far as Started: Parallel Sort A but then does nothing it seems.
May 13, 2014 at 12:41 history edited NPSF3000 CC BY-SA 3.0
new version of code, with args
May 12, 2014 at 23:25 comment added NPSF3000 Comparer fixed, results produced. Computation accounts for only about 5 seconds of my thirty (input 12, sorts 5 each) so I'm thinking about my next line of attack. IO is processing at ~100MBps so speedups might be limited.
May 12, 2014 at 23:13 history edited NPSF3000 CC BY-SA 3.0
added 174 characters in body
May 12, 2014 at 14:47 comment added NPSF3000 @James_pic thanks for the suggests, I'll chase them up if I have time. I just cut 40 odd seconds out of my IO so I can focus again on the sort and computation.
May 12, 2014 at 14:43 history edited NPSF3000 CC BY-SA 3.0
deleted 774 characters in body
May 12, 2014 at 14:38 comment added James_pic Another fun option, since you know that the values of B will be uniformly distributed in a specific range, would be to swap binary search for interpolation search, which reduces search time from O(log(n)) to O(log(log(n)).
May 12, 2014 at 14:35 comment added James_pic More generally, if you're sorting by A and B, there's a faster algorithm than iterating on A and binary searching on B which is O(n log(n)) (and is effectively a poor-man's hash table). You can instead merge-join the two lists, which is O(n).
May 12, 2014 at 14:32 comment added James_pic You should get around 200 results, more or less irrespective of the size of your input data. I suspect your problem is related to the way you're using binary search, in lines 98-102 - I suspect you're assuming that x.A will come from sortedA, and x.B will come from sortedB, whereas in fact both will come from sortedB, and this Comparer will produce nonsense results.
May 12, 2014 at 14:27 history edited NPSF3000 CC BY-SA 3.0
added 1556 characters in body
May 12, 2014 at 13:58 history answered NPSF3000 CC BY-SA 3.0