21

I want to read all lines of a 1 GB large file as fast as possible into a Stream<String>. Currently I'm using Files(path).lines() for that. After parsing the file, I'm doing some computations (map()/filter()).

At first I thought this is already done in parallel, but it seems I'm wrong: when reading the file as it is, it takes about 50 seconds on my dual CPU laptop. However, if I split the file using bash commands and then process them in parallel, it only takes about 30 seconds.

I tried the following combinations:

  1. single file, no parallel lines() stream ~ 50 seconds
  2. single file, Files(..).lines().parallel().[...] ~ 50 seconds
  3. two files, no parallel lines() strean ~ 30 seconds
  4. two files, Files(..).lines().parallel().[...] ~ 30 seconds

I ran these 4 multiple times with roughly the same results (by 1 or 2 seconds). The [...] is a chain of map and filter only, with a toArray(...) at the end to trigger the evaluation.

The conclusion is that there is no difference in using lines().parallel(). As reading two files in parallel takes a shorter time, there is a performance gain from splitting the file. However it seems the whole file is read serially.

Edit:
I want to point out that I use an SSD, so there is practically no seeking time. The file has 1658652 (relatively short) lines in total. Splitting the file in bash takes about 1.5 seconds:

 time split -l 829326 file # 829326 = 1658652 / 2 split -l 829326 file 0,14s user 1,41s system 16% cpu 9,560 total 

So my question is, is there any class or function in the Java 8 JDK which can parallelize reading all lines without having to split it first? For example, if I have two CPU cores, the first line reader should start at the first line and a second one at line (totalLines/2)+1.

10
  • 2
    What do you mean split the file using bash? Also, you have to read the file to find the lines. It can't just know where the line terminators are. Commented Sep 7, 2014 at 15:07
  • 4
    @SotiriosDelimanolis You can jump to an arbitrary point in the file and look for the first new line and split that way without processing all the data in the entire segment first. Commented Sep 7, 2014 at 15:22
  • 1
    In bash, you can use split -l #lines, combine that with wc -l to split it into two halves. Micheal Petch understood my idea :) Commented Sep 7, 2014 at 15:23
  • 2
    ... Probably the real difficulty is that the file is huge and can't fit in cache, so it has to be read from disk. But the disk only has one seek head (typically), so you can pay the price of seeking between the halves (very expensive) or read sequentially from one end. The price likely gets worse as N goes up because of multiple demands on the heads. So as N goes up, you pay seek time costs that will likely impact your parallelism. My guess is you have simply try this for various N; there are too many buffers/overheads to easily guess where the sweet spot is. Expect tail-off for small N. Commented Sep 7, 2014 at 15:28
  • 1
    @IraBaxter although I agree, it is interesting that the split operation can be done from bash more efficiently. I assumed user3001 also included the time to split with the time to process. Commented Sep 7, 2014 at 15:31

1 Answer 1

7

You might find some help from this post. Trying to parallelize the actual reading of a file is probably barking up the wrong tree, as the biggest slowdown will be your file system (even on an SSD).

If you set up a file channel in memory, you should be able to process the data in parallel from there with great speed, but chances are you won't need it as you'll see a huge speed increase.

Sign up to request clarification or add additional context in comments.

3 Comments

I changed the method I used for testing, I seem to be wrong, as I accidentally limited the result size to 200 before doing the final operation (forEach(System.out::println) on the resulting stream. Actually, Files.lines(Path) reads the given file as a sequential stream. Moreover, most of the time was actually spent reading the file and not on computations, plus I ran into some GC problems. I timed Files.lines(Path).toArray() vs a method using a MappedByteBuffer(readonly, 0, file.size). The results were actually the quite same, MappedByteBuffer a bit slower here and there.
@user3001, what was your final resolution here?
Actually I found out that it largely depends on what you do with the data in the stream. In the end I used a sequential stream to avoid problems with garbage collection.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.