Improving IO performance and speed

Question

I need to read & write a huge amount of strings (each strin line 90 chars long) from/to zipped text file.
There is also time consuming task to prepare the input/output but it can be neglected (IO time is much much bigger (profiled) )

This is the code I am using:

GZIPOutputStream out = new GZIPOutputStream(new FileOutputStream(file)); out.write((stringData+NewLineConstant).getBytes()); GZIPInputStream in = new GZIPInputStream(new FileInputStream(file)); BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(in),8192); String data = bufferedReader.readLine();

The problem it takes too much time to complete.

This is also done over multiple files that are used to sort the data (merge sort).

Is there something I can do to improve dramaticallythe performance? (without Hardware change)

If I read this code correctly, you're writing to a file and reading the same file back in? Or am I wrong? — ArjunShankar
– ArjunShankar, Commented Feb 23, 2012 at 11:09
Ah! You sort in between. Could you possibly share a bigger bit of the code? Might 'expose' opportunity for speedup. — ArjunShankar
– ArjunShankar, Commented Feb 23, 2012 at 11:10
Why don't you use a BufferedWriter, call newLine() and remove that NewLineConstant? You can even reuse a char[90] buffer for calling write. — Mister Smith
– Mister Smith, Commented Feb 23, 2012 at 11:42
Which line is the most time consuming (based on your profiling)? — Palesz
– Palesz, Commented Feb 23, 2012 at 12:13

Palesz · Accepted Answer · 2012-02-23 12:22:57Z

Do you have any information about the distribution of the first one or two characters in those lines?

If so, you could read this big file one time, and you could create one or two dozen buckets (files) based only the first one or two character of those lines. After that, you could sort those buckets in-memory (those files would be smaller than 1GB) if the distribution is uniform.

In detail it would look like this:

open the big file (10GB)
open dozens of bucket files to write (1 for each type of line: aa, ab, ...)
read the lines of the big file, and write to the bucket files
close the big file
close the bucket files
sort the bucket files in memory (first the aa, than ab, ...), this could be parallelized, and append them

In general you should increase read buffers (from 8K to some megabytes) and the write buffers (from 8K to 256K-512K).

Collectives™ on Stack Overflow

Improving IO performance and speed

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related