2

I need to read & write a huge amount of strings (each strin line 90 chars long) from/to zipped text file.
There is also time consuming task to prepare the input/output but it can be neglected (IO time is much much bigger (profiled) )

This is the code I am using:

GZIPOutputStream out = new GZIPOutputStream(new FileOutputStream(file)); out.write((stringData+NewLineConstant).getBytes()); GZIPInputStream in = new GZIPInputStream(new FileInputStream(file)); BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(in),8192); String data = bufferedReader.readLine(); 

The problem it takes too much time to complete.

This is also done over multiple files that are used to sort the data (merge sort).

Is there something I can do to improve dramaticallythe performance? (without Hardware change)

8
  • 1
    If I read this code correctly, you're writing to a file and reading the same file back in? Or am I wrong? Commented Feb 23, 2012 at 11:09
  • 1
    Ah! You sort in between. Could you possibly share a bigger bit of the code? Might 'expose' opportunity for speedup. Commented Feb 23, 2012 at 11:10
  • 1
    How much faster do you need it to be? Commented Feb 23, 2012 at 11:13
  • 1
    Why don't you use a BufferedWriter, call newLine() and remove that NewLineConstant? You can even reuse a char[90] buffer for calling write. Commented Feb 23, 2012 at 11:42
  • 1
    Which line is the most time consuming (based on your profiling)? Commented Feb 23, 2012 at 12:13

1 Answer 1

2

Do you have any information about the distribution of the first one or two characters in those lines?

If so, you could read this big file one time, and you could create one or two dozen buckets (files) based only the first one or two character of those lines. After that, you could sort those buckets in-memory (those files would be smaller than 1GB) if the distribution is uniform.

In detail it would look like this:

  • open the big file (10GB)
  • open dozens of bucket files to write (1 for each type of line: aa, ab, ...)
  • read the lines of the big file, and write to the bucket files
  • close the big file
  • close the bucket files
  • sort the bucket files in memory (first the aa, than ab, ...), this could be parallelized, and append them

In general you should increase read buffers (from 8K to some megabytes) and the write buffers (from 8K to 256K-512K).

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.