6

now i am working on a job about data format transform. there is a large file, like 10GB, the current solution i implemented is read this file line by line, transform the format for each line, then output to a output file. i found the transform process is a bottle neck. so i am trying to do this in a concurrent way.

Each line is a complete unit, has nothing to do with other lines. Some lines may be discarded as some specific value in the line do not meet the demand.

now i have two plans:

  1. one thread read data line by line from input file, then put the line into a queue, several threads get lines from the queue, transform the format, then put the line into a output queue, finally an output thread reads lines from the output queue and writes to a output file.

  2. several threads currently read data from different part of the input file, then process the line and output to a file through a output queue or file lock.

would you guys please give me some advise ? i really appreciate it.

thanks in advance!

8
  • 1
    Solution 1 makes more sense - Using several threads to read/write a file won't speed up the process. Commented Dec 19, 2012 at 14:28
  • 1
    (1) Do you have to have a single output file, or is it OK to have several files each containing a part of the output? (2) Does the order in which the data appears in the output file(s) matter? Commented Dec 19, 2012 at 14:28
  • Perhaps you should post your existing code. It would be good to get some advice on whether your current algorithm is optimised, before you try a complex multi-thread design. Commented Dec 19, 2012 at 14:28
  • 1
    What kind of processing are we talking about? Can the lines be written out-or-order? Commented Dec 19, 2012 at 14:40
  • @NPE the order in the output file does not matter. i tried to output all data into one output file is because the succeeding process function interface does not support multiple files as input, and it is a 3-part tool, i cannot change its interface. Commented Dec 19, 2012 at 16:51

2 Answers 2

3

I would go for the first option ... reading data from a file in small pieces normally is slower than reading the whole file at once (depending on file caches/buffering/read ahead etc).

You also might need to think about a way to create the output file (acquiring all lines from the different processes, possibly in the correct order if needed).

Sign up to request clarification or add additional context in comments.

2 Comments

"You also might need to think about a way to create the output file", i cannot get your point. the order does not matter. would u give me some explanation? thanks~
If the order does not matter, it is quite trivial, then the process is: create the file, read for each thread the result and write it, and then close the file. If order would matter, some kind of ID for the thread would be needed.
1

Solution 1 makes sense.

This would also map nicely and simply to Java's Executor framework. Your main thread reads lines and submits each line to an Executor or ExecutorService.

It gets more complicated if you must keep order intact, though.

1 Comment

luckily, order does not matter:-)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.