now i am working on a job about data format transform. there is a large file, like 10GB, the current solution i implemented is read this file line by line, transform the format for each line, then output to a output file. i found the transform process is a bottle neck. so i am trying to do this in a concurrent way.
Each line is a complete unit, has nothing to do with other lines. Some lines may be discarded as some specific value in the line do not meet the demand.
now i have two plans:
one thread read data line by line from input file, then put the line into a queue, several threads get lines from the queue, transform the format, then put the line into a output queue, finally an output thread reads lines from the output queue and writes to a output file.
several threads currently read data from different part of the input file, then process the line and output to a file through a output queue or file lock.
would you guys please give me some advise ? i really appreciate it.
thanks in advance!