Reading huge (up to 32 GB) file in multithreading env?

Question

Goal: 1) huge file read in chunks each 1 MB long, 2) each one gets compressed and written to an another output file.

Note: I am limited to .Net 3.5 only.

Is there a known pattern how to parallelize file-reading process? Even if I would have to use Thread manually.

P.S. I am not asking for code, I am looking for a nice concept.

UPD:

The thing I particularly don't quite realize is the correct way to ensure queue with chunked raw data won't grow unlimited? When I try to imagine future solution, I don't see appropriate place for checking such a condition and don't quite realize supsend/resume of the file reading operation. Would be great to get some suggestions.

There are several such patterns. Have a look at Albahari's Threading tutorial , especially the Parallel Programming section. albahari.com/threading/part5.aspx#_Parallel_Programming — Robert Harvey
– Robert Harvey, Commented May 10, 2018 at 15:53
Hint: Software Engineering Stack Exchange... expect research before asking — gnat
– gnat, Commented May 10, 2018 at 17:21
The way you prevent memory usage from growing unbounded is by processing the chunks as you receive them and discarding the processed chunks. If you can't process the file fast enough, you have to stop filling the queue until you can empty some of it. — Robert Harvey
– Robert Harvey, Commented May 10, 2018 at 20:02

Doc Brown · Accepted Answer · 2018-05-10 19:19:49Z

Accessing one file by multiple threads for reading 1MB chunks is trivial, just determine the file length and the number of chunks beforehand. Queue up the the chunks and let them process in parallel by a given maximum number of threads. Each thread then opens the file in a non-blocking way, does a file seek to the desired chunk start position, and read the chunk. The same thread then can also do the processing.

Only the result writing will have to happen sequentially (I assume one cannot easily determine the output file position for a compressed chunk before the previous processing of the previous chunk is not completed): whenever a thread is ready with its chunk, it waits until the previous chunk is completed and written to disk, then the current thread can write the compressed chunk as well (and release the used mem).

However, you will have to test at how many parallel threads you will reach the maximum performance, this is totally dependend on your system, on the kind of storage device, the speed of the I/O channel, the CPUs and number of cores, memory speed file system involved and some more factors.

If you are unlucky, it might turn out the optimal number of threads is just one, and more can slow down the process. It might be higher, or course (look into this older Dr. Dobbs article, where multiple thread reads on a single file was optimal for 2-4 threads, depending on the used hardware). But the only reliable way to find out is to make some tests and measure by yourself.

But do any test results become seriously less meaningful when the program is run on different hardware? If so, then what? Best guess and pray? — user244218
– user244218, Commented May 10, 2018 at 17:42
@mickeyf: for this cases one should make the number of threads a configurable parameter, then the user or admin can test which number of threads works best. Or, one can try develop some automatic adjustment strategy, if the program will be run often on different hardware. It totally depends on the exact usage scenario & non functional requirements we here cannot know. — Doc Brown
– Doc Brown, Commented May 10, 2018 at 18:53
I don't quite understand the sync of in-memory concurrent queue size with file reading while loop. I should suspend/resume reading the file when queue gets too big not to get OutOfMemory. — Zazaeil
– Zazaeil, Commented May 10, 2018 at 18:57
@SerejaBogolubov: the queue should hold a chunk order number and a file position, not the actual data. One thread takes a queue item, reads the corresponding data, processes it, waits until read the previous thread has written the compressed data to disk, then writes its own results to disk and releases the memory. — Doc Brown
– Doc Brown, Commented May 10, 2018 at 19:14

Martin Maat · Accepted Answer · 2018-05-10 17:09:03Z

I would not focus on generic patterns but rather ask yourself what you are up to here and where you can expect the "pain". I do not expect any gain from using multiple threads for reading the file (on the contrary, I expect performance to suffer if you were to go that route). Multiple threads may get you something with the zipping part though.

Considering you may be dealing with a hard disk, you do not want multiple threads to compete for the reading heads of that device. So I would first try to have one thread read blocks of 1 MB sequentially and pass each block to a new zip task. The zip tasks should zip to a memory object and store it in a queue the writing thread reads from. You will want to limit the queue size depending on available memory. The writing thread should communicate with the reading thread, if they are both using the same device they should not compete but allow each other some exclusive time with the device in an alternating manner.

The number of threads for zipping should be limited, there will be an optimum that you can find by trying different numbers (and it will be related to the number of cores your machine has). The read and write threads should sleep for a couple of ms when idle and see if there is something for them to do again or, if you want to be really fancy, be signaled by providers/consumers of work.

"I do not expect any gain from using multiple threads for reading the file", well that is what I thought first, until I found the 9 year old Dr. Dobbs article I linked to. — Doc Brown
– Doc Brown, Commented May 10, 2018 at 17:14
I agree he should just just try. Still my money would be on the one read thread considering sustained sequential reads (which we are dealing with here) will always be so much faster than anything else. Note the test in the Dobbs article was performed with a RAID5 system with 4 disks, not exactly standard equipment for the average developer. — Martin Maat
– Martin Maat, Commented May 10, 2018 at 17:32
The thing I kind of threaten is "limiting" queue size. I not quite can imagine such of reading the file in chunks and tracking the queue size. Would it be like a tirvial while (file_not_ended && queue_size_is_OK)? Seems like not, coz then such a while would be terminated after the very first case when queue is full... any quick ideas? — Zazaeil
– Zazaeil, Commented May 10, 2018 at 18:56
The Raid5 system was only one of three disks in that test, and storage hardware has developed a little bit in between. And I guess it is not so important which equipment the OP uses on their developer machine, but what is used in their production environment. — Doc Brown
– Doc Brown, Commented May 10, 2018 at 18:57
@Sereja Read thread gets 1MB. Then takes 10 ms naps until (queueSize < max). Then enqueues 1MB. Repeat. — Martin Maat
– Martin Maat, Commented May 10, 2018 at 19:49

Stack Exchange Network

Reading huge (up to 32 GB) file in multithreading env?

2 Answers 2

Hot Network Questions

Reading huge (up to 32 GB) file in multithreading env?

2 Answers 2

Related

Hot Network Questions