It seems that you are loading the contents of all files in memory, before writing them back to the single file. This could explain why the process becomes slower over time.
A way to optimize the process is to separate the reading part from the writing part, and do them in parallel. This is called the producer-consumer pattern. It can be implemented with the Parallel class, or with threads, or with tasks, but I will demonstrate instead an implementation based on the powerful TPL Dataflow library, that is particularly suited for jobs like this.
private static async Task MergeFiles(IEnumerable<string> sourceFilePaths, string targetFilePath, CancellationToken cancellationToken = default, IProgress<int> progress = null) { var readerBlock = new TransformBlock<string, string>(async filePath => { return File.ReadAllText(filePath); // Read the small file }, new ExecutionDataflowBlockOptions() { MaxDegreeOfParallelism = 2, // Reading is parallelizable BoundedCapacity = 100, // No more than 100 file-paths buffered CancellationToken = cancellationToken, // Cancel at any time }); StreamWriter streamWriter = null; int filesProcessed = 0; var writerBlock = new ActionBlock<string>(text => { streamWriter.Write(text); // Append to the target file filesProcessed++; if (filesProcessed % 10 == 0) progress?.Report(filesProcessed); }, new ExecutionDataflowBlockOptions() { MaxDegreeOfParallelism = 1, // We can't parallelize the writer BoundedCapacity = 100, // No more than 100 file-contents buffered CancellationToken = cancellationToken, // Cancel at any time }); readerBlock.LinkTo(writerBlock, new DataflowLinkOptions() { PropagateCompletion = true }); // This is a tricky part. We use BoundedCapacity, so we must propagate manually // a possible failure of the writer to the reader, otherwise a deadlock may occur. PropagateFailure(writerBlock, readerBlock); // Open the output stream using (streamWriter = new StreamWriter(targetFilePath)) { // Feed the reader with the file paths foreach (var filePath in sourceFilePaths) { var accepted = await readerBlock.SendAsync(filePath, cancellationToken); // Cancel at any time if (!accepted) break; // This will happen if the reader fails } readerBlock.Complete(); await writerBlock.Completion; } async void PropagateFailure(IDataflowBlock block1, IDataflowBlock block2) { try { await block1.Completion.ConfigureAwait(false); } catch (Exception ex) { if (block1.Completion.IsCanceled) return; // On cancellation do nothing block2.Fault(ex); } } }
Usage example:
var cts = new CancellationTokenSource(); var progress = new Progress<int>(value => { // Safe to update the UI Console.WriteLine($"Files processed: {value:#,0}"); }); var sourceFilePaths = Directory.EnumerateFiles(@"C:\SourceFolder", "*.log", SearchOption.AllDirectories); // Include subdirectories await MergeFiles(sourceFilePaths, @"C:\AllLogs.log", cts.Token, progress);
The BoundedCapacity is used to keep the memory usage under control.
If the disk drive is SSD, you can try reading with a MaxDegreeOfParallelism larger than 2.
For best performance you could consider writing to a different disc drive than the drive containing the source files.
The TPL Dataflow library is available as a package for .NET Framework, and is build-in for .NET Core.