422

I want to read a text file line by line. I wanted to know if I'm doing it as efficiently as possible within the .NET C# scope of things.

This is what I'm trying so far:

var filestream = new System.IO.FileStream(textFilePath, System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.ReadWrite); var file = new System.IO.StreamReader(filestream, System.Text.Encoding.UTF8, true, 128); while ((lineOfText = file.ReadLine()) != null) { //Do something with the lineOfText } 
6
  • 12
    By Fastest you mean from performance or development perspectives? Commented Nov 7, 2011 at 13:26
  • 1
    This is going to lock the file for the duration of the method. You could use File.ReadAllLines into an array then process the array. Commented Nov 7, 2011 at 13:27
  • 19
    BTW, enclose filestream = new FileStream in using() statement to avoid possible annoying issues with locked file handle Commented Nov 7, 2011 at 13:28
  • Regarding enclosing FileStream is using() statement, see StackOverflow regarding recommended method: StackOverflow using statement filestream streamreader Commented Aug 31, 2013 at 17:58
  • I think ReadToEnd() is faster. Commented Sep 17, 2015 at 20:55

10 Answers 10

447

To find the fastest way to read a file line by line you will have to do some benchmarking. I have done some small tests on my computer but you cannot expect that my results apply to your environment.

Using StreamReader.ReadLine

This is basically your method. For some reason you set the buffer size to the smallest possible value (128). Increasing this will in general increase performance. The default size is 1,024 and other good choices are 512 (the sector size in Windows) or 4,096 (the cluster size in NTFS). You will have to run a benchmark to determine an optimal buffer size. A bigger buffer is - if not faster - at least not slower than a smaller buffer.

const Int32 BufferSize = 128; using (var fileStream = File.OpenRead(fileName)) using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize)) { String line; while ((line = streamReader.ReadLine()) != null) { // Process line } } 

The FileStream constructor allows you to specify FileOptions. For example, if you are reading a large file sequentially from beginning to end, you may benefit from FileOptions.SequentialScan. Again, benchmarking is the best thing you can do.

Using File.ReadLines

This is very much like your own solution except that it is implemented using a StreamReader with a fixed buffer size of 1,024. On my computer this results in slightly better performance compared to your code with the buffer size of 128. However, you can get the same performance increase by using a larger buffer size. This method is implemented using an iterator block and does not consume memory for all lines.

var lines = File.ReadLines(fileName); foreach (var line in lines) // Process line 

Using File.ReadAllLines

This is very much like the previous method except that this method grows a list of strings used to create the returned array of lines so the memory requirements are higher. However, it returns String[] and not an IEnumerable<String> allowing you to randomly access the lines.

var lines = File.ReadAllLines(fileName); for (var i = 0; i < lines.Length; i += 1) { var line = lines[i]; // Process line } 

Using String.Split

This method is considerably slower, at least on big files (tested on a 511 KB file), probably due to how String.Split is implemented. It also allocates an array for all the lines increasing the memory required compared to your solution.

using (var streamReader = File.OpenText(fileName)) { var lines = streamReader.ReadToEnd().Split("\r\n".ToCharArray(), StringSplitOptions.RemoveEmptyEntries); foreach (var line in lines) // Process line } 

My suggestion is to use File.ReadLines because it is clean and efficient. If you require special sharing options (for example you use FileShare.ReadWrite), you can use your own code but you should increase the buffer size.

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks for this - your inclusion of the buffer size parameter on the StreamReader's constructor was really helpful. I'm streaming from Amazon's S3 API, and using a matching buffer size speeds things up considerably in conjunction with ReadLine().
I don't understand. In theory, the vast majority of the time spent reading the file would be the seeking time on disk and the overheads of manupulating streams, like what you'd do with the File.ReadLines. File.ReadLines, on the other hand, is supposed to read everything of a file into the memory in one go. How could it be worse in performance?
I can't say about speed performance but one thing is certain: it is far worse on memory consumption. If you have to handle very large files (GB for instance), this is very critical. Even more if it means it has to swap memory. On the speed side, you could add that ReadAllLine needs to read ALL lines BEFORE returning the result delaying processing. In some scenarios, the IMPRESSION of speed is more important that raw speed.
If you read the stream as byte arrays It will read the file from 20%~80% faster (from the tests I did). What you need is to get the byte array and convert it to string. That's how I did it: For reading use stream.Read() You can make a loop to make it read in chunks. After appending the whole content into a byte array (use System.Buffer.BlockCopy) you'll need to convert the bytes into string: Encoding.Default.GetString(byteContent,0,byteContent.Length - 1).Split(new string[] { "\r\n", "\r", "\n" }, StringSplitOptions.None);
One issue with File.ReadLinesAsync is that it does not have an option to pass a call back after the lines or read, making it more or less to be used as sync itself. If StreamReader can read line by line rather than loading the whole file into memory, that should be a viable option for large files .....
|
221

If you're using .NET 4, simply use File.ReadLines which does it all for you. I suspect it's much the same as yours, except it may also use FileOptions.SequentialScan and a larger buffer (128 seems very small).

3 Comments

Another benefit of ReadLines() is it's lazy so works well with LINQ.
Does File.ReadLines closes the file after each iteration when used within a foreach loop?
@RBT: No - it closes the file when the iterator is disposed. (It's actually somewhat broken in some other ways, but that's a different matter, and only relevant if you try to use it multiple times.)
47

While File.ReadAllLines() is one of the simplest ways to read a file, it is also one of the slowest.

If you're just wanting to read lines in a file without doing much, according to these benchmarks, the fastest way to read a file is the age old method of:

using (StreamReader sr = File.OpenText(fileName)) { string s = String.Empty; while ((s = sr.ReadLine()) != null) { //do minimal amount of work here } } 

However, if you have to do a lot with each line, then this article concludes that the best way is the following (and it's faster to pre-allocate a string[] if you know how many lines you're going to read) :

AllLines = new string[MAX]; //only allocate memory here using (StreamReader sr = File.OpenText(fileName)) { int x = 0; while (!sr.EndOfStream) { AllLines[x] = sr.ReadLine(); x += 1; } } //Finished. Close the file //Now parallel process each line in the file Parallel.For(0, AllLines.Length, x => { DoYourStuff(AllLines[x]); //do your work here }); 

Comments

19

Use the following code:

foreach (string line in File.ReadAllLines(fileName)) 

This was a HUGE difference in reading performance.

It comes at the cost of memory consumption, but totally worth it!

1 Comment

i would prefer File.ReadLines (click me) than File.ReadAllLines
8

If the file size is not big, then it is faster to read the entire file and split it afterwards

var filestreams = sr.ReadToEnd().Split(Environment.NewLine, StringSplitOptions.RemoveEmptyEntries); 

4 Comments

@jgauffin I don't know behind implementation of file.ReadAlllines() but I think it has a limited buffer and fileReadtoEnd buffer should be greater, so number of access to file will be decreased by this way, and doing string.Split in the case file size is not big is faster than multiple access to file.
I doubt that File.ReadAllLines have a fixed buffer size since the file size is known.
@jgauffin: In .NET 4.0 File.ReadAllLines creates a list and adds to this list in a loop using StreamReader.ReadLine (with potential reallocation of the underlying array). This method uses a default buffer size of 1024. The StreamReader.ReadToEnd avoids the line parsing part and the buffer size can be set in the constructor if desired.
It would be helpful to define "BIG" in regards to file size.
7

There's a good topic about this in Stack Overflow question Is 'yield return' slower than "old school" return?.

It says:

ReadAllLines loads all of the lines into memory and returns a string[]. All well and good if the file is small. If the file is larger than will fit in memory, you'll run out of memory.

ReadLines, on the other hand, uses yield return to return one line at a time. With it, you can read any size file. It doesn't load the whole file into memory.

Say you wanted to find the first line that contains the word "foo", and then exit. Using ReadAllLines, you'd have to read the entire file into memory, even if "foo" occurs on the first line. With ReadLines, you only read one line. Which one would be faster?

Comments

2

If you have enough memory, I've found some performance gains by reading the entire file into a memory stream, and then opening a stream reader on that to read the lines. As long as you actually plan on reading the whole file anyway, this can yield some improvements.

1 Comment

File.ReadAllLines seems to be a better choice then.
2

You can't get any faster if you want to use an existing API to read the lines. But reading larger chunks and manually find each new line in the read buffer would probably be faster.

Comments

2

The StreamReader.ReadLine() implementation that ships with .NET has been suprisingly decent in recent years, and with current .NET versions it is likely the fastest thing you will ever need in most practical scenarios. It is not like the early times where you had to bring a home-brewed line reader to coding challenges in order to avoid a time-out simply reading the inputs.

In fact, if you do stuff with the lines you read - take substrings, call string.Split() or int.Parse() and so on - then the time taken by StreamReader.ReadLine() often becomes negligible compared to the rest.

Having said that, StreamReader.ReadLine() does have an obvious Achilles heel and that is the requirement to allocate a string for each newly-read line. You can make things significantly faster by avoiding this allocation and processing the line in byte form directly in the buffer where the input routine has placed the raw bytes of a chunk of your input file.

It used to be that the fastest way of scanning lines was placing a sentinel line feed after the valid portion of the data buffer and doing something on the lines of

for (var b = m_buffer; b[h] != ASC_LF; ) ++h; 

No need to check the current offset against an end offset or something like that, because if there is nothing to find in the valid portion of the buffer then the scan will simply run into the sentinel.

However, .NET like Java insists on doing a length check anyway (native code listing taken from the 'IL+Native' display of the amazingly amazing LINQPad):

L0060 inc esi L0062 cmp esi, ecx L0064 jae short L00db L0066 mov edx, esi L0068 cmp byte ptr [rax+rdx+0x10], 0xa L006d jne short L0060 

With modern processors this makes no odds because of branch prediction and superscalar execution, but 20 years ago things were a lot different.

The length check can be eliminated by invoking the dark side of the Force like so:

unsafe { fixed (byte *buffer_h = &m_buffer[h]) { var p = buffer_h; while (*p != ASC_LF) ++p; h += (int) (p - buffer_h); } } 

This gives the following machine code for the loop proper:

L006f inc rcx L0072 cmp byte ptr [rcx], 0xa L0075 jne short L006f 

Lean and mean, but it doesn't make any difference because modern processors have been bred to wade through bloated junk with the same efficiency as if they were eating finely crafted hand-smithed code.

But we can go faster still. For a couple of years now, Array.IndexOf() has been slightly but consistently faster than the hand-rolled loop even for the .NET Framework, but on newer .NET versions it makes the line reading twice as fast because it can use vectorisation under the hood.

So nowadays the scan for line feeds looks somewhat like this (no sentinels needed anymore, yay):

h = Array.IndexOf(m_buffer, ASC_LF, h, unread_bytes_left); 

I've used a little line reader class based on these principles for over ten years now, especially for parsing huge logs (up to several gigabytes per file).

Here's a timing comparison for a 512 MiB HTTP log file that is my current reference for performance measurements. It resides on an SSD and it contains full HTTP messages as captured on the wire; I'm using it as reference because it represents the toughest production use case for my parsing code, but it has been cut down somewhat so as not to try my patience too much.

3526760 lines 536870912 bytes 106.7 ms 33041 lines/ms via ZU.LineReader.BufferNextLine() 3526760 lines 536870912 bytes 111.9 ms 31531 lines/ms via ZU.LineReader.BufferNextLine() 3526760 lines 536870912 bytes 111.9 ms 31524 lines/ms via ZU.LineReader.BufferNextLine() 3526760 lines - bytes 411.5 ms 8570 lines/ms via StreamReader.ReadLine() 3526760 lines - bytes 380.6 ms 9264 lines/ms via StreamReader.ReadLine() 3526760 lines - bytes 388.1 ms 9086 lines/ms via StreamReader.ReadLine() 

Note: there are not byte counts for StreamReader.ReadLine() because it delivers text, not bytes.

The line reader used to be 10 times as fast as StreamReader.ReadLine(), now it is not even 4 times. Convert the line bytes to a string and suddenly the whole shebang is hardly any faster than StreamReader.ReadLine(). That should give you an idea how fast the stock StreamReader.ReadLine() actually is!

To be significantly faster than code based on StreamReader.ReadLine() you have to process data in the byte domain, not as text, and you have to avoid or minimise allocations (e.g. ValueTask instead of Task for async code). For coding challenges, consider hand-rolled number-to-text or text-to-number conversions that process multiple digits at a time, and - in the case of number-to-text - placing converted digits directly in the output buffer at the proper position instead of converting into a separate buffer and then copying stuff around.

Part of my test/monitoring code is a little class for allocation-free async parsing of HTTP messages (HTTP/1.0 and HTTP/1.1), which is based on exactly the same principles as the line reader. When fed our typical API traffic from a MemoryStream instead of a network stream then it buffers lines to the tune of 140000 per millisecond per core and it parses HTTP messages at a rate of 3000 per millisecond per core on my laptop¹.

So, you can get a lot faster than StreamReader.ReadLine() if you need to, but it is not exactly easy. For an idea of how involved it gets to push the boundaries, have a look at the Kestrel source code. You can study that amazing thing for months and years and still discover new tricks. Kudos.

¹) in this bench the data comes from the processor's L2 and L3 caches, so it is not representative for real-world usage; I'm using it for finding and eliminating unnecessary slow-downs in the code

Comments

-2

When you need to efficiently read and process a HUGE text file, ReadLines() and ReadAllLines() are likely to throw Out of Memory exception, this was my case. On the other hand, reading each line separately would take ages. The solution was to read the file in blocks, like below.

The class:

 //can return empty lines sometimes class LinePortionTextReader { private const int BUFFER_SIZE = 100000000; //100M characters StreamReader sr = null; string remainder = ""; public LinePortionTextReader(string filePath) { if (File.Exists(filePath)) { sr = new StreamReader(filePath); remainder = ""; } } ~LinePortionTextReader() { if(null != sr) { sr.Close(); } } public string[] ReadBlock() { if(null==sr) { return new string[] { }; } char[] buffer = new char[BUFFER_SIZE]; int charactersRead = sr.Read(buffer, 0, BUFFER_SIZE); if (charactersRead < 1) { return new string[] { }; } bool lastPart = (charactersRead < BUFFER_SIZE); if (lastPart) { char[] buffer2 = buffer.Take<char>(charactersRead).ToArray(); buffer = buffer2; } string s = new string(buffer); string[] sresult = s.Split(new string[] { "\r\n" }, StringSplitOptions.None); sresult[0] = remainder + sresult[0]; if (!lastPart) { remainder = sresult[sresult.Length - 1]; sresult[sresult.Length - 1] = ""; } return sresult; } public bool EOS { get { return (null == sr) ? true: sr.EndOfStream; } } } 

Example of use:

 class Program { static void Main(string[] args) { if (args.Length < 3) { Console.WriteLine("multifind.exe <where to search> <what to look for, one value per line> <where to put the result>"); return; } if (!File.Exists(args[0])) { Console.WriteLine("source file not found"); return; } if (!File.Exists(args[1])) { Console.WriteLine("reference file not found"); return; } TextWriter tw = new StreamWriter(args[2], false); string[] refLines = File.ReadAllLines(args[1]); LinePortionTextReader lptr = new LinePortionTextReader(args[0]); int blockCounter = 0; while (!lptr.EOS) { string[] srcLines = lptr.ReadBlock(); for (int i = 0; i < srcLines.Length; i += 1) { string theLine = srcLines[i]; if (!string.IsNullOrEmpty(theLine)) //can return empty lines sometimes { for (int j = 0; j < refLines.Length; j += 1) { if (theLine.Contains(refLines[j])) { tw.WriteLine(theLine); break; } } } } blockCounter += 1; Console.WriteLine(String.Format("100 Mb blocks processed: {0}", blockCounter)); } tw.Close(); } } 

I believe splitting strings and array handling can be significantly improved, yet the goal here was to minimize number of disk reads.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.