2

So i have a large file which has ~2 million lines. The file reading is a bottleneck in my code. Any suggested ways and expert opinion to read the file faster is welcome. Order of reading lines from that text file is unimportant. All lines are pipe '|' separated fixed length records.

What i tried? I started parallel StreamReaders and made sure that resource is locked properly but this approach failed as i now had multiple threads fighting to get hold of the single StreamReader and wasting more time in locking etc thereby making the code slow down further.

One intuitive approach is to break the file and then read it, but i wish to leave the file intact and still be somehow able to read it faster.

10
  • 5
    Are you sure the bottleneck is the file reader and not the disk IO? Commented Jul 11, 2014 at 1:04
  • Have you measured the performance? My oppinion is that the StreamReader will read with little overhead. Maybe your are reading small chucks of data. Try to read large blocks and perform the line splitting in memory. Commented Jul 11, 2014 at 1:05
  • 2
    Agree with ^^^^, the multi threading would come into play processing the what is read, however you should have no problem reading the file with multiple readers (as long as you open it read only and shared)... Commented Jul 11, 2014 at 1:07
  • 1
    I don't think that parallelize file IO will help much. Most costly operation with disk IO is the disk head moving among tracks.... Commented Jul 11, 2014 at 1:07
  • @Mysticial : updated. It is 'file reading' not 'file reader'. EZI: i get your point... so will i have to settle for this performance? Commented Jul 11, 2014 at 1:09

2 Answers 2

1

I would try maximizing my buffer size. The default size is 1024, increasing this should increase performance. I would suggest trying other buffer size options.

StreamReader(Stream, Encoding, Boolean, Int32) Initializes a new instance of the StreamReader class for the specified stream, with the specified character encoding, byte order mark detection option, and buffer size.

Sign up to request clarification or add additional context in comments.

1 Comment

I suggest you do a benchmark comparison between all your options than: StreamReader.ReadLine , File.ReadLines , File.ReadAllLines and String.Split . Your environment might yield different results but for me the StreamReader was the fastest.
0

I understand that my problem is not related to software. It is a 'mechanical' problem. Unless there is a possibility to perform changes in hardware, there is no way to improve the reading performance. Why is that? There is one head only to read from the disk and therefore even if i try to read file from both ends for example, it is that same reader which will now have to move even more to read file from both ends for the two threads. Hence it is wiser to let the reader read sequentially and that is the maximum performance achievable.

Thank you all for the explanations. That helped me understand this concept. It may be a very basic and straightforward point for most people here on stackoverflow, but I really learned something about file reading and hardware performance and understood the things taught to me in college, from this question.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.