1

I have to read in a huge text file (>200,000 words) and process each word. I read in the entire file into a string and then attach a string stream to it to process easily each word. The approach is I directly input each word from file using << and process it but comparing both the approaches does not give me any advantage in terms of execution time. Isn't it faster to operate on a string in memory than from a file which needs a system call every time I need a word? Please suggest some performance enhancing methods.

7
  • 1
    stream does some buffering internally. Fiddle with the buffer size to find optimal fit. Commented Nov 4, 2010 at 17:52
  • See @Martin York's answer here for a 'stringstream from file' that minimizes copying - stackoverflow.com/questions/132358/… Commented Nov 4, 2010 at 17:58
  • @Martin: but he should probably also read the comments and (if I might be so bold) my answer at: stackoverflow.com/q/3303527/179910. To make a long story short, the method I've given below, seems to be faster pretty consistently. I think most of the problem is that seeking to the end of the file takes more time than the internal copying it avoids. Commented Nov 4, 2010 at 18:18
  • Post some code. Are you sure it is the IO that is the bottleneck here? Have you profiled it? Commented Nov 4, 2010 at 18:21
  • 1
    Why is this tagged c? Anyway ... try opening the file in "binary" mode (fopen("hugefile", "rb") in C) to make the runtime library NOT do end-of-line, and possibly other, translations. Commented Nov 4, 2010 at 18:36

4 Answers 4

5

For performance and minimal copying, this is hard to beat (as long as you have enough memory!):

void mapped(const char* fname) { using namespace boost::interprocess; //Create a file mapping file_mapping m_file(fname, read_only); //Map the whole file with read permissions mapped_region region(m_file, read_only); //Get the address of the mapped region void * addr = region.get_address(); std::size_t size = region.get_size(); // Now you have the underlying data... char *data = static_cast<char*>(addr); std::stringstream localStream; localStream.rdbuf()->pubsetbuf(data, size); // now you can do your stuff with the stream // alternatively } 
Sign up to request clarification or add additional context in comments.

Comments

4

If you're going to put the data into a stringstream anyway, it's probably a bit faster and easier to copy directly from the input stream to the string stream:

std::ifstream infile("yourfile.txt"); std::stringstream buffer; buffer << infile.rdbuf(); 

The ifstream will use a buffer, however, so while that's probably faster than reading into a string, then creating a stringstream, it may not be any faster than working directly from the input stream.

6 Comments

If there is other I/O going on in parallel with this processing, then one-shot load of data will be preferable as disk head does not have the same opportunity to move between buffer refreshes.
I think this combined with Martin's answer to previous q will perform the best. I gave you +1 for best local info
@ruslik - where is there mention of async I/O? How is memory-mapped file different from loading entire file contents using STL, apart from complexity and non-portability of the former?
@Steve: memory mapping can be considerably different if you don't have (close to) enough physical memory to hold the entire file. In such a case, loading the entire file basically ends up copying the data into memory, then copying it back out to the paging file, then reading it back in from the paging file when used. Memory mapping avoids all but the last of those. On Windows it typically helps only a little, but on Linux it can be extremely effective.
@Jerry - right, but 200K words is the problem as stated. 1-2MB tops?
|
1

There is caching involved, so it does not necessarily do a system call each time you extract. Having said that, you may get marginally better performance at parse time by parsing a single contiguous buffer. On the other hand, you are serializing the workload (read entire file, then parse), which can potentially be parallelized (read and parse in parallel).

Comments

1

The string will get reallocated and copied an awful lot of times to accommodate 200,000 words. That's probably what is taking the time.

You should use a rope if you want to create a huge string by appending.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.