I have to read in a huge text file (>200,000 words) and process each word. I read in the entire file into a string and then attach a string stream to it to process easily each word. The approach is I directly input each word from file using << and process it but comparing both the approaches does not give me any advantage in terms of execution time. Isn't it faster to operate on a string in memory than from a file which needs a system call every time I need a word? Please suggest some performance enhancing methods.
4 Answers
For performance and minimal copying, this is hard to beat (as long as you have enough memory!):
void mapped(const char* fname) { using namespace boost::interprocess; //Create a file mapping file_mapping m_file(fname, read_only); //Map the whole file with read permissions mapped_region region(m_file, read_only); //Get the address of the mapped region void * addr = region.get_address(); std::size_t size = region.get_size(); // Now you have the underlying data... char *data = static_cast<char*>(addr); std::stringstream localStream; localStream.rdbuf()->pubsetbuf(data, size); // now you can do your stuff with the stream // alternatively } Comments
If you're going to put the data into a stringstream anyway, it's probably a bit faster and easier to copy directly from the input stream to the string stream:
std::ifstream infile("yourfile.txt"); std::stringstream buffer; buffer << infile.rdbuf(); The ifstream will use a buffer, however, so while that's probably faster than reading into a string, then creating a stringstream, it may not be any faster than working directly from the input stream.
6 Comments
There is caching involved, so it does not necessarily do a system call each time you extract. Having said that, you may get marginally better performance at parse time by parsing a single contiguous buffer. On the other hand, you are serializing the workload (read entire file, then parse), which can potentially be parallelized (read and parse in parallel).
c? Anyway ... try opening the file in "binary" mode (fopen("hugefile", "rb")in C) to make the runtime library NOT do end-of-line, and possibly other, translations.