Efficient way of searching string in file in c++ for very large inputs

Question

I have a primary file which has millions of lines. Then while reading each line from the file, I need to find the line in another file that has much fewer lines (several thousand only) to make some decision. Currently I am using vector to read the second file at the beginning and then for each line in the primary file I iterate over the vector to look for the line. The problem is that running time is quite long. Is there any efficient way to perform the task and limit the running time to some reasonable value.

Is there some unique property you can ascribe to each line? Some hash? Some function of the data you can use to reduce the search space? Then load the smaller dataset into memory for processing? Think algorithmically, and think economically, because at the moment you have an O(n*m) search over raw data, and this is not efficient. — Lightness Races in Orbit
– Lightness Races in Orbit, Commented Oct 14, 2013 at 10:42
The smallest dataset is in the memory as string vector. Each line in the dataset has four columns. The values in each column can have redundant values but the combination is unique. — Timir
– Timir, Commented Oct 14, 2013 at 11:28
You'd probably win a ton just by comparing string lengths. If they're unequal, the strings are unequal, and if they're equal you know exactly how many characters you should compare. — MSalters
– MSalters, Commented Oct 14, 2013 at 16:04

mvp · Accepted Answer · 2013-10-14 10:38:35Z

1

You should read second file into std::map<std::string,int>. Map key would be line, and value is number of times line was encountered in second file.

This way time to check if given line from first file can be found in second is constant, and overall time of your run should be only limited by speed of disk drive to read contents of first huge file.

answered Oct 14, 2013 at 10:38

mvp

118k15 gold badges132 silver badges155 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Sigroad Over a year ago

Be careful, maps have logarithmic complexity for search, insert etc. Anyway it is still much better than the linear complexity for vectors.

mvp Over a year ago

Question stated only several thousand lines in second file - nothing to worry about.

Lightness Races in Orbit Over a year ago

@mvp: It may very well be "something to worry about"; you know nothing of the target platform. It could be a washing machine.

Timir Over a year ago

@mvp, I don't understand what do you mean by # of times line was encountered in second file as value. I was thinking std::map<int, std::string> where key is the line number and value is the string itself.

SigTerm Over a year ago

@LightnessRacesinOrbit: what kind of washing machine reads MILLIONS of lines from input text file?

|

zabulus · Accepted Answer · 2013-10-14 10:38:17Z

0

You can try to replace second (smaller) vector with a std::set.

answered Oct 14, 2013 at 10:38

zabulus

2,5433 gold badges17 silver badges29 bronze badges

1 Comment

MSalters Over a year ago

An unordered_set would be an even better choice.

Community · Accepted Answer · 2017-05-23 11:57:20Z

You have an inner loop, which compares the current line of the primary file to lines in the secondary file. If you take some stack samples, you're probably going to find it somewhere in that inner loop most of the time.

You might consider this technique, where you preprocess your secondary file into a special-purpose procedure that you then compile and link in with your main program. The time it takes will be the time to read the secondary file, and then on the order of a second or two to write the special-purpose procedure, and then to compile and link the whole thing.

Then the running of your main program should be I/O bound reading the primary file, since the inner loop will be much faster.

Collectives™ on Stack Overflow

Efficient way of searching string in file in c++ for very large inputs

3 Answers 3

6 Comments

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

1 Comment

Comments

Related