1

I have a primary file which has millions of lines. Then while reading each line from the file, I need to find the line in another file that has much fewer lines (several thousand only) to make some decision. Currently I am using vector to read the second file at the beginning and then for each line in the primary file I iterate over the vector to look for the line. The problem is that running time is quite long. Is there any efficient way to perform the task and limit the running time to some reasonable value.

3
  • Is there some unique property you can ascribe to each line? Some hash? Some function of the data you can use to reduce the search space? Then load the smaller dataset into memory for processing? Think algorithmically, and think economically, because at the moment you have an O(n*m) search over raw data, and this is not efficient. Commented Oct 14, 2013 at 10:42
  • The smallest dataset is in the memory as string vector. Each line in the dataset has four columns. The values in each column can have redundant values but the combination is unique. Commented Oct 14, 2013 at 11:28
  • You'd probably win a ton just by comparing string lengths. If they're unequal, the strings are unequal, and if they're equal you know exactly how many characters you should compare. Commented Oct 14, 2013 at 16:04

3 Answers 3

1

You should read second file into std::map<std::string,int>. Map key would be line, and value is number of times line was encountered in second file.

This way time to check if given line from first file can be found in second is constant, and overall time of your run should be only limited by speed of disk drive to read contents of first huge file.

Sign up to request clarification or add additional context in comments.

6 Comments

Be careful, maps have logarithmic complexity for search, insert etc. Anyway it is still much better than the linear complexity for vectors.
Question stated only several thousand lines in second file - nothing to worry about.
@mvp: It may very well be "something to worry about"; you know nothing of the target platform. It could be a washing machine.
@mvp, I don't understand what do you mean by # of times line was encountered in second file as value. I was thinking std::map<int, std::string> where key is the line number and value is the string itself.
@LightnessRacesinOrbit: what kind of washing machine reads MILLIONS of lines from input text file?
|
0

You can try to replace second (smaller) vector with a std::set.

1 Comment

An unordered_set would be an even better choice.
0

You have an inner loop, which compares the current line of the primary file to lines in the secondary file. If you take some stack samples, you're probably going to find it somewhere in that inner loop most of the time.

You might consider this technique, where you preprocess your secondary file into a special-purpose procedure that you then compile and link in with your main program. The time it takes will be the time to read the secondary file, and then on the order of a second or two to write the special-purpose procedure, and then to compile and link the whole thing.

Then the running of your main program should be I/O bound reading the primary file, since the inner loop will be much faster.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.