I am currently using fstream to load data, which is 7.1GB, with C++. The .txt file goes like this:
item1 2.87 4.64 ... item2 5.89 9.24 ... ... ... ... ... It has 300000 rows and 201 columns (1 column for the item name and 200 for its weights) and each cell has a double type number. What I do now is like this:
ifstream click_log(R"(1.txt)", ifstream::in); string line; unordered_map<string, vector<double>> dict; while (getline(click_log, line)){ istringstream record(line); string key; vector<double> weights; double weight; record >> key; while (record >> weight){ weights.push_back(weight); } dict[key] = weights; } However, it takes my computer (AMD 3700X, 8 cores) about 30 minutes to load the file completely. Is it slow because its O(m*n) complexity or maybe simply the fact that converting string to double is slow? What is the most efficient way to load data from .txt?
std::arrayinstead since the size is well known (same thing for the dictionnary). It would remove a lot of deallocation/reallocation/copies due tostd::vectormachinery. Then you should write directly into the target container instead of using temporary vectors (and do a lot of extra copies). But I have a question, does your file contains "+-----+" too ? Or did you add it for readability ? It is very confusing because your code does not seem to handle that.weights.reserve(200)before reading in the weights will be a significant improvement. This will have the additional benefit of the vector not being oversized. Secondly, using move-semantics to add it to the map will help prevent an extra copy. Finally, reserving enough buckets on yourunordered_mapwill help avoid re-hashing which is also expensive. Consider usingstd::mapas an alternative, if you're not sure.doublevalues from strings is a non-trivial cost. If you know your data is always conforming to a specific format (e.g. always decimal notation), consider rolling your own simplifieddoublevalue parser, if profiling shows it to be significant after making these other optimizations. And if you can get away with storingfloatinstead ofdouble(assuming you don't need high precision), this will almost halve your memory footprint.