3

I am currently using fstream to load data, which is 7.1GB, with C++. The .txt file goes like this:

 item1 2.87 4.64 ... item2 5.89 9.24 ... ... ... ... ... 

It has 300000 rows and 201 columns (1 column for the item name and 200 for its weights) and each cell has a double type number. What I do now is like this:

ifstream click_log(R"(1.txt)", ifstream::in); string line; unordered_map<string, vector<double>> dict; while (getline(click_log, line)){ istringstream record(line); string key; vector<double> weights; double weight; record >> key; while (record >> weight){ weights.push_back(weight); } dict[key] = weights; } 

However, it takes my computer (AMD 3700X, 8 cores) about 30 minutes to load the file completely. Is it slow because its O(m*n) complexity or maybe simply the fact that converting string to double is slow? What is the most efficient way to load data from .txt?

13
  • Most C compilers also have a profiling tool. Maybe try to find out where most of the time is spent. I suspect that a dictionary probably isn't the best data structure to use in this case. How about using an old-fashioned linear linked list (instead of an array with pre-allocated size)? Also, what is the time if you just read all lines from the file (discarding them)? Commented Jul 21, 2020 at 6:30
  • 4
    First thing, you should pre-allocate your vectors or use std::array instead since the size is well known (same thing for the dictionnary). It would remove a lot of deallocation/reallocation/copies due to std::vector machinery. Then you should write directly into the target container instead of using temporary vectors (and do a lot of extra copies). But I have a question, does your file contains "+-----+" too ? Or did you add it for readability ? It is very confusing because your code does not seem to handle that. Commented Jul 21, 2020 at 7:19
  • 1
    Without a doubt, calling weights.reserve(200) before reading in the weights will be a significant improvement. This will have the additional benefit of the vector not being oversized. Secondly, using move-semantics to add it to the map will help prevent an extra copy. Finally, reserving enough buckets on your unordered_map will help avoid re-hashing which is also expensive. Consider using std::map as an alternative, if you're not sure. Commented Jul 21, 2020 at 7:56
  • 1
    Note also that even parsing double values from strings is a non-trivial cost. If you know your data is always conforming to a specific format (e.g. always decimal notation), consider rolling your own simplified double value parser, if profiling shows it to be significant after making these other optimizations. And if you can get away with storing float instead of double (assuming you don't need high precision), this will almost halve your memory footprint. Commented Jul 21, 2020 at 7:59
  • 2
    Another consideration here is you are doing this entirely single threaded and waiting for a blocking I/O operation before you process each line of the file. You should be able to divide this work among all CPU threads. There should be an environment variable you can read to determine the number of CPU threads (not always the same as number of cores btw) and modern storage devices can do multiple reads in parallel so this would be worth looking into imo. Commented Jul 21, 2020 at 8:09

1 Answer 1

2

You should not recreate your variables at each loop iteration. Create them once and for all, then you can reassign them when needed.

If you want to use std::vector instead of std::array<double, 200>, then you should reserve(200) all of your vectors in order to avoid a lot of reallocations/copies/deallocations due to std::vector's machinery.

You can do the same for your std::unordered_map.

Finally, write your data directly into the target container, you don't need to use that much of temporaries (it would remove the huge overhead caused by all these unnecessary copies).

I have rewritten your code with taking these guidelines into account. I bet it would increase your performances:

int main() { std::ifstream ifs("..\\tests\\data\\some_data.txt"); // Replace with your file if(!ifs) return -1; std::unordered_map<std::string, std::array<double, 200>> dict; dict.reserve(300000); std::string line; std::string key; double weight; std::size_t i; while(getline(ifs, line)) { std::istringstream record(line); i = 0; record >> key; while(record >> weight) { dict[key].at(i++) = weight; } } ifs.close(); // The whole file is loaded return 0; } 

Of course, I don't claim this to be the most efficient way to do it. I'm sure we can bring more improvements that I didn't thought of at this very moment.

Anyway, keep in mind that you will still probably have a bottleneck with hard drive access, IO operations,...

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.