I need to find the union of many (about 1M) 64-bit integer vectors, each has about 10K integers. There are duplicates in each array and also duplicates among different arrays.
A simple algorithm is to create a hash table (std::unordered_set) and insert all those integers into it, then dump out integers of the hash table. The problem is, after many tries/tests, I cannot find a faster algorithm than this simple one.
The vectors are input in batches, and the system memory isn't large enough to hold all of them.
Also output some non-existing integers is acceptable, but missing a required integer isn't. About 10% false positives would be acceptable.
std::vector<int64_t>s with 10000 entries each, then you may not have memory to hold the union of those 1000000 vectors either since each 10000 vector may be unique - or how do you guarantee that most of the 1000000 vectors are duplicates?