3

I need to find the union of many (about 1M) 64-bit integer vectors, each has about 10K integers. There are duplicates in each array and also duplicates among different arrays.

A simple algorithm is to create a hash table (std::unordered_set) and insert all those integers into it, then dump out integers of the hash table. The problem is, after many tries/tests, I cannot find a faster algorithm than this simple one.

The vectors are input in batches, and the system memory isn't large enough to hold all of them.

Also output some non-existing integers is acceptable, but missing a required integer isn't. About 10% false positives would be acceptable.

22
  • 2
    Have you considered parallelizing? With a custom reduction operator, OpenMP should handle that with minimal effort. And of course there are faster hash tables than the standard lib Commented Jul 3 at 7:13
  • 3
    IMO best way to ask this question is to provide a unit test which should pass. This way there will not be any ambiguity to understand a problem. Providing performance measurement test is also recommended - you have wrote "The problem is, after many tries/tests, I cannot find a faster algorithm than this simple one." - so where are those performance measurements tests and where is your implementation of this "simple algorithm"? Commented Jul 3 at 8:38
  • 4
    @user416983 every time you provide new information in response to comments, please edit your question to update it, so other user do not have to scan all comments. Commented Jul 3 at 8:45
  • 6
    If you don't have memory to hold 1000000 of std::vector<int64_t>s with 10000 entries each, then you may not have memory to hold the union of those 1000000 vectors either since each 10000 vector may be unique - or how do you guarantee that most of the 1000000 vectors are duplicates? Commented Jul 3 at 8:58
  • 3
    Voting to close because OP's original question omitted important information that hasn't been edited into the revised question and the discussion has grown so long that it's now become hard to follow. OP would do well to provide a MWE characteristic of their problem and as much information about the data ranges, volumes, and characteristics as possible. Can false positives be tolerated? How many unique values are there? Are the values constrained to a particular range? What other methods has OP tried or considered - why were they not sufficient? Commented Jul 3 at 17:20

1 Answer 1

6

Your hash table algorithm is efficient, but your problem seems to be that it doesn't fit in system memory. You are producing a set of about 10B 64-bit ints, which would require around 200GB of RAM.

This is a problem for Map-Reduce:

  1. Create about 1024 empty buckets;

  2. From each input integer, generate a hash from 0-1023, and put the integer in the bucket corresponding to its hash value;

  3. For each bucket, use your hash table algorithm to find all the unique integers in that bucket

  4. Combine the outputs from each bucket into one. There will be no duplicates, since each distinct integer hashes into exactly one bucket.

The bucket inputs and outputs can just be files on disk. In step (2) you append onto the file sequentially, so you don't need their contents in RAM. In step (3) you only need to load the bucket into RAM, but you only need to work with one bucket at a time, In step (4) you read from the buckets sequentially, so again you don't need them all in RAM.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.