7

To improve some lookups, I am considering the use of Bloom Filters. But in my use-case, the most probable outcome is that the element do exists in the target set.

Bloom filters can have false positive, but no false negatives. That would make me check the real (big and slow) storage most of the time because of the uncertainties.

Is there another algorithm/data structure with the same properties for space and computation speed (and parallelism of query) that can assure no false positive and a low probability of false negative?

(The max size of the set will be around 300k items, the items will be strings of, at most, 512 characters, and I will have hundreds of sets like that.)

2
  • Sounds to me like you just need an in-memory read-through cache, ideally with some mechanism to prioritise entries which are frequently queried. Commented Jan 11, 2014 at 15:40
  • A cache would not work for me. No item will be check twice. I have a big csv with one key that I am assured have no duplicates. I just need to check which are already in the system. Unfortunately to go to the db to check the existence for every key is a slow process that I am trying to improve. Being keys in the file unique, I don't see how a LRU o MRU cache could improve something. Maybe I could sort the input file.. and some cache technique using along with tries to reduce memory consumption could help.. but I prefer a "lazy programmer" approach jejejeje. Commented Jan 11, 2014 at 16:25

1 Answer 1

5

The nice thing about bloom filters is that their space requirement can be scaled arbitrarily, with the cost of more false positives as the size is decreased.

If you want no false positives whatsoever, you can't take probabilistic shortcuts and will not be able to reduce space requirements significantly (which isn't that much of an issue as storing each of your strings sequentially would only use up a few hundred MB per set).

There two important representations of string sets:

  • hash tables, and
  • tries.

These data structures allow parallel access, and each access is O(1) with respect to the number of items in the set.

Tries have the advantage that common prefixes of strings are shared, thus potentially reducing space requirements below that of sequential storage. A read-only trie can also be compressed further.

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.