1

Suppose I have an Array: ['a', 'b', 'c']. I want to record whether I have seen a particular array before.

I can put the array in a Set, but that is wasteful if I don't need to store the contents of the array, only that I have seen it before.

In Python, I could hash a tuple (i.e. hash(('a', 'b', 'c'))) and store the result in a set to achieve this. What is the way to do this in Ruby?

5
  • @tadman (response to deleted comment) maybe I should have included a more detailed Python example. I am envisioning something like (Python) hash(('a', 'b', 'c')) in {hash(('a', 'b', 'c'))}. It sounds like Object#hash is what I'm looking for though - if that's what Set or Hash use for arrays as keys, then it's good enough for this. Commented Mar 22, 2021 at 19:00
  • 3
    First: Hashing isn't compression. You can't save O(space). Second: Storing pointers is cheap. Just use Set. Commented Mar 22, 2021 at 20:50
  • @ChristopherOezbek this may be reasonably accurate when the arrays contain very simple objects. But if they refer to objects with a large footprint, and so prevent them from being garbage collected, then it is not ideal. Commented Mar 22, 2021 at 21:11
  • 1
    If probabilistic accuracy is enough, then go ahead. Just don't use it anywhere where data might be lost. Commented Mar 22, 2021 at 21:16
  • Also, you can increase the probabilistic accuracy by using a bloom filter. Just like comparing hash values, a bloom filter can tell you "definitely not seen" or "may have seen", with higher accuracy, and modest increase in memory consumption, over a plain hash value comparison. Commented Mar 23, 2021 at 3:52

1 Answer 1

4

Ruby has #hash on most objects, including Array, but these values are not unique and will eventually collide.

For any serious use I'd strongly suggest using something like SHA2-256 or stronger as these are cryptographic hashes designed to minimize collisions.

For example:

require 'digest/sha2' array = %w[ a b c ] array.hash # => 3218529217224510043 Digest::SHA2.hexdigest(array.inspect) # => "ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad" 

Where that value is going to be relatively unique. SHA2-256 collisions are really infrequent due to the sheer size of that hash, 256 bits vs. the 64 bit #hash value. That's not 4x stronger, it's 6.2 octodecillion times stronger. That number may as well be a "zillion" given how it has 57 zeroes in it.

Sign up to request clarification or add additional context in comments.

10 Comments

@biqqles Hash sets and hash maps do not assume that two elements/keys are equal just because their hash codes are. They still use == on the elements/keys themselves to make sure that they're actually equal. So the hash function doesn't need to be that strong because a hash collision isn't the end of the world. If you don't keep the values around and only rely on the hash code, that's a whole different ball game because you'll get wrong results when there's a hash collision.
@biqqles I don't know specifically how Hash and Set do it in Ruby (and that may differ from implementation to implementation and version to version anyway), but the commonly used approaches to collision handling are explained on the wiki page about hash tables for example.
@biqqles According to the source they use open addressing in current versions of MRI at least.
@sepp2k: Nitpick: Hash, Set, uniq, uniq! and other "hashtable-like" or "set-like" applications use eql?, not ==. In fact, that is more or less the only use of eql? in Ruby.
"that value is going to be relatively unique" – you have to choose the value you are hashing carefully. In the shown example, %w[a b c], %w[ab c], %w[a bc] and %w[abc] would all have the same SHA2 hash due to the way join('') works. Better use to_s / to_json / Marshal.dump.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.