5

I want a hash function that takes a long number (64 bits) and produces result of 10 bits. What is the best hash function for such purpose. Inputs are basically addresses of variables (Addresses are of 64 bits or 8 bytes on Linux), so my hash function should be optimized for that purpose.

5
  • 1
    What information about the distribution of 64-bit values in your universe can you give us ? Commented Jul 6, 2012 at 9:27
  • There is no "best" hash function for all cases. You have to study the distribution and characteristings of your input numbers. Commented Jul 6, 2012 at 9:27
  • Input is addresses of variables on Linux. Commented Jul 6, 2012 at 9:28
  • 2
    @MetallicPriest: in that case, you can drop the lower 4+ bits (assuming everything is aligned), and atm address space is limited to 47 bits, so then that means you only need hash 43 bits (if you are willing to not be very future safe) Commented Jul 6, 2012 at 9:32
  • 1
    Though of course it does not hurt hashing all the bits, including the ones that have low entropy (bits 0-2) or are always the same (topmost 16 bits which in user space are, at least for now, always zero). That way, in 3-4 years from now, you avoid falling for the stupid mistake of reusing this function, having forgotten why and under what premise you designed it initially, using it for something that doesn't meet the assumptions. This saves public humiliation when e.g. a week worth of time was wasted because a hash table performs poorly and nobody can explain why. Commented Jul 6, 2012 at 16:17

3 Answers 3

7

I would say somethig like this:

uint32_t hash(uint64_t x) { x >>= 3; return (x ^ (x>>10) ^ (x>>20)) & 0x3FF; } 

The lest significant 3 bits are not very useful, as most variables are 4-byte or 8-byte aligned, so we remove them. Then we take the next 30 bits and mix them together (XOR) in blocks of 10 bits each.

Naturally, you could also take the (x>>30)^(x>>40)^(x>>50) but I'm not sure if they'll make any difference in practice.

Sign up to request clarification or add additional context in comments.

1 Comment

Since you use xor-shift for mixing, I would recommend using one of the known 275 triplets with a 2^64-1 period in their 64x64 matrix as described by Marsaglia, for example (7,11,10) or (21,17,48). Since this mixes bits in a pseudorandom way with no known oddities, it is valid to xor together all the words just before before doing the &0x3ff. That way, every input bit should have a chance to influence all output bits. Maybe not as perfectly 50:50 distributed as in a cryptographic hash, but as good as you can get. Other than that, still an excellent idea, +1
2

Best for most distributions is mod by a prime, 1021 is the largest 10-bit prime. There's no need to strip low bits.

static inline int hashaddress(void *v) { return (uintptr_t)v % 1021; } 

If you think performance might be a concern, have a few alternates on hand and race them in your actual program. Microbenchmarks are waste; a difference of a few cycles is almost certain to be swamped by cache effects, and size matters.

Comments

2

I wrote a toy program to see some real addresses on the stack, data area, and heap. Basically I declared 4 globals, 4 locals and did 2 mallocs. I dropped the last two bits when printing the addresses. Here is an output from one of the runs:

 20125e8 20125e6 20125e7 20125e4 3fef2131 3fef2130 3fef212f 3fef212c 25e4802 25e4806 

What this tells me:

  1. The LSB in this output (3rd bit of the address) is frequently 'on' and 'off'. So I wouldn't drop it when calculating the hash. Dropping 2 LSBs seems enough.
  2. We also see that there is more entropy in the lower 8-10 bits. We must use that when calculating the hash.
  3. We know that on a 64 bit machine, virtual addresses are never more than 48 bits wide.

What I would do next:

/* Drop two LSBs. */ a >>= 2; /* Get rid of the MSBs. Keep 46 bits. */ a &= 0x3fffffffffff; /* Get the 14 MSBs and fold them in to get a 32 bit integer. The MSBs are mostly 0s anyway, so we don't lose much entropy. */ msbs = (a >> 32) << 18; a ^= msbs; 

Now we pass this through a decent 'half avalanche' hash function, instead of rolling our own. 'Half avalanche' means each bit of the input gets a chance to affect bits at the same position and higher:

uint32_t half_avalanche( uint32_t a) { a = (a+0x479ab41d) + (a<<8); a = (a^0xe4aa10ce) ^ (a>>5); a = (a+0x9942f0a6) - (a<<14); a = (a^0x5aedd67d) ^ (a>>3); a = (a+0x17bea992) + (a<<7); return a; } 

For an 10-bit hash, use the 10 MSBs of the uint32_t returned. The hash function continues to work fine if you pick N MSBs for an N bit hash, effectively doubling the bucket count with each additional bit.

I was a little bored, so I wrote a toy benchmark for this. Nothing fancy, it allocates a bunch of memory on the heap and tries out the hash I described above. The source can be had from here. An example result:

1024 buckets, 256 values generated, 29 collissions
1024 buckets, 512 values generated, 103 collissions
1024 buckets, 1024 values generated, 370 collissions

Next: I tried out the other two hashes answered here. They both have similar performance. Looks like: Just pick the fastest one ;)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.