3

I am creating a system that stores large numpy arrays in pyarrow.plasma. I want to give each array a unique, deterministic plasma.ObjectID, np.array is not hashable sadly My current (broken) approach is:

import numpy as np from pyarrow import plasma def int_to_bytes(x: int) -> bytes: return x.to_bytes( (x.bit_length() + 7) // 8, "big" ) # https://stackoverflow.com/questions/21017698/converting-int-to-bytes-in-python-3 def get_object_id(arr): arr_id = int(arr.sum() / (arr.shape[0])) oid: bytes = int_to_bytes(arr_id).zfill(20) # fill from left with zeroes, must be of length 20 return plasma.ObjectID(oid) 

But this can easily fail, for example:

arr = np.arange(12) a1 = arr.reshape(3, 4) a2 = arr.reshape(3,2,2) assert get_object_id(a1) != get_object_id(a2), 'Hash collision' # another good test case assert get_object_id(np.ones(12)) != get_object_id(np.ones(12).reshape(4,3)) assert get_object_id(np.ones(12)) != get_object_id(np.zeros(12)) 

It also involves summing the array, which could be very slow for large arrays. Feel free to assume that the dtype of arr will be np.uint or np.int.

I know think that it's impossible to never have a hash collision (I only have 20 bytes of ID and there are more than 2^20) possible inputs, so I am just looking for something that is either a) cheaper to compute b) less likely to fail in practice

or, ideally, both!

1 Answer 1

4

The hashlib module has some routines for computing hashes from byte strings (typically used for CRC). You can convert an ndarray into a bytes string with ndarray.tobytes however your examples will still fail because those arrays have the same bytes but different shapes. So you could just hash the shape as well.

def hasharr(arr): hash = hashlib.blake2b(arr.tobytes(), digest_size=20) for dim in arr.shape: hash.update(dim.to_bytes(4, byteorder='big')) return hash.digest() 

Exmaple:

>>> hasharr(a1) b'\x9f\xd7<\x16\xb6u\xfdM\x14\xc2\xe49.\xf0P\xaa[\xe9\x0bZ' >>> hasharr(a2) b"Z\x18+'`\x83\xd6\xc8\x04\xd4%\xdc\x16V)\xb3\x97\x95\xf7v" 

I'm not an expert on blake2b so you'd have to do your own research to figure out how likely a collision would be.

I'm not sure why you tagged pyarrow but if you're wanting to do the same on pyarrow arrays without converting to numpy then you can get the buffers of an array with arr.buffers() and convert these buffers (there will be multiple and some may be None) to byte strings with buf.to_pybytes(). Just hash all the buffers. There will be no need to worry about the shape here because pyarrow arrays are always one dimensional.

Sign up to request clarification or add additional context in comments.

3 Comments

This is really close to working! The only issue with it is that if a1.shape has an entry larger than 256, bytes(a1.shape) raises ValueError: bytes must be in range(0, 256)
Thanks for pointing that out. I've updated the alg. It allocates 4 bytes per dimension which should be good for most cases I think. You could always make it dynamic to use as few as neccesary as a bonus exercise.
Cool I ended up just hashing the shape of the array and arr[0].to_bytes() to make it faster to compute the hash, but this was massively helpful!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.