Revisions to Deterministic method to hash np.array -> int

added 47 characters in body

edited Feb 26, 2021 at 17:49

44.5k
14
128
164

The hashlib module has some routines for computing hashes from byte strings (typically used for CRC). You can convert an ndarray into a bytes string with ndarray.tobytes however your examples will still fail because those arrays have the same bytes but different shapes. So you could just hash the shape as well.

def hasharr(arr): hash = hashlib.blake2b(bytes(arr.shapetobytes(), digest_size=20) for dim in arr.shape:   hash.update(arrdim.tobytesto_bytes(4, byteorder='big')) return hash.digest()

Exmaple:

>>> hasharr(a1) b'\x9f\xd7<\x16\xb6u\xfdM\x14\xc2\xe49.\xf0P\xaa[\xe9\x0bZ' >>> hasharr(a2) b"Z\x18+'`\x83\xd6\xc8\x04\xd4%\xdc\x16V)\xb3\x97\x95\xf7v"

I'm not an expert on blake2b so you'd have to do your own research to figure out how likely a collision would be.

I'm not sure why you tagged pyarrow but if you're wanting to do the same on pyarrow arrays without converting to numpy then you can get the buffers of an array with arr.buffers() and convert these buffers (there will be multiple and some may be None) to byte strings with buf.to_pybytes(). Just hash all the buffers. There will be no need to worry about the shape here because pyarrow arrays are always one dimensional.

The hashlib module has some routines for computing hashes from byte strings (typically used for CRC). You can convert an ndarray into a bytes string with ndarray.tobytes however your examples will still fail because those arrays have the same bytes but different shapes. So you could just hash the shape as well.

def hasharr(arr): hash = hashlib.blake2b(bytes(arr.shape), digest_size=20) hash.update(arr.tobytes()) return hash.digest()

Exmaple:

>>> hasharr(a1) b'\x9f\xd7<\x16\xb6u\xfdM\x14\xc2\xe49.\xf0P\xaa[\xe9\x0bZ' >>> hasharr(a2) b"Z\x18+'`\x83\xd6\xc8\x04\xd4%\xdc\x16V)\xb3\x97\x95\xf7v"

I'm not an expert on blake2b so you'd have to do your own research to figure out how likely a collision would be.

I'm not sure why you tagged pyarrow but if you're wanting to do the same on pyarrow arrays without converting to numpy then you can get the buffers of an array with arr.buffers() and convert these buffers (there will be multiple and some may be None) to byte strings with buf.to_pybytes(). Just hash all the buffers. There will be no need to worry about the shape here because pyarrow arrays are always one dimensional.

The hashlib module has some routines for computing hashes from byte strings (typically used for CRC). You can convert an ndarray into a bytes string with ndarray.tobytes however your examples will still fail because those arrays have the same bytes but different shapes. So you could just hash the shape as well.

def hasharr(arr): hash = hashlib.blake2b(arr.tobytes(), digest_size=20) for dim in arr.shape:   hash.update(dim.to_bytes(4, byteorder='big')) return hash.digest()

Exmaple:

>>> hasharr(a1) b'\x9f\xd7<\x16\xb6u\xfdM\x14\xc2\xe49.\xf0P\xaa[\xe9\x0bZ' >>> hasharr(a2) b"Z\x18+'`\x83\xd6\xc8\x04\xd4%\xdc\x16V)\xb3\x97\x95\xf7v"

I'm not an expert on blake2b so you'd have to do your own research to figure out how likely a collision would be.

I'm not sure why you tagged pyarrow but if you're wanting to do the same on pyarrow arrays without converting to numpy then you can get the buffers of an array with arr.buffers() and convert these buffers (there will be multiple and some may be None) to byte strings with buf.to_pybytes(). Just hash all the buffers. There will be no need to worry about the shape here because pyarrow arrays are always one dimensional.

Refined the hasharr function to improve performance

Source Link

edited Feb 25, 2021 at 20:44

Pace

44.5k
14
128
164

The hashlib module has some routines for computing hashes from byte strings (typically used for CRC). You can convert an ndarray into a bytes string with ndarray.tobytes however your examples will still fail because those arrays have the same bytes but different shapes. So you could just hash the shape as well.

>>> def hasharr(arr): ... hash return= hashlib.blake2b(bytes(arr.shape), +digest_size=20)  hash.update(arr.tobytes(), digest_size=20) return hash.digest() ...

Exmaple:

>>> hasharr(a1) b'\x9f\xd7<\x16\xb6u\xfdM\x14\xc2\xe49.\xf0P\xaa[\xe9\x0bZ' >>> hasharr(a2) b"Z\x18+'`\x83\xd6\xc8\x04\xd4%\xdc\x16V)\xb3\x97\x95\xf7v"

I'm not an expert on blake2b so you'd have to do your own research to figure out how likely a collision would be.

I'm not sure why you tagged pyarrow but if you're wanting to do the same on pyarrow arrays without converting to numpy then you can get the buffers of an array with arr.buffers() and convert these buffers (there will be multiple and some may be None) to byte strings with buf.to_pybytes(). Just concatenatehash all the buffers and hash that. There will be no need to worry about the shape here because pyarrow arrays are always one dimensional.

The hashlib module has some routines for computing hashes from byte strings (typically used for CRC). You can convert an ndarray into a bytes string with ndarray.tobytes however your examples will still fail because those arrays have the same bytes but different shapes. So you could just hash the shape as well.

>>> def hasharr(arr): ... return hashlib.blake2b(bytes(arr.shape) + arr.tobytes(), digest_size=20).digest() ...  >>> hasharr(a1) b'\x9f\xd7<\x16\xb6u\xfdM\x14\xc2\xe49.\xf0P\xaa[\xe9\x0bZ' >>> hasharr(a2) b"Z\x18+'`\x83\xd6\xc8\x04\xd4%\xdc\x16V)\xb3\x97\x95\xf7v"

I'm not an expert on blake2b so you'd have to do your own research to figure out how likely a collision would be.

I'm not sure why you tagged pyarrow but if you're wanting to do the same on pyarrow arrays without converting to numpy then you can get the buffers of an array with arr.buffers() and convert these buffers (there will be multiple and some may be None) to byte strings with buf.to_pybytes(). Just concatenate all the buffers and hash that. There will be no need to worry about the shape here because pyarrow arrays are always one dimensional.

The hashlib module has some routines for computing hashes from byte strings (typically used for CRC). You can convert an ndarray into a bytes string with ndarray.tobytes however your examples will still fail because those arrays have the same bytes but different shapes. So you could just hash the shape as well.

def hasharr(arr): hash = hashlib.blake2b(bytes(arr.shape), digest_size=20)  hash.update(arr.tobytes()) return hash.digest()

Exmaple:

>>> hasharr(a1) b'\x9f\xd7<\x16\xb6u\xfdM\x14\xc2\xe49.\xf0P\xaa[\xe9\x0bZ' >>> hasharr(a2) b"Z\x18+'`\x83\xd6\xc8\x04\xd4%\xdc\x16V)\xb3\x97\x95\xf7v"

I'm not an expert on blake2b so you'd have to do your own research to figure out how likely a collision would be.

I'm not sure why you tagged pyarrow but if you're wanting to do the same on pyarrow arrays without converting to numpy then you can get the buffers of an array with arr.buffers() and convert these buffers (there will be multiple and some may be None) to byte strings with buf.to_pybytes(). Just hash all the buffers. There will be no need to worry about the shape here because pyarrow arrays are always one dimensional.

Source Link

answered Feb 25, 2021 at 20:36

Pace

44.5k
14
128
164

The hashlib module has some routines for computing hashes from byte strings (typically used for CRC). You can convert an ndarray into a bytes string with ndarray.tobytes however your examples will still fail because those arrays have the same bytes but different shapes. So you could just hash the shape as well.

>>> def hasharr(arr): ... return hashlib.blake2b(bytes(arr.shape) + arr.tobytes(), digest_size=20).digest() ... >>> hasharr(a1) b'\x9f\xd7<\x16\xb6u\xfdM\x14\xc2\xe49.\xf0P\xaa[\xe9\x0bZ' >>> hasharr(a2) b"Z\x18+'`\x83\xd6\xc8\x04\xd4%\xdc\x16V)\xb3\x97\x95\xf7v"

I'm not an expert on blake2b so you'd have to do your own research to figure out how likely a collision would be.

I'm not sure why you tagged pyarrow but if you're wanting to do the same on pyarrow arrays without converting to numpy then you can get the buffers of an array with arr.buffers() and convert these buffers (there will be multiple and some may be None) to byte strings with buf.to_pybytes(). Just concatenate all the buffers and hash that. There will be no need to worry about the shape here because pyarrow arrays are always one dimensional.

Collectives™ on Stack Overflow

Return to Answer