Repeatably hashing an arbitrary Python tuple

Question

I'm writing a specialised unit testing tool that needs to save the results of tests to be compared against in the future. Thus I need to be able to consistently map parameters that were passed to each test to the test result from running the test function with those parameters for each version. I was hoping there was a way to just hash the tuple and use that hash to name the files where I store the test results.

My first impulse was just to call hash() on the tuple of parameters, but of course that won't work since hash is randomized between interpreter instances now.

I'm having a hard time coming up with a way that works for whatever arbitrary elements that might be in the tuple (I guess restricting it to a mix of ints, floats, strings, and lists\tuples of those three would be okay). Any ideas?

I've thought of using the repr of the tuple or pickling it, but repr isn't guaranteed to produce byte-for-byte same output for same input, and I don't think pickling is either (is it?)

I've seen this already, but the answers are all based on that same assumption that doesn't hold anymore and don't really translate to this problem anyway, a lot of the discussion was about making the hash not depend on the order items come up and I do want the hash to depend on order.

I would pickle the tuple of the parameters and the results into the same file whose name is the hashed tuple of parameters. That way, you should not need to worry about the randomization, because the original tuple is in the file. — DYZ
– DYZ, Commented Jan 30, 2018 at 2:45
@DYZ Right, I'm doing that too, but I need the tuple to be hashed repeatably to be able to find the file in the first place. — Schilcote
– Schilcote, Commented Jan 30, 2018 at 2:47
@ShadowRanger That... would work, I suppose, but it's horribly inelegant and does technically mean someone can DOS my CI server with a specially crafted merge request. — Schilcote
– Schilcote, Commented Jan 30, 2018 at 2:49
@Schilcote: So if this is a public facing server, that's a bad idea; the question made it sound like this was just for repeatable (assumed local) unit tests. — ShadowRanger
– ShadowRanger, Commented Jan 30, 2018 at 2:50

Tom Tang · Accepted Answer · 2018-01-30 03:55:21Z

5

Not sure if I understand your question fully, but will just give it a try.

Before you do the hash, just serialize the result to a JSON string, and do the hash computing on your JSON string.

params = (1, 3, 2) hashlib.sha224(json.dumps(params)).hexdigest() # '5f0f7a621e6f420002d54ee28b0c169b8112ef72d8a6b60e6a25171c'

If your params is a dictionary, use sort_keys=True to ensure your keys are sorted.

params = {'b': 123, 'c': 345} hashlib.sha224(json.dumps(params, sort_keys=True)).hexdigest() # '2e75966ce3f1185cbfb4eccc49d5552c08cfb7502a8765fe1dce9303'

edited Jan 30, 2018 at 3:55

answered Jan 30, 2018 at 2:49

Tom Tang

1,1449 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

Schilcote Over a year ago

Is the JSON result guaranteed to be the same every time?

user2864740 Over a year ago

@Schilcote For lists or tuples (and any stable primitives in such) it should be.

Tom Tang Over a year ago

If you're serializing a tupple / list, then yes.

DYZ Over a year ago

And for dictionaries, you can set sort_keys to True.

ShadowRanger Over a year ago

@DYZ: A caution: sort_keys only works on Python 3 if the keys are homogeneous types (or otherwise have defined comparisons for all pairs of heterogeneous types, e.g. a mix of int and float is fine, but int and str is not). On Python 2, the fallback comparison allows it to (usually) work (though not necessarily repeatably, since the same-type fallback comparison is based on memory address, which isn't repeatable), but on Python 3 you'll just get a TypeError.

|

ShadowRanger · Accepted Answer · 2018-01-30 02:49:14Z

2

One approach for simple tests would be to disable the hash randomization entirely by setting PYTHONHASHSEED=0 in the environment that launches your script, e.g., in bash, doing:

export PYTHONHASHSEED=0

answered Jan 30, 2018 at 2:49

ShadowRanger

158k12 gold badges221 silver badges315 bronze badges

4 Comments

ShadowRanger Over a year ago

Note: This is only for the case where the tests are local; doing it on a public facing web service would expose you to denial of service attacks (which is what hash randomization was designed to protect you from).

mkoistinen Over a year ago

Disabling the has randomization won't help the problem with hash(-1) == hash(-2) as all integers hash to themselves, except -1, which hashes to -2 (at least as recently as Python 3.8.2)

ShadowRanger Over a year ago

@mkoistinen: Sure? But that's a problem with hashing in general, and kind of irrelevant to this answer. Hash randomization is intended to remove the ability to craft colliding hashes; disabling it allows you to find colliding strings just as easily as you found colliding ints. hash of an int is going to potentially collide whether or not you disable it.

mkoistinen Over a year ago

My response wasn't a criticism of your helpful post, nor did I downvote it, but rather my comment is a warning to others who attempt to use hash() similarly to the original question.

Collectives™ on Stack Overflow

Repeatably hashing an arbitrary Python tuple

2 Answers 2

12 Comments

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

12 Comments

4 Comments

Linked

Related