2

I'm writing a specialised unit testing tool that needs to save the results of tests to be compared against in the future. Thus I need to be able to consistently map parameters that were passed to each test to the test result from running the test function with those parameters for each version. I was hoping there was a way to just hash the tuple and use that hash to name the files where I store the test results.

My first impulse was just to call hash() on the tuple of parameters, but of course that won't work since hash is randomized between interpreter instances now.

I'm having a hard time coming up with a way that works for whatever arbitrary elements that might be in the tuple (I guess restricting it to a mix of ints, floats, strings, and lists\tuples of those three would be okay). Any ideas?

I've thought of using the repr of the tuple or pickling it, but repr isn't guaranteed to produce byte-for-byte same output for same input, and I don't think pickling is either (is it?)

I've seen this already, but the answers are all based on that same assumption that doesn't hold anymore and don't really translate to this problem anyway, a lot of the discussion was about making the hash not depend on the order items come up and I do want the hash to depend on order.

8
  • I would pickle the tuple of the parameters and the results into the same file whose name is the hashed tuple of parameters. That way, you should not need to worry about the randomization, because the original tuple is in the file. Commented Jan 30, 2018 at 2:45
  • @DYZ Right, I'm doing that too, but I need the tuple to be hashed repeatably to be able to find the file in the first place. Commented Jan 30, 2018 at 2:47
  • 1
    Is disabling the hash randomization acceptable? Commented Jan 30, 2018 at 2:47
  • @ShadowRanger That... would work, I suppose, but it's horribly inelegant and does technically mean someone can DOS my CI server with a specially crafted merge request. Commented Jan 30, 2018 at 2:49
  • 3
    @Schilcote: So if this is a public facing server, that's a bad idea; the question made it sound like this was just for repeatable (assumed local) unit tests. Commented Jan 30, 2018 at 2:50

2 Answers 2

5

Not sure if I understand your question fully, but will just give it a try.

Before you do the hash, just serialize the result to a JSON string, and do the hash computing on your JSON string.

params = (1, 3, 2) hashlib.sha224(json.dumps(params)).hexdigest() # '5f0f7a621e6f420002d54ee28b0c169b8112ef72d8a6b60e6a25171c' 

If your params is a dictionary, use sort_keys=True to ensure your keys are sorted.

params = {'b': 123, 'c': 345} hashlib.sha224(json.dumps(params, sort_keys=True)).hexdigest() # '2e75966ce3f1185cbfb4eccc49d5552c08cfb7502a8765fe1dce9303' 
Sign up to request clarification or add additional context in comments.

12 Comments

Is the JSON result guaranteed to be the same every time?
@Schilcote For lists or tuples (and any stable primitives in such) it should be.
If you're serializing a tupple / list, then yes.
And for dictionaries, you can set sort_keys to True.
@DYZ: A caution: sort_keys only works on Python 3 if the keys are homogeneous types (or otherwise have defined comparisons for all pairs of heterogeneous types, e.g. a mix of int and float is fine, but int and str is not). On Python 2, the fallback comparison allows it to (usually) work (though not necessarily repeatably, since the same-type fallback comparison is based on memory address, which isn't repeatable), but on Python 3 you'll just get a TypeError.
|
2

One approach for simple tests would be to disable the hash randomization entirely by setting PYTHONHASHSEED=0 in the environment that launches your script, e.g., in bash, doing:

export PYTHONHASHSEED=0 

4 Comments

Note: This is only for the case where the tests are local; doing it on a public facing web service would expose you to denial of service attacks (which is what hash randomization was designed to protect you from).
Disabling the has randomization won't help the problem with hash(-1) == hash(-2) as all integers hash to themselves, except -1, which hashes to -2 (at least as recently as Python 3.8.2)
@mkoistinen: Sure? But that's a problem with hashing in general, and kind of irrelevant to this answer. Hash randomization is intended to remove the ability to craft colliding hashes; disabling it allows you to find colliding strings just as easily as you found colliding ints. hash of an int is going to potentially collide whether or not you disable it.
My response wasn't a criticism of your helpful post, nor did I downvote it, but rather my comment is a warning to others who attempt to use hash() similarly to the original question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.