25

With python 2.7 the following code computes the mD5 hexdigest of the content of a file.

(EDIT: well, not really as answers have shown, I just thought so).

import hashlib def md5sum(filename): f = open(filename, mode='rb') d = hashlib.md5() for buf in f.read(128): d.update(buf) return d.hexdigest() 

Now if I run that code using python3 it raise a TypeError Exception:

 d.update(buf) TypeError: object supporting the buffer API required 

I figured out that I could make that code run with both python2 and python3 changing it to:

def md5sum(filename): f = open(filename, mode='r') d = hashlib.md5() for buf in f.read(128): d.update(buf.encode()) return d.hexdigest() 

Now I still wonder why the original code stopped working. It seems that when opening a file using the binary mode modifier it returns integers instead of strings encoded as bytes (I say that because type(buf) returns int). Is this behavior explained somewhere ?

4

3 Answers 3

37

I think you wanted the for-loop to make successive calls to f.read(128). That can be done using iter() and functools.partial():

import hashlib from functools import partial def md5sum(filename): with open(filename, mode='rb') as f: d = hashlib.md5() for buf in iter(partial(f.read, 128), b''): d.update(buf) return d.hexdigest() print(md5sum('utils.py')) 
Sign up to request clarification or add additional context in comments.

8 Comments

Yes, that's exactly what I was trying to do. I finally achieved that with a less elegant solution than yours using a generator.
This leaks the file handle on some Python implementations. You should at least call close.
I've added with statement to close the file properly.
@phihag: is there really python implementation where the automatic close actually leaks file handles ? I thought it merely delayed the releasing of these file handles until garbage collection ?
@RaymondHettinger: if you don't like it; just revert the change. I've considered it to be a too minor change to discuss. Though I strongly disagree with your reasoning. Public code should follow best practices especially if it is aimed for beginners. If best practices are too hard to follow (though I don't think it is the case) for such a common task then the language should change.
|
10
for buf in f.read(128): d.update(buf) 

.. updates the hash sequentially with each of the first 128 bytes values of the file. Since iterating over a bytes produces int objects, you get the following calls which cause the error you encountered in Python3.

d.update(97) d.update(98) d.update(99) d.update(100) 

which is not what you want.

Instead, you want:

def md5sum(filename): with open(filename, mode='rb') as f: d = hashlib.md5() while True: buf = f.read(4096) # 128 is smaller than the typical filesystem block if not buf: break d.update(buf) return d.hexdigest() 

7 Comments

This is eat the whole RAM if you open a huge file. That's why we buffer.
@fastreload Already added that ;). Since the original solution didn't even work for files with >128 bytes, I don't think memory is an issue, but I added a buffered read anyway.
Well done then, yet OP claimed that he could use his code in Python 2.x and stopped working on 3.x. And I remember I made 1 byte buffer for calculating md5 of 3 gb iso file for benchmarking and it did not fail. My bet is, python 2.7 has a failsafe mechanism that whatever the user input is, the minimum buffer size does not go below a certain level. What do you say?
@fastreload The code didn't crash in Python 2 since iterating over a str produced str. The result was still wrong for files larger than 128 Bytes. Sure, you can adjust the buffer size as you want (unless you have a fast SSD, the CPU will get bored anyway, and good OSs preload the next bytes of the file). Python 2.7 has definitely no such failsafe mechanism; that would violate the contract of read. The OP did just not compare the results of the script with the canonical md5sum's, or the results of the script on two files that with 128 identical first bytes.
yes, my original code is indeed broken (but not yet in the wild). I just didn't tested it on large files with the same beginning. I should have guessed there was a real problem as it was running way too fast.
|
2

I finally changed my code to the version below (that I find easy to understand) after asking the question. But I will probably change it to the version suggested by Raymond Hetting unsing functools.partial.

import hashlib def chunks(filename, chunksize): f = open(filename, mode='rb') buf = "Let's go" while len(buf): buf = f.read(chunksize) yield buf def md5sum(filename): d = hashlib.md5() for buf in chunks(filename, 128): d.update(buf) return d.hexdigest() 

3 Comments

This will now work if the file lenght is not a multiple of chunksize, read will infact return a shorter buffer in the last read. The termination is given by an empty buffer, that's why the "not buf" condition in the example code above (that works).
@Mapio: there is indeed a kind of bug in my code, but not at all where you say. The file length is irrelevant. The code above works provided there is no partial read returning incomplete buffers. If a partial read occurs, it will stop too soon (but taking the partial buffer into account). A partial read may occur in some case, say if the program receive a managed interrupt signal while reading, then continue with read after returning from interruption.
Well, in the above comment, when speaking of "code above" I'm refering to the old version. This current one is now working (even if it's not the best possible solution).

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.