Using hashlib to compute md5 digest of a file in Python 3

Question

With python 2.7 the following code computes the mD5 hexdigest of the content of a file.

(EDIT: well, not really as answers have shown, I just thought so).

import hashlib def md5sum(filename): f = open(filename, mode='rb') d = hashlib.md5() for buf in f.read(128): d.update(buf) return d.hexdigest()

Now if I run that code using python3 it raise a TypeError Exception:

 d.update(buf) TypeError: object supporting the buffer API required

I figured out that I could make that code run with both python2 and python3 changing it to:

def md5sum(filename): f = open(filename, mode='r') d = hashlib.md5() for buf in f.read(128): d.update(buf.encode()) return d.hexdigest()

Now I still wonder why the original code stopped working. It seems that when opening a file using the binary mode modifier it returns integers instead of strings encoded as bytes (I say that because type(buf) returns int). Is this behavior explained somewhere ?

Would it be faster if you did larger reads, closer to the filesystem's file block size? (for instance 1024 bytes on Linux ext3 and 4096 bytes or more on Windows NTFS) — rakslice
– rakslice, Commented Aug 1, 2013 at 18:23
Similar questions: Get the MD5 hash of big files in Python and Hashing a file in Python — maxschlepzig
– maxschlepzig, Commented Mar 5, 2023 at 13:46

jfs · Accepted Answer · 2011-10-20 02:08:51Z

37

I think you wanted the for-loop to make successive calls to f.read(128). That can be done using iter() and functools.partial():

import hashlib from functools import partial def md5sum(filename): with open(filename, mode='rb') as f: d = hashlib.md5() for buf in iter(partial(f.read, 128), b''): d.update(buf) return d.hexdigest() print(md5sum('utils.py'))

edited Oct 20, 2011 at 2:08

jfs

417k210 gold badges1k silver badges1.7k bronze badges

answered Oct 19, 2011 at 23:59

Raymond Hettinger

229k67 gold badges405 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

kriss Over a year ago

Yes, that's exactly what I was trying to do. I finally achieved that with a less elegant solution than yours using a generator.

phihag Over a year ago

This leaks the file handle on some Python implementations. You should at least call close.

jfs Over a year ago

I've added with statement to close the file properly.

kriss Over a year ago

@phihag: is there really python implementation where the automatic close actually leaks file handles ? I thought it merely delayed the releasing of these file handles until garbage collection ?

jfs Over a year ago

@RaymondHettinger: if you don't like it; just revert the change. I've considered it to be a too minor change to discuss. Though I strongly disagree with your reasoning. Public code should follow best practices especially if it is aimed for beginners. If best practices are too hard to follow (though I don't think it is the case) for such a common task then the language should change.

|

phihag · Accepted Answer · 2011-10-21 07:38:50Z

10

for buf in f.read(128): d.update(buf)

.. updates the hash sequentially with each of the first 128 bytes values of the file. Since iterating over a bytes produces int objects, you get the following calls which cause the error you encountered in Python3.

d.update(97) d.update(98) d.update(99) d.update(100)

which is not what you want.

Instead, you want:

def md5sum(filename): with open(filename, mode='rb') as f: d = hashlib.md5() while True: buf = f.read(4096) # 128 is smaller than the typical filesystem block if not buf: break d.update(buf) return d.hexdigest()

edited Oct 21, 2011 at 7:38

answered Oct 19, 2011 at 23:42

phihag

289k75 gold badges475 silver badges489 bronze badges

7 Comments

Umur Kontacı Over a year ago

This is eat the whole RAM if you open a huge file. That's why we buffer.

phihag Over a year ago

@fastreload Already added that ;). Since the original solution didn't even work for files with >128 bytes, I don't think memory is an issue, but I added a buffered read anyway.

Umur Kontacı Over a year ago

Well done then, yet OP claimed that he could use his code in Python 2.x and stopped working on 3.x. And I remember I made 1 byte buffer for calculating md5 of 3 gb iso file for benchmarking and it did not fail. My bet is, python 2.7 has a failsafe mechanism that whatever the user input is, the minimum buffer size does not go below a certain level. What do you say?

phihag Over a year ago

@fastreload The code didn't crash in Python 2 since iterating over a str produced str. The result was still wrong for files larger than 128 Bytes. Sure, you can adjust the buffer size as you want (unless you have a fast SSD, the CPU will get bored anyway, and good OSs preload the next bytes of the file). Python 2.7 has definitely no such failsafe mechanism; that would violate the contract of read. The OP did just not compare the results of the script with the canonical md5sum's, or the results of the script on two files that with 128 identical first bytes.

kriss Over a year ago

yes, my original code is indeed broken (but not yet in the wild). I just didn't tested it on large files with the same beginning. I should have guessed there was a real problem as it was running way too fast.

|

kriss · Accepted Answer · 2011-10-20 07:01:52Z

2

I finally changed my code to the version below (that I find easy to understand) after asking the question. But I will probably change it to the version suggested by Raymond Hetting unsing functools.partial.

import hashlib def chunks(filename, chunksize): f = open(filename, mode='rb') buf = "Let's go" while len(buf): buf = f.read(chunksize) yield buf def md5sum(filename): d = hashlib.md5() for buf in chunks(filename, 128): d.update(buf) return d.hexdigest()

edited Oct 20, 2011 at 7:01

answered Oct 20, 2011 at 0:20

kriss

24.3k17 gold badges104 silver badges120 bronze badges

3 Comments

user109839 Over a year ago

This will now work if the file lenght is not a multiple of chunksize, read will infact return a shorter buffer in the last read. The termination is given by an empty buffer, that's why the "not buf" condition in the example code above (that works).

kriss Over a year ago

@Mapio: there is indeed a kind of bug in my code, but not at all where you say. The file length is irrelevant. The code above works provided there is no partial read returning incomplete buffers. If a partial read occurs, it will stop too soon (but taking the partial buffer into account). A partial read may occur in some case, say if the program receive a managed interrupt signal while reading, then continue with read after returning from interruption.

kriss Over a year ago

Well, in the above comment, when speaking of "code above" I'm refering to the old version. This current one is now working (even if it's not the best possible solution).

Collectives™ on Stack Overflow

Using hashlib to compute md5 digest of a file in Python 3

3 Answers 3

8 Comments

7 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

7 Comments

3 Comments

Linked

Related