Get the MD5 hash of big files in Python

Question

I have used hashlib (which replaces md5 in Python 2.6/3.0), and it worked fine if I opened a file and put its content in the hashlib.md5() function.

The problem is with very big files that their sizes could exceed the RAM size.

How can I get the MD5 hash of a file without loading the whole file into memory?

I would rephrase: "How to get the MD5 has of a file without loading the whole file to memory?" — XTL
– XTL, Commented Feb 24, 2012 at 12:29
starting with 3.11 hashlib gained the file_digest function which appears to take the hazzle to write chunking boilerplate from you docs.python.org/3.11/library/hashlib.html#hashlib.file_digest — pseyfert
– pseyfert, Commented Nov 8, 2022 at 14:10

Peter Mortensen · Accepted Answer · 2022-10-28 11:25:24Z

You need to read the file in chunks of suitable size:

def md5_for_file(f, block_size=2**20): md5 = hashlib.md5() while True: data = f.read(block_size) if not data: break md5.update(data) return md5.digest()

Note: Make sure you open your file with the 'rb' to the open - otherwise you will get the wrong result.

So to do the whole lot in one method - use something like:

def generate_file_md5(rootdir, filename, blocksize=2**20): m = hashlib.md5() with open( os.path.join(rootdir, filename) , "rb" ) as f: while True: buf = f.read(blocksize) if not buf: break m.update( buf ) return m.hexdigest()

The update above was based on the comments provided by Frerich Raabe - and I tested this and found it to be correct on my Python 2.7.2 Windows installation

I cross-checked the results using the jacksum tool.

jacksum -a md5 <filename>

What's important to notice is that the file which is passed to this function must be opened in binary mode, i.e. by passing rb to the open function.
This is a simple addition, but using hexdigest instead of digest will produce a hexadecimal hash that "looks" like most examples of hashes.
Erik, no, why would it be? The goal is to feed all bytes to MD5, until the end of the file. Getting a partial block does not mean all the bytes should not be fed to the checksum.
@user2084795 open always opens a fresh file handle with the position set to the start of the file, (unless you open a file for append).

Yuval Adam · Accepted Answer · 2019-11-22 11:34:10Z

198

Break the file into 8192-byte chunks (or some other multiple of 128 bytes) and feed them to MD5 consecutively using update().

This takes advantage of the fact that MD5 has 128-byte digest blocks (8192 is 128×64). Since you're not reading the entire file into memory, this won't use much more than 8192 bytes of memory.

In Python 3.8+ you can do

import hashlib with open("your_filename.txt", "rb") as f: file_hash = hashlib.md5() while chunk := f.read(8192): file_hash.update(chunk) print(file_hash.digest()) print(file_hash.hexdigest()) # to get a printable str instead of bytes

edited Nov 22, 2019 at 11:34

user3064538

answered Jul 15, 2009 at 12:55

Yuval Adam

166k95 gold badges318 silver badges406 bronze badges

6 Comments

jmanning2k Over a year ago

You can just as effectively use a block size of any multiple of 128 (say 8192, 32768, etc.) and that will be much faster than reading 128 bytes at a time.

JustRegisterMe Over a year ago

Thanks jmanning2k for this important note, a test on 184MB file takes (0m9.230s, 0m2.547s, 0m2.429s) using (128, 8192, 32768), I will use 8192 as the higher value gives non-noticeable affect.

user3064538 Over a year ago

If you can, you should use hashlib.blake2b instead of md5. Unlike MD5, BLAKE2 is secure, and it's even faster.

vy32 Over a year ago

@Boris, you can't actually say that BLAKE2 is secure. All you can say is that it hasn't been broken yet.

user3064538 Over a year ago

@vy32 you can't say it's definitely going to be broken either. We'll see in 100 years, but it's at least better than MD5 which is definitely insecure.

|

Peter Mortensen · Accepted Answer · 2022-10-28 11:48:33Z

Python < 3.7

import hashlib def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128): h = hash_factory() with open(filename,'rb') as f: for chunk in iter(lambda: f.read(chunk_num_blocks*h.block_size), b''): h.update(chunk) return h.digest()

Python 3.8 and above

import hashlib def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128): h = hash_factory() with open(filename,'rb') as f: while chunk := f.read(chunk_num_blocks*h.block_size): h.update(chunk) return h.digest()

Original post

If you want a more Pythonic (no while True) way of reading the file, check this code:

import hashlib def checksum_md5(filename): md5 = hashlib.md5() with open(filename,'rb') as f: for chunk in iter(lambda: f.read(8192), b''): md5.update(chunk) return md5.digest()

Note that the iter() function needs an empty byte string for the returned iterator to halt at EOF, since read() returns b'' (not just '').

Better still, use something like 128*md5.block_size instead of 8192.
mrkj: I think it's more important to pick your read block size based on your disk and then to ensure that it's a multiple of md5.block_size.
@ThorSummoner: Not really, but from my working finding optimum block sizes for flash memory, I'd suggest just picking a number like 32k or something easily divisible by 4, 8, or 16k. For example, if your block size is 8k, reading 32k will be 4 reads at the correct block size. If it's 16, then 2. But in each case, we're good because we happen to be reading an integer multiple number of blocks.

Peter Mortensen · Accepted Answer · 2022-10-28 11:26:54Z

Here's my version of Piotr Czapla's method:

def md5sum(filename): md5 = hashlib.md5() with open(filename, 'rb') as f: for chunk in iter(lambda: f.read(128 * md5.block_size), b''): md5.update(chunk) return md5.hexdigest()

Peter Mortensen · Accepted Answer · 2022-10-28 11:27:36Z

31

Using multiple comment/answers for this question, here is my solution:

import hashlib def md5_for_file(path, block_size=256*128, hr=False): ''' Block size directly depends on the block size of your filesystem to avoid performances issues Here I have blocks of 4096 octets (Default NTFS) ''' md5 = hashlib.md5() with open(path,'rb') as f: for chunk in iter(lambda: f.read(block_size), b''): md5.update(chunk) if hr: return md5.hexdigest() return md5.digest()

This is Pythonic
This is a function
It avoids implicit values: always prefer explicit ones.
It allows (very important) performance optimizations

edited Oct 28, 2022 at 11:27

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Jul 22, 2013 at 8:14

Bastien Semene

6078 silver badges20 bronze badges

5 Comments

Hawkwing Over a year ago

One suggestion: make your md5 object an optional parameter of the function to allow alternate hashing functions, such as sha256 to easily replace MD5. I'll propose this as an edit, as well.

Hawkwing Over a year ago

also: digest is not human-readable. hexdigest() allows a more understandable, commonly recogonizable output as well as easier exchange of the hash

Bastien Semene Over a year ago

Others hash formats are out of the scope of the question, but the suggestion is relevant for a more generic function. I added a "human readable" option according to your 2nd suggestion.

EnemyBagJones Over a year ago

Can you elaborate on how 'hr' is functioning here?

Bastien Semene Over a year ago

@EnemyBagJones 'hr' stands for human readable. It returns a string of 32 char length hexadecimal digits: docs.python.org/2/library/md5.html#md5.md5.hexdigest

Peter Mortensen · Accepted Answer · 2022-10-28 11:45:02Z

A Python 2/3 portable solution

To calculate a checksum (md5, sha1, etc.), you must open the file in binary mode, because you'll sum bytes values:

To be Python 2.7 and Python 3 portable, you ought to use the io packages, like this:

import hashlib import io def md5sum(src): md5 = hashlib.md5() with io.open(src, mode="rb") as fd: content = fd.read() md5.update(content) return md5

If your files are big, you may prefer to read the file by chunks to avoid storing the whole file content in memory:

def md5sum(src, length=io.DEFAULT_BUFFER_SIZE): md5 = hashlib.md5() with io.open(src, mode="rb") as fd: for chunk in iter(lambda: fd.read(length), b''): md5.update(chunk) return md5

The trick here is to use the iter() function with a sentinel (the empty string).

The iterator created in this case will call o [the lambda function] with no arguments for each call to its next() method; if the value returned is equal to sentinel, StopIteration will be raised, otherwise the value will be returned.

If your files are really big, you may also need to display progress information. You can do that by calling a callback function which prints or logs the amount of calculated bytes:

def md5sum(src, callback, length=io.DEFAULT_BUFFER_SIZE): calculated = 0 md5 = hashlib.md5() with io.open(src, mode="rb") as fd: for chunk in iter(lambda: fd.read(length), b''): md5.update(chunk) calculated += len(chunk) callback(calculated) return md5

Peter Mortensen · Accepted Answer · 2022-10-28 11:35:33Z

A remix of Bastien Semene's code that takes the Hawkwing comment about generic hashing function into consideration...

def hash_for_file(path, algorithm=hashlib.algorithms[0], block_size=256*128, human_readable=True): """ Block size directly depends on the block size of your filesystem to avoid performances issues Here I have blocks of 4096 octets (Default NTFS) Linux Ext4 block size sudo tune2fs -l /dev/sda5 | grep -i 'block size' > Block size: 4096 Input: path: a path algorithm: an algorithm in hashlib.algorithms ATM: ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512') block_size: a multiple of 128 corresponding to the block size of your filesystem human_readable: switch between digest() or hexdigest() output, default hexdigest() Output: hash """ if algorithm not in hashlib.algorithms: raise NameError('The algorithm "{algorithm}" you specified is ' 'not a member of "hashlib.algorithms"'.format(algorithm=algorithm)) hash_algo = hashlib.new(algorithm) # According to hashlib documentation using new() # will be slower then calling using named # constructors, ex.: hashlib.md5() with open(path, 'rb') as f: for chunk in iter(lambda: f.read(block_size), b''): hash_algo.update(chunk) if human_readable: file_hash = hash_algo.hexdigest() else: file_hash = hash_algo.digest() return file_hash

Peter Mortensen · Accepted Answer · 2022-10-28 11:21:47Z

You can't get its md5 without reading the full content. But you can use the update function to read the file's content block by block.

m.update(a); m.update(b) is equivalent to m.update(a+b).

Peter Mortensen · Accepted Answer · 2022-10-28 11:47:10Z

I think the following code is more Pythonic:

from hashlib import md5 def get_md5(fname): m = md5() with open(fname, 'rb') as fp: for chunk in fp: m.update(chunk) return m.hexdigest()

Peter Mortensen · Accepted Answer · 2022-10-28 11:36:57Z

Implementation of Yuval Adam's answer for Django:

import hashlib from django.db import models class MyModel(models.Model): file = models.FileField() # Any field based on django.core.files.File def get_hash(self): hash = hashlib.md5() for chunk in self.file.chunks(chunk_size=8192): hash.update(chunk) return hash.hexdigest()

Peter Mortensen · Accepted Answer · 2022-10-28 11:47:59Z

0

I don't like loops. Based on Nathan Feger's answer:

md5 = hashlib.md5() with open(filename, 'rb') as f: functools.reduce(lambda _, c: md5.update(c), iter(lambda: f.read(md5.block_size * 128), b''), None) md5.hexdigest()

edited Oct 28, 2022 at 11:47

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered May 8, 2019 at 10:48

Sebastian Wagner

2,5652 gold badges30 silver badges35 bronze badges

2 Comments

Naltharial Over a year ago

What possible reason is there to replace a simple and clear loop with a functools.reduce abberation containing multiple lambdas? I'm not sure if there's any convention on programming this hasn't broken.

Sebastian Wagner Over a year ago

My main problem was that hashlibs API doesn't really play well with the rest of Python. For example let's take shutil.copyfileobj which closely fails to work. My next idea was fold (aka reduce) which folds iterables together into single objects. Like e.g. a hash. hashlib doesn't provide operators which makes this a bit cumbersome. Nevertheless were folding an iterables here.

simon · Accepted Answer · 2024-04-11 08:13:59Z

As mentioned in @pseyfert's comment; in Python 3.11 and above, hashlib.file_digest() can be used. While not explicitly documented, internally the function uses a chunking approach similar to the one in the accepted answer, as can be seen from its source code (lines 230–236).

The function also provides a keyword-only argument _bufsize with a default value of 2^18 = 262,144 bytes that controls the buffer size for chunking; however, given its leading underscore and missing documentation, it should probably rather be considered an implementation detail.

In any case, the following code equivalently reproduces the accepted answer in Python 3.11+ (apart from the different chunk size):

import hashlib with open("your_filename.txt", "rb") as f: file_hash = hashlib.file_digest(f, "md5") # or `hashlib.md5` as 2nd arg print(file_hash.digest()) print(file_hash.hexdigest()) # to get a printable str instead of bytes

Peter Mortensen · Accepted Answer · 2022-10-28 11:30:55Z

I'm not sure that there isn't a bit too much fussing around here. I recently had problems with md5 and files stored as blobs in MySQL, so I experimented with various file sizes and the straightforward Python approach, viz:

FileHash = hashlib.md5(FileData).hexdigest()

I couldn’t detect any noticeable performance difference with a range of file sizes 2 KB to 20 MB and therefore no need to 'chunk' the hashing. Anyway, if Linux has to go to disk, it will probably do it at least as well as the average programmer's ability to keep it from doing so. As it happened, the problem was nothing to do with md5. If you're using MySQL, don't forget the md5() and sha1() functions already there.

This is not answering the question and 20 MB is hardly considered a very big file that may not fit into RAM as discussed here.

WhiZTiM · Accepted Answer · 2016-08-27 20:49:26Z

-5

import hashlib,re opened = open('/home/parrot/pass.txt','r') opened = open.readlines() for i in opened: strip1 = i.strip('\n') hash_object = hashlib.md5(strip1.encode()) hash2 = hash_object.hexdigest() print hash2

edited Aug 27, 2016 at 20:49

WhiZTiM

21.6k5 gold badges46 silver badges71 bronze badges

answered Jul 17, 2016 at 21:37

mhmad msarwe

1

2 Comments

Farside Over a year ago

please, format the code in the answer, and read this section before giving answers: stackoverflow.com/help/how-to-answer

Steve Barnes Over a year ago

This will not work correctly as it is reading the file in text mode line by line then messing with it and printing the md5 of each stripped, encoded, line!

Collectives™ on Stack Overflow

Get the MD5 hash of big files in Python

14 Answers 14

8 Comments

6 Comments

Python < 3.7

Python 3.8 and above

Original post

9 Comments

Comments

5 Comments

Comments

Comments

Comments

Comments

Comments

2 Comments

Comments

1 Comment

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

14 Answers 14

8 Comments

6 Comments

Python < 3.7

Python 3.8 and above

Original post

9 Comments

Comments

5 Comments

Comments

Comments

Comments

Comments

Comments

2 Comments

Comments

1 Comment

2 Comments

Linked

Related