Why is the Python calculated "hashlib.sha1" different from "git hash-object" for a file?

Question

I'm trying to calculate the SHA-1 value of a file.

I've fabricated this script:

def hashfile(filepath): sha1 = hashlib.sha1() f = open(filepath, 'rb') try: sha1.update(f.read()) finally: f.close() return sha1.hexdigest()

For a specific file I get this hash value:
8c3e109ff260f7b11087974ef7bcdbdc69a0a3b9
But when i calculate the value with git hash_object, then I get this value: d339346ca154f6ed9e92205c3c5c38112e761eb7

How come they differ? Am I doing something wrong, or can I just ignore the difference?

You can't really ignore the difference if you plan to use the hashes together. — Matthew Scharley
– Matthew Scharley, Commented Dec 8, 2009 at 21:16
Forgot to mention, just used git as a reference, not going to use them together. — Ikke
– Ikke, Commented Dec 8, 2009 at 21:24
If the file could be quite large, you can process it a block at a time so you don't need the whole thing in RAM at once: stackoverflow.com/questions/7829499/… — rakslice
– rakslice, Commented Aug 1, 2013 at 18:24
Possible duplicate of How does git compute file hashes? Minimal examples ;-) — Ciro Santilli OurBigBook.com
– Ciro Santilli OurBigBook.com, Commented May 17, 2016 at 10:14

Community · Accepted Answer · 2017-05-23 10:30:11Z

52

git calculates hashes like this:

sha1("blob " + filesize + "\0" + data)

Reference

edited May 23, 2017 at 10:30

CommunityBot

11 silver badge

answered Dec 8, 2009 at 21:17

Brian R. Bondy

349k129 gold badges607 silver badges641 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Brian R. Bondy Over a year ago

No prob, the reference'd link is quite different, just happened to find it by luck.

Omnifarious Over a year ago

It should be mentioned that git does this to avoid length extension attacks.

Mark Booth Over a year ago

A full implementation of this suggestion can be found in the question Assigning Git SHA1's without Git.

Ben · Accepted Answer · 2014-06-12 02:03:52Z

For reference, here's a more concise version:

def sha1OfFile(filepath): import hashlib with open(filepath, 'rb') as f: return hashlib.sha1(f.read()).hexdigest()

On second thought: although I've never seen it, I think there's potential for f.read() to return less than the full file, or for a many-gigabyte file, for f.read() to run out of memory. For everyone's edification, let's consider how to fix that: A first fix to that is:

def sha1OfFile(filepath): import hashlib sha = hashlib.sha1() with open(filepath, 'rb') as f: for line in f: sha.update(line) return sha.hexdigest()

However, there's no guarantee that '\n' appears in the file at all, so the fact that the for loop will give us blocks of the file that end in '\n' could give us the same problem we had originally. Sadly, I don't see any similarly Pythonic way to iterate over blocks of the file as large as possible, which, I think, means we are stuck with a while True: ... break loop and with a magic number for the block size:

def sha1OfFile(filepath): import hashlib sha = hashlib.sha1() with open(filepath, 'rb') as f: while True: block = f.read(2**10) # Magic number: one-megabyte blocks. if not block: break sha.update(block) return sha.hexdigest()

Of course, who's to say we can store one-megabyte strings. We probably can, but what if we are on a tiny embedded computer?

I wish I could think of a cleaner way that is guaranteed to not run out of memory on enormous files and that doesn't have magic numbers and that performs as well as the original simple Pythonic solution.

On second thought, this may have issues if f.read() can't return the whole file (e.g., in the case of multi-gigabyte files) and so it should probably iterate over chunks.

Collectives™ on Stack Overflow

Why is the Python calculated "hashlib.sha1" different from "git hash-object" for a file?

2 Answers 2

3 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Linked

Related