6

I'm using the following code to do a checksum of a file which works fine. But when I generate a hash for a large file, say 2 GB, it is quite slow. How can I improve the performance of this code?

fs = new FileStream(txtFile.Text, FileMode.Open); formatted = string.Empty; using (SHA1Managed sha1 = new SHA1Managed()) { byte[] hash = sha1.ComputeHash(fs); foreach (byte b in hash) { formatted += b.ToString("X2"); } } fs.Close(); 

Update:

System:

OS: Win 7 64bit, CPU: I5 750, RAM: 4GB, HDD: 7200rpm

Tests:

Test1 = 59.895 seconds

Test2 = 59.94 seconds

8
  • 1
    +1 just for trying to improve the performance of the heaviest bit, and not caring that formatted is built in a relatively inefficient way :) Commented Oct 1, 2010 at 9:23
  • :) should probably change that to a stringbuilder? Commented Oct 1, 2010 at 9:51
  • Ah now, you're talking yourself out of that +1! What can be worthwhile though if you produce such hex strings often enough is to have a method that does this (good case for an extension method). With it being then potentially used somewhere where the performance will make more real difference, it'd be more worthwhile moving the the StringBuilder (created at the appropriate capacity) or fixed-size char array approaches. Commented Oct 1, 2010 at 10:33
  • In the middle of the 60 seconds, hit the pause button. Is it in ComputeHash or in string += ? Do that a few times. That will tell which part to worry about. If it's in ComputeHash, is it in getting a buffer from the file, or is it in munching through a buffer? If mostly the former, it's I/O bound. If mostly the latter, it's CPU bound. Commented Oct 1, 2010 at 11:27
  • Maybe you can optimize in the direction that sometimes you don't need to hash at all. E.g. if you are interested in "are 2 files different" you can first look at file sizes. But I'm expanding on the strict question that you asked. We need more context for that. Commented Oct 1, 2010 at 13:39

6 Answers 6

3

The first question is what you need this checksum for. If you don't need the cryptographic properties, then a non-cryptographic hash, or a hash that is less cryptographically secure (MD5 being "broken" doesn't prevent it being a good hash, nor still strong enough for some uses) is likely to be more performant. You could make your own hash by reading a subset of the data (I'd advise making this subset work in 4096byte chunks of the underlying file, as that would match the buffer size used by SHA1Managed as well as allowing for a faster chunk read than you would if you did say every X bytes for some value of X).

Edit: An upvote reminding me of this answer, has also reminded me that I since wrote SpookilySharp which provides high-performance 32-, 64- and 128-bit hashes that are not cryptographic, but good for providing checksums against errors, storage, etc. (This in turn has reminded me that I should update it to support .NET Core).

Of course, if you want the SHA-1 of the file to interoperate with something else, you are stuck.

I would experiment with different buffer sizes, as increasing the size of the filestream's buffer can increase speed at the cost of extra memory. I would advise a whole multiple of 4096 (4096 is the default, incidentally) as SHA1Managed will ask for 4096 chunks at a time, and this way there'll be no case where either FileStream returns less than the most asked for (allowed but sometimes suboptimal) or does more than one copy at a time.

Sign up to request clarification or add additional context in comments.

2 Comments

+1 for the first sequence. Sometimes we are solving the wrong problem altogether.
Thanks. Decided to go with MD5 as I was only checking the integrity of the files after transmission and didn't require the extra security of SHA-1. Just out of curiosity. I found Intel’s new implementation of SHA-1 using SSE3 instructions. software.intel.com/en-us/articles/… Just wondering if and how this can be used in managed code?
1

Well, is it IO-bound or CPU-bound? If it's CPU-bound, there's not a lot we can do about that.

It's possible that opening the FileStream with different parameters would allow the file system to do more buffering or assume that you're going to read the file sequentially - but I doubt that will help very much. (It's certainly not going to do a lot if it's CPU-bound.)

How slow is "quite slow" anyway? Compared with, say, copying the file?

If you have a lot of memory (e.g. 4GB or more) how long does it take to hash the file a second time, when it may be in the file system cache?

4 Comments

I've run some speed tests. Check my update. Also CPU usage only peaks at about 30%.
@Bruce: 30% in total? Out of how many cores? If it's a multi-core CPU but a single-threaded hashing algorithm, it could still be CPU-bound. Look at Task Manager's performance tab to see whether one CPU is pegged for the entire time :)
No, all 4 cores average at about 5 - 6%. 2 cores doing a bit of work but nowhere near pegged. Definitely IO-bound right?
@Bruce: Sounds that way, yes.
1

First of all, have you measured "quite slow"? From this site, SHA-1 has about half the speed of MD5 with about 100 MB/s (depending on the CPU), so 2 GB would take about 20 seconds to hash. Also, note that if you're using a slow HDD, this might be your real bottleneck as 30-70 MB/s aren't unusual.

To speed up things, you might just not hash the whole file, but the first X KB or representable parts of it (the parts that will most likely differ). If your files aren't too similar, this shouldn't cause duplicates.

Comments

1

First: SHA-1 file hashing should be I/O bound on non-ancient CPUs - and I5 certainly doesn't qualify as ancient. Of course it depends on the implementation of SHA-1, but I doubt SHA1Managed is über-slow.

Next, 60sec for 2GB data is ~34MB/s - that's slow for harddisk reads; even a 2.5" laptop disk can read faster than that. Assuming the harddrive is internal (no USB2/whatever or network bottleneck), and there's not a lot of other disk I/O activity, I'd be surprised to see less than 60MB/s read from a modern drive.

My guess would be that ComputeHash() uses a small buffer internally. Try manually reading/hashing, so you can specify a larger buffer (64kb or even larger) to increase throughput. You could also move to async processing so disk-read and compute can be overlapped.

Comments

0

Neither is SHA1Managed the best choice for large input strings, nor is Byte.ToString("X2") the fastest way to convert the byte array to a string.

I just finished an article with detailed benchmarks on that topic. It compares SHA1Managed, SHA1CryptoServiceProvider, SHA1Cng and also considers SHA1.Create() on different length input strings.

In the second part, it shows 5 different methods of converting the byte array to string where Byte.ToString("X2") is the worst.

My largest input was only 10,000 characters so you may want to run my benchmarks on your 2 GB file. Would be quite interesting if/how that changes the numbers.

http://wintermute79.wordpress.com/2014/10/10/c-sha-1-benchmark/

However, for file integrity checks you are better off using MD5 as you already wrote.

Comments

-3

You Can Use This logic for getting SHA-1 value. I was using it in java.

public class sha1Calculate {

 public static void main(String[] args)throws Exception { File file = new File("D:\\Android Links.txt"); String outputTxt= ""; String hashcode = null; try { FileInputStream input = new FileInputStream(file); ByteArrayOutputStream output = new ByteArrayOutputStream (); byte [] buffer = new byte [65536]; int l; while ((l = input.read (buffer)) > 0) output.write (buffer, 0, l); input.close (); output.close (); byte [] data = output.toByteArray (); MessageDigest digest = MessageDigest.getInstance( "SHA-1" ); byte[] bytes = data; digest.update(bytes, 0, bytes.length); bytes = digest.digest(); StringBuilder sb = new StringBuilder(); for( byte b : bytes ) { sb.append( String.format("%02X", b) ); } System.out.println("Digest(in hex format):: " + sb.toString()); }catch (FileNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (NoSuchAlgorithmException e) { // TODO Auto-generated catch block e.printStackTrace(); } } 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.