How to find the position of Central Directory in a Zip file?

Question

I am trying to find the position of the first Central Directory file header in a Zip file.

I'm reading these: http://en.wikipedia.org/wiki/Zip_(file_format) http://www.pkware.com/documents/casestudies/APPNOTE.TXT

As I see it, I can only scan through the Zip data, identify by the header what kind of section I am at, and then do that until I hit the Central Directory header. I would obviously read the File Headers before that and use the "compressed size" to skip the actual data, and not for-loop through every byte in the file...

If I do it like that, then I practically already know all the files and folders inside the Zip file in which case I don't see much use for the Central Directory anymore.

To my understanding the purpose of Central Directory is to list file metadata, and the position of the actual data in the Zip file so you wouldn't need to scan the whole file?

After reading about End Of Central Directory record, Wikipedia says:

This ordering allows a zip file to be created in one pass, but it is usually decompressed by first reading the central directory at the end.

How would I find End of Central Directory record easily? We need to remember that it can have an arbitrary sized comment there, so I may not know how many bytes from the end of the data stream it is located at. Do I just scan it?

P.S. I'm writing a Zip file reader.

Can't you start scanning backwards from the end (ZIP directory is located at the end of file)? — Eugene Mayevski 'Callback
– Eugene Mayevski 'Callback, Commented Dec 21, 2011 at 18:13
Yes I can, but is this really the way you are supposed to do this? Scanning backwards to find the End of Central Directory is a possibility, but considering the fact that it has a variable-sized comment field of size 16-bits, you can have about 65k of comments that you need to read/scan through, and if the comment contains the magic number your scanning will fail. — Tower
– Tower, Commented Dec 21, 2011 at 18:51
I ended up doing it that way. 64k and the fact that no one is likely to introduce such bytes in the comments do not mean that it's okay to do it this way. — Tower
– Tower, Commented Dec 22, 2011 at 18:22
Fun Fact - Windows Explorer will not open zip files if they contain the end of directory signature in the zip file comment. WinRAR and 7z do not have this problem. — namey
– namey, Commented Mar 21, 2015 at 20:05

Derek E · Accepted Answer · 2013-01-09 15:53:25Z

Start at the end and scan towards the beginning, looking for the end of directory signature and counting the number of bytes you have scanned. When you find a candidate, get the byte 20 offset for the comment length (L). Check if L + 20 matches your current count. Then check that the start of the central directory (pointed to by the byte 12 offset) has an appropriate signature.

If you assumed the bits were pretty random when the signature check happened to be a wild guess (e.g. a guess landing into a data segment), the probability of getting all the signature bits correct is pretty low. You could refine this and figure out the chance of landing in a data segment and the chance of hitting a legitimate header (as a function of the number of such headers), but this is already sounded like a low likelihood to me. You could increase your confidence level by then checking the signature of the first file record listed, but be sure to handle the boundary case of an empty zip file.

It also should be mentioned that it's best to start at endOfFile - 22 position, since real end of central directory signatue cannot occur after this position. For archives with empty comments, this will find the signature on the first iteration.
I checked at endOfFile -22, if that fails then try endOfFile - 64k - 22 and loop until endOfFile -22 applying this heuristical check anytime I see the signature. Code here for the curious: github.com/paulsapps/msgi/blob/…

Tower · Accepted Answer · 2011-12-22 18:23:37Z

1

I ended up looping through the bytes starting from the end. The loop stops if it finds a matching byte sequence, the index is below zero or if it already went through 64k bytes.

answered Dec 22, 2011 at 18:23

Tower

103k132 gold badges366 silver badges523 bronze badges

1 Comment

Andi Giga Over a year ago

Did you find solution? How does the Central Directory look like? I have a base64 encoded file.

user2624417 · Accepted Answer · 2014-01-11 17:57:00Z

0

Just cross your fingers and hope that there isn't an entry with the CRC, timestamp or datestamp as 06054B50, or any other sequence of four bytes that happen to be 06054B50.

answered Jan 11, 2014 at 17:57

user2624417

771 silver badge1 bronze badge

1 Comment

Ben Reser Over a year ago

I really don't think this added anything terribly constructive to this question. Would have been better added as just a comment.

Collectives™ on Stack Overflow

How to find the position of Central Directory in a Zip file?

3 Answers 3

2 Comments

1 Comment

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

1 Comment

Linked

Related