4

I am running Python 3.6.4 on Windows 10 with Fall Creators update. I am attempting to decompress a Wikimedia data dump file, specifically https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-meta-current.xml.bz2.

This file decompresses without problems using 7z on the command line but fails on the first block of data with zero length output from the Python decompressor. The code follows:

import bz2 def decompression(qin, # Iterable supplying input bytes data qout): # Pipe to next process - needs bytes data decomp = bz2.BZ2Decompressor() # Create a decompressor for chunk in qin: # Loop obtaining data from source iterable lc = len(chunk) # = 16384 dc = decomp.decompress(chunk) # Do the decompression ldc = len(dc) # = 0 qout.put(dc) # Pass the decompressed chunk to the next process 

I have verified that the bz2 header is valid and since the file decompresses without problems using command line utilities, the problem seems to be related to the Python implementation of BZ2. The following values from the decompressor seem OK and match what you would expect given the documentation.

eof = False unused_data = b'' needs_input = True 

Any suggestions on how to troubleshoot this problem?

4
  • Are you using the "rb" mode when reading the file? On Windows machines, binary files get corrupted by end-of-line conversions, unless some special effort is made to prevent that from happening. Commented Mar 30, 2018 at 19:31
  • @MarkAdler Yes, I am. What should I be looking out for? Commented Mar 31, 2018 at 1:30
  • Otherwise, there is no problem with your code. I tried it with the linked .bz2 file and it worked fine. You might want to check your iterable. Commented Mar 31, 2018 at 2:54
  • @MarkAdler I have tried it both getting data directly from the Internet and reading from a local file. In both cases it fails on the first call to decompress. It does not matter whether the chunk of data passed is large - 64K or small,- 265 bytes, it fails. I tried upgrading Python to 3.6.5 and got the same result. Commented Mar 31, 2018 at 3:47

1 Answer 1

2

Beats me. I can't find anything wrong with your function. The following works on the linked .bz2 file with no issue, where the output exactly matches the result of a command-line decompression of that .bz2 file:

import sys import bz2 def decompression(qin, # Iterable supplying input bytes data qout): # Pipe to next process - needs bytes data decomp = bz2.BZ2Decompressor() # Create a decompressor for chunk in qin: # Loop obtaining data from source iterable lc = len(chunk) # = 16384 dc = decomp.decompress(chunk) # Do the decompression # qout.put(dc) # Pass the decompressed chunk to the next process qout.write(dc) with open('enwiktionary-latest-pages-meta-current.xml.bz2', 'rb') as f: it = iter(lambda: f.read(16384), b'') decompression(it, sys.stdout.buffer) 

I only made one trivial change to your function in order to write the result to stdout. I am using Python 3.6.4. I also tried it with Python 2.7.10 (removing the .buffer), and it again worked flawlessly.

Are you actually just letting your function run? What do you mean by "fails on the first block"? The first few calls (seven in this case) will in fact return no decompressed data, because you have not yet provided a complete block for it to work on. But there are no errors reported.

Note: to do this right for .bz2 files that contain concatenated bzip2 streams, you would need to loop on eof true, creating a new decompressor object and feeding in the unused_data from the previous decompressor object, followed by more data read from the compressed file. The linked file isn't one of those.

Sign up to request clarification or add additional context in comments.

5 Comments

I mean that the output from the first call to decompress is a zero length byte string which does not make any sense to me unless (I just thought of this) it represents the 4 byte bz2 header.
I tried letting it run but every call to decompress resulted in a zero length byte string.
You were right. I simply was not expecting the behavior that I saw as I didn't know the buffer size needed by the decompressor. I took the opportunity to bump my buffer size to 128K and it ran fine.
It also works fine with a 16K buffer. And not every call returns zero output. Only the first seven calls return zero output. The eighth call returns a bunch.
What is this lc for?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.