I am running Python 3.6.4 on Windows 10 with Fall Creators update. I am attempting to decompress a Wikimedia data dump file, specifically https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-meta-current.xml.bz2.
This file decompresses without problems using 7z on the command line but fails on the first block of data with zero length output from the Python decompressor. The code follows:
import bz2 def decompression(qin, # Iterable supplying input bytes data qout): # Pipe to next process - needs bytes data decomp = bz2.BZ2Decompressor() # Create a decompressor for chunk in qin: # Loop obtaining data from source iterable lc = len(chunk) # = 16384 dc = decomp.decompress(chunk) # Do the decompression ldc = len(dc) # = 0 qout.put(dc) # Pass the decompressed chunk to the next process I have verified that the bz2 header is valid and since the file decompresses without problems using command line utilities, the problem seems to be related to the Python implementation of BZ2. The following values from the decompressor seem OK and match what you would expect given the documentation.
eof = False unused_data = b'' needs_input = True Any suggestions on how to troubleshoot this problem?
"rb"mode when reading the file? On Windows machines, binary files get corrupted by end-of-line conversions, unless some special effort is made to prevent that from happening.