Zlib compress in python

Question

Why is the size of the compressed string bigger? Doesn't the zlib need to compress ??

Example:

import zlib import sys str1 = "abcdefghijklmnopqrstuvwxyz" print "size1: ", sys.getsizeof(str1) print "size2: ", sys.getsizeof(zlib.compress(str1))

The output:

size1: 47 size2: 55

Possible duplicate of Python 3x- Compression Makes File Bigger :( — Aran-Fey
– Aran-Fey, Commented Mar 22, 2018 at 15:35
Compression is not magic. You can't just throw data away to make the file smaller. There are tradeoffs. — Aran-Fey
– Aran-Fey, Commented Mar 22, 2018 at 15:35
Yes but no matter what is the string. The compressed string appears larger, I am using python 2.7 if that makes any difference — 0Interest
– 0Interest, Commented Mar 22, 2018 at 15:36
Think of compression as just an encoding that happens to be smaller than the original. The encoding includes the encoded data as well as some metadata needed to reconstruct the original (good compression schemes are adaptive; they are not independent of the original data). If the metadata is larger than the difference in size between the original and the encoded data, there is no net compression. That metadata has a fixed size (or at least, a fixed minimum size), so you won't see any benefit until you start compressing larger pieces of data. — chepner
– chepner, Commented Mar 22, 2018 at 15:46
Just doubling the size of your input with str1 *= 2 already gives you a net gain (although it's an extreme corner case, as such a string is incredibly compressible); my system shows that you go from an increase of 8 bytes to a decrease of 5 bytes. — chepner
– chepner, Commented Mar 22, 2018 at 15:49

Mark Adler · Accepted Answer · 2018-03-22 23:57:28Z

Grant's answer is fine, but something here needs to be emphasized.

Doesn't the zlib need to compress ??

No! It does not, and cannot always compress. Any operations that losslessly compress and decompress and input must expand some, actually most, inputs, while compressing only some inputs. This is a simple and obvious consequence of counting.

The only thing that is guaranteed by a lossless compressor is that what you get out from decompression is what you put in to compression.

Any useful compression scheme is rigged to take advantage of the specific redundancies expected to be found in the particular kind of data being compressed. Language data, e.g. English, C code, data files, even machine code, which is a sequence of symbols with a specific frequency distribution and oft repeated strings, is compressed using models that are expecting and looking for those redundancies. Such schemes depend on gathering information on the data being compressed in the first, at least, 10's of Kbytes before the compression starts being really effective.

Your example is far too short to have the statistics needed, and has no repetition of any kind, and so will be expanded by any general compressor.

Would you recommend to compress data read from a picture? I need to send multiple pictures in order to share screens between sockets.
Sure, but use a compressor that is designed for images. Such as PNG or FLIF. Compression only works on the specific redundancies found in particular types of data.

Mark Adler · Accepted Answer · 2018-03-22 23:49:34Z

You're going to have a hard time compressing a string like that. It's rather short and contains 26 unique characters. Compressors work by assigning byte values to common words, characters, etc, so by having all unique characters you'll get poor performance.

You'll also get poor performance if the data is random.

Here's an example with a string of the same length which compresses.

>>> str2 = 'a'*26 >>> str2 'aaaaaaaaaaaaaaaaaaaaaaaaaa' >>> sys.getsizeof(str2) 63 >>> sys.getsizeof(zlib.compress(str2)) 48

Oh ok, now I see, I tried spamming the keyboard with random characters and the compressed string does get smaller in size.
Try it with something like the pride and prejudice text file. It's a common text used to test compressors.
Yup, it really compressed, 1 chapter is 11353 and after compression, it's 4902. Thanks!

Collectives™ on Stack Overflow

Zlib compress in python

2 Answers 2

2 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Linked

Related