Skip to main content

Timeline for Compressing EBCDIC file vs UTF8

Current License: CC BY-SA 4.0

18 events
when toggle format what by license comment
Nov 22, 2019 at 19:45 comment added rodripf Sadly the data I am testing right now is confidential, so I cannot share it. I'll try to generate some random data with the same characteristics and share it!
Nov 21, 2019 at 21:12 comment added Christophe @RobertHarvey Indeed! Here I can agree with you. ZIP probably choses a different algorithm for both files, in view of a very different statistical distribution in both files. Different algorithms would explain the huge difference. And I agree, more data is needed to say for sure.
Nov 21, 2019 at 21:06 answer added Christophe timeline score: 3
Nov 21, 2019 at 20:30 review Close votes
Dec 8, 2019 at 3:05
Nov 21, 2019 at 20:24 answer added gnasher729 timeline score: 0
Nov 21, 2019 at 20:14 comment added Robert Harvey Then you would have to evaluate the algorithm in use against the data being compressed to see what is happening under the hood. Not exactly a trivial exercise.
Nov 21, 2019 at 20:13 comment added Robert Harvey @Christophe: Fundamentally, compression merely reduces redundancies in the data. Practically, this particular question doesn't contain enough information to be answerable. Zip uses Shrink, Reduce (levels 1-4), Implode, Deflate, Deflate64, bzip2, LZMA (EFS), WavPack, and PPMd algorithms to compress data; at a minimum, we would need to know which algorithm is in use, and whether Zip chose a different algorithm for each compression exercise.
Nov 21, 2019 at 20:08 comment added Christophe @RobertHarvey 99.7 characters are English in both files. The question is why the EBCDIC compresses 60% more than the UTF-8, where only 0,03% of the UTF8 are multibyte (and use effectively the 8th bit).
Nov 21, 2019 at 20:05 comment added Robert Harvey @Christophe: It doesn't. The other 99.7 percent of 7-bit English characters explains the difference.
Nov 21, 2019 at 20:02 comment added Christophe @RobertHarvey and this is exactly what makes this question very interesting. How could 0,3% of non-english char explain a 60% difference ? The English chars are encoded on 7 bits in both cases.
Nov 21, 2019 at 19:50 comment added Robert Harvey @Christophe: A few non-English characters won't materially affect the compression characteristics.
Nov 21, 2019 at 19:48 comment added Christophe @RobertHarvey Well, according to OP's data, with at least 1MB multibyte characters in the file, there are for sure a couple of non-english characters. OP is from Uruguay where a lot of ñ, Ñ, ú and other non-ascii chars are used. By the way, could you explain why you think that a non-English EBCDIC is improbable ?
Nov 21, 2019 at 19:36 comment added Robert Harvey @Christophe: Which would only apply to EBCDIC files that are not in English, probably an unlikely scenario.
Nov 21, 2019 at 19:34 comment added Christophe @RobertHarvey For people using other languages as English, EBCDIC uses the full 8 bit range, as far as I know, depending on the code pages ...
Nov 21, 2019 at 18:41 comment added πάντα ῥεῖ @Robert I stole that information to improve my answer. I hope you are OK with that.
Nov 21, 2019 at 18:38 comment added Robert Harvey Practically, EBCDIC only uses 7 bits out of 8 in a byte. That alone could explain the compression difference.
Nov 21, 2019 at 18:35 review First posts
Nov 21, 2019 at 19:34
Nov 21, 2019 at 18:32 history asked rodripf CC BY-SA 4.0