Timeline for Compressing EBCDIC file vs UTF8

Current License: CC BY-SA 4.0

18 events

when toggle format	what		by	license	comment
Nov 22, 2019 at 19:45	comment	added	rodripf		Sadly the data I am testing right now is confidential, so I cannot share it. I'll try to generate some random data with the same characteristics and share it!
Nov 21, 2019 at 21:12	comment	added	Christophe		@RobertHarvey Indeed! Here I can agree with you. ZIP probably choses a different algorithm for both files, in view of a very different statistical distribution in both files. Different algorithms would explain the huge difference. And I agree, more data is needed to say for sure.
Nov 21, 2019 at 21:06	answer	added	Christophe		timeline score: 3
Nov 21, 2019 at 20:30	review	Close votes
Dec 8, 2019 at 3:05
Nov 21, 2019 at 20:24	answer	added	gnasher729		timeline score: 0
Nov 21, 2019 at 20:14	comment	added	Robert Harvey		Then you would have to evaluate the algorithm in use against the data being compressed to see what is happening under the hood. Not exactly a trivial exercise.
Nov 21, 2019 at 20:13	comment	added	Robert Harvey		@Christophe: Fundamentally, compression merely reduces redundancies in the data. Practically, this particular question doesn't contain enough information to be answerable. Zip uses Shrink, Reduce (levels 1-4), Implode, Deflate, Deflate64, bzip2, LZMA (EFS), WavPack, and PPMd algorithms to compress data; at a minimum, we would need to know which algorithm is in use, and whether Zip chose a different algorithm for each compression exercise.
Nov 21, 2019 at 20:08	comment	added	Christophe		@RobertHarvey 99.7 characters are English in both files. The question is why the EBCDIC compresses 60% more than the UTF-8, where only 0,03% of the UTF8 are multibyte (and use effectively the 8th bit).
Nov 21, 2019 at 20:05	comment	added	Robert Harvey		@Christophe: It doesn't. The other 99.7 percent of 7-bit English characters explains the difference.
Nov 21, 2019 at 20:02	comment	added	Christophe		@RobertHarvey and this is exactly what makes this question very interesting. How could 0,3% of non-english char explain a 60% difference ? The English chars are encoded on 7 bits in both cases.
Nov 21, 2019 at 19:50	comment	added	Robert Harvey		@Christophe: A few non-English characters won't materially affect the compression characteristics.
Nov 21, 2019 at 19:48	comment	added	Christophe		@RobertHarvey Well, according to OP's data, with at least 1MB multibyte characters in the file, there are for sure a couple of non-english characters. OP is from Uruguay where a lot of ñ, Ñ, ú and other non-ascii chars are used. By the way, could you explain why you think that a non-English EBCDIC is improbable ?
Nov 21, 2019 at 19:36	comment	added	Robert Harvey		@Christophe: Which would only apply to EBCDIC files that are not in English, probably an unlikely scenario.
Nov 21, 2019 at 19:34	comment	added	Christophe		@RobertHarvey For people using other languages as English, EBCDIC uses the full 8 bit range, as far as I know, depending on the code pages ...
Nov 21, 2019 at 18:41	comment	added	πάντα ῥεῖ		@Robert I stole that information to improve my answer. I hope you are OK with that.
Nov 21, 2019 at 18:38	comment	added	Robert Harvey		Practically, EBCDIC only uses 7 bits out of 8 in a byte. That alone could explain the compression difference.
Nov 21, 2019 at 18:35	review	First posts
Nov 21, 2019 at 19:34
Nov 21, 2019 at 18:32	history	asked	rodripf	CC BY-SA 4.0

toggle format