Timeline for Compressing EBCDIC file vs UTF8
Current License: CC BY-SA 4.0
18 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Nov 22, 2019 at 19:45 | comment | added | rodripf | Sadly the data I am testing right now is confidential, so I cannot share it. I'll try to generate some random data with the same characteristics and share it! | |
| Nov 21, 2019 at 21:12 | comment | added | Christophe | @RobertHarvey Indeed! Here I can agree with you. ZIP probably choses a different algorithm for both files, in view of a very different statistical distribution in both files. Different algorithms would explain the huge difference. And I agree, more data is needed to say for sure. | |
| Nov 21, 2019 at 21:06 | answer | added | Christophe | timeline score: 3 | |
| Nov 21, 2019 at 20:30 | review | Close votes | |||
| Dec 8, 2019 at 3:05 | |||||
| Nov 21, 2019 at 20:24 | answer | added | gnasher729 | timeline score: 0 | |
| Nov 21, 2019 at 20:14 | comment | added | Robert Harvey | Then you would have to evaluate the algorithm in use against the data being compressed to see what is happening under the hood. Not exactly a trivial exercise. | |
| Nov 21, 2019 at 20:13 | comment | added | Robert Harvey | @Christophe: Fundamentally, compression merely reduces redundancies in the data. Practically, this particular question doesn't contain enough information to be answerable. Zip uses Shrink, Reduce (levels 1-4), Implode, Deflate, Deflate64, bzip2, LZMA (EFS), WavPack, and PPMd algorithms to compress data; at a minimum, we would need to know which algorithm is in use, and whether Zip chose a different algorithm for each compression exercise. | |
| Nov 21, 2019 at 20:08 | comment | added | Christophe | @RobertHarvey 99.7 characters are English in both files. The question is why the EBCDIC compresses 60% more than the UTF-8, where only 0,03% of the UTF8 are multibyte (and use effectively the 8th bit). | |
| Nov 21, 2019 at 20:05 | comment | added | Robert Harvey | @Christophe: It doesn't. The other 99.7 percent of 7-bit English characters explains the difference. | |
| Nov 21, 2019 at 20:02 | comment | added | Christophe | @RobertHarvey and this is exactly what makes this question very interesting. How could 0,3% of non-english char explain a 60% difference ? The English chars are encoded on 7 bits in both cases. | |
| Nov 21, 2019 at 19:50 | comment | added | Robert Harvey | @Christophe: A few non-English characters won't materially affect the compression characteristics. | |
| Nov 21, 2019 at 19:48 | comment | added | Christophe | @RobertHarvey Well, according to OP's data, with at least 1MB multibyte characters in the file, there are for sure a couple of non-english characters. OP is from Uruguay where a lot of ñ, Ñ, ú and other non-ascii chars are used. By the way, could you explain why you think that a non-English EBCDIC is improbable ? | |
| Nov 21, 2019 at 19:36 | comment | added | Robert Harvey | @Christophe: Which would only apply to EBCDIC files that are not in English, probably an unlikely scenario. | |
| Nov 21, 2019 at 19:34 | comment | added | Christophe | @RobertHarvey For people using other languages as English, EBCDIC uses the full 8 bit range, as far as I know, depending on the code pages ... | |
| Nov 21, 2019 at 18:41 | comment | added | πάντα ῥεῖ | @Robert I stole that information to improve my answer. I hope you are OK with that. | |
| Nov 21, 2019 at 18:38 | comment | added | Robert Harvey | Practically, EBCDIC only uses 7 bits out of 8 in a byte. That alone could explain the compression difference. | |
| Nov 21, 2019 at 18:35 | review | First posts | |||
| Nov 21, 2019 at 19:34 | |||||
| Nov 21, 2019 at 18:32 | history | asked | rodripf | CC BY-SA 4.0 |