6

I have a file

СМП бваг™вга† 

The first three letters are proper Cyrllic and the remaining part is mojibake.

"Mojibake is the garbled or gibberish text that is the result of text being decoded using an unintended character encoding." — Wikipedia

Originally, it was

СМП структура 

but then garbled somehow, most probably because of the file has been zip-ped on Windows XP and then un-zipped on a Mac, by an unskillful user.

I tried to fix it using convmv and iconv, like this:

convmv -r -f cp1251 -t utf-8 DIR 
ls | iconv -f cp1251 -t cp850 | iconv -f cp866 

but didn't succeed yet. Could someone help with this?

update 1

Hexdump of СМП бваг™вга†:

0000000 d0 a1 d0 9c d0 9f 20 d0 b1 d0 b2 d0 b0 d0 b3 e2 С ** М ** П ** б ** в ** а ** г ** ™ 0000020 84 a2 d0 b2 d0 b3 d0 b0 e2 80 a0 0a ** ** в ** г ** а ** † ** ** \n 0000034 

Hexdump of СМП структура:

0000000 d0 a1 d0 9c d0 9f 20 d1 81 d1 82 d1 80 d1 83 d0 С ** М ** П ** с ** т ** р ** у ** к 0000020 ba d1 82 d1 83 d1 80 d0 b0 0a ** т ** у ** р ** а ** \n 0000032 
1
  • I assume the user's skill doesn't matter here. Commented Mar 25 at 10:36

1 Answer 1

12

Those look like file names that were initially encoded in CP866 but were incorrectly converted to UTF-8 assuming they were encoded in MAC-CYRILLIC instead.

$ echo СМП структура | iconv -t cp866 | # cp866 encoded text iconv -f mac-cyrillic -t utf-8 # incorrectly converted to UTF-8 СМП бваг™вга† 

To fix, revert the process:

$ echo СМП бваг™вга† | iconv -f utf-8 -t mac-cyrillic | iconv -f cp866 -t utf-8 СМП структура 

On systems other than macos,

convmv --notest -r -f utf-8 --qto -t mac-cyrillic DIR && convmv --notest -r --qfrom -f cp866 -t utf-8 DIR 

Should do it.

But macos can't have file names with arbitrary byte values, it seems its API expects filenames that are valid UTF-8 encoded text and its rename() would reject with a EILSEQ error (Illegal byte sequence) an attempt to rename a file to something that is not UTF-8 encoded such as what the first convmv does.

So, we'd need to do those 2 steps in one to go straight from UTF-8 to UTF-8 there.

Could be something like:

find DIR -depth -print0 | rename -0 -d -n -e ' use Unicode::Normalize qw(NFC); use Encode qw(:all); my $check = DIE_ON_ERR | LEAVE_SRC; my $new = eval {encode("UTF-8", decode("cp866", encode("mac-cyrillic", NFC(decode("UTF-8", $_, $check)), $check), $check)) }; if ($new) {$_ = $new} else {warn$@}' 

(here assuming the rename from the perl File::Rename module and using NFC to work around yet another macos idiosyncrasy where it mangles some characters in filenames by converting them to their decomposed form, something convmv also works around internally automatically).

Remove the -n (dry run) if happy.

9
  • The first and second commands work fine, but the third one doesn't work with a real file. The output first complains "Error: Illegal byte sequence" and then says "Skipping, already UTF-8". Commented Mar 14 at 18:36
  • "And what are those real files that convmv complains about?" — I have uploaded the file here: github.com/johnmapeson/test/blob/main/СМП%20бваг™вга†.docx. And just in case, here it is in the archive: github.com/johnmapeson/test/blob/main/file.zip Commented Mar 15 at 17:42
  • 1
    Ah, looking at the code of convmv that Error: Illegal byte sequence is reported upon failure of the rename() system call. It seems macos can't create files with arbitrary byte values in their name, so that approach won't work, we need to go straight from UTF-8 to UTF-8 and can't do in two steps like that. Commented Mar 15 at 18:29
  • 1
    @jsx97 What are the different versions of the rename command? How do I use the Perl version? suggests that on macos, you can install via homebrew (though you could do via cpan as well I'd expect). Commented Mar 15 at 19:26
  • 1
    @jsx97 rename -0 expects a NUL-delimited list of paths on stdin. Even if you used a for loop to generate that list, you'd need something like find or globbing to generate the list of things for the loop to loop over, and I can't really see how a loop would help. If you prefer globs, you can use print -rNC1 -- **/*(NDod) in zsh to generate that list. Commented Apr 16 at 14:18

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.