How to convert unknown-8bit file to utf8

Question

I have a .srt file that displays as gibberish when I open it in gEdit in ubuntu. So I want to convert it to utf8 to be able to read it.

When I try to figure out what the encoding it give:

file -i x.srt x.srt: text/plain; charset=unknown-8bit

In another attempt I found:

find . -type f -print | xargs file ./x.srt: Non-ISO extended-ASCII text, with CRLF line terminators

Also I tried enca:

enca x.srt enca: Cannot determine (or understand) your language preferences. Please use `-L language', or `-L none' if your language is not supported (only a few multibyte encodings can be recognized then). Run `enca --list languages' to get a list of supported languages.

and

enca -L Persian x.srt enca: Cannot determine (or understand) your language preferences. Please use `-L language', or `-L none' if your language is not supported (only a few multibyte encodings can be recognized then). Run `enca --list languages' to get a list of supported languages.

So I am wondering how to know the encoding and eventually convert it to a usable format.

You could look at the source code of SubRip to determine the existing file format: sourceforge.net/projects/subrip — William Deans
– William Deans, Commented Feb 12, 2015 at 19:55
is there too much languages in 'enca --list languages' ... maybe to write bash script for writing all langueas result in one file and inspect it visually after ... something like " for lang in $(enca --list languageas); do eca -L $lang - ; done > tmp.txt" — Asain Kujovic
– Asain Kujovic, Commented Feb 12, 2015 at 20:51
@OmerMerdan enca --list gives a list of 12 Slavic languages along with Chinese and other . — supermario
– supermario, Commented Feb 12, 2015 at 21:00
What language are the subtitles supposed to be in? Can you post a sample (output of head -n 20 x.srt | od -tx1)? — Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil', Commented Feb 13, 2015 at 22:45
For me the just trying to guess the correct encoding worked, e.g. iconv -f iso-8859-1 -t utf-8 < file.txt > out.txt — BladeMight
– BladeMight, Commented Jan 30, 2019 at 12:18

tripleee · Accepted Answer · 2019-09-15 16:26:49Z

There is no reliable way to convert from an unknown encoding to a known one.

In your case, if you know the original text is in Farsi / Persian, maybe you can identify a number of possible encodings, and iterate over those until you see the output you expect.

Based on quick googling, there is no standard, stable converter for the legacy Iran System encoding, and the only remaining popular alternative is Windows codepage 1256. I have included MacArabic here mainly for illustrative purposes (though maybe it would even be a feasible alternative for Farsi, too?)

for encoding in cp1256 macarabic; do if iconv -f "$encoding" -t utf-8 inputfile >outputfile."$encoding"; then echo "$encoding: possible" else echo "$encoding: skipped" rm outputfile."$encoding" fi done

(My version of iconv doesn't actually support MacArabic, but maybe you will have more luck; or you can try a different conversion tool.)

Examine the resulting output files; see if one of them seems to make sense.

If you know what the output should look like, you can also look up individual mappings for bytes in the file. If the first byte is 0x94 and you know it should display as ﭖ you have basically established that the encoding is Iran System. Maybe look up a few more bytes to verify this conclusion. The Wikipedia page for this encoding has a table of all the characters. Obviously, this is painstaking, slow, and error prone, especially if there are many candidate encodings to choose from.

For some encodings, you can find a list e.g. at https://tripleee.github.io/8bit/ -- for others, maybe you just have to look at the corresponding Wikipedia coding tables.

Community · Accepted Answer · 2017-04-13 12:22:52Z

A file in an unknown 8-bit code page is determined as “unknown-8bit” for a reason: it is not an easy problem without any ideas about the language. Not to say it’s impossible but, to work efficiently, such heuristic detector had to possess a large vocabulary of all most-used languages, a large list of code pages, and know some grammar. Update: never tried enca; possibly it’s a wonder-decoder made along these lines. But if the file represents, say, a mostly ASCII source code with only one or two words made of high-bit-set octets, then it’s virtually impossible to guess the language and encoding even with such miraculous heuristic algorithm. That’s why original HTTP/1.1 strongly insisted on declaration of charset in the HTTP Content-Type: header for any text/* media type.

So, the solution, by points:

Investigate/learn/guess which language does the file supposedly encode. Here a human intelligence is crucial. At least compile a list of few plausible hypotheses.
Compile a list of encodings used by the language(s).
Try these encodings: headfile |iconv -ftry (LANG environment variable is assumed to be set accordingly to a TUI used) and look whether is the result readable, until success.

This solution, of course, assumes that the text is encoded properly but in an unknown code page. Cases where the text was garbled by human mistake or due to a software glitch can’t be solved this way.

The points 2. and 3. may be automated and such tools exist, indeed, but they are language-specific (i. e. a heuristic decoder for Russian won’t work for Japanese and vice versa) or, at least, require to specify the input language (as enca does).

As for Persian language, possible encodings include Windows-1256 (see this thread), ISO 8859-6, and now obsolete Iran System encoding. Be happy here you haven’t a list of at least seven code pages used for Russian (KOI7, KOI8, CP866, Windows-1251, ISO 8859-5, MacCyrillic, MIK).

Asain Kujovic · Accepted Answer · 2015-02-12 21:48:03Z

maybe to visually inspect all ~1000 possibilities of iconv, by listing 20 first lines of each one ... merged to all.txt result.

#!/usr/bin/env bash line=$(printf "=%.0s" {1..50}) for FMT in $(iconv -l); do echo "$line\nFormat $FMT:\n$line" iconv -f $FMT -t UTF8 < inputFile.srt | head -n20 done > all.txt #gedit all.txt

... and find out which format is right one (if you can recognize persian).

This is overkill -- iconv -l brings up a large number of effectively duplicate aliases, as well as obviously unlikely candidates such as various CJKV encodings. — tripleee
– tripleee, Commented Oct 6, 2015 at 7:36

Stack Exchange Network

How to convert unknown-8bit file to utf8

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

How to convert unknown-8bit file to utf8

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions