Convert an ASCII file with octal escapes for UTF-8 codes to UTF-8

Question

I have a text file which is an ASCII file itself, but contains octal escape sequences representing codes in utf-8:
```
\350\207\252\345\212\250\346\216 
```
Is there some program or command that can convert such ASCII file to a text file actually encoded in utf-8?
By the way, this site is "Online ASCII(Unicode Escaped) to Unicode(UTF-8) converter tool", and this site is "Online Unicode(UTF-8) to ASCII(Unicode Escaped) converter tool". Do they make the conversion in my question? If not, what kinds of conversion do they make?

@Peter.O: I don't know how to put this. but in "\xxx", "\" and "x" are characters encoded in ASCII. download this as a text file: paste.ubuntu.com/11885151 — Tim
– Tim, Commented Jul 16, 2015 at 0:45

Gilles 'SO- stop being evil' · Accepted Answer · 2015-07-16 01:31:29Z

If you have these escape sequences in a shell variable, in dash, mksh or bash:

printf %b "$string_with_backslash_escapes"

This isn't POSIX: the %b specifier is POSIX but it requires a 0 after each backslash. This also interprets other backslash escapes: \n as a newline, \t as a tab, etc.

Here's a perl one-liner that converts octal escape sequences only.

perl -pe 's[\\(?:([0-7]{1,3})|(.))] [defined($1) ? chr(oct($1)) : $2]eg'

http://www.rapidmonkey.com/unicodeconverter/reverse.jsp interprets octal values as Latin-1 characters, I don't know why Unicode and UTF-8 are mentioned in the page. I have no idea what http://www.rapidmonkey.com/unicodeconverter/advanced.jsp does.

Community · Accepted Answer · 2021-10-07 07:34:52Z

Using just Bash:

3.1.2.4 ANSI-C Quoting

Words of the form $'string' are treated specially. The word expands to string, with backslash-escaped characters replaced as specified by the ANSI C standard. Backslash escape sequences, if present, are decoded as follows:

…

<i>nnn the eight-bit character whose value is the octal value nnn (one to three digits)

Demonstration in a UTF-8 terminal:

$ echo $'\350\207\252\345\212\250\346\216' 自动?

The last character displays as a question mark because the sequence is malformed: only two of the three required bytes are present.

The website you linked to performs RFC 5137 encoding/decoding.

If you enter \u81ea\u52a8 in the "ASCII (Unicode Escaped)" text area, you'll get 自动 as output, because 自 is Unicode Character U+81EA (whose UTF-8 representation is e8 87 aa in hex, or 350 207 252 in octal) and 动 is Unicode character U+52A8 (whose UTF-8 representation is e5 8a a8 in hex, or 345 212 250 in octal).

Does the posix shell definition currently (2018) define $'string'? Or is this currently a non-posix extension? If so, please provide a pointer. — Juan
– Juan, Commented Mar 23, 2018 at 20:13
@Juan I quoted the Bash manual. I don't believe that it is part of the POSIX standard. — 200_success
– 200_success, Commented Mar 23, 2018 at 20:16
Thanks. I've seen other non-bash shells that support that syntax (mksh, freebsd 9.x+ /bin/sh). So I thought it might be a posix feature (or a movement to make it so). — Juan
– Juan, Commented Mar 24, 2018 at 16:46

Wilmer · Accepted Answer · 2015-07-15 23:19:10Z

2

Python in the interactive shell can do at least some of this. But the sequence above appears to be corrupted:

 wilmer@ruby:~$ python Python 2.7.10 (default, Jul 1 2015, 10:54:53) [GCC 4.9.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> unicode("\350\207\252\345\212\250\346\216", "utf-8") Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7: unexpected end of data >>> unicode("\350\207\252\345\212\250", "utf-8") u'\u81ea\u52a8' >>> print unicode("\350\207\252\345\212\250", "utf-8") 自动

answered Jul 15, 2015 at 23:19

Wilmer

3103 silver badges6 bronze badges

thanks. do you know that what kinds of conversion the two online tools in my second part make? Some examples of their conversions will be clear.

Tim
– Tim

2015-07-15 23:20:42 +00:00
Commented Jul 15, 2015 at 23:20
My guess is that that page converts from hexadecimal Unicode escape sequences, whereas your bit above is UTF-8 (so two \xxx bits per actual character).

Wilmer
– Wilmer

2015-07-15 23:25:47 +00:00
Commented Jul 15, 2015 at 23:25
(1) "the sequence above appears to be corrupted". Does two "\xxx"'s in my ASCII file represent a character? If yes, does six "\xxx"'s represent three characters? If yes, why the last line of code only print out two characters? (2) what is the difference between "hexadecimal Unicode escape sequences" and "UTF-8 (so two \xxx bits per actual character)"?

Tim
– Tim

2015-07-15 23:52:19 +00:00
Commented Jul 15, 2015 at 23:52
@Tim UTF-8 uses a variable number of bytes per character. Look it up. In the sequence you gave, the first three bytes represent one character, the next three represent one character, and the last two are a valid prefix of several three-byte sequences that would represent one character.

Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil'

2015-07-16 01:35:21 +00:00
Commented Jul 16, 2015 at 1:35

Add a comment |

slm · Accepted Answer · 2016-09-10 13:26:10Z

1

Simplest way is ascii2uni -a K, for example:

cat escaped.txt | ascii2uni -a K > unescaped.txt

edited Sep 10, 2016 at 13:26

slm♦

380k127 gold badges793 silver badges897 bronze badges

answered Sep 10, 2016 at 12:45

VolCh

1211 bronze badge

Add a comment |

Stack Exchange Network

Convert an ASCII file with octal escapes for UTF-8 codes to UTF-8

4 Answers 4

3.1.2.4 ANSI-C Quoting

You must log in to answer this question.

Hot Network Questions

Convert an ASCII file with octal escapes for UTF-8 codes to UTF-8

4 Answers 4

3.1.2.4 ANSI-C Quoting

You must log in to answer this question.

Related

Hot Network Questions