13
  1. I have a text file which is an ASCII file itself, but contains octal escape sequences representing codes in utf-8:

    \350\207\252\345\212\250\346\216 

    Is there some program or command that can convert such ASCII file to a text file actually encoded in utf-8?

  2. By the way, this site is "Online ASCII(Unicode Escaped) to Unicode(UTF-8) converter tool", and this site is "Online Unicode(UTF-8) to ASCII(Unicode Escaped) converter tool". Do they make the conversion in my question? If not, what kinds of conversion do they make?
1
  • @Peter.O: I don't know how to put this. but in "\xxx", "\" and "x" are characters encoded in ASCII. download this as a text file: paste.ubuntu.com/11885151 Commented Jul 16, 2015 at 0:45

4 Answers 4

12

If you have these escape sequences in a shell variable, in dash, mksh or bash:

printf %b "$string_with_backslash_escapes" 

This isn't POSIX: the %b specifier is POSIX but it requires a 0 after each backslash. This also interprets other backslash escapes: \n as a newline, \t as a tab, etc.

Here's a perl one-liner that converts octal escape sequences only.

perl -pe 's[\\(?:([0-7]{1,3})|(.))] [defined($1) ? chr(oct($1)) : $2]eg' 

http://www.rapidmonkey.com/unicodeconverter/reverse.jsp interprets octal values as Latin-1 characters, I don't know why Unicode and UTF-8 are mentioned in the page. I have no idea what http://www.rapidmonkey.com/unicodeconverter/advanced.jsp does.

5

Using just Bash:

3.1.2.4 ANSI-C Quoting

Words of the form $'string' are treated specially. The word expands to string, with backslash-escaped characters replaced as specified by the ANSI C standard. Backslash escape sequences, if present, are decoded as follows:

<i>nnn the eight-bit character whose value is the octal value nnn (one to three digits)

Demonstration in a UTF-8 terminal:

$ echo $'\350\207\252\345\212\250\346\216' 自动? 

The last character displays as a question mark because the sequence is malformed: only two of the three required bytes are present.


The website you linked to performs RFC 5137 encoding/decoding.

If you enter \u81ea\u52a8 in the "ASCII (Unicode Escaped)" text area, you'll get 自动 as output, because is Unicode Character U+81EA (whose UTF-8 representation is e8 87 aa in hex, or 350 207 252 in octal) and is Unicode character U+52A8 (whose UTF-8 representation is e5 8a a8 in hex, or 345 212 250 in octal).

3
  • Does the posix shell definition currently (2018) define $'string'? Or is this currently a non-posix extension? If so, please provide a pointer. Commented Mar 23, 2018 at 20:13
  • @Juan I quoted the Bash manual. I don't believe that it is part of the POSIX standard. Commented Mar 23, 2018 at 20:16
  • Thanks. I've seen other non-bash shells that support that syntax (mksh, freebsd 9.x+ /bin/sh). So I thought it might be a posix feature (or a movement to make it so). Commented Mar 24, 2018 at 16:46
2

Python in the interactive shell can do at least some of this. But the sequence above appears to be corrupted:

 wilmer@ruby:~$ python Python 2.7.10 (default, Jul 1 2015, 10:54:53) [GCC 4.9.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> unicode("\350\207\252\345\212\250\346\216", "utf-8") Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7: unexpected end of data >>> unicode("\350\207\252\345\212\250", "utf-8") u'\u81ea\u52a8' >>> print unicode("\350\207\252\345\212\250", "utf-8") 自动 
4
  • thanks. do you know that what kinds of conversion the two online tools in my second part make? Some examples of their conversions will be clear. Commented Jul 15, 2015 at 23:20
  • My guess is that that page converts from hexadecimal Unicode escape sequences, whereas your bit above is UTF-8 (so two \xxx bits per actual character). Commented Jul 15, 2015 at 23:25
  • (1) "the sequence above appears to be corrupted". Does two "\xxx"'s in my ASCII file represent a character? If yes, does six "\xxx"'s represent three characters? If yes, why the last line of code only print out two characters? (2) what is the difference between "hexadecimal Unicode escape sequences" and "UTF-8 (so two \xxx bits per actual character)"? Commented Jul 15, 2015 at 23:52
  • @Tim UTF-8 uses a variable number of bytes per character. Look it up. In the sequence you gave, the first three bytes represent one character, the next three represent one character, and the last two are a valid prefix of several three-byte sequences that would represent one character. Commented Jul 16, 2015 at 1:35
1

Simplest way is ascii2uni -a K, for example:

cat escaped.txt | ascii2uni -a K > unescaped.txt 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.