How to recognize which ASCII character in hex is this?

Question

We have a textfile that we want to clear from "bad" characters. If we open it with vim (with ":set number"):

57000044 zo¥<9a>¥ge¥o¥graph¥i¥cal¥ly 39999999 pariá¹<83>Å<9b>a

The ex.: "<9a>" and "<83>" and "<9b>" is marked blue in vim and these two lines looks like this outside vim:

$ sed '57000044,57000044!d' toclean.txt zo���ge�o�graph�i�cal�ly $ sed '57000044,57000044!d' toclean.txt | cat -vte - zoM-%M-^ZM-%geM-%oM-%graphM-%iM-%calM-%ly$ $

and

$ sed '39999999,39999999!d' toclean.txt pariṃśa $ sed '39999999,39999999!d' toclean.txt | cat -vte - pariM-aM-9M-^CM-EM-^[a$ $

Question: How do we find out that what is the HEX ASCII char for the mentioned "<9a>" and "<83>" and "<9b>"? Or "¹" or "¥"...

The hex code is needed to remove it all from the file to make it cleaner. Example this code removes HEX ASCII "x09", so the "Horizontal Tab":

sed -i 's/[\x09]//g' toclean.txt

We tried using "9A" or "A5" in hex, it didn't helped..

$ sed '57000044,57000044!d' toclean.txt | sed 's/[\x9A]//g; s/[\xA5]//g' zo���ge�o�graph�i�cal�ly zo���ge�o�graph�i�cal�ly $

Janis · Accepted Answer · 2015-04-03 06:49:42Z

The codes hex:<9a> and hex:<83> are not ASCII codes (ASCII codes go only from <00> to <7F>). You also cannot "find out" what characters (from any larger character set than ASCII) are associated with those codes since that depends on the underlying character set ("code page") encoding. So you have to ask the one who created that data what character encoding he used. (Typical encodings that you often find are ISO 8859-1, ISO 8859-15, UTF-8, UCS-2. You can also inspect the code tables you find on the net what characters with those indices makes most sense in your data context.)

Once you know the codes values you want to remove you can (for example) use the tr command with option -d (arguments in octal).

Gilles 'SO- stop being evil' · Accepted Answer · 2015-04-04 01:12:18Z

ASCII is a 7-bit character set. Characters with values above 128 are non-ASCII characters.

If you use Unicode, note that a character is represented by multiple bytes (there are only 256 different byte values but more than 100000 Unicode characters). The de facto standard representation of Unicode is UTF-8 uses a variable number of bytes per character; ASCII characters are represented by a single bytes, others by 2 to 4 bytes.

Vim displays some characters with blue placeholders such as <9a> because these are bytes that are not part of a valid character representation in the character set specified by the current locale.

If you want to work on bytes, set the locale setting LC_CTYPE to C.

LC_CTYPE=C vim toclean.txt

If you want to work on UTF-8, run Vim on a Unicode terminal.

You can display the bytes in the file with a command such as od (POSIX) or hexdump (BSD, often found on Linux).

od -t x1 toclean.txt hexdump -C toclean.txt hd toclean.txt

If you've determined that you want to remove certain byte values, you can use tr.

LC_CTYPE=C tr -d '\x83\xa5' toclean.txt >clean.txt

If you've determined that you want to remove certain UTF-8 characters, use tr in a locale with the UTF-8 encoding, e.g.

LC_CTYPE=en_US.utf8 tr -d '¥' toclean.txt >clean.txt

Morgan · Accepted Answer · 2015-04-04 22:34:18Z

The simplest solution I was able to find for removing "non-ascii" characters from a text file was from this thread.

$ tr -cd '\000-\777' < dirtyfile > cleanfile

The '\000-\777' defines the ascii set in octal. "-c" is the compliment of the given set, aka "non-ascii" and "-d" deletes characters.

You mean \177 (decimal 127). \777 is a lot more than the maximum possible byte value. — alexis
– alexis, Commented Jun 8, 2015 at 2:20

mikeserv · Accepted Answer · 2015-04-04 05:43:54Z

You can just use luit. It's purpose is to clean terminal text to suit the system's encoding and to act as a transparent filter between a applications which improperly handle unicode and terminals - or the other way around.

You almost definitely already have it installed - it ships standard with X because xterm calls on it automatically if it detects encoding issues on its host.

Its man page describes this example for interaction w/ Emacs:

luit is also useful with applications that hardwire an encoding that is different from the one normally used on the system or want to use legacy escape sequences for multilingual output. In particular, versions of Emacs that do not speak UTF-8 well can use luit for multilingual output:
```
$ luit -encoding 'ISO 8859-1' emacs -nw 
```

And then, in Emacs,

 M-x set-terminal-coding-system RET iso-2022-8bit-ss2 RET

Besides its direct terminal applications, though, it also supports...

-c Function as a simple converter from standard input to standard output.

And so might be used like...

luit -c <infile >outfile

Stack Exchange Network

How to recognize which ASCII character in hex is this?

4 Answers 4

You must log in to answer this question.

Hot Network Questions

How to recognize which ASCII character in hex is this?

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions