-1

Curiosity here. I cat-ed a bitmap and expected it to display one ascii character for each byte in the file. The first few characters were thus as expected (BM6), but I noticed further down that it also displayed non-ascii characters in the terminal like "ڮ", "Ѿ", "ӷ", etc.

Why is this? What is cat doing here?

(The bitmap I used has bitsPerPixel=8, so it can't be a representation of a multi-byte pixel, right?)

5
  • cat's doing nothing except sending characters to the terminal, which is doing what terminals do: display characters. Commented Jan 2, 2020 at 2:03
  • 2
    You probably have a UTF-8 capable terminal emulator. This means it will display many non asciii characters assuming that it has the font information to do so. So if the first byte in the file is 0x42 it will display "B", and if later there is 0xd1 followed by 0xbe then you will get "Ѿ" Commented Jan 2, 2020 at 2:10
  • I see. I'm used to thinking of encoding as a property of the file, but I guess that doesn't make sense with a bitmap. So now I'm trying to get my Mac terminal to cat the bitmap with just ascii representation. (Proving tricky!) Commented Jan 2, 2020 at 3:06
  • To those interested -- I found that cat file.bmp | hexdump -C is a nice way of displaying the bytes in the bmp as ascii Commented Jan 2, 2020 at 3:39
  • 1
    hexdump -C file.bmp will do the same thing with no pipe and just one program running. Also od -c file.bmp. Commented Jan 2, 2020 at 4:00

1 Answer 1

6

It writes them to standard output.

What happens next is up to whatever standard output is. If it is a terminal device, then the behaviour is determined by the terminal, and is nothing to do with cat.

When it comes to the behaviour of the terminal, several things are important:

  • ASCII is a 7-bit character encoding. An 8-bit byte can represent twice as many things as in the ASCII character set. What the other 128 values mean has been the subject of a lot of back and forth in the 1970s, 1980s, and 1990s. We had single-byte encodings, double-byte encodings, code pages, ISO 8859, and the parts of ASCII itself that were intended to be variant (or that changed in later editions of the standard). And that is not even to get into the complexities of ISO 2022/ECMA-35 and switchable character sets.

    There's a myth that there's a "plain ASCII" world out there. This has not really been true for almost half a century. You will almost never be in a situation nowadays where you are looking at solely actual ASCII.

  • Nowadays, a terminal emulator is, more likely than not, using UTF-8, a variable-length encoding of Unicode where each code point is one or more bytes. It was starting to become commonplace 15 years ago.
  • Even were you working with just ASCII, some characters are printing characters, that a terminal will render with a printable graphic, and the others are control characters, which have various non-printing effects (including, in the case of ␀, no effect at all). Unicode is more complex, but the basic idea that only some code points are actually displayable is still roughly true.
  • Terminals decode their received byte streams into characters according to what character encoding they are currently set to use. This varies quite significantly from terminal to terminal. Terminal emulators usually have some menu option where this decoding can be changed on the fly by the user, at whim.

On the gripping hand, looking at a bitmap by printing it to a terminal with cat is madness. Learn the joy of hexdump, or od. No cat should be involved at all.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.