7

I have a file I created (in vim), for testing purposes (testing UTF-8 output in an SSH client). Odd things, however, are happening to this file.

I wondered what bytes were in the file, so I used hexdump:

username@computername:~$ hexdump -x intl.txt 0000000 9ecf 000a 0000003 

Ok, there's four bytes in there, How the 00 and the 0a got in there, I'm not clear, but whatever. Here's where it gets weird, though:

username@computername:~$ ls -al intl.txt -rw-rw-r-- 1 username username 3 Mar 26 15:14 intl.txt 

Wait, it's three bytes? What's going on here?

As if that wasn't odd enough, hexdump -C gives very different output:

username@computername:~$ hexdump -C intl.txt 00000000 cf 9e 0a |...| 00000003 

Vim is also a bit confused about the file. When I start it up, it gives this in the status line:

"intl.txt" 1L, 3C 

Up top, however, I get this (using set list):

Ϟ$ ~ ~ ~ ~ 

So, it thinks there's 3 characters, but only prints one. I could understand if it printed the koppa and a blank line under it...

8
  • 3
    hexdump isn't a tool to measure file sizes. Look at the man page, you'll see "zero-filled" in most of the output format descriptions. Commented Mar 26, 2014 at 21:43
  • Also it looks like hexdump -x outputs the 2-byte pairs little-endian. Commented Mar 26, 2014 at 21:43
  • so, if I want the actual bytes in the file (not padded and not rearranged), what can I use instead of hexdump? Commented Mar 26, 2014 at 21:44
  • 1
    Were you not happy with hexdump -C? Commented Mar 26, 2014 at 21:45
  • 1
    That file contains 3 bytes in one line containing one 2-byte UTF-8 encoded character. See od -vtx1 to see the hex values. Commented Mar 26, 2014 at 21:55

3 Answers 3

10

As others have pointed out, this is because hexdump -x treats the files as containing 2-byte words. On little endian systems (almost all desktops are), this means the bytes will be swapped before they are displayed. This means that the byte values are printed in pairs and that the order of these bytes are swapped. Since you have an odd number of bytes, hexdump just adds a zero to make up the final pair. The zero is then swapped with the 0a. This is documented behaviour for hexdump, so it is not lying to you!

Using hexdump -C is a better command to get a formatted output that shows the bytes in the order they are in the file. Also the 0a is a new line and was probably added quietly by whatever created the file (vim does this by default). Eg, echo will always add a new line if you don't tell it not to. In bash:

echo -e '\xcf\x9e' | hexdump -C 

will give the same result, but suppressing the newline with -n will give what you expected:

echo -ne '\xcf\x9e' | hexdump -C 

To stop vim from adding the newline:

:set noeol :set binary 
3
  • Interesting that vim adds the newline, but doesn't show it. I never knew that it did that. Clearly, I misunderstood what the purpose of hexdump was that was leading me astray. od -vtx1 (and indeed hexdump -C apparently are what I was looking for. Thanks! Commented Mar 27, 2014 at 14:24
  • 1
    @Mark a lot of compilers compilers don't like code that doesn't have a new line at the end of the file, which is why many editors add the newline. Commented Mar 27, 2014 at 14:42
  • Geany add \n in even empty file, but gedit and vim add new line character just only for at least one character files. Commented Dec 20, 2018 at 0:08
2

hexdump -x displays the values as if they were 2-byte integers. On a little-endian machine this will display each pair of bytes in swapped order, treating them as two-byte quantities with the high-order (second) byte first, followed by the low-order (first) byte.

As you've seen, using hexdump -C displays the actual bytes. The actual contents of your file are the two bytes 0xCF 0x9E, followed by the newline character 0x0A. Vim and ls are correctly telling you that there are 3 bytes (2 characters). The first two bytes comprise one Unicode character using the UTF-8 encoding.

More interesting information is in the comments above.

2

If you are having trouble understanding endianess, here's another illustration.

#include <stdio.h> #include <inttypes.h> #include <unistd.h> int main (void) { uint16_t x = 1; write(1, &x, 2); x = 2; write(1, &x, 2); return 0; } 

This is C code which will write out 2 16-bit values, 1 and 2. When we think about values, we think about them as big endian, so the padding here (to make these 16-bit values) would mean you have a zero byte then a byte worth 1 (or 2). However, because the system is little endian and it here considers these two discrete 16-bit (2 byte) units, the four bytes that literally get written out are 1, 0, 2, 0.

If you compile that (gcc whatever.c) and redirect to a file (./a.out > dword), hexdump -C will show you the physical order of the bytes:

> hexdump -C dword 00000000 01 00 02 00 |....| 00000004 

But in this case, hexdump -x will provide a more correct interpretation in terms of meaning, because it swaps the bytes to show the correct two 16-bit values:

> hexdump -x dword 0000000 0001 0002 0000004 

If those four bytes are instead interpreted as a (little endian) 32-bit integer:

> hexdump -e '"%d\n"' dword 131073 

Because it is translating the following 32-bits of binary into a decimal value:

00000001 00000000 00000010 00000000 

As a big endian value, that would be 2^9 (512) + 2^24 (16777216). This is what I mean by us "thinking" in big endianess. If we write out a binary number we use big endian bit order (one byte 00000010 == 2) and so when the number is longer than one byte, we would use big endian byte order (two bytes 0000000000000010 == 2).

But since the system is little endian,1 if we wanted to write those bytes out as a binary number padded to 32 places (with the same spaces every 8 digits for readability), we'd have:

00000000 00000010 00000000 00000001 

In decimal, 2^17 (131072) + 2^0 (1). And indeed, if you replace the body of the program with:

int main (void) { uint32_t x = 131073; write(1, &x, 4); return 0; } 

Compile, and write to a file, you will get exactly the same output from hexdump as before, because the file contains exactly the same thing.

1. Note that when we talk about endianess it virtually always refers to byte order. Since the smallest unit is effectively the byte, its bit ordering is inconsequential.

1
  • I understand endianness (I'm a programmer by trade), it was my misunderstanding of hexdump (I'm not a Unix person by trade) that was causing my confusion. Thanks for the excellent explanation, though. Commented Mar 27, 2014 at 14:22

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.