10

A common way to do things with a couple of files is—and don't hit me for that:

for f in $(ls); do … 

Now, to be safe against files with spaces or other strange characters, a naive way would be to do:

find . -type f -print0 | while IFS= read -r -d '' file; … 

Here, the -d '' is short for setting the ASCII NUL as in -d $'\0'.

But why is that so? Why are '' and $'\0' the same? Is that due to the C roots of Bash with an empty string always being null-terminated?

6
  • Referring to the "naïve" way, is there a better way of doing this? Commented Jan 12, 2013 at 15:44
  • 2
    By the way if you want to do safe operations iterating over a set of files - use for f in * instead of parsing ls. Commented Jan 12, 2013 at 16:01
  • @htor I know for i in $(ls) is terribly stupid—I'm almost ashamed I used it as a bad example here. Commented Jan 12, 2013 at 16:04
  • @ChandraRavoori Yes, for example by using find … -exec instead of looping around files, which works for most cases where you'd use such a for loop instead. Here, find takes care of everything for you. Commented Jan 12, 2013 at 16:06
  • @slhck, thanks. What about situations involving multi-step operations on each file where a loop may be preferable for readability reasons? Is there a better loop option than the "naïve way" above? Commented Jan 12, 2013 at 16:13

2 Answers 2

10

The man page of bash reads:

 -d delim The first character of delim is used to terminate the input line, rather than newline. 

Because strings are usually null terminated, the first character of an empty string is the null byte. - Makes sense to me. :)

The source reads:

static unsigned char delim; [...] case 'd': delim = *list_optarg; break; 

For an empty string delim is simply the null byte.

5
  • When you say "strings are usually null terminated", is that not the case somewhere in a POSIX environment? From the days when I was learning C for school, of course it makes sense to assume so; I was just checking. Commented Jan 12, 2013 at 8:45
  • But one could regard any string as containing arbitrarily many empty strings, e.g. if you concatenate '' and "X" you get "X". So the you could argue that the first substring bash encounters is the empty string. For example if you use the empty string in javascript's split() it will split between each character. I suspect a "for historical reasons" may be the best explanation we can get. Commented Jan 12, 2013 at 8:48
  • Well, not quite because "concatenating" a C-style '\0' with 'X\0' should give you 'X\0', if done right. This doesn't have much to do with high-level functions in languages such as JavaScript @don Commented Jan 12, 2013 at 9:06
  • Thanks, michas, for adding the source. delim = *list_optarg; makes it clear why it's that way. Commented Jan 12, 2013 at 9:08
  • @slhck: Sorry, I didn't make myself clear. You asked "why are '' and $'\0' the same?", michas gave the proximate explaination of "that's what the code does". I outlined an alternative way of handling the empty string that I saw as equally reasonable and suggested that chosing one or the other was simply a matter of convention or happenstance. Commented Jan 12, 2013 at 12:16
6

There are two deficiencies in bash that compensate each other.

When you write $'\0', that is internally treated identically to the empty string. For example:

$ a=$'\0'; echo ${#a} 0 

That's because internally bash stores all strings as C strings, which are null-terminated — a null byte marks the end of the string. Bash silently truncates the string to the first null byte (which is not part of the string!).

# a=$'foo\0bar'; echo "$a"; echo ${#a} foo 3 

When you pass a string as an argument to the -d option of the read builtin, bash only looks at the first byte of the string. But it doesn't actually check that the string is not empty. Internally, an empty string is represented as a 1-element byte array that contains just a null byte. So instead of reading the first byte of the string, bash reads this null byte.

Then, internally, the machinery behind the read builtin works well with null bytes; it keeps reading byte by byte until it finds the delimiter.

Other shells behave differently. For example, ash and ksh ignore null bytes when they read input. With ksh, ksh -d "" reads until a newline. Shells are designed to cope well with text, not with binary data. Zsh is an exception: it uses a string representation that copes with arbitrary bytes, including null bytes; in zsh, $'\0' is a string of length 1 (but read -d '', oddly, behaves like read -d $'\0').

1
  • The behavior of read changed in bash 4.3 so that it now skips null bytes. For example read x< <(printf a\\0a) sets x to aa instead of a. Commented Jun 9, 2014 at 2:45

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.