Ksh loses data after piping 16K bytes

Question

I recently found that ksh may lose some data after printing more than 16K bytes to the stdout if it is blocked for a couple of seconds.

This test.sh script prints out 257*64 (16448) bytes:

#!/usr/bin/ksh i=0 while [[ i -lt 257 ]] do x=$(file /tmp) echo "0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDE" i=$((i+1)) done | while read datafile do echo $datafile done

I performed the following test:

0 $ ./test.sh | wc -c 16448 0 $ ./test.sh | (sleep 3; wc -c) 16384

The line x=$(file /tmp) seems to affect this behaviour although it does not pipe anything to the second loop.

If I use bash, it works as expected.

It looks like a bug to me in ksh. I am using Solaris 5.10. Is there a solution or workaround for this? What is the root cause of this issue? I guess it might be related to pipe buffer size.

Thanks, Peter

EDIT:

So running the test with truss, I can see an error at writing the last 64 bytes:

ioctl(0, I_PEEK, 0x08046B40) = 0 Received signal #18, SIGCLD, in write() [caught] siginfo: SIGCLD CLD_EXITED pid=6561 status=0x0000 write(1, " 0 1 2 3 4 5 6 7 8 9 A B".., 64) Err#4 EINTR lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] setcontext(0x08046670) read(0, 0x0809064C, 1) = 0 ioctl(0, TCGETA, 0x08046B18) Err#22 EINVAL

Running the same script with dtksh looks like below. As Stephane indicated, the failed write is reattempted.

ioctl(0, I_PEEK, 0x08046694) = 1 read(0, " 0 1 2 3 4 5 6 7 8 9 A B".., 64) = 64 Received signal #18, SIGCLD, in write() [caught] siginfo: SIGCLD CLD_EXITED pid=28276 status=0x0000 write(1, " 0 1 2 3 4 5 6 7 8 9 A B".., 64) Err#4 EINTR lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] waitid(P_ALL, 0, 0x08046500, WEXITED|WTRAPPED|WSTOPPED|WNOHANG) = 0 waitid(P_ALL, 0, 0x08046500, WEXITED|WTRAPPED|WSTOPPED|WNOHANG) Err#10 ECHILD sigaction(SIGCLD, 0x08046510, 0x08046580) = 0 setcontext(0x08046430) write(1, 0x080F0FD8, 64) (sleeping...) write(1, " 0 1 2 3 4 5 6 7 8 9 A B".., 64) = 64 ioctl(0, I_PEEK, 0x08046694) = 0

Try running the script with truss to see what's going on. small writes are meant to be atomic on pipes, but ksh is known sometimes to use sockets instead of pipes for shell pipes. Also, ksh is known to do wild optimisations that sometimes affect behavior. truss will tell you if for instance ksh does one write(2) per echo or if some write(2) fail or are partial. You could also use lsof to check whether ksh uses pipes or sockets. FWIW, I can't reproduce it with ksh93u+ on Linux. What version of ksh are you using? — Stéphane Chazelas
– Stéphane Chazelas, Commented Dec 18, 2012 at 10:40
Yes, I also tried on linux using PD KSH v5.2.14 and it just works fine. Actually I am not sure what Ksh version is shipped with SunOS 5.10, something like this: grep -i ver /usr/bin/ksh @(#)Version M-11/16/88i — Peter Miklos
– Peter Miklos, Commented Dec 18, 2012 at 12:31
running with truss I can see something abnormal, though I don't really know what it means: ioctl(0, I_PEEK, 0x08046B40) = 0 Received signal #18, SIGCLD, in write() [caught] siginfo: SIGCLD CLD_EXITED pid=6561 status=0x0000 write(1, " 0 1 2 3 4 5 6 7 8 9 A B".., 64) Err#4 EINTR lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] setcontext(0x08046670) read(0, 0x0809064C, 1) = 0 ioctl(0, TCGETA, 0x08046B18) Err#22 EINVAL — Peter Miklos
– Peter Miklos, Commented Dec 18, 2012 at 12:39
It would make more sense to add that information to the question. Also, it would be helpful to see whether it's the first echo or the second that gets interrupted (like change the second to "echo x$datafile"). It looks like a ksh bug indeed. I suspect ksh doesn't wait for the "file" command to terminate (just wait until EOF when reading its output), and the SIGCLD are handled later on. It's fine in the general case, but when writing to a full pipe, the write for that echo is blocking so likely to be interrupted by the SIGCLD. Doing (echo ...), that is use a subshell, might WA the problem. — Stéphane Chazelas
– Stéphane Chazelas, Commented Dec 18, 2012 at 13:04

Stéphane Chazelas · Accepted Answer · 2012-12-19 12:45:26Z

That indeed looks like a bug in ksh.

What I suspect is that in

x=$(file /tmp)

ksh spawns a new process to run the file command and reads its output through a pipe, and doesn't wait for its termination (as would all modern shells, including modern versions of ksh), but that command returns as soon as EOF is reached when reading from that pipe.

That behaviour could be confirmed by running:

ksh -c 'x=$(exec sh -c "echo foo;exec >&-; sleep 10"); echo "$x"'

And check whether ksh returns immediately after having output foo or after 10 seconds.

If that's the case, then that means the file command will terminate and cause a SIGCLD to be sent to its parent (the shell), after the x=... command has returned.

The shell is meant to handle those SIGCLD to enquire about the death of its child. If the shell has a child running in the background, it should be ready for it death happening at any moment. That SIGCLD signal, just like any non-ignored signal would cause a blocking system call to be interrupted. The shell should be ready for that to happen, either by blocking the signal while doing a system call that may be interrupted, or by reattempting the interrupted system call after having handled the signal.

In this case, it looks like none of that is happening. Most of the time, the write system call performed by ksh about running the builtin echo returns immediately, so there's no chance for it to be interrupted, but after the pipe that stdout points to is full, the write system call ends up blocking and that's when it gets interrupted by the SIGCLD. And ksh doesn't reattempt it which is the bug.

We can see the same behavior even on Linux if we run

strace -e write ksh -c 'i=0; while [ "$i" -lt 2000 ]; do : & echo xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx i=$(($i+1)); done' | (sleep 3; wc)

Then we see:

write(1, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 61) = ? ERESTARTSYS (To be restarted) --- SIGCHLD (Child exited) @ 0 (0) --- write(1, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 61...

It's the same, the terminating : command causes the blocking write system call to be interrupted, but this time, the write is reattempted.

A work around could consist in avoiding command substitutions before a call to a builtin echo, or make sure the write is done by a different process from the one getting the SIGCLD, for instance by running the echo command in a subshell:

(echo "012...")

EDIT: A closer look at the truss output reveals it's the trace from the second loop which is meant to run in a separate process from the one running the other loop, so shouldn't be getting the SIGCLD from the death of the file commands. It could get one SIGCLD though, from the termination of the subshell running the first loop.

Also if, as your test result suggests, ksh does wait for the processes spawned for command substitution, then the SIGCLD signals received can't be explained by asynchronous termination of the file command.

What looks more likely is that the outer pipe gets full, but not the pipe between the two while loops, the SIGCLD is received during the blocking echo in the second loop and comes from the termination of the first loop. So, a more efficient workaround would be to run the second loop in a subshell rather than every echo command in it.

while ...; done | (while ...;done)

The test command above seems to work and returns 'foo' after 10s. However, using dtksh instead of ksh I can see that the write is reattempted after the SIGCLD. And as you stated, this is not done by ksh and so it fails. I will append an trace log fragment of dtksh to the question for reference. — Peter Miklos
– Peter Miklos, Commented Dec 19, 2012 at 9:24
(a $ was missing which I've now added). If it outputs "foo" after 10s, then that means my interpretation is wrong, that ksh does wait for the process spawned by the command substitution and therefore shouldn't receive the SIGCLD in the middle of the next "write" system call. The full output of truss -f would tell us what's really going on. — Stéphane Chazelas
– Stéphane Chazelas, Commented Dec 19, 2012 at 11:26
yes, running the second while in a subshell also works. The full output of truss -f is 2.3M and unfortunately I don't see a way to add attachments here. I could paste the last 5000 lines here: pastebin.com/YCp12e80 — Peter Miklos
– Peter Miklos, Commented Dec 19, 2012 at 14:17

Stack Exchange Network

Ksh loses data after piping 16K bytes

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

Ksh loses data after piping 16K bytes

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions