8

I'm using Ubuntu 14.04 and I'm experiencing this behavior I can't seem to understand:

  1. Run the yes command (in the default shell: Bash)
  2. Type CtrlZ to stop yes
  3. Run jobs. Output:
    [1]+ Stopped yes
  4. Run kill -9 %1 to stop yes. Output:
    [1]+ Stopped yes
  5. Run jobs. Output:
    [1]+ Stopped yes

This is on Ubuntu 3.16.0-30-generic running in a parallels virtual machine.

Why didn't my kill -9 command terminate the yes command? I thought SIGKILL can't be caught or ignored? And how can I terminate the yes command?

6
  • 1
    That's interesting. SIGKILL should work and it does on my Linux Mint 17. For any other signal, you'd normally need to send it SIGCONT afterwards to make sure the signal gets received by the stopped target. Commented Jun 12, 2015 at 18:34
  • Does bash really print "Stopped" for a process that is suspended? Commented Jun 12, 2015 at 18:45
  • Kernel version (uname -a) please Commented Jun 12, 2015 at 18:56
  • Linux ubuntu 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:43:14 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux. I'm running Ubuntu in Parallels Desktop. Commented Jun 12, 2015 at 18:59
  • 1
    @black most shells say "Stopped". tcsh says "Suspended" and zsh says "suspended". A cosmetic difference. Of somewhat more importance is the fact that bash prints an identical message for STOP and TSTP, where all other shells mark the annotate the STOP message with (signal) so you can tell the difference. Commented Jun 12, 2015 at 20:21

4 Answers 4

10

Signals are blocked for suspended processes. In a terminal:

$ yes ... y y ^Zy [1]+ Stopped yes 

In a second terminal:

$ killall yes 

In the first terminal:

$ jobs [1]+ Stopped yes $ fg yes Terminated 

However SIGKILL can't be blocked. Doing the same thing with killall -9 yes from the second terminal immediately gives this in the yes terminal:

[1]+ Killed yes 

Consequently if kill -9 %1 doesn't terminate the process right away then either bash isn't actually sending the signal until you fg the process, or you have uncovered a bug in kernel.

8
  • 5
    Some background details: When issuing Ctrl+Z in your terminal bash sends a SIGTSTP (which is the blockable version of SIGSTOP) to the active process. This puts the process in a frozen state where the kernel won't schedule it. That also inhibits signal processing (except for the SIGCONT signal which unfreezes the process) and therefore prevents the process from being killed right away. Commented Jun 12, 2015 at 19:24
  • 1
    SIGKILL, unlike other signals, is not blocked for suspended processes. Sending the KILL signal to a suspended process kills it — asynchronously, but in practice basically immediately. Commented Jun 12, 2015 at 22:30
  • 1
    @Gilles That's what I was trying to illustrate above: SIGTERM is blocked, but SIGKILL isn't. Anyway, according to a comment from OP, the problem seems to be that jobs doesn't detect that the process has died, not the process not being killed by kill -9 %1. Commented Jun 12, 2015 at 22:32
  • 1
    But I can reproduce s1m0n's behavior on my system (Debian, amd64, bash 4.3.30). Commented Jun 12, 2015 at 22:34
  • 1
    While SIGKILL cannot be blocked, there is no guarantee that it will be delivered within any meaningful time. If a process is suspended pending blocking I/O, for example, SIGKILL will not arrive until the process wakes. This could potentially be never, if no I/O occurs. Commented Jun 13, 2015 at 1:22
8

Don't panic.

There's nothing funky going on. There's no kernel bug here. This is perfectly normal behaviour from the Bourne Again shell and a multitasking operating system.

The thing to remember is that a process kills itself, even in response to SIGKILL. What's happening here is that the Bourne Again shell is getting around to things before the process that it just told to kill itself gets around to killing itself.

Consider what happens from the point where yes has been stopped with SIGTSTP and you've just executed the kill command with the Bourne Again shell:

  1. The shell sends SIGKILL to the yes process.
  2. In parallel:
    1. The yes process is scheduled to run and immediately kills itself.
    2. The Bourne Again shell continues, issuing another prompt.

The reason that you are seeing one thing and other people are seeing another is a simple race between two ready to run processes, the winner of which is entirely down to things that vary both from machine to machine and over time. System load makes a difference, as does the fact that your CPU is virtual.

In the interesting case, the detail of step #2 is this:

  1. The Bourne Again shell continues.
  2. As part of the internals of the built-in kill command it marks the entry in its job table as needing a notification message printed at the next available point.
  3. It finishes the kill command, and just before printing the prompt again checks to see whether it should print notification messages about any jobs.
  4. The yes process hasn't had the chance to kill itself yet, so as far as the shell is concerned the job is still in the stopped state. So the shell prints a "Stopped" job status line for that job, and resets its notification pending flag.
  5. The yes process gets scheduled and kills itself.
  6. The kernel informs the shell, which is busy running its command line editor, that the process has killed itself. The shell notes the change in status and flags the job as notification pending again.
  7. Simply pressing enter to cycle through the prompt printing again gives the shell the chance to print the new job status.

The important points are:

  • Processes kill themselves. SIGKILL isn't magical. Processes check for pending signals when returning to application mode from kernel mode, which happens at the ends of page faults, (non-nested) interrupts, and system calls. The only special thing is that the kernel doesn't allow the action in response to SIGKILL to be anything other than immediate and unconditional suicide, with no return to application mode. Importantly, processes need to both be making kernel-to-application-mode transitions and be scheduled to run in order to respond to signals.
  • A virtual CPU is just a thread on a host operating system. There's no guarantee that the host has scheduled the virtual CPU to run. Host operating systems are not magical, either.
  • Notification messages aren't printed when the job state changes happen (unless you use set -o notify). They are printed when next the shell reaches a point in its execution cycle that it checks to see whether any notifications are pending.
  • The notification pending flag is being set twice, once by kill and once by the SIGCHLD signal handler. This means that one can see two messages if the shell is running ahead of the yes process being rescheduled to kill itself; one a "Stopped" message and one a "Killed" message.
  • Obviously, the /bin/kill program doesn't have any access to the shell's internal jobs table; so you won't see such behaviour with /bin/kill. The notification pending flag is only set the once, by the SIGCHLD handler.
  • For the same reason, you won't see this behaviour if you kill the yes process from another shell.
7
  • 3
    That's an interesting theory, but the OP gets to type jobs and the shell still sees the process as alive. That would be one unusually long scheduling race condition. :) Commented Jun 12, 2015 at 21:15
  • 3
    First of all, thanks for your elaborative answer! I certainly makes sense and clears up quite a few things.. But as stated above, I can run multiply jobs commands after the kill which all still indicate the process is just stopped. You however inspired me to keep experimenting and I discovered this: the message [1]+ Terminated yes is printed as soon as I run another external command (not a shell builtin like echo or jobs). So I can run jobs as much as I like and it keeps printing [1]+ Stopped yes. But as soon as I run ls for example, Bash prints [1]+ Terminated yes Commented Jun 12, 2015 at 21:29
  • lcd047 didn't read your comment to the question; which was important and should have been edited into the start of the question, properly. It's easy to overload a host operating system such that guests appear to schedule very strangely, from within. Just like this, and more besides. (I once managed to cause quite odd scheduling with a runaway Bing Desktop consuming most of the host CPU time.) Commented Jun 12, 2015 at 22:18
  • 1
    @Gilles The problem seems to be that jobs doesn't notice that the process has actually died... Not sure what to make about the status being updated by running another command though. Commented Jun 12, 2015 at 22:41
  • 1
    Even Gilles didn't see the comment. This is why you should put this sort of important stuff in the question, not bury it in a comment. Gilles, the answer clearly talks about delays in delivering a signal, not delays in sending it. You've mixed them up. Also, read the questioner's comment (and indeed the bullet point that it is given here) and see the very important wrong fundamental assumption that you are making. Virtual processors do not necessarily run in lockstep, and are not magically capable of always running at full speed. Commented Jun 13, 2015 at 0:28
3

What you're observing is a bug in this version of bash.

kill -9 %1 does kill the job immediately. You can observe that with ps. You can trace the bash process to see when the kill system call is called, and trace the subprocess to see when it receives and processes the signals. More interstingly, you can go and see what's happening to the process.

bash-4.3$ sleep 9999 ^Z [1]+ Stopped sleep 9999 bash-4.3$ kill -9 %1 [1]+ Stopped sleep 9999 bash-4.3$ jobs [1]+ Stopped sleep 9999 bash-4.3$ jobs -l [1]+ 3083 Stopped sleep 9999 bash-4.3$ 

In another terminal:

% ps 3083 PID TTY STAT TIME COMMAND 3083 pts/4 Z 0:00 [sleep] <defunct> 

The subprocess is a zombie. It's dead: all that's left of it is an entry in the process table (but no memory, code, open files, etc.). The entry is left around until its parent takes notice and retrieves its exit status by calling the wait system call or one of its siblings.

An interactive shell is supposed to check for dead children and reap them before printing a prompt (unless configured otherwise). This version of bash fails to do it in some circumstances:

bash-4.3$ jobs -l [1]+ 3083 Stopped sleep 9999 bash-4.3$ true bash-4.3$ /bin/true [1]+ Killed sleep 9999 

You might expect bash to report “Killed” as soon as it's printing the prompt after the kill command, but that isn't guaranteed, because there's a race condition. Signals are delivered asynchronously: the kill system call returns as soon as the kernel has figured out which process(es) to deliver the signal to, without waiting for it to be actually delivered. It's possible, and it does happen in practice, that bash has time to check on the status of its subprocess, find that it's still not dead (wait4 doesn't report any child death), and print that the process is still stopped. What is wrong is that before the next prompt, the signal has been delivered (ps reports that the process is dead), yet bash still hasn't called wait4 (we can see that not only because it still reports the job as “Stopped”, but because the zombie is still present in the process table). In fact, bash only reaps the zombie the next time it needs to call wait4, when it's run some other external command.

The bug is intermittent and I couldn't reproduce it while bash is traced (presumably because it's a race condition where bash needs to react fast). If the signal is delivered before bash checks, everything happens as expected.

2

Something funky may be happening on your system, on mine your recipe works nicely both with and without the -9:

> yes ... ^Z [1]+ Stopped yes > jobs [1]+ Stopped yes > kill %1 [1]+ Killed yes > jobs > 

Get the pid with jobs -p and try to kill it as root.

4
  • May I ask what distribution/kernel/bash version you're using? Maybe your bash's internal kill command goes the extra mile and checks if the job is frozen (you might want to try finding out the PID of the job and kill it using env kill <pid>. That way you'll be using the actual kill command and not the bash builtin. Commented Jun 12, 2015 at 19:51
  • bash-4.2-75.3.1.x86_64 on opensuse 13.2. The kill cmd is not the internal one: which kill /usr/bin/kill Commented Jun 13, 2015 at 1:55
  • 1
    which is not a bash-builtin, so which <anything> will always give you the path to the actual command. But try comparing kill --help vs. /usr/bin/kill --help. Commented Jun 13, 2015 at 19:43
  • Ah, right. Indeed, it's the builtin kill. Commented Jun 14, 2015 at 2:42

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.