61

Imagine I have a process that starts several child processes. The parent needs to know when a child exits.

I can use waitpid, but then if/when the parent needs to exit I have no way of telling the thread that is blocked in waitpid to exit gracefully and join it. It's nice to have things clean up themselves, but it may not be that big of a deal.

I can use waitpid with WNOHANG, and then sleep for some arbitrary time to prevent a busy wait. However then I can only know if a child has exited every so often. In my case it may not be super critical that I know when a child exits right away, but I'd like to know ASAP...

I can use a signal handler for SIGCHLD, and in the signal handler do whatever I was going to do when a child exits, or send a message to a different thread to do some action. But using a signal handler obfuscates the flow of the code a little bit.

What I'd really like to do is use waitpid on some timeout, say 5 sec. Since exiting the process isn't a time critical operation, I can lazily signal the thread to exit, while still having it blocked in waitpid the rest of the time, always ready to react. Is there such a call in linux? Of the alternatives, which one is best?


EDIT:

Another method based on the replies would be to block SIGCHLD in all threads with pthread \ _sigmask(). Then in one thread, keep calling sigtimedwait() while looking for SIGCHLD. This means that I can time out on that call and check whether the thread should exit, and if not, remain blocked waiting for the signal. Once a SIGCHLD is delivered to this thread, we can react to it immediately, and in line of the wait thread, without using a signal handler.

1

12 Answers 12

45

Don't mix alarm() with wait(). You can lose error information that way.

Use the self-pipe trick. This turns any signal into a select()able event:

int selfpipe[2]; void selfpipe_sigh(int n) { int save_errno = errno; (void)write(selfpipe[1], "",1); errno = save_errno; } void selfpipe_setup(void) { static struct sigaction act; if (pipe(selfpipe) == -1) { abort(); } fcntl(selfpipe[0],F_SETFL,fcntl(selfpipe[0],F_GETFL)|O_NONBLOCK); fcntl(selfpipe[1],F_SETFL,fcntl(selfpipe[1],F_GETFL)|O_NONBLOCK); memset(&act, 0, sizeof(act)); act.sa_handler = selfpipe_sigh; sigaction(SIGCHLD, &act, NULL); } 

Then, your waitpid-like function looks like this:

int selfpipe_waitpid(void) { static char dummy[4096]; fd_set rfds; struct timeval tv; int died = 0, st; tv.tv_sec = 5; tv.tv_usec = 0; FD_ZERO(&rfds); FD_SET(selfpipe[0], &rfds); if (select(selfpipe[0]+1, &rfds, NULL, NULL, &tv) > 0) { while (read(selfpipe[0],dummy,sizeof(dummy)) > 0); while (waitpid(-1, &st, WNOHANG) != -1) died++; } return died; } 

You can see in selfpipe_waitpid() how you can control the timeout and even mix with other select()-based IO.

Sign up to request clarification or add additional context in comments.

14 Comments

seems like an interesting concept. question, why make the pipe non-blocking? and why do you need to loops after the select? shouldn't there always be data when the select succeeds?
If two children die, you won't necessarily get two SIGCHLD notifications. You make the pipe non-blocking in case too many SIGCHLDs come in (roughly PIPE_BUF).
The loops also help to protect against too many SIGCHLDs, and while ideally there would always be data after select completes, read() will block until sizeof(dummy) bytes are filled unless it is marked non-blocking for read.
But I am wondering: why include the no-op act.sa_flags |= 0;?
Wouldn't signalfd(2) be the right way to make "turn any signal into a select()able event"?
|
36

Fork an intermediate child, which forks the real child and a timeout process and waits for all (both) of its children. When one exits, it'll kill the other one and exit.

pid_t intermediate_pid = fork(); if (intermediate_pid == 0) { pid_t worker_pid = fork(); if (worker_pid == 0) { do_work(); _exit(0); } pid_t timeout_pid = fork(); if (timeout_pid == 0) { sleep(timeout_time); _exit(0); } pid_t exited_pid = wait(NULL); if (exited_pid == worker_pid) { kill(timeout_pid, SIGKILL); } else { kill(worker_pid, SIGKILL); // Or something less violent if you prefer } wait(NULL); // Collect the other process _exit(0); // Or some more informative status } waitpid(intermediate_pid, 0, 0); 

Surprisingly simple :)

You can even leave out the intermediate child if you're sure no other module in the program is spwaning child processes of its own.

6 Comments

I'm not sure if I'm going to use it (I have exactly the same problem as OP), but damn, this is a very cool trick! Kudos (just wonder why it isn't upvoted more)
Is there any way make the do_work function a system() call? I want the niceties of what the shell has to offer (globbing, piping), but calling system() there causes it to keep running if the worker fork gets killed.
This may seem to provide no guarantee that the timeout is respected, since the latter is measured since the start of the timeout_pid() process. However, the delay between calling timeout_pid = fork() and the actual start of the timeout_pid process is arbitrary.
Isn't all CPU scheduling strictly speaking arbitrary on most (non-realtime) OSes? That is, you can have an arbitrarily long scheduling delay immediately after starting your worker no matter which timeout mechanism you use. That said, this solution is indeed likely to be somewhat less accurate than many others.
There is an exotic condition with this approach when using signal(SIGCHLD, SIG_IGN); that can cause problems. See here for details.
|
22

This is an interesting question. I found sigtimedwait can do it.

EDIT 2016/08/29: Thanks for Mark Edington's suggestion. I'v tested your example on Ubuntu 16.04, it works as expected.

Note: this only works for child processes. It's a pity that seems no equivalent way of Window's WaitForSingleObject(unrelated_process_handle, timeout) in Linux/Unix to get notified of unrelated process's termination within timeout.

OK, Mark Edington's sample code is here:

/* The program creates a child process and waits for it to finish. If a timeout * elapses the child is killed. Waiting is done using sigtimedwait(). Race * condition is avoided by blocking the SIGCHLD signal before fork(). */ #include <sys/types.h> #include <sys/wait.h> #include <signal.h> #include <stdio.h> #include <string.h> #include <stdlib.h> #include <unistd.h> #include <errno.h> static pid_t fork_child (void) { int p = fork (); if (p == -1) { perror ("fork"); exit (1); } if (p == 0) { puts ("child: sleeping..."); sleep (10); puts ("child: exiting"); exit (0); } return p; } int main (int argc, char *argv[]) { sigset_t mask; sigset_t orig_mask; struct timespec timeout; pid_t pid; sigemptyset (&mask); sigaddset (&mask, SIGCHLD); if (sigprocmask(SIG_BLOCK, &mask, &orig_mask) < 0) { perror ("sigprocmask"); return 1; } pid = fork_child (); timeout.tv_sec = 5; timeout.tv_nsec = 0; do { if (sigtimedwait(&mask, NULL, &timeout) < 0) { if (errno == EINTR) { /* Interrupted by a signal other than SIGCHLD. */ continue; } else if (errno == EAGAIN) { printf ("Timeout, killing child\n"); kill (pid, SIGKILL); } else { perror ("sigtimedwait"); return 1; } } break; } while (1); if (waitpid(pid, NULL, 0) < 0) { perror ("waitpid"); return 1; } return 0; } 

6 Comments

How can noone else appreciate this answer? :)
This would be a better answer with some more detail and example code. I used this method. Here is a blog post with a good example: linuxprogrammingblog.com/code-examples/…
@Mark Edington Thank you for your kind. I'v updated the answer with mention your example.
This answer looks like it would only work if there is just a single child process. If there are multiple children involved, then surely the wires would get crossed and you wouldn't know whether the child you were waiting for finished.
@DavidRoundy thanks for sharing. I have not tested multiple children, will any children's exit cause sigtimedwait signaled? If it is, then can use waitpid to get which child is exited.
|
19

If your program runs only on contemporary Linux kernels (5.3 or later), the preferred way is to use pidfd_open (https://lwn.net/Articles/789023/ https://man7.org/linux/man-pages/man2/pidfd_open.2.html).

This system call returns a file descriptor representing a process, and then you can select, poll or epoll it, the same way you wait on other types of file descriptors.

For example,

int fd = pidfd_open(pid, 0); struct pollfd pfd = {fd, POLLIN, 0}; poll(&pfd, 1, 1000) == 1; 

1 Comment

This is where stack overflow falls down unfortunately... Software improves with time, but answer scores don't age out.
4

The function can be interrupted with a signal, so you could set a timer before calling waitpid() and it will exit with an EINTR when the timer signal is raised. Edit: It should be as simple as calling alarm(5) before calling waitpid().

6 Comments

What determines which thread handles a signal? How will I be sure that this is the thread that handles it? Is it that alarm was called in some thread, so that thread handles the signal?
The man page for signal seems to say that the result is unspecified, which means that it may not be handled by the right thread and lead to incorrect results.
It is probably a good idea to have just one thread which receives signals, ensuring that all other threads mask the signal with sigprocmask or similar
note to anyone reading the above comment: use pthread_sigmask not sigprocmask
Don't actually do this. You can lose children if waitpid() reaps the child but SIGALRM fires before the kernel returns. Many unixes have bugs here as well, and don't EINTR correctly even in the ideal case.
|
3

I thought that select will return EINTR when SIGCHLD signaled by on of the child. I belive this should work:

while(1) { int retval = select(0, NULL, NULL, NULL, &tv, &mask); if (retval == -1 && errno == EINTR) // some signal { pid_t pid = (waitpid(-1, &st, WNOHANG) == 0); if (pid != 0) // some child signaled } else if (retval == 0) { // timeout break; } else // error } 

Note: you can use pselect to override current sigmask and avoid interrupts from unneeded signals.

2 Comments

This is pretty good, only you need to mask the signal while not in the select() call and that's complicated (you will have a race condition between the unmask + select calls).
@AlexisWilke, true. signalfd - is a better alternative for Linux. Using of glib may simplify solution suggested by @geocar . And approach of this answer can be enhanced with use of sigtimedwait though you still need to mask signals to ensure that we would not miss them. In case of signalfd you should create it before spawning children, I guess.
3

Instead of calling waitpid() directly, you could call sigtimedwait() with SIGCHLD (which would be sended to the parent process after child exited) and wait it be delived to the current thread, just as the function name suggested, a timeout parameter is supported.

please check the following code snippet for detail

 static bool waitpid_with_timeout(pid_t pid, int timeout_ms, int* status) { sigset_t child_mask, old_mask; sigemptyset(&child_mask); sigaddset(&child_mask, SIGCHLD); if (sigprocmask(SIG_BLOCK, &child_mask, &old_mask) == -1) { printf("*** sigprocmask failed: %s\n", strerror(errno)); return false; } timespec ts; ts.tv_sec = MSEC_TO_SEC(timeout_ms); ts.tv_nsec = (timeout_ms % 1000) * 1000000; int ret = TEMP_FAILURE_RETRY(sigtimedwait(&child_mask, NULL, &ts)); int saved_errno = errno; // Set the signals back the way they were. if (sigprocmask(SIG_SETMASK, &old_mask, NULL) == -1) { printf("*** sigprocmask failed: %s\n", strerror(errno)); if (ret == 0) { return false; } } if (ret == -1) { errno = saved_errno; if (errno == EAGAIN) { errno = ETIMEDOUT; } else { printf("*** sigtimedwait failed: %s\n", strerror(errno)); } return false; } pid_t child_pid = waitpid(pid, status, WNOHANG); if (child_pid != pid) { if (child_pid != -1) { printf("*** Waiting for pid %d, got pid %d instead\n", pid, child_pid); } else { printf("*** waitpid failed: %s\n", strerror(errno)); } return false; } return true; } 

Refer: https://android.googlesource.com/platform/frameworks/native/+/master/cmds/dumpstate/DumpstateUtil.cpp#46

1 Comment

what if you want to do this for arbitrary pid that isn't a child?
2

If you're going to use signals anyways (as per Steve's suggestion), you can just send the signal manually when you want to exit. This will cause waitpid to return EINTR and the thread can then exit. No need for a periodic alarm/restart.

Comments

2

Due to circumstances I absolutely needed this to run in the main thread and it was not very simple to use the self-pipe trick or eventfd because my epoll loop was running in another thread. So I came up with this by scrounging together other stack overflow handlers. Note that in general it's much safer to do this in other ways but this is simple. If anyone cares to comment about how it's really really bad then I'm all ears.

NOTE: It is absolutely necessary to block signals handling in any thread save for the one you want to run this in. I do this by default as I believe it messy to handle signals in random threads.

static void ctlWaitPidTimeout(pid_t child, useconds_t usec, int *timedOut) { int rc = -1; static pthread_mutex_t alarmMutex = PTHREAD_MUTEX_INITIALIZER; TRACE("ctlWaitPidTimeout: waiting on %lu\n", (unsigned long) child); /** * paranoid, in case this was called twice in a row by different * threads, which could quickly turn very messy. */ pthread_mutex_lock(&alarmMutex); /* set the alarm handler */ struct sigaction alarmSigaction; struct sigaction oldSigaction; sigemptyset(&alarmSigaction.sa_mask); alarmSigaction.sa_flags = 0; alarmSigaction.sa_handler = ctlAlarmSignalHandler; sigaction(SIGALRM, &alarmSigaction, &oldSigaction); /* set alarm, because no alarm is fired when the first argument is 0, 1 is used instead */ ualarm((usec == 0) ? 1 : usec, 0); /* wait for the child we just killed */ rc = waitpid(child, NULL, 0); /* if errno == EINTR, the alarm went off, set timedOut to true */ *timedOut = (rc == -1 && errno == EINTR); /* in case we did not time out, unset the current alarm so it doesn't bother us later */ ualarm(0, 0); /* restore old signal action */ sigaction(SIGALRM, &oldSigaction, NULL); pthread_mutex_unlock(&alarmMutex); TRACE("ctlWaitPidTimeout: timeout wait done, rc = %d, error = '%s'\n", rc, (rc == -1) ? strerror(errno) : "none"); } static void ctlAlarmSignalHandler(int s) { TRACE("ctlAlarmSignalHandler: alarm occured, %d\n", s); } 

EDIT: I've since transitioned to using a solution that integrates well with my existing epoll()-based eventloop, using timerfd. I don't really lose any platform-independence since I was using epoll anyway, and I gain extra sleep because I know the unholy combination of multi-threading and UNIX signals won't hurt my program again.

1 Comment

Could you spare some time to post final solution - epoll solution?
1

I can use a signal handler for SIGCHLD, and in the signal handler do whatever I was going to do when a child exits, or send a message to a different thread to do some action. But using a signal handler obfuscates the flow of the code a little bit.

In order to avoid race conditions you should avoid doing anything more complex than changing a volatile flag in a signal handler.

I think the best option in your case is to send a signal to the parent. waitpid() will then set errno to EINTR and return. At this point you check for waitpid return value and errno, notice you have been sent a signal and take appropriate action.

1 Comment

Well, you can do the self-pipe trick, and have the waitpid-thread really be blocking on a select to a pipe instead. Then, when it gets SIGCHLD, have it write a byte to the pipe, which wakes itself up.
0

If a third party library is acceptable then the libkqueue project emulates kqueue (the *BSD eventing system) and provides basic process monitoring with EVFILT_PROC + NOTE_EXIT.

The main advantages of using kqueue or libkqueue is that it's cross platform, and doesn't have the complexity of signal handling. If your program is utilises async I/O you may also find it a lower friction interface than using something like epoll and the various *fd functions (signalfd, eventfd, pidfd etc...).

#include <stdio.h> #include <stdint.h> #include <sys/event.h> /* kqueue header */ #include <sys/types.h> /* for pid_t */ /* Link with -lkqueue */ int waitpid_timeout(pid_t pid, struct timespec *timeout) { struct kevent changelist, eventlist; int kq, ret; /* Populate a changelist entry (an event we want to be notified of) */ EV_SET(&changelist, pid, EVFILT_PROC, EV_ADD, NOTE_EXIT, 0, NULL); kq = kqueue(); /* Call kevent with a timeout */ ret = kevent(kq, &changelist, 1, &eventlist, 1, timeout); /* Kevent returns 0 on timeout, the number of events that occurred, or -1 on error */ switch (ret) { case -1: printf("Error %s\n", strerror(errno)); break; case 0: printf("Timeout\n"); break; case 1: printf("PID %u exited, status %u\n", (unsigned int)eventlist.ident, (unsigned int)eventlist.data); break; } close(kq); return ret; } 

Behind the scenes on Linux libkqueue uses either pidfd on Linux kernels >= 5.3 or a waiter thread that listens for SIGCHLD and notifies one or more kqueue instances when a process exits. The second approach is not efficient (it scans PIDs that interest has been registered for using waitid), but that doesn't matter unless you're waiting on large numbers of PIDs.

EVFILT_PROC support has been included in kqueue since its inception, and in libkqueue since v2.5.0.

Comments

-1

Here's a C++ solution using a watchdog thread.

int pid = fork(); . . . // In the parent process ProcessWatchdog wd(pid, 20s); // start a watchdog thread to kill the process after the timeout int rc; waitpid(pid, &rc, 0); // guaranteed to not hang forever 

And here's an example implementation of ProcessWatchdog that first tries SIGTERM, then SIGKILL.

#pragma once #include <chrono> #include <condition_variable> #include <mutex> #include <signal.h> #include <thread> using namespace std::literals; class ProcessWatchdog { public: ProcessWatchdog(int pid, std::chrono::milliseconds timeout) : pid_(pid) , timeout_(timeout) , t_([this] { run(); }) { } ProcessWatchdog(ProcessWatchdog const&) = delete; ProcessWatchdog& operator=(ProcessWatchdog const&) = delete; void done() { std::unique_lock lock{ mtx_ }; done_ = true; done_cond_.notify_all(); } ~ProcessWatchdog() { done(); t_.join(); } private: void run() { std::unique_lock lock{ mtx_ }; if (done_cond_.wait_for(lock, timeout_, [this] { return done_; })) { return; } kill(pid_, SIGTERM); if (done_cond_.wait_for(lock, timeout_, [this] { return done_; })) { return; } kill(pid_, SIGKILL); } int pid_; std::mutex mtx_; std::condition_variable done_cond_; const std::chrono::milliseconds timeout_; bool done_{}; std::thread t_; }; 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.