I have a simple watchdog mechanism made as follow:
- A kernel module which holds a list of thread counters in a /proc file. The module increments each counter every second. When a counter reaches its max value, it means that the corresponding thread is somehow stuck
- A user space program, that generates some threads, monitored by the kernel module. Each thread, sends a counter reset command every 10 seconds by writing a specific string to the /proc file:
time_t timer = time(NULL); int n = 0; while (true) { if (time(NULL) >= timer + 10) { timer = time(NULL); char szData[64]; memset(szData, 0, 64); sprintf(szData, "s|%d|%s", tid, function_name); if ((n = strlen(szData)) > 0) { int t = 0; int fp = 0; while ((fp = open("/proc/counters", O_WRONLY | O_EXCL)) == -1 && errno == EACCESS) { if (++t > 4) { break; } usleep(260000); } if (fp == -1) { return 1; } write(fp, szData, n); close(fp); } } } In this condition, each counter should reach a maximum value of 10, but my problem is that, sometimes, some of them reach a maximum value of 20, then 30, and so on. On the kernel module side, I see that indeed the reset command does not arrive in time and, on the next round, it receives two commands in one second, like the first has been delayed. Example:
Content of the /proc counters file
*856 7 20/600 thread_function_name Debug prints by the kernel module
Aug 26 11:16:38 XWEB-PRO kern.info kernel: [ 114.086453] register 856
Aug 26 11:16:38 XWEB-PRO kern.info kernel: [ 124.138523] register 856
Aug 26 11:16:58 XWEB-PRO kern.info kernel: [ 134.190508] register 856
Aug 26 11:16:58 XWEB-PRO kern.info kernel: [ 144.242274] register 856
Aug 26 11:16:58 XWEB-PRO kern.info kernel: [ 144.242277] register 856
Aug 26 11:17:08 XWEB-PRO kern.info kernel: [ 154.294433] register 856
Aug 26 11:17:28 XWEB-PRO kern.info kernel: [ 164.346516] register 856
Aug 26 11:17:28 XWEB-PRO kern.info kernel: [ 174.398552] register 856
Aug 26 11:17:38 XWEB-PRO kern.info kernel: [ 184.468022] register 856
Aug 26 11:17:48 XWEB-PRO kern.info kernel: [ 194.522241] register 856
As you can see, I missed the command at 11:16:48, but I have 3 at 11:16:58. Then I missed the one at 11:17:18, but have 2 at 11:17:28. I already tried fflush, fsync and sync, but with no luck. Anybody can point me in a right direction? Thank you
n? (type and value)O_EXCL? As I understand it, the file is provided by your module, and therefore always exists. Moreover, you are not usingO_CREAT, which you almost always want together withO_EXCL. The behavior of youropen()call with the specified flags is not defined.O_EXCL, it is conceivable that theopen()may sometimes fail. Robust code must recognize that and handle it appropriately./procfile? (e.g.)/proc/countersor/proc/pid/watchdog? How do you derive a per-thread counter if you don't add a field totask_struct?