0

I have a simple watchdog mechanism made as follow:

  • A kernel module which holds a list of thread counters in a /proc file. The module increments each counter every second. When a counter reaches its max value, it means that the corresponding thread is somehow stuck
  • A user space program, that generates some threads, monitored by the kernel module. Each thread, sends a counter reset command every 10 seconds by writing a specific string to the /proc file:
time_t timer = time(NULL); int n = 0; while (true) { if (time(NULL) >= timer + 10) { timer = time(NULL); char szData[64]; memset(szData, 0, 64); sprintf(szData, "s|%d|%s", tid, function_name); if ((n = strlen(szData)) > 0) { int t = 0; int fp = 0; while ((fp = open("/proc/counters", O_WRONLY | O_EXCL)) == -1 && errno == EACCESS) { if (++t > 4) { break; } usleep(260000); } if (fp == -1) { return 1; } write(fp, szData, n); close(fp); } } } 

In this condition, each counter should reach a maximum value of 10, but my problem is that, sometimes, some of them reach a maximum value of 20, then 30, and so on. On the kernel module side, I see that indeed the reset command does not arrive in time and, on the next round, it receives two commands in one second, like the first has been delayed. Example:

Content of the /proc counters file

 *856 7 20/600 thread_function_name 

Debug prints by the kernel module

Aug 26 11:16:38 XWEB-PRO kern.info kernel: [ 114.086453] register 856
Aug 26 11:16:38 XWEB-PRO kern.info kernel: [ 124.138523] register 856
Aug 26 11:16:58 XWEB-PRO kern.info kernel: [ 134.190508] register 856
Aug 26 11:16:58 XWEB-PRO kern.info kernel: [ 144.242274] register 856
Aug 26 11:16:58 XWEB-PRO kern.info kernel: [ 144.242277] register 856
Aug 26 11:17:08 XWEB-PRO kern.info kernel: [ 154.294433] register 856
Aug 26 11:17:28 XWEB-PRO kern.info kernel: [ 164.346516] register 856
Aug 26 11:17:28 XWEB-PRO kern.info kernel: [ 174.398552] register 856
Aug 26 11:17:38 XWEB-PRO kern.info kernel: [ 184.468022] register 856
Aug 26 11:17:48 XWEB-PRO kern.info kernel: [ 194.522241] register 856

As you can see, I missed the command at 11:16:48, but I have 3 at 11:16:58. Then I missed the one at 11:17:18, but have 2 at 11:17:28. I already tried fflush, fsync and sync, but with no luck. Anybody can point me in a right direction? Thank you

10
  • What is n? (type and value) Commented Aug 26, 2024 at 10:08
  • Sorry, I edited the question. It's the length of the reset command. Commented Aug 26, 2024 at 10:26
  • Why are you opening the file with O_EXCL? As I understand it, the file is provided by your module, and therefore always exists. Moreover, you are not using O_CREAT, which you almost always want together with O_EXCL. The behavior of your open() call with the specified flags is not defined. Commented Aug 26, 2024 at 13:02
  • And why aren't you checking whether opening the file in fact succeeds? Even once you clear up the problem with O_EXCL, it is conceivable that the open() may sometimes fail. Robust code must recognize that and handle it appropriately. Commented Aug 26, 2024 at 13:06
  • 1
    We need to see your kernel module code. What is the /proc file? (e.g.) /proc/counters or /proc/pid/watchdog? How do you derive a per-thread counter if you don't add a field to task_struct? Commented Aug 26, 2024 at 17:38

1 Answer 1

0

Finally I spotted the problem. Looks like it was a race condition issue. There are 10 threads that write at the same time on the /proc file, so, probably, some write operation were missed by the kernel module (indeed the reset always happened at multiple of 10 seconds). I surrounded the open/write/close sequence in the user space in a mutex condition and the problem seems gone.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.