C - Linux proc file syncing / write deferred

Question

I have a simple watchdog mechanism made as follow:

A kernel module which holds a list of thread counters in a /proc file. The module increments each counter every second. When a counter reaches its max value, it means that the corresponding thread is somehow stuck
A user space program, that generates some threads, monitored by the kernel module. Each thread, sends a counter reset command every 10 seconds by writing a specific string to the /proc file:

time_t timer = time(NULL); int n = 0; while (true) { if (time(NULL) >= timer + 10) { timer = time(NULL); char szData[64]; memset(szData, 0, 64); sprintf(szData, "s|%d|%s", tid, function_name); if ((n = strlen(szData)) > 0) { int t = 0; int fp = 0; while ((fp = open("/proc/counters", O_WRONLY | O_EXCL)) == -1 && errno == EACCESS) { if (++t > 4) { break; } usleep(260000); } if (fp == -1) { return 1; } write(fp, szData, n); close(fp); } } }

In this condition, each counter should reach a maximum value of 10, but my problem is that, sometimes, some of them reach a maximum value of 20, then 30, and so on. On the kernel module side, I see that indeed the reset command does not arrive in time and, on the next round, it receives two commands in one second, like the first has been delayed. Example:

Content of the /proc counters file

 *856 7 20/600 thread_function_name

Debug prints by the kernel module

Aug 26 11:16:38 XWEB-PRO kern.info kernel: [ 114.086453] register 856
Aug 26 11:16:38 XWEB-PRO kern.info kernel: [ 124.138523] register 856
Aug 26 11:16:58 XWEB-PRO kern.info kernel: [ 134.190508] register 856
Aug 26 11:16:58 XWEB-PRO kern.info kernel: [ 144.242274] register 856
Aug 26 11:16:58 XWEB-PRO kern.info kernel: [ 144.242277] register 856
Aug 26 11:17:08 XWEB-PRO kern.info kernel: [ 154.294433] register 856
Aug 26 11:17:28 XWEB-PRO kern.info kernel: [ 164.346516] register 856
Aug 26 11:17:28 XWEB-PRO kern.info kernel: [ 174.398552] register 856
Aug 26 11:17:38 XWEB-PRO kern.info kernel: [ 184.468022] register 856
Aug 26 11:17:48 XWEB-PRO kern.info kernel: [ 194.522241] register 856

As you can see, I missed the command at 11:16:48, but I have 3 at 11:16:58. Then I missed the one at 11:17:18, but have 2 at 11:17:28. I already tried fflush, fsync and sync, but with no luck. Anybody can point me in a right direction? Thank you

Sorry, I edited the question. It's the length of the reset command. — bui3
– bui3, Commented Aug 26, 2024 at 10:26
Why are you opening the file with O_EXCL? As I understand it, the file is provided by your module, and therefore always exists. Moreover, you are not using O_CREAT, which you almost always want together with O_EXCL. The behavior of your open() call with the specified flags is not defined. — John Bollinger
– John Bollinger, Commented Aug 26, 2024 at 13:02
And why aren't you checking whether opening the file in fact succeeds? Even once you clear up the problem with O_EXCL, it is conceivable that the open() may sometimes fail. Robust code must recognize that and handle it appropriately. — John Bollinger
– John Bollinger, Commented Aug 26, 2024 at 13:06
We need to see your kernel module code. What is the /proc file? (e.g.) /proc/counters or /proc/pid/watchdog? How do you derive a per-thread counter if you don't add a field to task_struct? — Craig Estey
– Craig Estey, Commented Aug 26, 2024 at 17:38

bui3 · Accepted Answer · 2024-08-29 08:38:29Z

Finally I spotted the problem. Looks like it was a race condition issue. There are 10 threads that write at the same time on the /proc file, so, probably, some write operation were missed by the kernel module (indeed the reset always happened at multiple of 10 seconds). I surrounded the open/write/close sequence in the user space in a mutex condition and the problem seems gone.

Collectives™ on Stack Overflow

C - Linux proc file syncing / write deferred

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related