Am I making invalid assumptions with regard to my kernel module's shared memory?

Question

I have written a "device driver" (see source code here: https://bitbucket.org/wothke/websid/src/master/raspi/websid_module/ ) that runs fine for most of the time (see https://www.youtube.com/watch?v=bE6nSTT_038 ) but which still seems to have the potential to randomly crash the device occasionally.

The "device driver" starts a kthread that performs a simple but timing critical playback loop in which it controls some connected audio chip via multiple GPIO pins. This kthread is run (using kthread_bind) on an "isolated" CPU core which should be largely excempted from regular kernel use (see details on kernel configuration below). The kthread is given high prio via sched_set_fifo. The kthread makes no subroutine calls and does not require any memory that has not already been previously allocated in the kernel. (The thread also temporarily disables anything that might interfer with its timing, using get_cpu, local_irq_save & local_bh_disable. However these do not seem to be the root cause of the sporadic crashes since crashes could be reproduced even when that disabling was not used.)

I have compiled a regular "Raspberry OS" "Desktop" kernel but I specifically activated NO_HZ_FULL (i.e. "Full dynaticks system (tickless)"). Also I am specifically isolating core #3 via cmdline.txt with: isolcpus=3 rcu_nocbs=3 rcu_nocb_poll=3 nohz_full=3 (which seems to keep most IRQs off cpu core #3 - as intended, so that my above kthread should be pretty alone on that core #3)

The susual suspect might be the "shared kernel memory" buffer that is used for all communication between the above "playback" kthread and the producer of the data which lives in "userland". I already took all the precautions that I could think of to avoid potential race-conditions but maybe there is some kind of CPU cache effect, or something else that I am overlooking.. The "shared buffer" contains 4 page aligned areas that are setup/used in a way that should ensure safe communication/synchronization.

the 1st page only contains one 32-bit flag that is accessed as an u32 or uint32_t (this should be naturally atomar). The kthread only updates this flag when it is 0 and it only sets it to something non-0. The userland code only resets this flag to 0 and only if it had some non-0 value - thereby acknowledging that it received the non-0 value set by the kthread.
the 2nd page contains a similar flag like 1) but for the opposite direction, i.e. here it is the kthread that will receive something non-0 from "userland".
the 3rd(+following) page(s) then contain the 1st buffer that is used for a simple double buffering scheme. This buffer is exclusively written to by the "userland" producer and exclusively read by the kthread. The "ping/pong" protocol implemented via the 2 flags is meant to ensure that the buffer is *never" used concurrently: The kthread initiates a sequence by signalling that one of the buffers can be filled and later the "userland" signals back after it has completed filling the respective buffer, i.e. the kthead only starts reading from a buffer after it has seen the signal from the producer that it is now safe to do so (before the "userland" producer gives that signal it uses msync(start_page, len, MS_INVALIDATE) to report which parts of the shared memory area it has updated.).
the n-th(+following) pages(s) then contain the 2nd buffer (everything said in 3) applies here as well)

But even if something went wrong in the above, that might then block the kthread or the respective userland process.. but I don't see why that should crash the whole system.

The most plausible explanation for me would be if the "shared buffer" got randomly relocated (thus leading to random memory corruption), but I would think that this should not happen to a buffer allocated via:

_raw_buffer = kmalloc(AREA_SIZE + 2*PAGE_SIZE, GFP_KERNEL & ~__GFP_RECLAIM & ~__GFP_MOVABLE);

Or if there was some kernal function that specifically went into some blocking wait for something from core #3 (which might not be happening due to my kthread starving everything else on that CPU..).. however I'd be surprised why such a problem would then only be striking sporadically instead of crashing the machine all the time..

Any ideas?

You might want to read the help file on memory barriers. Without using them, it is possible to see the flag set that indicates the shared buffer is valid, but to not yet see the data in that buffer. Also it is important to look at exactly what the crash says. — psusi
– psusi, Commented Apr 24, 2021 at 16:19
@psusi where does the crash say anything? the device gets irresponsive and I have to power it down & back up again. Also seeing some old version of the buffer should not hurt at all.. there would be a minor glitch in the audio output since old data would be replayed for a 2nd time - but other than that noting bad should happen, especially no system crash.. — wothke
– wothke, Commented Apr 24, 2021 at 19:15
Out of curiosity, is core #3 a real physical core or a hyper-threaded (i.e., faking it) core? In my system, my CPU fakes it, but I had to disable that because it caused random hard system crashes (screen would go blank, and BIOS would initiate as if it was just powered on) anywhere from within minutes of booting up to hours after--usually with no clue in the logs what happened other than a truncated log file. — C. M.
– C. M., Commented Apr 25, 2021 at 5:18
RaspberryPi4 uses a Quad-Core Cortex-A72 and as far as I know it has no hyper-threading. — wothke
– wothke, Commented Apr 25, 2021 at 10:05
crash usually means a kernel panic, which tells you what went wrong. If your device just stops working, you'll have to debug your driver and see what it is and isn't doing and compare that to what it should be doing. — psusi
– psusi, Commented Apr 30, 2021 at 15:05

wothke · Accepted Answer · 2021-04-25 15:56:12Z

After having failed to improve the situation by adding "memory barriers" at every reasonable point in my code I finally found a workaround that works. The problem does not seem to be linked to the shared memory at all. Instead it seems to be triggered by the scheduler and adding calls to "schedule()" in my long running kthread does seem to avoid the system freezes.

Unfortunately this workaround is not a viable solution for me and I've created a separate thread to further explore the direction that this is taking: Is there a way to use a long running kthread without calling schedule()?

Stack Exchange Network

Am I making invalid assumptions with regard to my kernel module's shared memory?

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

Am I making invalid assumptions with regard to my kernel module's shared memory?

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions