I have written a "device driver" (see source code here: https://bitbucket.org/wothke/websid/src/master/raspi/websid_module/ ) that runs fine for most of the time (see https://www.youtube.com/watch?v=bE6nSTT_038 ) but which still seems to have the potential to randomly crash the device occasionally.
The "device driver" starts a kthread that performs a simple but timing critical playback loop in which it controls some connected audio chip via multiple GPIO pins. This kthread is run (using kthread_bind) on an "isolated" CPU core which should be largely excempted from regular kernel use (see details on kernel configuration below). The kthread is given high prio via sched_set_fifo. The kthread makes no subroutine calls and does not require any memory that has not already been previously allocated in the kernel. (The thread also temporarily disables anything that might interfer with its timing, using get_cpu, local_irq_save & local_bh_disable. However these do not seem to be the root cause of the sporadic crashes since crashes could be reproduced even when that disabling was not used.)
I have compiled a regular "Raspberry OS" "Desktop" kernel but I specifically activated NO_HZ_FULL (i.e. "Full dynaticks system (tickless)"). Also I am specifically isolating core #3 via cmdline.txt with: isolcpus=3 rcu_nocbs=3 rcu_nocb_poll=3 nohz_full=3 (which seems to keep most IRQs off cpu core #3 - as intended, so that my above kthread should be pretty alone on that core #3)
The susual suspect might be the "shared kernel memory" buffer that is used for all communication between the above "playback" kthread and the producer of the data which lives in "userland". I already took all the precautions that I could think of to avoid potential race-conditions but maybe there is some kind of CPU cache effect, or something else that I am overlooking.. The "shared buffer" contains 4 page aligned areas that are setup/used in a way that should ensure safe communication/synchronization.
the 1st page only contains one 32-bit flag that is accessed as an u32 or uint32_t (this should be naturally atomar). The kthread only updates this flag when it is 0 and it only sets it to something non-0. The userland code only resets this flag to 0 and only if it had some non-0 value - thereby acknowledging that it received the non-0 value set by the kthread.
the 2nd page contains a similar flag like 1) but for the opposite direction, i.e. here it is the kthread that will receive something non-0 from "userland".
the 3rd(+following) page(s) then contain the 1st buffer that is used for a simple double buffering scheme. This buffer is exclusively written to by the "userland" producer and exclusively read by the kthread. The "ping/pong" protocol implemented via the 2 flags is meant to ensure that the buffer is *never" used concurrently: The kthread initiates a sequence by signalling that one of the buffers can be filled and later the "userland" signals back after it has completed filling the respective buffer, i.e. the kthead only starts reading from a buffer after it has seen the signal from the producer that it is now safe to do so (before the "userland" producer gives that signal it uses msync(start_page, len, MS_INVALIDATE) to report which parts of the shared memory area it has updated.).
the n-th(+following) pages(s) then contain the 2nd buffer (everything said in 3) applies here as well)
But even if something went wrong in the above, that might then block the kthread or the respective userland process.. but I don't see why that should crash the whole system.
The most plausible explanation for me would be if the "shared buffer" got randomly relocated (thus leading to random memory corruption), but I would think that this should not happen to a buffer allocated via:
_raw_buffer = kmalloc(AREA_SIZE + 2*PAGE_SIZE, GFP_KERNEL & ~__GFP_RECLAIM & ~__GFP_MOVABLE); Or if there was some kernal function that specifically went into some blocking wait for something from core #3 (which might not be happening due to my kthread starving everything else on that CPU..).. however I'd be surprised why such a problem would then only be striking sporadically instead of crashing the machine all the time..
Any ideas?