How to isolate cpu cores, even from kernel space, at boot?

Question

I have a faulty Ryzen 5900X desktop CPU. Previously, I somewhat tamed its faulty cores via isolcpus=2,3,14,15 kernel parameter in GRUB2 (see https://blog.cbugk.com/post/ryzen-5850x/).

However, on Proxmox 8.2, I have set up a CEPH cluster. It had crippling performance of around 2 MB/s. Redone the cluster got 20 MB/s speed while cloning a template VM. I was suspecting my use of second-hand enterprise SSDs but even fresh ones did it (with or without NVMe DB cache).

But, when I checked my faulty cores (2,3,14,15) they were being used. The moment I turn down the computer with 5900X, transfer speed jumps to around 100 MB/s on the remaining two nodes. Networking is 10G between each-node, iperf previously had shown 6G throughput, ~~it cannot be the bottle-neck.~~ It was the damn cabling.

Some duckduckgo-ing later found out isolcpus= works for user space but not for kernel space.

watch -n1 -- "ps -axo psr,pcpu,uid,user,pid,tid,args --sort=psr | grep -e '^ 2 ' -e '^ 3 ' -e '^ 14 ' -e '^ 15'" (source) gives:

 2 0.0 0 root 27 27 [cpuhp/2] 2 0.0 0 root 28 28 [idle_inject/2] 2 0.3 0 root 29 29 [migration/2] 2 0.0 0 root 30 30 [ksoftirqd/2] 2 0.0 0 root 31 31 [kworker/2:0-events] 2 0.0 0 root 192 192 [irq/26-AMD-Vi] 2 0.0 0 root 202 202 [kworker/2:1-events] 3 0.0 0 root 33 33 [cpuhp/3] 3 0.0 0 root 34 34 [idle_inject/3] 3 0.3 0 root 35 35 [migration/3] 3 0.0 0 root 36 36 [ksoftirqd/3] 3 0.0 0 root 37 37 [kworker/3:0-events] 3 0.0 0 root 203 203 [kworker/3:1-events] 14 0.0 0 root 99 99 [cpuhp/14] 14 0.0 0 root 100 100 [idle_inject/14] 14 0.3 0 root 101 101 [migration/14] 14 0.0 0 root 102 102 [ksoftirqd/14] 14 0.0 0 root 103 103 [kworker/14:0-events] 14 0.0 0 root 210 210 [kworker/14:1-events] 15 0.0 0 root 105 105 [cpuhp/15] 15 0.0 0 root 106 106 [idle_inject/15] 15 0.3 0 root 107 107 [migration/15] 15 0.0 0 root 108 108 [ksoftirqd/15] 15 0.0 0 root 109 109 [kworker/15:0-events] 15 0.0 0 root 211 211 [kworker/15:1-events]

Since Ceph uses kernel driver, I need a way to isolate cores from the whole system. Running PID 1 on-wards in a taskset is okay. I cannot use cset due to cgroups2. numactl is also okay.

With isolcpus I do not have apparent system stability issues, without that I would face secure connection errors on Firefox and OS installs would fail. But even that is not enough when using CEPH. And now I conclude that it could corrupt data unnoticed if this wasn't my homelab machine.

Can anyone suggest a way to effectively ban these faulty threads as soon as system allows to do so, permanently? (I better use the phrase CPU affinity in the post)

I was wrong, redone Cat6 cables just the right length, having cleared power cables earlier I can state intererence should be quite lower than earlier. The same error was there when I disabled half the cores on BIOS including the faulty ones. I get instant VM clones on CEPH pool now, thanks to nvme DB cache I suppose.

Also, the kernel threads on the cores are the ones used for scheduling processes, their PID and set of threads on those cores is constant with above watch command even during a VM clone on CEPH pool. So if no tasks are being scheduled, it might be working as intended.

Found these tangentially relevant readings interesting: migration - reddit, nohz - lwn.net

I've never tested for kernel threads, but could you offline the cores by writing a zero to /sys/devices/system/cpu/cpu#/online ? Note that in a hyperthreaded setup each thread shows as a separate cpu and you can see them by looking at cpu#/topology/thread_siblings_list — Stephen Harris
– Stephen Harris, Commented May 14, 2024 at 2:02
I tried your suggestion but did not help, got even worse (for an other reason). The processes might have been dead-locked, as they did not vanish. Upon inspection, found out it was the eth cables. They worked right after plugging them in, but could not withstood a reboot, dropping to mere hundreds of Kbps at best. Edited question, this was a non-issue as no fresh PID was added to isolated threads. I can't answer for sure, but seems like it works as I intended. @StephenHarris thank you for your time, and sorry to bother with my lack of effort to verify. Well, better late than never. — cbugk
– cbugk, Commented May 14, 2024 at 23:00

Stack Exchange Network

How to isolate cpu cores, even from kernel space, at boot?

0

You must log in to answer this question.

Hot Network Questions

How to isolate cpu cores, even from kernel space, at boot?

0

You must log in to answer this question.

Related

Hot Network Questions