I have a faulty Ryzen 5900X desktop CPU. Previously, I somewhat tamed its faulty cores via isolcpus=2,3,14,15 kernel parameter in GRUB2 (see https://blog.cbugk.com/post/ryzen-5850x/).
However, on Proxmox 8.2, I have set up a CEPH cluster. It had crippling performance of around 2 MB/s. Redone the cluster got 20 MB/s speed while cloning a template VM. I was suspecting my use of second-hand enterprise SSDs but even fresh ones did it (with or without NVMe DB cache).
But, when I checked my faulty cores (2,3,14,15) they were being used. The moment I turn down the computer with 5900X, transfer speed jumps to around 100 MB/s on the remaining two nodes. Networking is 10G between each-node, iperf previously had shown 6G throughput, ~~it cannot be the bottle-neck.~~ It was the damn cabling.
Some duckduckgo-ing later found out isolcpus= works for user space but not for kernel space.
watch -n1 -- "ps -axo psr,pcpu,uid,user,pid,tid,args --sort=psr | grep -e '^ 2 ' -e '^ 3 ' -e '^ 14 ' -e '^ 15'" (source) gives:
2 0.0 0 root 27 27 [cpuhp/2] 2 0.0 0 root 28 28 [idle_inject/2] 2 0.3 0 root 29 29 [migration/2] 2 0.0 0 root 30 30 [ksoftirqd/2] 2 0.0 0 root 31 31 [kworker/2:0-events] 2 0.0 0 root 192 192 [irq/26-AMD-Vi] 2 0.0 0 root 202 202 [kworker/2:1-events] 3 0.0 0 root 33 33 [cpuhp/3] 3 0.0 0 root 34 34 [idle_inject/3] 3 0.3 0 root 35 35 [migration/3] 3 0.0 0 root 36 36 [ksoftirqd/3] 3 0.0 0 root 37 37 [kworker/3:0-events] 3 0.0 0 root 203 203 [kworker/3:1-events] 14 0.0 0 root 99 99 [cpuhp/14] 14 0.0 0 root 100 100 [idle_inject/14] 14 0.3 0 root 101 101 [migration/14] 14 0.0 0 root 102 102 [ksoftirqd/14] 14 0.0 0 root 103 103 [kworker/14:0-events] 14 0.0 0 root 210 210 [kworker/14:1-events] 15 0.0 0 root 105 105 [cpuhp/15] 15 0.0 0 root 106 106 [idle_inject/15] 15 0.3 0 root 107 107 [migration/15] 15 0.0 0 root 108 108 [ksoftirqd/15] 15 0.0 0 root 109 109 [kworker/15:0-events] 15 0.0 0 root 211 211 [kworker/15:1-events] Since Ceph uses kernel driver, I need a way to isolate cores from the whole system. Running PID 1 on-wards in a taskset is okay. I cannot use cset due to cgroups2. numactl is also okay.
With isolcpus I do not have apparent system stability issues, without that I would face secure connection errors on Firefox and OS installs would fail. But even that is not enough when using CEPH. And now I conclude that it could corrupt data unnoticed if this wasn't my homelab machine.
Can anyone suggest a way to effectively ban these faulty threads as soon as system allows to do so, permanently? (I better use the phrase CPU affinity in the post)
I was wrong, redone Cat6 cables just the right length, having cleared power cables earlier I can state intererence should be quite lower than earlier. The same error was there when I disabled half the cores on BIOS including the faulty ones. I get instant VM clones on CEPH pool now, thanks to nvme DB cache I suppose.
Also, the kernel threads on the cores are the ones used for scheduling processes, their PID and set of threads on those cores is constant with above watch command even during a VM clone on CEPH pool. So if no tasks are being scheduled, it might be working as intended.
Found these tangentially relevant readings interesting: migration - reddit, nohz - lwn.net
/sys/devices/system/cpu/cpu#/online? Note that in a hyperthreaded setup each thread shows as a separate cpu and you can see them by looking atcpu#/topology/thread_siblings_list