1

I have a faulty Ryzen 5900X desktop CPU. Previously, I somewhat tamed its faulty cores via isolcpus=2,3,14,15 kernel parameter in GRUB2 (see https://blog.cbugk.com/post/ryzen-5850x/).

However, on Proxmox 8.2, I have set up a CEPH cluster. It had crippling performance of around 2 MB/s. Redone the cluster got 20 MB/s speed while cloning a template VM. I was suspecting my use of second-hand enterprise SSDs but even fresh ones did it (with or without NVMe DB cache).

But, when I checked my faulty cores (2,3,14,15) they were being used. The moment I turn down the computer with 5900X, transfer speed jumps to around 100 MB/s on the remaining two nodes. Networking is 10G between each-node, iperf previously had shown 6G throughput, ~~it cannot be the bottle-neck.~~ It was the damn cabling.

Some duckduckgo-ing later found out isolcpus= works for user space but not for kernel space.

watch -n1 -- "ps -axo psr,pcpu,uid,user,pid,tid,args --sort=psr | grep -e '^ 2 ' -e '^ 3 ' -e '^ 14 ' -e '^ 15'" (source) gives:

 2 0.0 0 root 27 27 [cpuhp/2] 2 0.0 0 root 28 28 [idle_inject/2] 2 0.3 0 root 29 29 [migration/2] 2 0.0 0 root 30 30 [ksoftirqd/2] 2 0.0 0 root 31 31 [kworker/2:0-events] 2 0.0 0 root 192 192 [irq/26-AMD-Vi] 2 0.0 0 root 202 202 [kworker/2:1-events] 3 0.0 0 root 33 33 [cpuhp/3] 3 0.0 0 root 34 34 [idle_inject/3] 3 0.3 0 root 35 35 [migration/3] 3 0.0 0 root 36 36 [ksoftirqd/3] 3 0.0 0 root 37 37 [kworker/3:0-events] 3 0.0 0 root 203 203 [kworker/3:1-events] 14 0.0 0 root 99 99 [cpuhp/14] 14 0.0 0 root 100 100 [idle_inject/14] 14 0.3 0 root 101 101 [migration/14] 14 0.0 0 root 102 102 [ksoftirqd/14] 14 0.0 0 root 103 103 [kworker/14:0-events] 14 0.0 0 root 210 210 [kworker/14:1-events] 15 0.0 0 root 105 105 [cpuhp/15] 15 0.0 0 root 106 106 [idle_inject/15] 15 0.3 0 root 107 107 [migration/15] 15 0.0 0 root 108 108 [ksoftirqd/15] 15 0.0 0 root 109 109 [kworker/15:0-events] 15 0.0 0 root 211 211 [kworker/15:1-events] 

Since Ceph uses kernel driver, I need a way to isolate cores from the whole system. Running PID 1 on-wards in a taskset is okay. I cannot use cset due to cgroups2. numactl is also okay.

With isolcpus I do not have apparent system stability issues, without that I would face secure connection errors on Firefox and OS installs would fail. But even that is not enough when using CEPH. And now I conclude that it could corrupt data unnoticed if this wasn't my homelab machine.

Can anyone suggest a way to effectively ban these faulty threads as soon as system allows to do so, permanently? (I better use the phrase CPU affinity in the post)


I was wrong, redone Cat6 cables just the right length, having cleared power cables earlier I can state intererence should be quite lower than earlier. The same error was there when I disabled half the cores on BIOS including the faulty ones. I get instant VM clones on CEPH pool now, thanks to nvme DB cache I suppose.

Also, the kernel threads on the cores are the ones used for scheduling processes, their PID and set of threads on those cores is constant with above watch command even during a VM clone on CEPH pool. So if no tasks are being scheduled, it might be working as intended.

Found these tangentially relevant readings interesting: migration - reddit, nohz - lwn.net

2
  • 1
    I've never tested for kernel threads, but could you offline the cores by writing a zero to /sys/devices/system/cpu/cpu#/online ? Note that in a hyperthreaded setup each thread shows as a separate cpu and you can see them by looking at cpu#/topology/thread_siblings_list Commented May 14, 2024 at 2:02
  • I tried your suggestion but did not help, got even worse (for an other reason). The processes might have been dead-locked, as they did not vanish. Upon inspection, found out it was the eth cables. They worked right after plugging them in, but could not withstood a reboot, dropping to mere hundreds of Kbps at best. Edited question, this was a non-issue as no fresh PID was added to isolated threads. I can't answer for sure, but seems like it works as I intended. @StephenHarris thank you for your time, and sorry to bother with my lack of effort to verify. Well, better late than never. Commented May 14, 2024 at 23:00

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.