cgroup / cpu scheduler tuning questions, cpu pressure & uclamp behavior

Question

I've been working on this for a few days and I'm scratching my head. The kernel docs for cgroups, pressure stall information, and the scheduler have not helped me shed any light on this so far, so I am hoping you can help.

I have 3 related processes which run in their own cgroup. It is a root partition and the only partition which may use cpus 3-10. One application runs on CPU 11 and all other processes run on cpus 0-2.

$ cat cpuset.cpus.effective 3-10

Aside from the cpuset the other cgroup properties are currently defaults. It's just using the normal SCHED_OTHER scheduler class right now.

$ for PROC in $(cat cgroup.procs); do chrt -p $PROC; done pid 16049's current scheduling policy: SCHED_OTHER pid 16049's current scheduling priority: 0 pid 16058's current scheduling policy: SCHED_OTHER pid 16058's current scheduling priority: 0 pid 16059's current scheduling policy: SCHED_OTHER pid 16059's current scheduling priority: 0

The cgroup is never getting throttled

$ cat cpu.stat usage_usec 1724410414 user_usec 737614077 system_usec 986796337 nr_periods 0 nr_throttled 0 throttled_usec 0

Yet somehow cpu.pressure is full ~1% of the time

$ cat cpu.pressure some avg10=6.85 avg60=5.87 avg300=3.98 total=45578161 full avg10=1.00 avg60=0.73 avg300=0.27 total=9479354

If I change cpu.uclamp.min to max

full drops to 0%
some drops to ~3%
the main application in the group's CPU usage drops from ~156% to ~100%
average CPU utilization drops (see graph below, where it drops is when I set cpu.uclamp.min to max)

$ cat cpu.pressure some avg10=6.85 avg60=5.87 avg300=3.98 total=45578161 full avg10=1.00 avg60=0.73 avg300=0.27 total=9479354 $ echo max > cpu.uclamp.min # wait a lil bit $ cat cpu.pressure some avg10=3.00 avg60=3.18 avg300=4.09 total=61530804 full avg10=0.00 avg60=0.06 avg300=0.32 total=12734850

No CPU core is ever fully loaded in either case, and it isn't being throttled, so I'm really confused how CPU pressure can ever have a nonzero value for full - wouldn't that mean there are CPU cores sitting idle while threads are not being scheduled?

# with cpu.uclamp.min = 0 CPU [22%@1971,16%@1971,17%@1971,4%@1971,39%@729,38%@729,40%@729,36%@729,36%@729,37%@729,37%@729,10%@729] # with cpu.uclamp.min = max CPU [26%@729,14%@1971,21%@729,2%@1971,29%@806,26%@1971,26%@1971,27%@1971,20%@729,25%@729,25%@729,7%@727]

I am assuming that setting the minimum uclamp value is causing the scheduler to prioritize scheduling this cgroups threads but given that the group has exclusive access to the cores it is running on, and no core in the system is fully utilized, I'm struggling to understand the exact mechanism at play here.

There's no memory or io pressure system wide with either cpu.uclamp.min setting. The system has 47 GiB of free memory, ~5 GiB in use, and ~1.5 Gib as caches.

How can I have CPU pressure when none of my CPUs are full, and there's no memory or io pressure?
Why does changing the uclamp.min value result in lower CPU usage when the system has excess CPU, memory, and IO resources available?
- especially since no other process is allowed to use those cores either way
Are there other scheduler classes or settings I can use to tune the performance so that my processes aren't waiting for CPU while my cores sit idle?
Any other ways I can debug what the bottleneck is?

Edit:

Additional details - I'm running 5.15.148 kernel on aarch64 using a (customized) poky based yocto image.

Edit 2:

I think the reason I have a non-zero full percentage inside my cgroup is because I have a large number of tasks which all wake up at approximately the same time, and compete for CPU resources in a very small time window. So even though I never max out any core, there are small time slices where tasks are waiting. I am going to see if I can temporally distribute the load a little more evenly to alleviate this.

I also saw some improvements using the realtime SCHED_FIFO scheduler class over SCHED_OTHER or SCHED_RR. The avg300 CPU pressure within the group dropped from 4.67 with SCHED_OTHER to 1.59 with SCHED_FIFO, and from 4.98 to 3.21 for the system overall.

SCHED_OTHER:

+ cat /sys/fs/cgroup/mycgroup/cpu.pressure some avg10=6.68 avg60=6.62 avg300=4.67 total=549098067 full avg10=1.00 avg60=1.00 avg300=0.66 total=130085026 + cat /proc/pressure/cpu some avg10=5.99 avg60=5.94 avg300=4.89 total=576635062 full avg10=0.00 avg60=0.00 avg300=0.00 total=0

SCHED_RR:

+ cat /sys/fs/cgroup/mycgroup/cpu.pressure some avg10=1.17 avg60=1.15 avg300=2.38 total=554738350 full avg10=0.00 avg60=0.00 avg300=0.21 total=132311886 + cat /proc/pressure/cpu some avg10=3.00 avg60=3.01 avg300=3.66 total=587227268 full avg10=0.00 avg60=0.00 avg300=0.00 total=0

SCHED_FIFO:

+ cat /sys/fs/cgroup/mycgroup/cpu.pressure some avg10=1.29 avg60=1.18 avg300=1.59 total=560419808 full avg10=0.00 avg60=0.00 avg300=0.05 total=134547726 + cat /proc/pressure/cpu some avg10=3.00 avg60=3.00 avg300=3.21 total=597882370 full avg10=0.00 avg60=0.00 avg300=0.00 total=0

Stack Exchange Network

cgroup / cpu scheduler tuning questions, cpu pressure & uclamp behavior

0

You must log in to answer this question.

Hot Network Questions

cgroup / cpu scheduler tuning questions, cpu pressure & uclamp behavior

0

You must log in to answer this question.

Related

Hot Network Questions