After digging through additional documentation, I found a solution.
"Delegation" is the tool that cgroups and systemd provide for this type of use case. With delegation a service is allowed to manage its own cgroup sub-hierarchy. This includes the ability to create additional "helper" cgroups. Below is an example of the situation described in the original question followed by a solution that uses delegation.
Error example
Let's say we have this foo service:
# foo.service [Unit] Description=Foo without delegation [Service] User=foo-user ExecStart=/bin/bash /usr/bin/foo.sh [Install] WantedBy=multi-user.target
With /usr/bin/foo.sh is defined as:
#!/usr/bin/bash # # /usr/bin/foo.sh set -euo pipefail echo "Starting foo service" while true; do echo "Calling tool in helper cgroup" cgexec --sticky -g memory:helper bash /usr/bin/memory-intensive-helper.sh done
Note that we'd like to execute memory-intensive-helper.sh in a "helper" cgroup we've naively created using cgcreate:
sudo cgcreate -a foo-user:foo-user -t foo-user:foo-user -g memory:helper sudo bash -c 'echo $((512 * 1024 * 1024)) > /sys/fs/cgroup/helper/memory.max'
If we start the foo.service:
sudo systemctl restart foo.service
And watch logs:
journalctl -f -u foo.service
We get the familiar "cgroup change of group failed" error:
systemd[1]: Started foo.service - Foo without delegation. bash[15202]: Starting foo service bash[15202]: Calling tool in helper cgroup bash[15203]: cgroup change of group failed systemd[1]: foo.service: Main process exited, code=exited, status=87/n/a systemd[1]: foo.service: Failed with result 'exit-code'.
The reason for the error is explained by cgroup v2's "containment" behavior (see "Delegation and Containment"). Notice that our cgroup hierarchy looks like this:
/sys/fs/cgroup | +-- helper | +-- system.slice | +-- foo.service
Systemd runs the foo.sh process in the cgroup system.slice/foo.service. foo.sh tries running memory-intensive-helper.sh in the helper cgroup. In order to do that the memory-intensive-helper.sh process would need to be moved from system.slice/foo.service to helper. To do this move cgroups v2 requires foo-user to have write access to the cgroup.procs file of the cgroup that is the common ancestor of system.slice/foo.service and helper. That would be root cgroup in this case. By default only root has access to the root cgroup, so the move fails. (Granting non-root users access to the root cgroup is discouraged.)
A solution using delegation
In the error example foo.sh's cgexec command failed because foo-user did not have full access to the cgroup hierarchy. With the Delegate and DelegateSubgroup options we can create a sub-hierarchy that the foo-user service can completely control itself:
# foo.service [Unit] Description=Foo with delegation [Service] User=foo-user Delegate=yes DelegateSubgroup=main ExecStart=/bin/bash /usr/bin/foo.sh [Install] WantedBy=multi-user.target
When we start the foo service with those additional options, the resulting cgroup hierarchy will look like this:
/sys/fs/cgroup | +-- system.slice | +-- foo.service <-- owned by foo-user | +-- main <-- owned by foo-user
We can then have foo.sh create any cgroups that it may need itself:
#!/usr/bin/bash # # /usr/bin/foo.sh set -euo pipefail echo "Starting foo service shell" delegated_subtree_root="/sys/fs/cgroup/foo.slice/foo.service" # Enable memory controllers in child cgroups. According to the docs [1]: # # > Resources are distributed top-down and a cgroup can further distribute a # resource only if the resource has been distributed to it from the parent. # This means that all non-root “cgroup.subtree_control” files can only contain # controllers which are enabled in the parent’s “cgroup.subtree_control” file. # # [1]: https://docs.kernel.org/admin-guide/cgroup-v2.html#top-down-constraint echo "+memory" > "${delegated_subtree_root}/cgroup.subtree_control" # Create and configure the "helper" cgroup helper_cgroup_path="${delegated_subtree_root}/helper" mkdir "$helper_cgroup_path" echo $((512 * 1024 * 1024)) > "${helper_cgroup_path}/memory.max" while true; do echo "Calling tool in helper cgroup" cgexec --sticky -g memory:system.slice/foo.service/helper memory-intensive-helper.sh sleep 2 done
With foo.sh now creating its own helper cgroup, here is the final cgroup hierarchy:
/sys/fs/cgroup | +-- system.slice | +-- foo.service <-- owned by foo-user | +-- main <-- owned by foo-user | +-- helper <-- owned by foo-user
With foo-user in control of the entire /sys/fs/cgroup/system.slice/foo.service sub-tree, it's now free to move processes among the leaf cgroups (i.e., main, helper).
Helpful documentation
The above solution elides some details. To fill in the gaps, I recommend the following key pieces of documentation: