0

I have a systemd-nspawn container in which I am trying to change the kernel parameter for msgmnb. When I try to change the kernel parameter by directly writing to the /proc filesystem or using sysctl inside the systemd-nspawn container, I get an error that the /proc file system is read only.

From the arch wiki I see this relevant documentation

systemd-nspawn limits access to various kernel interfaces in the container to read-only, such as /sys, /proc/sys or /sys/fs/selinux. Network interfaces and the system clock may not be changed from within the container. Device nodes may not be created. The host system cannot be rebooted and kernel modules may not be loaded from within the container. 

I thought the container would inherit some properties of /proc from the host, including the kernel parameter value for msgmnb, but this does not appear to be the case as the host and container have different values for msgmnb.

The kernel parameter value in the container:

cat /proc/sys/kernel/msgmnb 16384 

Writing to the proc filesystem inside the container

$ bash -c 'echo 2621440 > /proc/sys/kernel/msgmnb' bash: /proc/sys/kernel/msgmnb: Read-only file system 

For completeness, I also tried sysctl in the container:

# sysctl -w kernel.msgmnb=2621440 sysctl: setting key "kernel.msgmnb": Read-only file system 

I thought this value would be inherited from the host system. I set the value on the host, rebooted and re-created my container. The container (even new ones) maintains the value of 16384.

# On the host $ cat /proc/sys/kernel/msgmnb 2621440 

I've also tried using unprivileged the -U flag when booting the systemd-nspawn container but I get the same results.

I've also tried to editted /etc/sysctl.conf in the container tree to include this line before booting the container:

kernel.msgmnb=2621440 

I also looked into https://man7.org/linux/man-pages/man7/capabilities.7.html and noticed CAP_SYS_RESOURCE which has a line that reads:

CAP_SYS_RESOURCE ... raise msg_qbytes limit for a System V message queue above the limit in /proc/sys/kernel/msgmnb (see msgop(2) and msgctl(2)); 

Using sudo systemd-nspawn --capability=CAP_SYS_RESOURCE -D /path/to/container, and then inside the container, when I use msgctl with IPC_SET and pass msqid_ds->msg_qbytes with a value that is higher than what is in /proc/sys/kernel/msgmnb, the syscall returns an error code. It seemed like passing the CAP_SYS_RESOURCE should work here?

Nothing I've tried here has changed the value for msgmnb in the container. I can't seem to find documentation on how to achieve my goal.

I'd appreciate any help - thank you!

EDIT: Trying to determine if the process calling msgctl has the capability. Here is what I found:

$ cat /proc/6211/status | grep -i Cap CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 00000000fdecafff CapAmb: 0000000000000000 $ capsh --decode=00000000fdecafff 0x00000000fdecafff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_raw,cap_ipc_owner,cap_sys_chroot,cap_sys_ptrace,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap 

1 Answer 1

1
$ cat /proc/6211/status | grep -i Cap CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 00000000fdecafff CapAmb: 0000000000000000 

CapInh is the set of inheritable capabilities, which is not useful for the current program, but could be passed on to any programs this process would exec() if the right conditions apply. It's all zeroes, so there's no capabilities in there anyway.

CapEff is the most important one: it is the set of effective capabilities, or the privileged things this process/thread is allowed to do right now. Unfortunately, it is all zeroes here.

CapPrm limits the capabilities this particular process/thread is permitted to get for itself or its child processes if it asks for them. And that is also all zeroes. So as long as this process executes the current program, it will never be able to gain any capabilities at all.

CapBnd is the bounding set that limits the capabilities the descendants of this program could receive - if they would get them from somewhere else. If this process would exec() a setuid-root program, this is the set of capabilities that would become effective for it all at once. Or if, for example, this process executed a program that had a setcap 'cap_sys_resource=eip' <filename> done on it, this CapBnd value would allow the CAP_SYS_RESOURCE capability to become effective for the executed program and its child processes.

So your process currently does not have the CAP_SYS_RESOURCE capability and cannot get it without exec()ing another program.

To make the CAP_SYS_RESOURCE immediately effective for your containerized process, you would need to add the option --ambient-capability=CAP_SYS_RESOURCE to your systemd-nspawn command line.

1
  • AmbientCapability was added with systemd version 248 and I am running version 245 (Ubuntu Focal Fossa). Thank you for this answer, yesterday, I did not even know about capabilities in Linux - very helpful! Commented Jun 6, 2024 at 17:23

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.