(Originally posted this on stackoverflow but was asked to move it here.) I have a python daemon running as a service, sometimes several of them are running for different processes (p0, p1, etc.) that need the same function. What I've noticed when a service fails, is that sometimes it takes a a significant amount of time for the service to come back up 2 to 3 minutes. When typically I have seen services come back up in seconds.
I had a script run kill -9 on the processes to test that systemd would restart the services and print out systemctl status myapplication@p1 every 30 seconds until the service was running again, for 3 of these services running. What I noticed when one or more of the services takes longer than expected to come back up at the end of their status messages there is this sleep 30 at the end of the CGroup and the PID for it changes between each call of the status command.
* myapplication.service - my application service for p1. Loaded: loaded (/lib/systemd/system/[email protected]; enabled; vendor preset: enabled) Active: activating (start) since Fri 2022-03-25 14:48:27 UTC; 26s ago Cntrl PID: 1335258 (myapplication) Tasks: 5 (limit: 9505) Memory: 3.3M CPU: 1.442s CGroup: /system.slice/system-myapplication.slice/[email protected] |-1335258 /bin/bash /path/to/application/myapplication start p1 |-1336794 sudo -u admin -i python /path/to/application/myapplication.py start p1 |-1336795 logger |-1336802 -bash --login -c python \/path\/to\/application/myapplication\.py start p1 `-1336830 sleep 30 * myapplication.service - my application service for p1. Loaded: loaded (/lib/systemd/system/[email protected]; enabled; vendor preset: enabled) Active: activating (start) since Fri 2022-03-25 14:48:27 UTC; 56s ago Cntrl PID: 1335258 (myapplication) Tasks: 5 (limit: 9505) Memory: 3.3M CPU: 1.444s CGroup: /system.slice/system-myapplication.slice/[email protected] |-1335258 /bin/bash /path/to/application/myapplication start p1 |-1336794 sudo -u admin -i python /path/to/application/myapplication.py start |-1336795 logger |-1336802 -bash --login -c python \/path\/to\/application/myapplication\.py start p1 `-1336919 sleep 30 * myapplication.service - my application service for p1. Loaded: loaded (/lib/systemd/system/[email protected]; enabled; vendor preset: enabled) Active: activating (start) since Fri 2022-03-25 14:48:27 UTC; 1min 26s ago Cntrl PID: 1335258 (myapplication) Tasks: 5 (limit: 9505) Memory: 3.3M CPU: 1.447s CGroup: /system.slice/system-myapplication.slice/[email protected] |-1335258 /bin/bash /path/to/application/myapplication start p1 |-1336794 sudo -u admin -i python /path/to/application/myapplication.py start p1 |-1336795 logger |-1336802 -bash --login -c python \/path\/to\/application/myapplication\.py start p1 `-1336998 sleep 30 * myapplication.service - my application service for p1. Loaded: loaded (/lib/systemd/system/[email protected]; enabled; vendor preset: enabled) Active: active (running) since Fri 2022-03-25 14:49:58 UTC; 25s ago Process: 1337069 ExecStart=/path/to/application/myapplication start (code=exited, status=0/SUCCESS) Main PID: 1337603 (python) Tasks: 6 (limit: 9505) Memory: 7.2M CPU: 1.541s CGroup: /system.slice/system-myapplication.slice/[email protected] |-1337603 python /path/to/application/myapplication.py start p1 |-1337659 /bin/sh -c sudo timeout 15 sudo tcpdump -n -i lo udp port 1234 2> /tmp/capture.txt > /dev/null |-1337660 sudo timeout 15 sudo tcpdump -n -i lo udp port 1234 |-1337662 timeout 15 sudo tcpdump -n -i lo udp port 1234 |-1337663 sudo tcpdump -n -i lo udp port 6343 `-1337664 tcpdump -n -i lo udp port 6343 My unit file doesn't have any restart delay as part of it
[Unit] Description=my application service for %i. [Service] Type=forking ExecStart=/path/to/application/myapplication start ExecStop=/path/to/application/myapplication stop ExecReload=/path/to/application/myapplication restart Restart=always [Install] WantedBy=multi-user.target Why does this delay occur, as it isn't regularly, sometimes it happens, sometimes it does not? Why is the PID for the sleep changing? Initial thought was that it is a new sleep after a restart attempt but looking at the status output there was no restart, the Active: activating (start) ... time is unchanged until the service is restarted.
I've done a bit of playing about now and I think it may be due to me running the kill command as a one liner kill -9 <p1-PID> <p2-PID> <p3-PID> when I separated it out to
kill -9 <p1-PID> kill -9 <p2-PID> kill -9 <p3-PID> I'm not running into the 30 sleep for the services. Still needs more testing to be sure, but it's looking that way just now.
But I'm still curious as to why the sleep 30 appeared?
StartLimitBurst(=5) andStartLimitIntervalSec(=10) inman systemd.unit, though I don't think this is the cause here. You are limited by default to 5 restarts in 10 seconds.