Systemd delaying restart of a failed service, appears to add a 30 sleep

Question

(Originally posted this on stackoverflow but was asked to move it here.) I have a python daemon running as a service, sometimes several of them are running for different processes (p0, p1, etc.) that need the same function. What I've noticed when a service fails, is that sometimes it takes a a significant amount of time for the service to come back up 2 to 3 minutes. When typically I have seen services come back up in seconds.

I had a script run kill -9 on the processes to test that systemd would restart the services and print out systemctl status myapplication@p1 every 30 seconds until the service was running again, for 3 of these services running. What I noticed when one or more of the services takes longer than expected to come back up at the end of their status messages there is this sleep 30 at the end of the CGroup and the PID for it changes between each call of the status command.

 * myapplication.service - my application service for p1. Loaded: loaded (/lib/systemd/system/[email protected]; enabled; vendor preset: enabled) Active: activating (start) since Fri 2022-03-25 14:48:27 UTC; 26s ago Cntrl PID: 1335258 (myapplication) Tasks: 5 (limit: 9505) Memory: 3.3M CPU: 1.442s CGroup: /system.slice/system-myapplication.slice/[email protected] |-1335258 /bin/bash /path/to/application/myapplication start p1 |-1336794 sudo -u admin -i python /path/to/application/myapplication.py start p1 |-1336795 logger |-1336802 -bash --login -c python \/path\/to\/application/myapplication\.py start p1 `-1336830 sleep 30 * myapplication.service - my application service for p1. Loaded: loaded (/lib/systemd/system/[email protected]; enabled; vendor preset: enabled) Active: activating (start) since Fri 2022-03-25 14:48:27 UTC; 56s ago Cntrl PID: 1335258 (myapplication) Tasks: 5 (limit: 9505) Memory: 3.3M CPU: 1.444s CGroup: /system.slice/system-myapplication.slice/[email protected] |-1335258 /bin/bash /path/to/application/myapplication start p1 |-1336794 sudo -u admin -i python /path/to/application/myapplication.py start |-1336795 logger |-1336802 -bash --login -c python \/path\/to\/application/myapplication\.py start p1 `-1336919 sleep 30 * myapplication.service - my application service for p1. Loaded: loaded (/lib/systemd/system/[email protected]; enabled; vendor preset: enabled) Active: activating (start) since Fri 2022-03-25 14:48:27 UTC; 1min 26s ago Cntrl PID: 1335258 (myapplication) Tasks: 5 (limit: 9505) Memory: 3.3M CPU: 1.447s CGroup: /system.slice/system-myapplication.slice/[email protected] |-1335258 /bin/bash /path/to/application/myapplication start p1 |-1336794 sudo -u admin -i python /path/to/application/myapplication.py start p1 |-1336795 logger |-1336802 -bash --login -c python \/path\/to\/application/myapplication\.py start p1 `-1336998 sleep 30 * myapplication.service - my application service for p1. Loaded: loaded (/lib/systemd/system/[email protected]; enabled; vendor preset: enabled) Active: active (running) since Fri 2022-03-25 14:49:58 UTC; 25s ago Process: 1337069 ExecStart=/path/to/application/myapplication start (code=exited, status=0/SUCCESS) Main PID: 1337603 (python) Tasks: 6 (limit: 9505) Memory: 7.2M CPU: 1.541s CGroup: /system.slice/system-myapplication.slice/[email protected] |-1337603 python /path/to/application/myapplication.py start p1 |-1337659 /bin/sh -c sudo timeout 15 sudo tcpdump -n -i lo udp port 1234 2> /tmp/capture.txt > /dev/null |-1337660 sudo timeout 15 sudo tcpdump -n -i lo udp port 1234 |-1337662 timeout 15 sudo tcpdump -n -i lo udp port 1234 |-1337663 sudo tcpdump -n -i lo udp port 6343 `-1337664 tcpdump -n -i lo udp port 6343

My unit file doesn't have any restart delay as part of it

[Unit] Description=my application service for %i. [Service] Type=forking ExecStart=/path/to/application/myapplication start ExecStop=/path/to/application/myapplication stop ExecReload=/path/to/application/myapplication restart Restart=always [Install] WantedBy=multi-user.target

Why does this delay occur, as it isn't regularly, sometimes it happens, sometimes it does not? Why is the PID for the sleep changing? Initial thought was that it is a new sleep after a restart attempt but looking at the status output there was no restart, the Active: activating (start) ... time is unchanged until the service is restarted.

I've done a bit of playing about now and I think it may be due to me running the kill command as a one liner kill -9 <p1-PID> <p2-PID> <p3-PID> when I separated it out to

kill -9 <p1-PID> kill -9 <p2-PID> kill -9 <p3-PID>

I'm not running into the 30 sleep for the services. Still needs more testing to be sure, but it's looking that way just now.

But I'm still curious as to why the sleep 30 appeared?

See also StartLimitBurst (=5) and StartLimitIntervalSec (=10) in man systemd.unit, though I don't think this is the cause here. You are limited by default to 5 restarts in 10 seconds. — meuh
– meuh, Commented Dec 8, 2022 at 17:46

Stewart · Accepted Answer · 2022-12-16 15:58:58Z

Your service contains Type=forking. That means your unit expects your ExecStart= command to fork() and then exit. The service will remain in activating (start) as long as ExecStart= is still running. When ExecStart= does end, systemd will choose one of the spawned processes (MainPID) and enter the active (running) state.

Your ExecStart= program appears to be a bash script. This script probably looks something like this:

#!/bin/bash sudo -u admin -i python /path/to/application/myapplication.py start p1 | logger & while /bin/true; do sleep 30 done

There are a few things wrong with that script, but the part that should be interesting to you is the sleep 30 at the end. Your main process is running, but systemd won't enter the active (running) state until that sleep 30 finishes. You see a new PID each time because a new sleep 30 is constantly being called.

A well behaved script would create your application and then quit immediately. It is possible that your script is waiting for feedback that the application has started properly before exiting, and that would be fine, but it clearly isn't getting the feedback it is waiting for, so you would need to dig deeper to find out why not.

However, I think it's more likely that your script is waiting for your application to exit before it quits. Something like this:

python myapplication.py & PID=$! while ps | grep -q $PID; do sleep 30 done

The reason I think that is because it sounds like you are killing python, not the ExecStart= script. This means the unit will remain in activating (start) until that sleep 30 finishes. Then it will check if the PID of your python application is still present (it isn't) and only then will your script stop.

Another part of that script that bothers me is the use of sudo. sudo is meant to authenticate an interactive user. It is not intended to authenticate scripts. Instead, a more elegant solution is to use User=admin in your service.

I see two solutions. Option 2 is by-far my favorite:

Edit /path/to/application/myapplication and remove that while loop.
Use a simple service and skip that script completely. It looks like it was written for another init system (it explicitly starts its own logger which is not needed these days), then modified to block when called in a shell. systemd does all of that for you, so I bet that script isn't required at all.

[Unit] Description=my application service for %i. [Service] Type=simple ExecStart=/usr/bin/python3 /path/to/application/myapplication.py %i User=admin Restart=always [Install] WantedBy=multi-user.target

If this doesn't help, please post the contents of /path/to/application/myapplication

Stack Exchange Network

Systemd delaying restart of a failed service, appears to add a 30 sleep

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Systemd delaying restart of a failed service, appears to add a 30 sleep

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions