Service not coming up after a restart on Linux

Published 10 months ago

I have had 2 gateways go down overnight during update installation. The servers are running Devolutions Gateway 2025.2.3-1 on Ubuntu 24.04. We have unattended upgrades enabled with needrestart, and needrestart deemed it necessary to restart the devolutions gateway service. The restart failed because the listening port was still in use. This morning I could immediately start the service again without having to kill any lingering process. It appears as if the gateway reported a stopped state too soon, so the start command was giving before it actually finished shutting down?

Sep 24 06:04:09 devolutionsgw02 systemd[1]: Stopping devolutions-gateway.service...
Sep 24 06:04:09 devolutionsgw02 devolutions-gateway[911]: 2025-09-24T04:04:09.008971Z  INFO devolutions_gateway::service: Stopping gateway service
Sep 24 06:04:19 devolutionsgw02 devolutions-gateway[911]: 2025-09-24T04:04:19.010468Z  WARN devolutions_gateway::service: Termination of certain tasks is experiencing significant delays
Sep 24 06:04:29 devolutionsgw02 devolutions-gateway[911]: 2025-09-24T04:04:29.011451Z  WARN devolutions_gateway::service: Termination of certain tasks is experiencing significant delays
Sep 24 06:04:39 devolutionsgw02 devolutions-gateway[911]: 2025-09-24T04:04:39.012837Z  WARN devolutions_gateway::service: Terminate forcefully the lingering tasks
Sep 24 06:04:39 devolutionsgw02 devolutions-gateway[911]: 2025-09-24T04:04:39.013276Z  INFO devolutions_gateway: devolutions-gateway service stopping
Sep 24 06:04:39 devolutionsgw02 systemd[1]: devolutions-gateway.service: Deactivated successfully.
Sep 24 06:04:39 devolutionsgw02 systemd[1]: Stopped devolutions-gateway.service.
Sep 24 06:04:39 devolutionsgw02 systemd[1]: devolutions-gateway.service: Consumed 58.644s CPU time, 79.4M memory peak, 0B memory swap peak.
Sep 24 06:04:39 devolutionsgw02 systemd[1]: Started devolutions-gateway.service.
Sep 24 06:04:39 devolutionsgw02 devolutions-gateway[48534]: 2025-09-24T04:04:39.046607Z  INFO devolutions_gateway::service: version="2025.2.3"
Sep 24 06:04:39 devolutionsgw02 devolutions-gateway[48534]: 2025-09-24T04:04:39.048196Z  INFO devolutions_gateway::service: XMF native library loaded and installed path=/usr/lib/devolutions-gateway/libxmf.so
Sep 24 06:04:39 devolutionsgw02 devolutions-gateway[48534]: 2025-09-24T04:04:39.048672Z  INFO devolutions_gateway::service: Reading JRL file from disk (path: /etc/devolutions-gateway/jrl.json)
Sep 24 06:04:39 devolutionsgw02 devolutions-gateway[48534]: 2025-09-24T04:04:39.049560Z  INFO devolutions_gateway::listener: Initiating listener... url=tcp://0.0.0.0:8181
Sep 24 06:04:39 devolutionsgw02 devolutions-gateway[48534]: 2025-09-24T04:04:39.049611Z  INFO devolutions_gateway::listener: Listener started successfully kind=Tcp addr=0.0.0.0:8181
Sep 24 06:04:39 devolutionsgw02 devolutions-gateway[48534]: 2025-09-24T04:04:39.049622Z  INFO devolutions_gateway::listener: Initiating listener... url=https://0.0.0.0:7171/
Sep 24 06:04:39 devolutionsgw02 devolutions-gateway[48534]: 2025-09-24T04:04:39.050019Z ERROR devolutions_gateway: Failed to start error="failed to bind listener: failed to initialize https://0.0.0.0:7171/: failed to bind TCP socke>
Sep 24 06:04:39 devolutionsgw02 systemd[1]: devolutions-gateway.service: Deactivated successfully.

All Comments (6)

Michel Audi

Published 10 months ago

Hi JelmerJ,

Overnight we saw two Devolutions Gateway instances briefly fail to come back up during unattended upgrades on Ubuntu 24.04 (Gateway 2025.2.3-1).
The upgrades triggered a service restart while the previous Gateway process had not yet fully released the HTTPS listener on port 7171. In that short window the new process could bind 8181, but 7171 was still held by the previous instance at the kernel level, so the start failed with a “port already in use” condition. By the time we checked in the morning the socket had been released and the service started normally.

This is expected behavior in a narrow timing window when a restart happens immediately after shutdown. During graceful stop, some tasks can linger for a few seconds; systemd may report the unit as stopped before the socket is actually free, so 7171 can appear non-responsive even though it is still bound by the prior process.
There is no indication of data loss or configuration issues. The impact is limited to a short service unavailability during the restart.

If this timing race recurs frequently, we can consider small OS-level adjustments to the restart timing to eliminate the bind conflict, but no change to the Gateway configuration is required at this time.

Regards,

Michel Audi

JelmerJ

Published 10 months ago

I would not say the impact is limited to a short service unavailability. The service ended in a stopped state and did not recover automatically. I had to log in and manually start the service.

Richard Markiewicz

Published 10 months ago

Hello

The log suggests that the Gateway did not shut itself down in a timely fashion, and was forced to terminate its runtime in an ungraceful way. It seems likely this caused one of the listeners to linger at the system level, such that when the service started back up it could not bind the port again.

It's definitely an issue, but it seems to be something like a race condition. Maybe it will never happen again, or maybe it will happen every time. Normally we'd expect systemd to restart the failed service, but since needrestart is triggering the process death via systemd that won't happen:

When the death of the process is a result of systemd operation (e.g. service stop or restart), the service will not be restarted.

So, what we need to do on our side is:

Check what tasks could be lingering and why they couldn't be gracefully shutdown
Implement a limited retry at start up for listener binding problems

I'm discussing both of these issues with my colleagues on the Gateway team.

In the meantime, to prevent this happening again, I'd recommend excluding the Gateway service from the services that needrestart is allowed to reboot. I know there's an exclusion list at /etc/needrestart/needrestart.conf which you should be able to modify appropriately. I've not done that before, and I can't find any canonical documentation; but I know you need to look for the $nrconf{override_rc} setting.

Please, let me know if something isn't clear or you have further questions

Kind regards,

Richard Markievicz

Benoit Cortier

Published 9 months ago

Hello,

I was not able to troubleshot the "lingering tasks" issue, but we had many bug fixes over time, so it’s possible the problem is already solved on a newer version of the Devolutions Gateway. However, If you are able to reproduce the issue, please let us know the version you reproduce it, and if known, how to reproduce it.

On the other hand, a limited retrial logic in case of listener binding problems was implemented, and will be part of the next release (2025.3.2).

The expectation is that even if the service experiences a lingering task problem, or any other race condition when restarting the service, this should solve the main problem of the service not being restarted successfully.

Let us know if you have any further questions, or more information about the lingering tasks issue.

Kind regards,

Benoit Cortier

JelmerJ

Published 9 months ago

Thanks for the response, the suggestion to add the service to the ignore list of needrestart is a sound one. I'll do this for 1 of our gateways to see if this problem ever occurs again on the other one.
Shortly after starting this thread I have updated my gateways to the latest version. I had missed the new releases. I enabled notifications for the announcement section of the gateway forum, but the latest releases were not announced there. I've set up anouncements for releases on github now.
Thanks for the continuous improvement on the product!

Benoit Cortier

Published 9 months ago

Hi,

Keeping you updated: the latest release of the Devolutions Gateway implements a retrial logic so that it will not terminate itself on transient errors unless the failure persists for too long.
We are still investigating the issue about the lingering tasks, but at least you’re not going to get an surprising shutdown like previously.

Best regards,

Benoit Cortier