Skip to content

ADR 0004: VPS System Diagnostic & Traefik Routing Fixes

Status

Accepted (Established February 2026)

Context

Following a scheduled system reboot (or Docker Cleanup) of the Strato VPS, critical services within the cfs-infra stack experienced operational failures despite their containers showing as up and healthy:

  1. Cockpit Daemon: Yielded a 502 Bad Gateway error through Traefik. The systemd socket (cockpit.socket) bound to the host network failed to start because the Docker bridge interface (docker0 at 172.17.0.1) was not yet established during systemd's early startup sequence.
  2. Open-WebUI: Traefik was unable to issue a Let's Encrypt SSL certificate for the domain openwebui-ls.cfscfs.com. The UI returned 404s and proxy errors. This occurred due to an outdated routing rule inside the docker-compose.yml (ai-ls.cfscfs.com).

Decision

To ensure high-availability and zero-downtime resilience against unscheduled host machine reboots or Docker network rebuilds, we implemented two permanent infrastructural fixes:

  1. Systemd Socket Race-Condition Fix (Cockpit):

    • Implemented a systemd drop-in configuration override for cockpit.socket (/etc/systemd/system/cockpit.socket.d/listen.conf).
    • Added the FreeBind=yes UNIX socket option.
    • Why? This allows the systemd daemon to successfully bind the listening port 9090 to the requested IP address (172.17.0.1) even if the Docker network interface isn't up at boot time, gracefully managing the race condition.
  2. Traefik DNS Label Alignment (Open-WebUI):

    • Updated the Traefik router rules mapped in the open-webui container labels to explicitly point to openwebui-ls.cfscfs.com (matching the actual A-Record pointing to the VPS).
    • Added extra_hosts: ["host.docker.internal:host-gateway"] to the main Traefik container.
    • Why? Traefik requires exact domain matching to negotiate ACME challenges with Let's Encrypt. The host gateway ensures Traefik can reliably hit native daemon ports (like Cockpit) bypassing Docker's virtual isolation.

Consequences

  • Positive:
    • The VPS architecture can survive hard reboots and complete Docker-prune cycles without dropping routing links.
    • Cockpit reliably recovers automatically without manual systemctl restart cockpit.socket interventions.
    • Standardized Host-to-Docker reverse proxy routing using natively available domain endpoints.
  • Negative:
    • FreeBind theoretically masks legitimate binding failures in syslog, requiring deeper inspection if the underlying Docker network engine fails completely.
    • Maintaining Traefik labels inside the docker-compose.yml mandates manual synchronization with DNS provider records.

Released under proprietary license.