Diagnose a stalled agent
Triage path when an agent stops reporting or its queue depth grows.
A stalled agent surfaces in two places: the cloud marks it as stopped on the Agents page (no heartbeat in the expected window), and the local dashboard's queue depth grows beyond a few pending entries. The triage below covers both.
Step 1: confirm the failure mode
Open the agent's dashboard at http://<host>:10101. Three
patterns are common:
| Pattern | Probable cause |
|---|---|
| Process up, queue growing, last heartbeat recent but failing | Cloud reachability problem from this host. |
| Process up, queue depth zero, heartbeat succeeding, but cloud says stopped | Clock skew or stale state in the console (refresh). |
| Process up, no probes have run | The cloud has not assigned any metrics to this agent yet. |
| Dashboard unreachable | The agent process is down. Container restart loop or host crash. |
Step 2: agent down
If the dashboard is unreachable:
# Docker
docker logs --tail 200 observer-agent
# Kubernetes
kubectl -n observer logs deploy/observer-agent --tail=200
# Linux systemd
journalctl -u observer-agent -n 200 --no-pager
Look for a panic, an unhandled rejection, or a configuration error
on startup. Common causes: missing AGENT_KEY, malformed
CLOUD_SERVER_URL, port 10101 already in use.
Step 3: cloud unreachable
If the dashboard's Cloud panel shows a recent
last_heartbeat_error, the issue is between the agent and the
cloud. Verify in this order:
- DNS:
getent hosts <cloud-host>from inside the container. - TCP:
curl -v https://<cloud-host>from inside the container. - TLS: certificate trust. Custom internal CAs need the container's trust store updated.
- Auth: a recently rotated
AGENT_KEYrequires updating the agent's environment.
The drain controller automatically retries with exponential
backoff. The queue continues to accept pushes up to its capacity
(MAX_ROWS, default 10000). Once the cloud is reachable again,
the queue drains.
Step 4: queue saturation
If the queue depth has hit MAX_ROWS, the oldest entries are
dropped to admit new ones. The dashboard's queue panel shows the
depth at the cap. After cloud reachability returns, the queue
drains and depth returns to near zero.
The cloud's agent.offline webhook fires when the heartbeat
window is exceeded. Subscribe to this event when on-call needs an
explicit alert.