Observer
Observer Agent

Diagnose a stalled agent

Triage path when an agent stops reporting or its queue depth grows.

A stalled agent surfaces in two places: the cloud marks it as stopped on the Agents page (no heartbeat in the expected window), and the local dashboard's queue depth grows beyond a few pending entries. The triage below covers both.

Step 1: confirm the failure mode

Open the agent's dashboard at http://<host>:10101. Three patterns are common:

PatternProbable cause
Process up, queue growing, last heartbeat recent but failingCloud reachability problem from this host.
Process up, queue depth zero, heartbeat succeeding, but cloud says stoppedClock skew or stale state in the console (refresh).
Process up, no probes have runThe cloud has not assigned any metrics to this agent yet.
Dashboard unreachableThe agent process is down. Container restart loop or host crash.

Step 2: agent down

If the dashboard is unreachable:

# Docker
docker logs --tail 200 observer-agent

# Kubernetes
kubectl -n observer logs deploy/observer-agent --tail=200

# Linux systemd
journalctl -u observer-agent -n 200 --no-pager

Look for a panic, an unhandled rejection, or a configuration error on startup. Common causes: missing AGENT_KEY, malformed CLOUD_SERVER_URL, port 10101 already in use.

Step 3: cloud unreachable

If the dashboard's Cloud panel shows a recent last_heartbeat_error, the issue is between the agent and the cloud. Verify in this order:

  1. DNS: getent hosts <cloud-host> from inside the container.
  2. TCP: curl -v https://<cloud-host> from inside the container.
  3. TLS: certificate trust. Custom internal CAs need the container's trust store updated.
  4. Auth: a recently rotated AGENT_KEY requires updating the agent's environment.

The drain controller automatically retries with exponential backoff. The queue continues to accept pushes up to its capacity (MAX_ROWS, default 10000). Once the cloud is reachable again, the queue drains.

Step 4: queue saturation

If the queue depth has hit MAX_ROWS, the oldest entries are dropped to admit new ones. The dashboard's queue panel shows the depth at the cap. After cloud reachability returns, the queue drains and depth returns to near zero.

The cloud's agent.offline webhook fires when the heartbeat window is exceeded. Subscribe to this event when on-call needs an explicit alert.

Was this page helpful?