Thresholds and dwell

A metric's status flips through three layers in order:

Threshold evaluation: the strict-operator rule applied to each pushed value (see threshold operators).
Dwell gating: a status only flips after holding the new state for the configured dwell period.
Shadow mode (optional): a metric can be marked as shadowed for a window so it does not affect status pages or fire webhook events while operators tune it.

This page covers steps 2 and 3.

Why dwell exists

A naive implementation would publish every status change the agent reports. In practice, metrics flap. A network blip pushes one bad sample, the next sample is fine, the on-call gets paged twice per minute.

Dwell gating requires the new status to hold for a minimum duration before propagating. Configure two values per metric:

Dwell to breach: how long the metric must report the new status before flipping into a worse band (healthy → degraded or healthy → unhealthy).
Dwell to recover: how long the metric must report the new status before flipping into a better band (unhealthy → healthy).

The defaults shipped with the create form are conservative: 60 seconds to breach, 300 seconds to recover. Asymmetric values ("quick to flag, slow to recover") are appropriate for systems where premature recovery announcements have a higher cost than a delayed unhealthy alert.

Status sources

Three statuses come from the strict-operator rule:

healthy
degraded
unhealthy

Two statuses come from collection-layer outcomes, not from values:

no_data: the agent attempted a probe but produced no value. The reason code is recorded alongside (ECONNREFUSED, ETIMEDOUT, no_data_for_query, etc.).
unknown: no recent push has arrived for the metric within the expected interval.

no_data and unknown do not burn SLO budget by default; they are operational signals that surface in the agent dashboard and as metric.no_data webhook events.

Stale data tolerance

Dwell gating handles the small flaps. A separate read-time rule handles a much larger gap: what happens when no sample arrives at all, because the agent is crashed, the network between the agent and Observer Cloud is partitioned, or Observer Cloud itself is degraded.

A metric is stale when its last push timestamp is older than three times its push interval, capped at 15 minutes:

threshold = min(3 × push_interval_minutes × 60s, 15 minutes)
stale     = (now - last_push_timestamp) > threshold

The 3× multiplier gives the agent one full retry-and-backoff window before a missing push is considered a problem. The 15-minute hard cap stops a slow push cadence from masking a multi-hour outage on a high-importance metric.

Staleness is computed at read time. The database always carries whatever the agent last pushed. The status-page and embed renderers make the call independently each time they load.

A stale metric is excluded from the service rollup. It is not counted toward SLO burn. The metric.status_changed and metric.no_data webhooks are not fired on staleness transitions. What does fire: agent.lag_high and agent.offline, which speak to the actual cause — they are operator-facing.

When every metric on a service is stale, the service renders as monitoring_delayed with a "Last known: Operational" caption alongside. See Observer availability for the full trust contract.

Shadow mode

A metric can be marked shadowed until a future timestamp. While shadowed:

The metric still pushes status to the cloud.
Status pages do not consume the shadowed metric in the rolled-up page status.
Webhook events for the shadowed metric are suppressed.
The metric's history is still recorded for later inspection.

Use shadow mode when introducing a new metric, tuning its threshold, or rolling out a new probe runtime. Once the metric behaves as expected, clear the shadow timestamp and it joins the public status surface.