Thresholds and dwell
How a metric's status is decided, and how dwell gating prevents flapping.
A metric's status flips through three layers in order:
- Threshold evaluation: the strict-operator rule applied to each pushed value (see threshold operators).
- Dwell gating: a status only flips after holding the new state for the configured dwell period.
- Shadow mode (optional): a metric can be marked as shadowed for a window so it does not affect status pages or fire webhook events while operators tune it.
This page covers steps 2 and 3.
Why dwell exists
A naive implementation would publish every status change the agent reports. In practice, metrics flap. A network blip pushes one bad sample, the next sample is fine, the on-call gets paged twice per minute.
Dwell gating requires the new status to hold for a minimum duration before propagating. Configure two values per metric:
- Dwell to breach: how long the metric must report the new status before flipping into a worse band (healthy → degraded or healthy → unhealthy).
- Dwell to recover: how long the metric must report the new status before flipping into a better band (unhealthy → healthy).
The defaults shipped with the create form are conservative: 60 seconds to breach, 300 seconds to recover. Asymmetric values ("quick to flag, slow to recover") are appropriate for systems where premature recovery announcements have a higher cost than a delayed unhealthy alert.
Status sources
Three statuses come from the strict-operator rule:
healthydegradedunhealthy
Two statuses come from collection-layer outcomes, not from values:
no_data: the agent attempted a probe but produced no value. The reason code is recorded alongside (ECONNREFUSED,ETIMEDOUT,no_data_for_query, etc.).unknown: no recent push has arrived for the metric within the expected interval.
no_data and unknown do not burn SLO budget by default; they
are operational signals that surface in the agent dashboard and as
metric.no_data webhook events.
Stale data tolerance
Dwell gating handles the small flaps. A separate read-time rule handles a much larger gap: what happens when no sample arrives at all, because the agent is crashed, the network between the agent and Observer Cloud is partitioned, or Observer Cloud itself is degraded.
A metric is stale when its last push timestamp is older than three times its push interval, capped at 15 minutes:
threshold = min(3 × push_interval_minutes × 60s, 15 minutes)
stale = (now - last_push_timestamp) > threshold
The 3× multiplier gives the agent one full retry-and-backoff window before a missing push is considered a problem. The 15-minute hard cap stops a slow push cadence from masking a multi-hour outage on a high-importance metric.
Staleness is computed at read time. The database always carries whatever the agent last pushed. The status-page and embed renderers make the call independently each time they load.
A stale metric is excluded from the service rollup. It is not
counted toward SLO burn. The metric.status_changed and
metric.no_data webhooks are not fired on staleness transitions.
What does fire: agent.lag_high and agent.offline, which speak
to the actual cause — they are operator-facing.
When every metric on a service is stale, the service renders as
monitoring_delayed with a "Last known: Operational" caption
alongside. See
Observer availability for
the full trust contract.
Shadow mode
A metric can be marked shadowed until a future timestamp. While shadowed:
- The metric still pushes status to the cloud.
- Status pages do not consume the shadowed metric in the rolled-up page status.
- Webhook events for the shadowed metric are suppressed.
- The metric's history is still recorded for later inspection.
Use shadow mode when introducing a new metric, tuning its threshold, or rolling out a new probe runtime. Once the metric behaves as expected, clear the shadow timestamp and it joins the public status surface.