Observer
Observer

Threshold operators

How healthy / degraded / unhealthy is decided from a metric value.

A metric's status on every push follows a strict rule applied to the value the agent reported.

Rule

Each metric carries two operator-and-value pairs:

  • healthy_operation and healthy_value
  • unhealthy_operation and unhealthy_value

Operators are: over, under, equal.

The agent computes status as:

  1. If the value matches the healthy condition, status is healthy.
  2. Else, if the value matches the unhealthy condition, status is unhealthy.
  3. Otherwise, status is degraded.

Strict comparisons

Operators are strict everywhere. A value exactly equal to a threshold under over or under does not match.

OperatorMatch condition
overvalue > threshold (not >=)
undervalue < threshold (not <=)
equalvalue == threshold

Examples

5xx error ratio

  • healthy_operation: under, healthy_value: 0.005
  • unhealthy_operation: over, unhealthy_value: 0.02

Reading: healthy under 0.5%; unhealthy over 2%; anything else is degraded. A value of exactly 0.005 is degraded (not healthy) because under is strict.

TLS certificate expiry

  • healthy_operation: over, healthy_value: 30
  • unhealthy_operation: under, unhealthy_value: 7

Reading: healthy when more than 30 days remain; unhealthy when fewer than 7 days remain; degraded in between (7 to 30 days).

Queue depth

  • healthy_operation: under, healthy_value: 100
  • unhealthy_operation: over, unhealthy_value: 1000

Reading: healthy under 100 messages; unhealthy over 1000; degraded in between.

No-data and unknown

no_data and unknown are not part of the operator rule. They arise from the agent's collection layer:

  • no_data: the agent attempted a probe but could not produce a value (timeout, connection refused, query returned empty). The cloud records the reason code alongside the status.
  • unknown: no recent push has arrived for the metric within the expected interval.

Stale data

A metric is stale when its last push timestamp is older than three times its push interval, capped at 15 minutes. Stale and no_data look identical in the database but mean different things:

  • no_data: the agent ran the probe and the probe failed to return a value. This is a real signal about the customer's service. It counts in the SLO and surfaces in status rollups.
  • stale: the agent has not pushed anything recently. The cause is on the monitoring side (cloud outage, agent crash, network partition between agent and cloud), not the customer's side. Stale metrics are excluded from the live status rollup, do not burn SLO budget, and do not fire metric.status_changed or metric.no_data webhooks.

When every metric on a service is stale, the service rolls up to monitoring_delayed rather than unhealthy. See Observer availability for the contract that protects customer status pages from Observer's own outages.

Was this page helpful?