Threshold operators
How healthy / degraded / unhealthy is decided from a metric value.
A metric's status on every push follows a strict rule applied to the value the agent reported.
Rule
Each metric carries two operator-and-value pairs:
healthy_operationandhealthy_valueunhealthy_operationandunhealthy_value
Operators are: over, under, equal.
The agent computes status as:
- If the value matches the healthy condition, status is
healthy. - Else, if the value matches the unhealthy condition, status is
unhealthy. - Otherwise, status is
degraded.
Strict comparisons
Operators are strict everywhere. A value exactly equal to a
threshold under over or under does not match.
| Operator | Match condition |
|---|---|
over | value > threshold (not >=) |
under | value < threshold (not <=) |
equal | value == threshold |
Examples
5xx error ratio
healthy_operation: under,healthy_value: 0.005unhealthy_operation: over,unhealthy_value: 0.02
Reading: healthy under 0.5%; unhealthy over 2%; anything else is
degraded. A value of exactly 0.005 is degraded (not healthy)
because under is strict.
TLS certificate expiry
healthy_operation: over,healthy_value: 30unhealthy_operation: under,unhealthy_value: 7
Reading: healthy when more than 30 days remain; unhealthy when fewer than 7 days remain; degraded in between (7 to 30 days).
Queue depth
healthy_operation: under,healthy_value: 100unhealthy_operation: over,unhealthy_value: 1000
Reading: healthy under 100 messages; unhealthy over 1000; degraded in between.
No-data and unknown
no_data and unknown are not part of the operator rule. They
arise from the agent's collection layer:
no_data: the agent attempted a probe but could not produce a value (timeout, connection refused, query returned empty). The cloud records the reason code alongside the status.unknown: no recent push has arrived for the metric within the expected interval.
Stale data
A metric is stale when its last push timestamp is older than three
times its push interval, capped at 15 minutes. Stale and no_data
look identical in the database but mean different things:
no_data: the agent ran the probe and the probe failed to return a value. This is a real signal about the customer's service. It counts in the SLO and surfaces in status rollups.stale: the agent has not pushed anything recently. The cause is on the monitoring side (cloud outage, agent crash, network partition between agent and cloud), not the customer's side. Stale metrics are excluded from the live status rollup, do not burn SLO budget, and do not firemetric.status_changedormetric.no_datawebhooks.
When every metric on a service is stale, the service rolls up to
monitoring_delayed rather than unhealthy. See
Observer availability for the
contract that protects customer status pages from Observer's own
outages.