Observer availability

What happens when Observer Cloud is degraded, when the agent stops pushing, and why your customers will not see a red status page because of our outage.

A status page exists to be honest with your customers. If the status page itself becomes a source of misinformation when the monitoring infrastructure has a bad day, it is worse than no status page at all.

This document is the contract for what Observer does when Observer itself is degraded, when the agent in your network goes silent, or when the link between them is broken.

The two failure modes that look identical

A metric ends up without a recent value for two reasons. They look identical in the database; they mean very different things.

The probe ran and got nothing. The agent reached out to your Prometheus, your HTTP endpoint, your TLS certificate, and could not produce a value. The query returned empty, the connection was refused, the TLS handshake failed. This is a real signal about your service. It is no_data with a reason code.
The agent has not pushed anything recently. The agent crashed. The host running it lost the network. Observer Cloud could not accept the push. This is a signal about our or your monitoring infrastructure, not your service. It is stale.

We treat these two cases differently.

What "stale" means

A metric is stale when its last push timestamp is older than three times its push interval, capped at 15 minutes. The 3× allows for the agent's normal retry-and-backoff window before declaring a problem. The 15-minute cap prevents a slow push cadence from masking a multi-hour outage.

Staleness is computed at read time. Nothing about the metric's stored row changes; the database still carries whatever the agent last pushed. The status page, the embed widget, and the SLO calculator each apply the rule independently when they load.

What happens when your agent stops pushing

Every metric driven by that agent becomes stale within minutes.

Your customer-facing status page does not flip to red. Each stale metric is excluded from the live rollup. If only some metrics on a service are stale, the service rollup uses the fresh metrics and shows a small "X of N metrics delayed" caption. If every metric on a service is stale, the service renders as Monitoring delayed, with a muted "Last known: Operational" (or whatever it was) pill alongside.

Your SLOs do not burn during the stale window. Observer's SLO calculation counts samples, not wall-clock time, so a quiet agent contributes zero to both the numerator (good samples) and the denominator (total samples). The error budget freezes in place until the agent resumes pushing.

We do not fire metric.status_changed or metric.no_data webhooks on staleness transitions. We do fire agent.lag_high and agent.offline — those are operator-facing and tell you the actual cause.

What happens when Observer Cloud is degraded

The agent's local SQLite buffer holds metric pushes for up to 24 hours of normal traffic. When the cloud receiver recovers, the agent drains the buffer in order. Status pages catch up to the real customer state as the backlog clears.

While the cloud is degraded:

Status pages continue to serve whatever last-known status was cached by their last successful render. Most pages tolerate a full read-side outage for several minutes before any user-visible effect.
The same staleness rule applies: as time without a fresh sample exceeds the threshold, services on the page roll up to Monitoring delayed rather than flipping to red.
No false-positive webhooks fire. The operator-facing agent.lag_high event tracks the cloud-side outage from each agent's vantage point.

Will my SLO burn during these gaps?

No.

Observer's SLO computation is sample-counting, not time-counting. The error budget is (good_samples - target × total_samples) / ((1 - target) × total_samples) × 100. When the agent is silent, no samples are added to either side of that fraction, so the budget remains exactly where it was when the last push arrived.

When the agent resumes, the new samples land in the same window and contribute on their own merits. A short gap during which your service was actually unhealthy does not get retroactively counted as a healthy window — there is just no data for it.

This trade-off has a name. Observer is explicit about absence: when we cannot say, we say so. We do not infer healthy minutes between samples and we do not infer unhealthy minutes either.

How is this different from a status page that just lies?

Some hosted status page platforms will hold a service at "All Systems Operational" indefinitely as long as no one publishes an incident. They are silent on the question of whether their monitoring is still working.

Observer is loud about it. Pages render Monitoring delayed when we cannot confirm health. The last-known status sits alongside in a muted pill so your customer can see the trajectory rather than a blank red square. We would rather say "we don't currently know" than guess.

Layered fallbacks (planned)

Two additional layers are scheduled post-launch and not part of the current contract. They harden the public read path against a full Observer Cloud outage:

Edge-cached status pages. A Cloudflare edge cache with stale-while-revalidate semantics serves the last-rendered HTML during a cloud outage. Customers see the same page they would have seen a minute earlier, with a small "served from cache" tag.
Independently-hosted static fallback. A Cloudflare Worker with status snapshots in KV serves a minimal page even if the origin is fully unreachable. Same staleness rules apply.

These are scheduled work, not promises. The current contract above is what you get today.