Observer
Observer

SLOs and error budgets

How service level objectives translate metric status into a contractual signal.

A Service Level Objective (SLO) is a commitment that a metric will remain healthy for a defined fraction of a rolling window. SLOs turn the binary "is this healthy right now" question into a running balance: the error budget, which is the remaining allowance of unhealthy time.

Definition

An SLO has three core fields:

  • Metric: which metric the SLO observes.
  • Target percentage: the fraction of the window the metric must be healthy. Common values: 99, 99.5, 99.9, 99.95, 99.99.
  • Window in days: the rolling period the target applies to. Common values: 7, 30, 90.

The window is rolling: at any instant, the SLO looks back N days and computes the fraction of that time the metric was healthy. There is no calendar boundary that resets the budget.

Error budget

Given a 99.9% target over 30 days, the budget allowance is:

allowance = 30 days * (1 - 99.9 / 100)
         = 30 days * 0.001
         = 43.2 minutes per 30-day window

The budget burns whenever the metric is in the unhealthy state. It does not burn for degraded, no_data, or unknown (the threshold operators reference covers each).

Burn events

A burn event opens when the metric flips to unhealthy and the SLO drops below 100% remaining. It closes when the metric returns to healthy. Each burn event records its start, end, and the percent of the budget it consumed.

Webhook subscribers receive slo.burn_started when an event opens and slo.burn_resolved when it closes. Pair the two by their burn_event_id.

Picking a target

The right SLO target reflects the system's actual achieved availability over the prior 90 days, plus a margin for the behaviour you want to drive. Three common starting points:

  • 99.5% for a new service or unknown baseline. Loose enough that noise does not drive false alerts.
  • 99.9% for a service with a stable history and a reasonable remediation pipeline.
  • 99.99% for systems where customers feel every minute of unhealthy time. Requires investment in error-handling and rapid remediation; otherwise the target produces churn rather than signal.

Per-customer targets

Different customers can sign different SLO targets against the same underlying metric. The model and configuration steps live in Customer scopes.

Was this page helpful?