# Observer Documentation (full content) # Source: https://docs.use.observer # Generated: 2026-05-14T06:26:59.026Z --- url: https://docs.use.observer/docs/concepts/architecture title: How Observer works description: A high-level view of the agent, the cloud, and how they exchange data. --- Observer has two parts: - **Observer Agent**, a small process you run inside your network. It reads metrics from your existing observability stack (Prometheus, HTTP endpoints, TCP services, DNS, TLS certificates) and computes the status verdict locally. - **Observer Cloud**, the control plane. It receives status pushes from agents, runs SLO evaluation, persists data, and renders status pages, dashboards, and the API. Observer Agent
Prometheus, HTTP, TCP, DNS, TLS"] end subgraph Cloud["Observer Cloud (control plane)"] C["SLO evaluation
Status page
Webhooks
API and audit"] end V["Customers
status pages, API"] A -->|"status push: metric_id, value, status, timestamp"| C A -->|"heartbeat / log"| C C --> V`} /> ## What crosses the network Only the precomputed verdict crosses the boundary from your network to Observer Cloud. The push payload is: ```json { "metric_id": "", "value": , "status": "", "timestamp": "" } ``` Raw query strings (PromQL, HTTP request bodies, DNS resolver responses) do not leave the agent. The cloud has no path back into your network; it cannot pull from your Prometheus or hit your endpoints directly. ## What runs where | Concern | Location | | ------------------ | ------------------------------------------ | | Metric collection | Agent, in your network | | Status verdict | Agent, computed against the threshold rule | | SLO evaluation | Cloud, against pushed status | | Status page render | Cloud | | Webhook delivery | Cloud | | Audit log | Cloud | | Public API | Cloud | Observer Cloud is a closed-source SaaS. The Observer Agent is open source: source at [github.com/useobserver/agent](https://github.com/useobserver/agent), Apache-2.0 licensed. ## Operational implications - The agent must run in a network segment that can reach your metric sources. The cloud cannot reach them on your behalf. - The agent's own health is reported back to the cloud through heartbeats. The cloud surfaces a stalled agent as `agent.offline`. - The agent is stateless with respect to historical data: lost agents do not lose history, because all status pushes are persisted in the cloud. --- url: https://docs.use.observer/docs/concepts/metrics-vs-pings title: Why metrics, not pings description: The case for metric-based status over availability pings. --- Most status page tools assert availability with periodic pings: a GET request every 60 seconds against a public endpoint, with a green check when the response code is 2xx. Observer's default is to compute status from metrics you already collect, with pings as one source among many. The reasoning: ## Pings only see the public envelope A ping confirms a single endpoint accepted a single request at a single moment. It does not see: - The error rate served to actual customers in the last five minutes. - The 95th percentile latency under real load. - The depth of an internal queue draining slower than its inflow. - A degraded backend that has been masked by retries upstream. A page that reads green from pings while customers are filing support tickets is the standard failure mode of ping-based status. ## Metrics see the actual signal Observer's primary data source is your own metrics: Prometheus queries, HTTP probes that include body checks, TCP connection times, DNS resolution times, TLS certificate expiry. The status the public page shows is computed from the same numbers your on-call team already trusts on the internal Grafana dashboard. The result: when customers see red, the on-call's dashboard shows the same red, with the same threshold semantics. There is no gap. ## Pings still have a place For systems that do not emit metrics (third-party APIs, public DNS, certificates issued by external CAs), the agent supports HTTP, TCP, DNS, and TLS-cert probes directly. These produce a metric in the same shape as a Prometheus query: a numeric value with a timestamp, evaluated against thresholds. - Internal API health: Prometheus query against your own metrics. - Third-party API reachability: HTTP probe. - Certificate expiry: TLS-cert probe. - DNS health: DNS probe. Mix as many as the system needs. Each becomes a metric on the status page with its own threshold band. --- url: https://docs.use.observer/docs/concepts/slos-and-error-budgets title: SLOs and error budgets description: How service level objectives translate metric status into a contractual signal. --- A Service Level Objective (SLO) is a commitment that a metric will remain healthy for a defined fraction of a rolling window. SLOs turn the binary "is this healthy right now" question into a running balance: the **error budget**, which is the remaining allowance of unhealthy time. ## Definition An SLO has three core fields: - **Metric**: which metric the SLO observes. - **Target percentage**: the fraction of the window the metric must be `healthy`. Common values: 99, 99.5, 99.9, 99.95, 99.99. - **Window in days**: the rolling period the target applies to. Common values: 7, 30, 90. The window is rolling: at any instant, the SLO looks back N days and computes the fraction of that time the metric was healthy. There is no calendar boundary that resets the budget. ## Error budget Given a 99.9% target over 30 days, the budget allowance is: ```text allowance = 30 days * (1 - 99.9 / 100) = 30 days * 0.001 = 43.2 minutes per 30-day window ``` The budget burns whenever the metric is in the `unhealthy` state. It does not burn for `degraded`, `no_data`, or `unknown` (the [threshold operators reference](/docs/reference/threshold-operators) covers each). ## Burn events A burn event opens when the metric flips to `unhealthy` and the SLO drops below 100% remaining. It closes when the metric returns to healthy. Each burn event records its start, end, and the percent of the budget it consumed. Webhook subscribers receive `slo.burn_started` when an event opens and `slo.burn_resolved` when it closes. Pair the two by their `burn_event_id`. ## Picking a target The right SLO target reflects the system's actual achieved availability over the prior 90 days, plus a margin for the behaviour you want to drive. Three common starting points: - **99.5%** for a new service or unknown baseline. Loose enough that noise does not drive false alerts. - **99.9%** for a service with a stable history and a reasonable remediation pipeline. - **99.99%** for systems where customers feel every minute of unhealthy time. Requires investment in error-handling and rapid remediation; otherwise the target produces churn rather than signal. A target tighter than the system's achieved availability burns budget on noise and trains the on-call team to ignore alerts. Start at the 90-day baseline and only tighten as the underlying system improves. ## Per-customer targets Different customers can sign different SLO targets against the same underlying metric. The model and configuration steps live in [Customer scopes](/docs/concepts/customer-scopes). --- url: https://docs.use.observer/docs/concepts/customer-scopes title: Customer scopes description: Per-customer status pages, JWT-verified, with per-customer SLO targets. --- A customer-scoped page is one underlying status page that renders differently per customer. Each customer sees a filtered subset of metrics and SLOs, with optional per-customer thresholds applied at render time. The same page can therefore serve a `99.99%` agreement with one customer and a `99%` agreement with another, without duplicating the underlying metric work. ## Why customer scopes exist Enterprise contracts vary. The same backend that an SMB customer signs at `99.5%` may carry a `99.99%` clause for an enterprise customer with a higher-priced contract. Two implementation paths exist: 1. Duplicate the metric definition per customer, with different thresholds. 2. Define the metric once and apply per-customer thresholds at render time. Path 2 keeps a single source of truth for collection and evaluation. Customer scopes implement path 2. ## Identity model Customer scopes use JWT-based identity. The page's access mode is set to `customer_scoped`. Observer Cloud verifies tokens against the public key (or JWKS endpoint) configured on the page, then reads a configurable claim (typically `sub`, `customer_id`, or a custom claim) to determine which customer is viewing. A customer must be: 1. Defined in the organisation's customer list. 2. Bound to the page through the page's customer-binding list. A token whose claim does not resolve to a bound customer returns 403, even when the token's signature is valid. ## SLO overrides Each customer can carry per-SLO target overrides. When the customer-scoped page renders for that customer, the SLO strip uses the override target. Customers without an override see the default target. ```text SLO: checkout-api availability default target: 99.9% Customer A: no override. renders at 99.9% Customer B: override at 99.99%. renders at 99.99% Customer C: override at 99%. renders at 99% ``` The underlying metric and the burn evaluator remain unchanged. The only differences are the threshold the page renders and the per-customer error budget displayed. Customer scopes do not change the contractual obligation; they reflect it. The contract is what your legal team and the customer signed. Observer's customer scopes ensure each customer's view of the system mirrors the agreement they read. ## Configuration See [Configure customer-scoped pages](/docs/guides/customer-scoped-pages) for the step-by-step setup. --- url: https://docs.use.observer/docs/concepts/thresholds-and-dwell title: Thresholds and dwell description: How a metric's status is decided, and how dwell gating prevents flapping. --- A metric's status flips through three layers in order: 1. **Threshold evaluation**: the strict-operator rule applied to each pushed value (see [threshold operators](/docs/reference/threshold-operators)). 2. **Dwell gating**: a status only flips after holding the new state for the configured dwell period. 3. **Shadow mode (optional)**: a metric can be marked as shadowed for a window so it does not affect status pages or fire webhook events while operators tune it. This page covers steps 2 and 3. ## Why dwell exists A naive implementation would publish every status change the agent reports. In practice, metrics flap. A network blip pushes one bad sample, the next sample is fine, the on-call gets paged twice per minute. Dwell gating requires the new status to hold for a minimum duration before propagating. Configure two values per metric: - **Dwell to breach**: how long the metric must report the new status before flipping into a worse band (healthy → degraded or healthy → unhealthy). - **Dwell to recover**: how long the metric must report the new status before flipping into a better band (unhealthy → healthy). The defaults shipped with the create form are conservative: 60 seconds to breach, 300 seconds to recover. Asymmetric values ("quick to flag, slow to recover") are appropriate for systems where premature recovery announcements have a higher cost than a delayed unhealthy alert. ## Status sources Three statuses come from the strict-operator rule: - `healthy` - `degraded` - `unhealthy` Two statuses come from collection-layer outcomes, not from values: - `no_data`: the agent attempted a probe but produced no value. The reason code is recorded alongside (`ECONNREFUSED`, `ETIMEDOUT`, `no_data_for_query`, etc.). - `unknown`: no recent push has arrived for the metric within the expected interval. `no_data` and `unknown` do not burn SLO budget by default; they are operational signals that surface in the agent dashboard and as `metric.no_data` webhook events. ## Stale data tolerance Dwell gating handles the small flaps. A separate read-time rule handles a much larger gap: what happens when no sample arrives at all, because the agent is crashed, the network between the agent and Observer Cloud is partitioned, or Observer Cloud itself is degraded. A metric is **stale** when its last push timestamp is older than three times its push interval, capped at 15 minutes: ``` threshold = min(3 × push_interval_minutes × 60s, 15 minutes) stale = (now - last_push_timestamp) > threshold ``` The 3× multiplier gives the agent one full retry-and-backoff window before a missing push is considered a problem. The 15-minute hard cap stops a slow push cadence from masking a multi-hour outage on a high-importance metric. Staleness is computed at read time. The database always carries whatever the agent last pushed. The status-page and embed renderers make the call independently each time they load. A stale metric is **excluded** from the service rollup. It is **not** counted toward SLO burn. The `metric.status_changed` and `metric.no_data` webhooks are **not** fired on staleness transitions. What does fire: `agent.lag_high` and `agent.offline`, which speak to the actual cause — they are operator-facing. When every metric on a service is stale, the service renders as `monitoring_delayed` with a "Last known: Operational" caption alongside. See [Observer availability](/docs/concepts/observer-availability) for the full trust contract. ## Shadow mode A metric can be marked shadowed until a future timestamp. While shadowed: - The metric still pushes status to the cloud. - Status pages do not consume the shadowed metric in the rolled-up page status. - Webhook events for the shadowed metric are suppressed. - The metric's history is still recorded for later inspection. Use shadow mode when introducing a new metric, tuning its threshold, or rolling out a new probe runtime. Once the metric behaves as expected, clear the shadow timestamp and it joins the public status surface. Threshold evaluation is strict in both the agent and in the read path that renders status pages. The same value cannot flip status depending on which surface read it. See the [threshold operators reference](/docs/reference/threshold-operators) for examples. --- url: https://docs.use.observer/docs/concepts/observer-availability title: Observer availability description: What happens when Observer Cloud is degraded, when the agent stops pushing, and why your customers will not see a red status page because of our outage. --- A status page exists to be honest with your customers. If the status page itself becomes a source of misinformation when the monitoring infrastructure has a bad day, it is worse than no status page at all. This document is the contract for what Observer does when Observer itself is degraded, when the agent in your network goes silent, or when the link between them is broken. ## The two failure modes that look identical A metric ends up without a recent value for two reasons. They look identical in the database; they mean very different things. - **The probe ran and got nothing.** The agent reached out to your Prometheus, your HTTP endpoint, your TLS certificate, and could not produce a value. The query returned empty, the connection was refused, the TLS handshake failed. This is a real signal about your service. It is `no_data` with a reason code. - **The agent has not pushed anything recently.** The agent crashed. The host running it lost the network. Observer Cloud could not accept the push. This is a signal about our or your monitoring infrastructure, not your service. It is `stale`. We treat these two cases differently. ## What "stale" means A metric is stale when its last push timestamp is older than three times its push interval, capped at 15 minutes. The 3× allows for the agent's normal retry-and-backoff window before declaring a problem. The 15-minute cap prevents a slow push cadence from masking a multi-hour outage. Staleness is computed at read time. Nothing about the metric's stored row changes; the database still carries whatever the agent last pushed. The status page, the embed widget, and the SLO calculator each apply the rule independently when they load. ## What happens when your agent stops pushing Every metric driven by that agent becomes stale within minutes. Your customer-facing status page does **not** flip to red. Each stale metric is excluded from the live rollup. If only some metrics on a service are stale, the service rollup uses the fresh metrics and shows a small "X of N metrics delayed" caption. If every metric on a service is stale, the service renders as **Monitoring delayed**, with a muted "Last known: Operational" (or whatever it was) pill alongside. Your SLOs do **not** burn during the stale window. Observer's SLO calculation counts samples, not wall-clock time, so a quiet agent contributes zero to both the numerator (good samples) and the denominator (total samples). The error budget freezes in place until the agent resumes pushing. We do not fire `metric.status_changed` or `metric.no_data` webhooks on staleness transitions. We do fire `agent.lag_high` and `agent.offline` — those are operator-facing and tell you the actual cause. ## What happens when Observer Cloud is degraded The agent's local SQLite buffer holds metric pushes for up to 24 hours of normal traffic. When the cloud receiver recovers, the agent drains the buffer in order. Status pages catch up to the real customer state as the backlog clears. While the cloud is degraded: - Status pages continue to serve whatever last-known status was cached by their last successful render. Most pages tolerate a full read-side outage for several minutes before any user-visible effect. - The same staleness rule applies: as time without a fresh sample exceeds the threshold, services on the page roll up to **Monitoring delayed** rather than flipping to red. - No false-positive webhooks fire. The operator-facing `agent.lag_high` event tracks the cloud-side outage from each agent's vantage point. ## Will my SLO burn during these gaps? No. Observer's SLO computation is sample-counting, not time-counting. The error budget is `(good_samples - target × total_samples) / ((1 - target) × total_samples) × 100`. When the agent is silent, no samples are added to either side of that fraction, so the budget remains exactly where it was when the last push arrived. When the agent resumes, the new samples land in the same window and contribute on their own merits. A short gap during which your service was actually unhealthy does not get retroactively counted as a healthy window — there is just no data for it. This trade-off has a name. Observer is **explicit about absence**: when we cannot say, we say so. We do not infer healthy minutes between samples and we do not infer unhealthy minutes either. ## How is this different from a status page that just lies? Some hosted status page platforms will hold a service at "All Systems Operational" indefinitely as long as no one publishes an incident. They are silent on the question of whether their monitoring is still working. Observer is loud about it. Pages render **Monitoring delayed** when we cannot confirm health. The last-known status sits alongside in a muted pill so your customer can see the trajectory rather than a blank red square. We would rather say "we don't currently know" than guess. The following are part of the published contract. We hold ourselves to them in production: - A stale metric is excluded from the service rollup, full stop. - A stale metric does not burn SLO budget. - Staleness transitions do not fire customer-facing webhooks. - The "Monitoring delayed" rollup carries the last-known status. - The push-interval policy is `>= 10 minutes` on Free/Starter and `>= 5 minutes` on Pro and Enterprise. Tightening below those bounds is gated by the plan validator. ## Layered fallbacks (planned) Two additional layers are scheduled post-launch and not part of the current contract. They harden the public read path against a full Observer Cloud outage: - **Edge-cached status pages.** A Cloudflare edge cache with stale-while-revalidate semantics serves the last-rendered HTML during a cloud outage. Customers see the same page they would have seen a minute earlier, with a small "served from cache" tag. - **Independently-hosted static fallback.** A Cloudflare Worker with status snapshots in KV serves a minimal page even if the origin is fully unreachable. Same staleness rules apply. These are scheduled work, not promises. The current contract above is what you get today. --- url: https://docs.use.observer/docs/concepts/incidents-and-metrics title: Incidents and metrics description: How customer-facing incidents relate to metric-driven status. --- Observer's status model has two layers: 1. **Metrics drive status by default.** A metric flips to `unhealthy` when its measured value crosses the threshold; the page status rolls up from the worst metric. No human action required. 2. **Incidents are the customer comm layer on top.** An incident is what the operator publishes to explain context — what is broken, what we know, what we are doing about it. Both layers can fire independently, and they often do. ## Why two layers A metric flip is an automated signal. The threshold breach happened at 14:32:18 because the agent reported 4.2% errors and the unhealthy rule said `over 2%`. That is precise, but it is not customer communication. Customers want to know: - Are you aware? - What is the impact? - When will it be fixed? - How will I know it is fixed? Those are operator-authored sentences. The metric flip cannot answer them on its own. ## How they relate at runtime The page status that customers see is **only** driven by metrics. Posting an incident does not change page status; resolving an incident does not change page status. Status is the measured truth; incidents are the human commentary. The exception is **manual metrics** (see [Manual metrics](/docs/concepts/manual-metrics)): when an open incident lists a service, the manual metrics on that service auto-set their status to mirror the incident severity. This is the case where incidents drive status — by design — because manual metrics have no probe to measure them. ## The "draft from metric" flow When a metric flips unhealthy, the metric edit page surfaces a **Draft incident** CTA. One click pre-fills: - Title: `Investigating: ` - Severity: `major` if metric is unhealthy, `minor` if degraded - Affected services: every service that has an SLO bound to this metric - Initial message: `Investigating . Current status: ` The operator reviews, edits if needed, and publishes. The "metric flipped → I need to update status" loop drops from minutes to one click. The CTA is idempotent within 30 minutes: a second click on the same metric in the same window surfaces the existing draft instead of creating a duplicate. ## Auto-drafts (opt-in) The same "draft from metric" path can run automatically. Opt a metric in via the **Automatic incident creation** section on its edit form (Pro+). When the metric flips unhealthy, Observer creates the draft for you and emails your org owners with publish / dismiss buttons. The auto flow shares the same dedup rule as the manual CTA — if an open incident already affects the metric's service, a message is appended to the existing incident instead of opening a new one. Per-metric cooldown is one hour. Drafts that go unactioned for 24 hours auto-expire. See the full setup walkthrough at [Auto-incident creation](/docs/guides/auto-incident-creation). If you have not yet read the threshold semantics doc, read [Thresholds and dwell](/docs/concepts/thresholds-and-dwell) first. The "metric is unhealthy" claim assumes you understand the comparison rule and dwell gating. --- url: https://docs.use.observer/docs/concepts/manual-metrics title: Manual metrics description: When the agent can't measure it, set the status explicitly. --- Most Observer metrics are probed: an agent runs a check on a schedule and reports the result. Manual metrics are the escape hatch for signals that have no automation, or where the operator wants to control the status surface explicitly. ## When to use a manual metric - A signal that has no observability today (a third-party SaaS outage, a vendor dependency, an internal system without instrumentation). - A high-level rollup that should follow operator judgment, not a noisy underlying measurement. - A service whose status is gated on a contract with a vendor (where Observer should reflect what the vendor says, not what we measure). ## What is different | Aspect | Probed metric | Manual metric | |---|---|---| | `source_type` | `prometheus`, `http`, `tcp`, etc. | `manual` | | Agent involvement | Agent runs the probe and pushes status. | Agent never sees the metric (filtered at definitions endpoint). | | Status transitions | Threshold + dwell gating against measured value. | Explicit set via UI / API / incident. | | Webhook payload | `source: "probe"` on flips. | `source: "manual"` or `"incident"`. | ## How status flips Three paths set status on a manual metric: 1. **Console UI**: the metric detail page shows a clickable status pill. Owner-tier users can pick a new status from the dropdown. 2. **API**: `POST /api/v1/metrics/{id}/status` with `{"status": "unhealthy"}`. Scope: `write:metrics`. 3. **Incident impact**: when an open incident lists a service that contains a manual metric, that metric auto-flips to mirror the incident severity. On resolve, it returns to its last explicitly-set status (default `healthy`). ## Threshold model Manual metrics carry no thresholds. Status is set directly. The `healthy_*` / `unhealthy_*` columns on the metric definition are ignored; the form hides the threshold section when source type is `manual`. Every manual transition writes an audit log row of action `metric.status.set_manually` with the actor, source, and old → new status. Useful when reconstructing why a public page rendered a particular status at a given time. --- url: https://docs.use.observer/docs/concepts/incident-slo-impact title: Incident SLO impact description: How the auto-impact panel computes burn rate and time to budget exhaustion. --- When an incident lists affected services, every SLO bound to those services contributes to the auto-impact panel. The panel updates every 30 seconds while an incident is open and freezes on resolve. ## What gets computed For each affected SLO: - **Burn during incident**: total seconds the metric was `unhealthy` between the incident's `published_at` and either `resolved_at` or now, whichever is earlier. - **Percent of budget consumed**: burn seconds divided by the SLO's total budget seconds. Total budget = window seconds × (1 − target%). - **Total budget remaining**: read from `slos.error_budget_remaining_pct` (populated by the SLO eval scheduler tick, not recomputed in the panel). - **Time to exhaust**: at the current burn rate (burn seconds / incident duration seconds), how long until the remaining budget reaches zero. Reported in minutes; null when the burn rate is zero. ## Caching Repeated panel polls within 30 seconds reuse the same computation (in-memory cache keyed by incident id). This protects the SLO eval pipeline from hammering when an open dashboard polls every 30s. ## Sources of error - The metric history table is the source of truth for burn. If the agent missed pushes during the incident, those gaps are not counted as unhealthy. - The remaining-budget % comes from the most recent SLO eval tick. If the scheduler fell behind, the value can be stale by a few minutes. The burn-during-incident value is always fresh. - Time-to-exhaust extrapolates a linear burn rate. Real systems rarely sustain a linear rate; treat the number as a rough budget rather than a precise countdown. ## Public visibility The auto-impact panel is console-only by default. A per-incident toggle exposes a slimmed view (burn % only, no time-to-exhaust) on the public incident page. Some operators choose to surface it for transparency; others view it as internal-only. The default is off. --- url: https://docs.use.observer/docs/quickstart/first-metric title: Define your first metric description: Install the agent, define a metric backed by a Prometheus query, and report status to Observer Cloud. --- This page walks through installing the Observer agent, defining a metric backed by a Prometheus query, and confirming that the cloud receives status pushes. ## Prerequisites - A Prometheus server reachable from the host or cluster that will run the agent. - A container runtime (Docker or Kubernetes) or a Linux host with systemd. - An Observer Cloud account. Sign up at [use.observer](https://use.observer). Observer also supports HTTP, TCP, DNS, and TLS certificate probes. Prometheus is documented first because most operators already run it, and the agent's reported value is straightforward to verify against a number you already trust. ## Steps ### Create an organisation Sign in at [use.observer](https://use.observer/console/auth) and create an organisation. The organisation slug becomes the URL path under `/console/` and defines the tenant boundary for every resource below. ### Create an agent and copy its key In the console, open **Agents**, then **New agent**. Provide a name (typically the hostname) and submit. The next screen reveals the agent key once. Copy it before navigating away. The key has the form `obs_live_<43 base64url characters>`. Observer Cloud stores its hash, never the plaintext. A lost key requires rotation through the console. ### Run the agent Pick the runtime that matches your environment. The container exposes a debug dashboard on port `10101`. ```bash title="docker run" docker run -d \ --name observer-agent \ -p 10101:10101 \ -e AGENT_KEY=obs_live_... \ -e CLOUD_SERVER_URL=https://use.observer \ -e PROMETHEUS_SERVER_URL=http://prometheus:9090 \ ghcr.io/useobserver/agent:1.0.1 ``` ```yaml title="agent.yaml" apiVersion: apps/v1 kind: Deployment metadata: name: observer-agent spec: replicas: 1 selector: { matchLabels: { app: observer-agent } } template: metadata: { labels: { app: observer-agent } } spec: containers: - name: agent image: ghcr.io/useobserver/agent:1.0.1 ports: [{ containerPort: 10101 }] env: - name: AGENT_KEY valueFrom: { secretKeyRef: { name: observer, key: agent-key } } - name: CLOUD_SERVER_URL value: https://use.observer - name: PROMETHEUS_SERVER_URL value: http://prometheus.monitoring:9090 ``` Verify the connection. With Docker, browse `http://:10101`. In Kubernetes, port-forward the deployment with `kubectl port-forward deploy/observer-agent 10101:10101` and open `http://localhost:10101`. The dashboard's *Cloud* panel shows a recent `last_heartbeat_at`. The Agents page in the console marks the agent as **running** within roughly 90 seconds. ### Define a metric In the console, open **Metrics**, then **New metric**. Select the agent created above and set the source type to **Prometheus**. Enter a query that returns a single scalar. A standard example is the five-minute 5xx error ratio: ```text title="PromQL" rate(http_requests_total{job="checkout-api",status=~"5.."}[5m]) / rate(http_requests_total{job="checkout-api"}[5m]) ``` Set thresholds: - Healthy: `under 0.005` (less than 0.5% errors). - Unhealthy: `over 0.02` (greater than 2% errors). Values that match neither boundary resolve to `degraded`. Threshold operators are strict. A value exactly equal to a threshold under `over` or `under` does not match the band. Configure thresholds with this in mind: a 0.5% healthy boundary expressed as `under 0.005` does not call exactly `0.005` healthy. Set **Interval** to `1` minute and save. ### Confirm reporting Within one push interval the metric appears in the Metrics list with its current status. Open the metric to see the latest value, last push timestamp, and rolling history. To verify the round trip, lower the unhealthy threshold below the current value. The metric flips to `unhealthy` on the next push. Restore the original threshold and the metric returns to `healthy`. ## Result ```text Prometheus → Observer Agent → Observer Cloud (your network) (control plane, debug on :10101) status pages) ``` The agent computes status client-side and pushes `{ metric_id, value, status, timestamp }` only. Raw query strings stay in your network. ## Next - [Define your first SLO](/docs/quickstart/first-slo) - [Publish your first status page](/docs/quickstart/first-status-page) - [Observer Agent reference](/agent) --- url: https://docs.use.observer/docs/quickstart/first-metric-http title: Define your first metric (HTTP probe) description: Install the agent, define a metric backed by an HTTP probe, and report status to Observer Cloud. --- This page walks through installing the Observer agent, defining a metric that probes an HTTP endpoint directly, and confirming that the cloud receives status pushes. Use this path when no Prometheus server is in place, or when the signal you want to measure is the endpoint's reachability and response time itself. ## Prerequisites - An HTTP endpoint reachable from the host or cluster that will run the agent. - A container runtime (Docker or Kubernetes) or a Linux host with systemd. - An Observer Cloud account. Sign up at [use.observer](https://use.observer). HTTP probes report `response_time_ms` for successful requests and `no_data` with a reason code on failure. Prometheus probes evaluate a PromQL query that already reflects the system's own observation of itself. Pick HTTP when the question is "is this endpoint reachable and fast"; pick Prometheus when the question is "is this metric within bounds". The [Prometheus quickstart](/docs/quickstart/first-metric) covers the latter. ## Steps ### Create an organisation Sign in at [use.observer](https://use.observer/console/auth) and create an organisation. The organisation slug becomes the URL path under `/console/` and defines the tenant boundary for every resource below. ### Create an agent and copy its key In the console, open **Agents**, then **New agent**. Provide a name (typically the hostname) and submit. The next screen reveals the agent key once. Copy it before navigating away. ### Run the agent HTTP probes do not require Prometheus. Omit `PROMETHEUS_SERVER_URL` from the agent's environment when no Prometheus probes are defined. ```bash title="docker run" docker run -d \ --name observer-agent \ -p 10101:10101 \ -e AGENT_KEY=obs_live_... \ -e CLOUD_SERVER_URL=https://use.observer \ ghcr.io/useobserver/agent:1.0.1 ``` ```yaml title="agent.yaml" apiVersion: apps/v1 kind: Deployment metadata: name: observer-agent spec: replicas: 1 selector: { matchLabels: { app: observer-agent } } template: metadata: { labels: { app: observer-agent } } spec: containers: - name: agent image: ghcr.io/useobserver/agent:1.0.1 ports: [{ containerPort: 10101 }] env: - name: AGENT_KEY valueFrom: { secretKeyRef: { name: observer, key: agent-key } } - name: CLOUD_SERVER_URL value: https://use.observer ``` Verify the connection. With Docker, browse `http://:10101`. In Kubernetes, port-forward the deployment with `kubectl port-forward deploy/observer-agent 10101:10101` and open `http://localhost:10101`. The dashboard's *Cloud* panel shows a recent `last_heartbeat_at`. The Agents page in the console marks the agent as **running** within roughly 90 seconds. ### Define an HTTP metric In the console, open **Metrics**, then **New metric**. Select the agent created above and set the source type to **HTTP**. Configure the probe: - **URL**: the full URL the agent should hit, for example `https://api.example.com/healthz`. - **Method**: `GET`. - **Expected status**: `200` (the probe reports `no_data` with `unexpected_status:

` for any other code).
- **Timeout (ms)**: `5000`. The probe reports `ETIMEDOUT` if the
  request takes longer.

Set thresholds against `response_time_ms`:

- Healthy: `under 500` (response under 500ms).
- Unhealthy: `over 2000` (response over 2 seconds).

Values that match neither boundary resolve to `degraded`.


  For endpoints that return 200 even when the underlying service
  is degraded, set **Body match** to a marker string that only
  appears in the healthy response (for example, `"status":"ok"`).
  The probe reports `body_mismatch` if the response body is
  missing that string. Only the first 4KB of the response is
  read.


Set **Interval** to `1` minute and save. The probe runs every
minute and pushes `response_time_ms` plus the resolved status to
the cloud.





### Confirm reporting

Within one push interval the metric appears in the Metrics list
with its current status. Open the metric to see the latest value,
last push timestamp, and rolling history.

To verify the round trip, lower the unhealthy threshold below the
current response time. The metric flips to `unhealthy` on the
next push. Restore the original threshold and the metric returns
to `healthy`.



## Probe behaviour

The agent computes status client-side. The cloud receives only
the verdict:

```text
{ metric_id, value: , status: , timestamp }
```

The full HTTP request runs from the agent's vantage point. The
cloud has no path to the endpoint. Request bodies, response
bodies, and headers stay in your network. The full reason-code
list and field reference are in
[Configure HTTP probes](/agent/guides/http-probes).

## Next

- [Define your first SLO](/docs/quickstart/first-slo)
- [Publish your first status page](/docs/quickstart/first-status-page)
- [Configure HTTP probes](/agent/guides/http-probes) covers the
  full per-field reference: redirects, custom headers, TLS
  verification, and body matching.


---
url: https://docs.use.observer/docs/quickstart/first-slo
title: Define your first SLO
description: Attach a service level objective to a metric and read the error budget.
---

A Service Level Objective (SLO) wraps an existing metric in a
target: a percentage of a rolling window during which the metric
must report `healthy`. The gap between the target and the actual
healthy time is tracked as an **error budget**. When the metric
reports `unhealthy`, the budget burns. When it recovers, the burn
stops. The budget surfaces on status pages and in webhook events.

## Prerequisites

- A reporting metric. If one is not in place, complete
  [Define your first metric](/docs/quickstart/first-metric) first.

## Steps



### Create a service

Services group related SLOs and render as a row on status pages. In
the console, open **Services**, then **New service**. Name it after
the system the SLOs describe, for example `checkout-api`. The
description field is optional.





### Define the SLO

Open the service, then **SLOs**, then **New SLO**.

Configure:

- **Metric**: the metric defined in the previous quickstart page.
- **Target**: percentage of the window the metric must remain
  healthy. A common starting value is `99.9`.
- **Window**: rolling window in days. A common starting value is
  `30`.
- **Public**: enables rendering on customer-facing status pages.

Save the SLO.


  The right target depends on the system's achieved availability
  over the prior 90 days. If that data is not available, start with
  `99.5%` and tighten once the SLO has accumulated a few weeks of
  history. A target tighter than reality burns budget on noise and
  loses signal value.






### Read the burn timeline

Open the SLO. The detail page reports:

- **Error budget remaining**: percent of the window's allowance
  still available. At `99.9% / 30 days`, the allowance is roughly
  43 minutes. Below 100% indicates the metric has been unhealthy
  for some of the window.
- **Latest burn event**: the current or most recent unhealthy
  stretch, including its start, end (or marker indicating still
  open), and percent of the budget burned.
- **History**: prior burn events in the window, with duration and
  budget cost.

The evaluator runs once per minute. If the metric flipped to
unhealthy during the previous quickstart page, a burn event is
visible here.





### Subscribe to webhook events

If the organisation's plan includes outbound webhooks, open
**Webhooks**, then **New subscription**. The events relevant to
SLOs are:

- `slo.burn_started`: an SLO crossed below its target. The payload
  includes the `slo_id`, `service_id`, `started_at`, and the
  current `error_budget_burned_pct`.
- `slo.burn_resolved`: the SLO recovered. The payload includes the
  matching `burn_event_id` and the `final_budget_remaining_pct`.

Wire deliveries to PagerDuty, Slack, or any HTTPS endpoint that
accepts JSON. Endpoint quotas vary by plan.



## Calculation

Each evaluator tick reads the metric's last status, updates the
moving window, and recomputes:

```text title="error budget"
budget_burned   = total seconds in unhealthy status, within the window
budget_total    = window_seconds * (1 - target / 100)
budget_remaining_pct = 100 * (1 - budget_burned / budget_total)
```

Statuses other than `unhealthy` (`degraded`, `no_data`, `unknown`)
do not burn budget. Brief `degraded` flickers therefore do not
consume the allowance on their own.

  Customers can carry per-SLO target overrides. See
  [Customer scopes](/docs/concepts/customer-scopes) for the model.

## Next

- [Publish your first status page](/docs/quickstart/first-status-page)
- [Webhook payload reference](/docs/reference/webhook-payloads) covers
  `slo.burn_started` and `slo.burn_resolved` payload shapes.


---
url: https://docs.use.observer/docs/quickstart/first-status-page
title: Publish your first status page
description: Compose services, metrics, and SLOs into a customer-facing status page on a subdomain.
---

Status pages are the customer-facing surface of the resources
configured in the previous two quickstart pages: services, metrics,
SLOs, and incident updates. This page covers creating a page,
adding content blocks, and shipping it on a subdomain.

## Prerequisites

- A reporting metric (see
  [Define your first metric](/docs/quickstart/first-metric)).
- Optionally, an SLO (see
  [Define your first SLO](/docs/quickstart/first-slo)). Pages
  without SLOs render correctly but lose the rolling availability
  signal.

## Steps



### Create the page

In the console, open **Pages**, then **New page**. Configure:

- **Title**: the heading rendered at the top of the page.
- **Subdomain**: the URL path, served as
  `.`. Lowercase letters, digits, and
  hyphens are accepted. The values `admin` and `blog` are reserved.
- **Theme**: pick a preset; further customization is available in
  the page builder.

Save the page. It is now reachable but contains no content blocks.





### Add metric blocks

Open the page in the builder. Drag the **Metrics** block onto the
canvas and select the metric defined earlier.

Metrics can be grouped by namespace to mirror service topology:

```text title="example grouping"
api/
  checkout-api
  payment-router
web/
  dashboard
```

Each group's name renders as a section heading on the published
page, and the metrics within share a status row.





### Add the SLO strip

Drag the **SLO strip** block. Select the SLO. The strip renders the
target, the window, the remaining error budget, and the current
burn event.

The block requires the SLO's **Public** flag to be enabled. If the
strip does not render after publish, open the SLO and toggle public
visibility on.





### Publish updates and incidents (optional)

The **Updates** block surfaces incident posts at the top of the
page. Open the page's update feed, create an Update with type
`Incident`, and the page renders the incident with its timeline and
follow-up posts. Updates are not required to publish the page; this
step shows where they appear when an incident is in progress.





### Visit the page

`.` resolves to the rendered page. On a
local development cloud the URL is
`http://.localhost:3000`. In production it follows the
wildcard DNS the operator has pointed at the cloud.



## Result

The page renders a header with the rolled-up status and any active
SLOs, followed by a section per metric group, followed by an
incident timeline. Every value reflects live metric data computed
against the configured thresholds and SLO targets.

## Next

- [Observer Agent reference](/agent) covers probe types, on-host
  configuration, and dashboard panels.
- Guides on customer-scoped pages, password protection, and theme
  customization will appear in the sidebar as content is published.


---
url: https://docs.use.observer/docs/quickstart/first-incident
title: File your first incident
description: Walk a draft → publish → update → resolve incident through the console.
---

Incidents are the customer-facing comm layer that sits on top of
metric-driven status. This page walks through filing one end to end:
draft, publish, append a follow-up, resolve. Every step has an API
equivalent (see the API tab) for IR automation.

## Prerequisites

- An organisation with at least one service. If services are not yet
  defined, create one under **Services** > **New service** before
  starting.
- Optional: an SLO bound to the service. The auto-impact panel on the
  incident detail page only renders when an affected service has at
  least one SLO.

## Steps



### Open the new-incident form

In the console, navigate to **Updates** > **Post update**. Pick **Incident**
as the type. The form prompts for severity, title, affected services,
and customer visibility.





### Fill the headline

- **Severity**: minor, major, or critical. The badge color on the
  public page follows this value.
- **Title**: one customer-facing sentence (avoid jargon and internal
  IDs; this is what visitors will see at the top of the timeline).
- **Affected services**: pick at least one. The auto-impact panel
  reads this list to compute SLO burn during the incident.
- **Visibility**: leave empty for a public incident. Pick specific
  customers to scope the incident to those tenants only.





### Decide draft vs publish

The form has two submit buttons:

- **Save as draft**: creates the row but does not publish. The
  incident is editable; nothing renders on the public page or fires
  to webhook subscribers.
- **Publish now**: sets `published_at = now()`, fires
  `incident.published`, and renders the incident on the public page.

For the first incident, choose **Save as draft** so you can review
the form before going customer-facing. The next step publishes.





### Append the first message

Open the incident from the **Updates** list. The detail page shows an
**Auto-impact** panel (live SLO burn, polled every 30 seconds), the
message timeline, and a **New message** popover.

Add a message of type **Investigating** with a brief description. The
public page renders messages in chronological order under the
incident header.





### Publish the incident

Use the right-side rail action to publish. The incident now renders
on the public page. Webhook subscribers receive `incident.published`
and `incident.message_added` events.





### Resolve

When the underlying issue clears, append a final message of type
**Resolved**. Observer auto-marks the parent incident resolved
(`resolved_at` populated, lifecycle pill flips to `resolved`,
`incident.resolved` webhook fires).



  Every step above maps to a `/api/v1/incidents/*` endpoint. The
  [API reference](/api) documents the full surface, including
  `POST /incidents/{id}/publish`, `POST /incidents/{id}/messages`, and
  `POST /incidents/{id}/resolve`.

## Related

- [Incident lifecycle reference](/docs/reference/incident-lifecycle)
- [Customer-scoped incidents](/docs/concepts/customer-scopes)
- [Webhook payload reference](/docs/reference/webhook-payloads)


---
url: https://docs.use.observer/docs/quickstart/first-maintenance
title: Schedule your first maintenance
description: Schedule a maintenance window with auto-start and auto-complete.
---

Maintenance windows differ from incidents in two ways: they are
planned in advance, and Observer auto-transitions them through their
lifecycle on a cron tick (so you do not have to remember to mark
"started" / "completed" manually).

## Steps



### Open the new-maintenance form

In the console, navigate to **Updates** > **Post update**. Pick
**Maintenance** as the type. The form replaces the severity field
with a **Scheduled start** + **Scheduled end** pair.





### Set the window

Pick the start and end times in your local timezone. The cron
auto-transitions the maintenance through:

- `scheduled` → `in_progress` when `now() >= scheduled_start_at`.
- `in_progress` → `completed` when `now() >= scheduled_end_at`.

A `maintenance.starting_soon` webhook fires one hour before
`scheduled_start_at` (idempotent; once per maintenance row).





### Pick affected services

The public page renders a banner for the maintenance starting within
24 hours and a sticky banner while in progress. The banner lists
affected services so customers know which surfaces are impacted.





### Publish

Maintenances always publish on save (drafts are an incident-only
flow). The page banner appears within 24 hours of the scheduled
start; subscribers receive `maintenance.scheduled` immediately and
`maintenance.starting_soon` one hour out.



## Cancel a scheduled maintenance

Open the maintenance row from **Updates**. The right-side rail has a
**Cancel** action. `canceled_at` is set, the banner is removed, and
the `maintenance.canceled` webhook fires.

  The lifecycle transitions can be triggered manually via the API
  (`POST /api/v1/maintenances/{id}/start` and `/complete`) when the
  cron schedule does not match the actual change-window timing.


---
url: https://docs.use.observer/docs/quickstart/first-subscriber
title: Add email subscribers to your status page
description: Configure the subscribe block, set up double opt-in, and verify a test subscription.
---

Public status pages can collect email subscribers. Each subscription
goes through double opt-in (the visitor must click the link in a
confirmation email) and includes per-message unsubscribe.

## Steps



### Enable subscriptions on the page

Open the page in the builder. In the access settings, ensure
**Allow subscriptions** is on. The setting is on by default but can
be turned off per page.





### Add the Subscribe block

In the page builder, drag the **Subscribe** block onto the canvas.
The block renders an email signup field on the public page; visitors
who submit an email receive a confirmation message.





### Verify with a test subscription

From the published page, submit a real email you control. Check the
inbox for a confirmation email. Click the confirm link. The
subscriber row's `confirmed_at` is set; future incidents fire
notifications to this address.





### Verify unsubscribe

Click the unsubscribe link from any received email. The row's
`unsubscribed_at` is set. Future events skip the recipient. The link
is idempotent: re-clicking does nothing.



## Filtering

Subscribers can opt into specific services or metrics rather than the
full page. The Subscribe block exposes a checkbox list when at least
one service has been added to the page; the chosen scopes write to
`subscriber_filters`.

  The subscriber-per-page cap depends on the org's plan tier (see
  [Plans and quotas](/docs/reference/plans-and-quotas)). Daily email
  caps follow the same matrix; both are enforced server-side.

## Programmatic export

The console **Customers** > **Subscribers** page supports CSV export
for an org's full active list. Useful when migrating between systems
or generating opt-in audit trails.


---
url: https://docs.use.observer/docs/guides/outbound-webhooks
title: Configure outbound webhooks
description: Subscribe an HTTPS endpoint to status, SLO, and agent events.
---

Outbound webhooks deliver Observer events to an HTTPS endpoint as
JSON. They are the integration path for paging tools, ticketing
systems, and chat notifications.

## Available event types

| Type | Trigger |
|---|---|
| `metric.status_changed` | A metric's status flips after dwell gating. |
| `metric.no_data` | A metric enters `no_data` because the agent could not collect a sample. |
| `page.status_changed` | A status page's rolled-up status flips. |
| `slo.burn_started` | An SLO crosses below its target. |
| `slo.burn_resolved` | An SLO recovers. |
| `agent.offline` | An agent misses its expected heartbeat window. |

## Configure a subscription



### Create the subscription

In the console, open **Webhooks**, then **New subscription**.
Configure:

- **Endpoint URL**: an HTTPS endpoint that accepts POST requests
  with a JSON body. HTTP is rejected.
- **Signing secret**: optional shared secret. When set, Observer
  signs every delivery with HMAC-SHA-256 in the
  `X-Observer-Signature` header.
- **Event types**: tick the events the endpoint should receive.

Save the subscription. The first delivery confirms reachability.





### Verify deliveries

The subscription detail page lists recent deliveries with their
HTTP response code and round-trip time. Successful deliveries
return a 2xx response within the timeout window. Failed
deliveries are retried with exponential backoff.





### Verify the signature (recommended)

When a signing secret is set, every delivery includes the header:

```text
X-Observer-Signature: sha256=
```

Compute `HMAC-SHA-256(secret, raw_body)` on the receiving side and
compare against the header value. Reject deliveries whose
signatures do not match.



## Quotas

Webhook subscription quotas vary by plan. Endpoints over the cap on
a downgrade remain configured; new endpoints are blocked until the
plan is upgraded or an existing endpoint is removed.

  Deliveries that fail every retry attempt land in the
  subscription's dead-letter list. Open the subscription, review the
  failures, and either fix the endpoint and replay, or discard.

## Payload reference

See the [webhook payload reference](/docs/reference/webhook-payloads)
for the JSON shape of each event type.


---
url: https://docs.use.observer/docs/guides/custom-domain
title: Serve a status page on your own domain
description: Point status.yourdomain.com at Observer with automatic TLS.
---

Add a custom domain so the public status page lives at
`status.yourdomain.com` instead of `something.use.observer`. Observer
provisions a TLS certificate automatically. The default subdomain
keeps working as a fallback.

## Prerequisites

- A Starter plan or higher. Free accounts cannot configure custom
  domains.
- A status page already created.
- Permission to add a CNAME record on the domain you want to use.

## Pick a subdomain, not the root

CNAME records cannot be set on the root of a domain. Use
`status.yourdomain.com`, not `yourdomain.com`. If your status page is
the only thing on that domain, `www.yourdomain.com` is a reasonable
choice.

## Steps



### Add the custom domain in Observer

In the console, open the status page and choose **General**. Under
**Custom domain**, type the hostname you want to use (for example
`status.yourdomain.com`) and click **Add custom domain**.

Observer creates the record in `dns_pending` state and starts
checking every 30 seconds for the CNAME you'll add next.





### Add the CNAME at your DNS provider

Create a CNAME record with these values:

| Field | Value |
| ----- | ----- |
| Type  | `CNAME` |
| Name  | the subdomain (`status` if your domain is `status.yourdomain.com`) |
| Value | `cname.use.observer` |
| TTL   | `300` (or "automatic") |

The Observer UI shows provider-specific notes for Cloudflare,
Route 53, GoDaddy, Namecheap, Vercel, and Netlify under the
**DNS provider** dropdown.


**Cloudflare:** the CNAME record must be DNS-only (grey cloud), not
proxied. A proxied CNAME ends in Cloudflare error 1014 ("CNAME Cross-
User Banned") because the destination is on a different Cloudflare
account.






### Verify

Wait a minute or two for DNS to propagate, then click **Check now**
in the custom domain card. The state pill walks through:

- `dns_pending` — Observer hasn't seen the record yet.
- `dns_invalid` — your CNAME exists but points somewhere else.
  Fix the record and click Check now.
- `dns_verified` — DNS is right. Observer asks Let's Encrypt for a
  certificate.
- `cert_pending` — certificate issuance in progress (usually under a
  minute, sometimes up to an hour if the issuer is rate-limited).
- `active` — your domain serves the status page with a valid TLS
  cert.



## After it's active

The page serves at your custom hostname. The original
`*.use.observer` URL keeps working — feel free to redirect from it
in your own infrastructure if you want a single canonical URL.

Observer renews the certificate automatically about 30 days before
expiry. The UI shows the next expiry date inside the custom domain
card.

## Common failures

**`dns_pending` for more than 30 minutes.** Your DNS provider's TTL
may be aggressive (1 hour or more). Wait it out, or temporarily
lower the TTL.

**`dns_invalid` with `CNAME points to .com`.** Your CNAME
is pointing at the wrong target. The correct value is
`cname.use.observer`.

**`cert_failed` with "rate limited" in the message.** Let's Encrypt
limits per-domain issuance. The cron tick retries every five
minutes; the rate limit resets within an hour. Clicking Check now
faster than that won't help.

**Cloudflare error 1014.** Your CNAME is proxied. Switch the record
to DNS-only (grey cloud) in the Cloudflare DNS panel.

## Removing a custom domain

Click **Remove custom domain** in the General popover. The page
stops serving on the custom hostname immediately and reverts to the
default `*.use.observer` URL.


---
url: https://docs.use.observer/docs/guides/password-protected-pages
title: Password-protect a status page
description: Require visitors to enter a password before the page renders.
---

Status pages run in `public` access mode by default. The `password`
mode requires a shared password before the page renders. Use it for
internal-only or partner-only views that do not need per-customer
scoping.

## Steps



### Switch the page to password mode

Open the page in the console, then **Access**. Set:

- **Mode**: `password`.
- **Password**: a shared secret you will distribute to authorised
  visitors.

Save. The page now redirects unauthenticated visitors to an unlock
form.





### Distribute the password

Share the password through the channel that already gates access
to the audience (e.g. a partner portal, an internal wiki, a signed
email).





### Rotate the password

Open the page's access settings and update the password. Existing
unlock cookies invalidate at rotation, and visitors must re-enter
the new password.



## Behaviour

- The unlock cookie is named `observer-page-access-`.
- The cookie is signed against the current password hash. Rotating
  the password invalidates outstanding cookies.
- Cookie lifetime is one hour. After expiry, visitors re-enter the
  password.

  Passwords are appropriate for low-stakes gating. For per-customer
  views, signed JWT access, or audit trails of who saw what, use
  [JWT-scoped access](/docs/guides/jwt-scoped-access) or
  [customer-scoped pages](/docs/guides/customer-scoped-pages).


---
url: https://docs.use.observer/docs/guides/jwt-scoped-access
title: Configure JWT-scoped access
description: Gate a status page behind a Bearer token verified against your public key or JWKS endpoint.
---

The `jwt` access mode gates a status page behind a Bearer token
that Observer Cloud verifies against a public key (or a JWKS
endpoint) you control. Use it when the audience already has an
identity issued by your auth system, and you want the same
identity to authorise status-page reads.

## Prerequisites

- A signing key (RS256, ES256, or any algorithm Observer's verifier
  supports). Either a single PEM public key or a JWKS endpoint
  Observer can fetch.
- A way to issue tokens for the audience (typically your auth
  service or an Identity Provider).

## Configure the page



### Switch the page to JWT mode

Open the page in the console, then **Access**. Set:

- **Mode**: `jwt`.
- **Public key** (PEM) **or** **JWKS URL**: whichever your issuer
  exposes.
- **Audience** (optional): the `aud` claim Observer should require.
- **Issuer** (optional): the `iss` claim Observer should require.

Save.





### Issue tokens

Sign tokens with the matching private key. Observer accepts:

- The `Authorization: Bearer ` header on requests to the
  page.
- The `?token=` query parameter, for embed iframes that
  cannot set headers.

A typical claim set:

```json title="claims"
{
  "iss": "https://your-idp.example",
  "aud": "observer-status-page",
  "sub": "user-or-customer-identifier",
  "exp": 1716480000
}
```





### Validate the round trip

Open the page with the Bearer header set. Successful verification
renders the page. A missing or invalid token returns 401.



  When the audience is a set of distinct customers and each customer
  should see a different subset of metrics or different SLO
  thresholds, use [customer-scoped pages](/docs/guides/customer-scoped-pages)
  instead. That mode adds per-customer routing on top of the same
  JWT verification.


---
url: https://docs.use.observer/docs/guides/customer-scoped-pages
title: Configure customer-scoped pages
description: Render the same status page differently per customer, with per-customer SLO thresholds.
---

Customer-scoped pages let one underlying page serve multiple
customers, with each customer's signed-in view filtered to the
metrics, services, and SLOs in their contract. The same page can
also apply per-customer SLO targets (for example, an enterprise
customer with `99.99%` reads against a different threshold than a
standard customer signed at `99.9%`).

## Prerequisites

- The audience already authenticates against an Identity Provider
  capable of signing JWTs.
- A list of customers in the console (open **Customers**, then
  **New customer**, and capture each customer's identifier).

## Configure the page



### Switch the page to customer-scoped mode

Open the page in the console, then **Access**. Set:

- **Mode**: `customer_scoped`.
- **Public key** or **JWKS URL**: same as JWT mode.
- **Customer claim**: the JWT claim Observer should read to
  identify the customer. Common choices are `sub`, `customer_id`,
  or a custom claim such as `obs_customer_id`.

Save.





### Bind customers to the page

Open the page, then **Access**, then **Customers**. Add each
customer who is allowed to view the page. Customers without a
binding receive a 403 even with a valid token.





### Apply per-customer SLO targets (optional)

Open a customer, then **SLO overrides**. Add an override with:

- The SLO whose target should be customer-specific.
- The customer's contracted target percentage.

When the customer-scoped page renders for that customer, the SLO
strip uses the override target. Other customers viewing the same
page see the default SLO target.



  Customer scopes apply at render time. A single underlying metric
  can therefore back a `99%` agreement with one customer and a
  `99.99%` agreement with another, without duplicating the metric
  definition or the agent's collection work.

## Issuing tokens

Issue tokens for each customer with the agreed claim set. The
customer-claim value must match a customer in the binding list.

```json title="claims"
{
  "iss": "https://your-idp.example",
  "aud": "observer-status-page",
  "sub": "user-1234",
  "obs_customer_id": "acme-cloud",
  "exp": 1716480000
}
```

## Behaviour

- Pages without a binding return 403 even with a valid token.
- Token expiry returns 401 and the embedded view re-fetches a
  token.
- SLO overrides are read on every render and require no caching
  on the consumer side.


---
url: https://docs.use.observer/docs/guides/multiple-metric-sources
title: Use multiple metric sources
description: Mix Prometheus, HTTP, TCP, DNS, and TLS certificate probes in one Observer organisation.
---

Observer's agent supports several probe runtimes within one
deployment. Pick the source that produces the most reliable signal
for what you want to assert about the system.

## Source types

| Source type | Returns | Typical use |
|---|---|---|
| `prometheus` | scalar from a PromQL query | latency / error rate / saturation against existing series |
| `http` | response time in ms | reachability + body match against an endpoint |
| `tcp` | connect time in ms | reachability for non-HTTP services (Redis, Postgres) |
| `dns` | resolve time in ms | DNS resolution path with optional record-value match |
| `tls_cert` | days until certificate expiry | leaf-cert validity for a hostname |

Stubbed in the schema and reserved for future runtimes:
`icmp`, `grpc`, `websocket`, `mtls_http`, `database`. Definitions
using these source types are accepted by the cloud and stored, but
the agent returns `no_data` until the runtime ships.

## Configure a non-Prometheus metric

Open **Metrics**, then **New metric**, and pick the source type.
Each source has its own configuration form:

- **HTTP**: URL, expected status code(s), optional body match,
  optional headers, timeout, follow-redirects, verify-TLS toggle.
- **TCP**: host, port, timeout.
- **DNS**: domain, record type (`A`, `AAAA`, `CNAME`, `MX`, `TXT`,
  `NS`, `SRV`, `CAA`, `PTR`), optional expected value, optional
  resolver.
- **TLS cert**: host, port (default `443`), warn-days,
  critical-days.

The thresholds remain consistent: each metric has `healthy_*` and
`unhealthy_*` operators applied to whatever value the source
returns.

  - HTTP `response_time_ms`: healthy `under 500`, unhealthy `over 2000`.
  - TLS cert `days_until_expiry`: healthy `over 30`, unhealthy `under 7`.
  - DNS `resolve_time_ms`: healthy `under 100`, unhealthy `over 500`.

## Mixing sources on one page

A status page can carry metrics from any combination of sources.
The page renders each metric using its threshold band, regardless
of the runtime that produced the value. Operators viewing the page
see one consistent green / amber / red signal across heterogeneous
checks.

## Agent reach

The agent must be able to reach each source from its host. For
Prometheus, that is your internal Prometheus URL. For HTTP probes,
the URL must be reachable from wherever the agent runs (for
example, an internal endpoint on a private network). The cloud
never reaches your endpoints directly: the agent collects, computes
status, and pushes the verdict.


---
url: https://docs.use.observer/docs/guides/theme-customization
title: Customise the status page theme
description: Apply a built-in theme preset or override colours, typography, and spacing on a per-page basis.
---

Status pages render against a token-driven theme. Every visible
surface (background, foreground, accent, semantic colours,
typography, spacing, border radius) is exposed as a CSS variable
that a preset or per-page override can change without touching code.

## Pick a preset

Open the page in the builder, then **Theme**. Each preset is a
pre-baked combination of colours and typography intended for a
particular brand register (warm-light, cool-dark, monochrome, and
others). Selecting a preset writes its tokens to the page's
`page_themes` row.

## Override individual tokens

The theme editor surfaces every token the public page consumes:

- **Background, surface, foreground**: page chrome.
- **Accent**: status pill, primary buttons, link colour.
- **Success, warning, danger**: status indicators (`healthy`,
  `degraded`, `unhealthy`).
- **Border, muted, muted foreground**: dividers and secondary text.
- **Heading, body, mono**: font families. The page builder picks
  Google Fonts by default; arbitrary CSS `font-family` strings are
  also accepted.
- **Spacing scale, radius**: layout density.

Every override is persisted on the page and applied at render time.
Preview changes in the page builder before saving.

## Custom CSS

If a token override is not enough, open **Theme**, then
**Custom CSS**. The CSS you provide is injected into the rendered
page after the preset and token overrides. Use it for narrow
corrections (e.g. shifting a margin, hiding a block on small
viewports) rather than re-skinning the page.

  Token overrides do not auto-correct contrast. Pick foreground
  colours that meet WCAG AA against the chosen background. The
  built-in presets are validated against AA at seed time.

## Preset rollout

A theme preset selected through the **Theme** picker writes the
preset's tokens to the page row. Subsequent updates to the preset
itself do not retroactively rewrite pages that already adopted it.
To apply a refreshed preset, re-select it on each page that should
update.


---
url: https://docs.use.observer/docs/guides/define-a-manual-metric
title: Define a manual metric
description: The cleanest path for operators without metrics infrastructure.
---

If you do not have a Prometheus server, no observability for the
target system, or simply want a status surface that follows operator
judgment rather than a measurement, manual metrics are the right
shape.

## Steps



### Create the metric

In the console, navigate to **Metrics** > **New metric**. In the
**Source type** picker, choose **Manual**. The form hides the probe
config and threshold sections; manual metrics carry neither.

Fill the title and description; pick the agent association if any
(manual metrics ignore the agent at runtime, but the field stays for
ownership / audit).





### Set the initial status

Save. Open the metric. The detail page shows a clickable status
pill. Pick the right initial status (`healthy` is the most common).





### Bind the metric to a service and (optional) SLO

Manual metrics fit the same service / SLO model as probed metrics.
Open the service, define an SLO that points at the manual metric,
set a target, and the budget will burn whenever the metric is in the
unhealthy state — same machinery as a probed metric.





### Hook up automation (optional)

For systems with their own observability, you can drive a manual
metric from outside Observer:

```bash
curl -X POST https://use.observer/api/v1/metrics//status \
  -H "Authorization: Bearer obs_pub_..." \
  -H "Content-Type: application/json" \
  -d '{"status":"unhealthy","note":"Vendor incident #VND-12345"}'
```

The scope `write:metrics` is required. The note ends up in the
audit log.



  When an open incident lists a service that contains a manual
  metric, that metric auto-flips to mirror the incident's severity.
  This is intentional: manual metrics have no probe, so the only
  meaningful signal is what the operator says is true. See
  [Manual metrics](/docs/concepts/manual-metrics) for the full
  semantics.


---
url: https://docs.use.observer/docs/guides/incidents-via-api
title: Create incidents via API
description: For IR automation and ChatOps integrations.
---

Every console action on incidents has an API equivalent. Most IR
teams wire their alerting (PagerDuty, Opsgenie) or ChatOps (Slack
slash commands, GitHub Actions) to file and update Observer
incidents directly without an operator touching the console.

## Auth

API keys are issued per organisation. Two new scopes were added in
Phase 25:

- `write:incidents` — create / patch / publish / resolve / delete.
- `write:maintenances` — create / patch / start / complete / cancel.

Both inherit from `read:incidents` for retrieval.

## File a new incident from a Slack slash command

```bash
curl -X POST https://use.observer/api/v1/incidents \
  -H "Authorization: Bearer obs_pub_..." \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Checkout API errors",
    "severity": "major",
    "affected_services": [""],
    "publish": true,
    "initial_message": {
      "type": "Investigating",
      "description": "Investigating elevated error rate on checkout."
    }
  }'
```

The response includes `id`, the projected lifecycle state, and the
affected-service rollup. Use the `id` for follow-up calls.

## Append a status update

```bash
curl -X POST https://use.observer/api/v1/incidents/$ID/messages \
  -H "Authorization: Bearer obs_pub_..." \
  -H "Content-Type: application/json" \
  -d '{
    "type": "Identified",
    "description": "Identified bad deploy. Rolling back."
  }'
```

A `Resolved` message auto-marks the parent incident resolved.

## Resolve

```bash
curl -X POST https://use.observer/api/v1/incidents/$ID/resolve \
  -H "Authorization: Bearer obs_pub_..." \
  -H "Content-Type: application/json" \
  -d '{"description": "Rollback complete. Error rate back to baseline."}'
```

## Schedule a maintenance window

```bash
curl -X POST https://use.observer/api/v1/maintenances \
  -H "Authorization: Bearer obs_pub_..." \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Database upgrade",
    "scheduled_start_at": "2026-06-01T02:00:00Z",
    "scheduled_end_at": "2026-06-01T04:00:00Z",
    "affected_services": [""]
  }'
```

## Idempotency

The `from-metric` endpoint is dedupe-protected:

```bash
curl -X POST https://use.observer/api/v1/incidents/from-metric/$METRIC_ID \
  -H "Authorization: Bearer obs_pub_..."
```

Calling this twice for the same metric within 30 minutes returns
the same draft incident id. Useful when an alert hook may fire
duplicate webhooks.

  Every endpoint above corresponds to a documented state transition.
  See [Incident lifecycle reference](/docs/reference/incident-lifecycle)
  for the full state machine.


---
url: https://docs.use.observer/docs/guides/auto-incident-creation
title: Auto-incident creation
description: Opt a metric in to automatic draft-incident creation when it flips unhealthy. Drafts ship with email CTAs so a human always verifies before customers see the incident.
---

When a metric flips unhealthy in the middle of the night, the
on-call already knows. The question is whether the customer-facing
status page should be updated to reflect that. Auto-incident
creation does the typing-out part for you — without ever publishing
without a human pressing a button.

## How it works

1. You opt a metric in to the feature on its edit form (Pro+).
2. The metric flips unhealthy (with dwell gating, exactly as a
   manual status change would).
3. The auto-incident worker creates a **draft** incident on the
   metric's bound service.
4. Observer emails your org owners with two buttons:
   **Publish** (flip to published; customers see it) and
   **Dismiss** (soft-delete the draft).
5. If neither button is clicked within 24 hours, the draft
   auto-expires. Nothing ever reaches the public page without a
   human action.

A draft incident is just a row in your database with `publishedAt = NULL`. Your status page renders only published incidents. The draft exists for you to verify and act on — it can be safely dismissed if it turned out to be noise.

## Enable for a metric

1. Open **Console → Metrics → \ → Edit**.
2. Scroll to the **Automatic incident creation** section.
3. Pick a **Policy**:
   - **Off** — auto-creation is disabled for this metric.
   - **On — create immediately** — a draft is created the moment
     the metric flips unhealthy.
   - **On — wait then re-check** — Observer waits the configured
     number of seconds, then re-checks the metric's current
     status. If it's still unhealthy, the draft is created. If
     the metric recovered during the dwell window, nothing
     happens. This is the recommended setting for metrics that
     occasionally flap.
4. Pick a **Severity** (`minor` / `major` / `critical`). This
   value is stamped on every auto-drafted incident.
5. For dwell-mode, pick a **Dwell seconds** value between 60 and
   3600. Defaults to 300 (5 minutes).
6. Save.

## What gets created

When the worker fires, you get:

- A new incident row with:
  - `title`: `Investigating elevated errors on `
  - `severity`: as configured on the metric
  - `affected_services`: every service that has an SLO pointing at
    the metric
  - `is_auto_drafted`: `true`
  - An initial `Information` message describing the value vs the
    threshold and the timestamp.
- An audit row (`incident.auto_drafted` on the metric, plus the
  parent row on the incident itself).
- A webhook event `incident.auto_drafted` (separate from the
  manual `incident.created` so you can listen specifically).
- An email to every org owner who hasn't opted out (see
  [Notification preferences](#notification-preferences)).

## Email CTAs

Each email has two buttons:

- **Publish incident** — `GET /api/incidents/auto-action?token=…&action=publish`
  inside the signed token. Flips the draft to published. Fires
  `incident.auto_published`.
- **Dismiss draft** — `GET /api/incidents/auto-action?token=…&action=dismiss`.
  Soft-deletes the row. Fires `incident.auto_dismissed` with
  `reason: "operator_dismiss"`.

The token format is `base64url(body) + "." + base64url(sig)` with
body `||` and signature
`HMAC-SHA-256(server_secret, body)`. Action is part of the signed
body, not the URL — you can't flip a publish link to dismiss (or
vice versa) by editing the URL. Tokens expire after 24 hours.

Both endpoints are idempotent. Re-clicking publish after the
incident is already published returns a success page. Re-clicking
dismiss after it's already dismissed returns a success page.

## Dedup, cooldown, and expiry

Three guardrails keep the auto-incident flow from spamming you:

1. **Dedup against open incidents on the service.** If you (or a
   prior auto-draft) have already filed an incident affecting the
   metric's service, the worker appends a new Information message
   to the existing incident instead of creating a duplicate.
   Message text: `Metric  is now unhealthy (auto-detected).`
2. **One auto-draft per metric per hour.** If a metric was already
   auto-drafted or auto-dismissed in the last hour, the worker
   skips. Flapping metrics never produce more than one draft per
   hour.
3. **24-hour auto-expiry.** Drafts older than 24 hours that
   haven't been published or dismissed are soft-deleted by a
   15-minute cron, audited as `incident.auto_expired`, and fire
   `incident.auto_dismissed` with `reason: "auto_expired"`.

## Notification preferences

Per-user opt-out lives at
**Console → Settings → Notifications → Auto-incident draft emails**.
Default is ON for org owners. Owners who toggle this off do not
receive auto-incident emails (any other type of email is
unaffected).

The toggle stores as
`users.notification_preferences.autoIncidentDrafts = false` on the
user row.

## Plan gate

This feature is **Pro+ only**. Free and Starter plans see a
locked-feature card on the metric edit form. Set the metric
policy to `disabled` (the default) on lower plans or upgrade.

## Webhook events

Three event types fire from the auto flow:

- `incident.auto_drafted` — fires when the draft is created.
- `incident.auto_published` — fires when the draft is published
  via the email link (or the equivalent API endpoint).
- `incident.auto_dismissed` — fires for both the email-dismiss
  and the 24h auto-expiry paths. `reason` distinguishes them.

Payloads are documented at
[Webhook payload reference](/docs/reference/webhook-payloads#incidentauto_drafted).

## Recommended setup

For most teams:

- **Dwell mode with 300 seconds** for any latency or error-rate
  metric. The dwell window catches noisy alarms before they
  generate an email.
- **Immediate mode** for binary signals (TLS expiry hit zero, a
  service is unreachable). These should not flap, so dwell adds
  nothing.
- Leave auto-creation **off** for noisy dashboards that are not
  customer-visible. The console already shows unhealthy metrics;
  not every internal alarm deserves a draft.


---
url: https://docs.use.observer/docs/guides/migrate-from-statuspage
title: Migrate from Statuspage
description: Move services, components, incidents, and subscribers from Atlassian Statuspage to Observer.
---

This guide covers a structured migration from Atlassian Statuspage
to Observer. The two products share a customer-facing surface, but
their backing models differ: Statuspage records component state
manually or through an API call; Observer derives status from
metrics that an agent collects in your network. Plan the migration
around that difference.

## Model differences to plan for

| Statuspage concept | Observer equivalent | Notes |
|---|---|---|
| Component | Metric (one or more, behind a service) | A Statuspage component represents the operator's manual verdict. An Observer metric represents a measured value evaluated against thresholds. One Statuspage component often becomes one Observer service with two or three Observer metrics. |
| Component group | Service | Logical grouping. Maps cleanly. |
| Manual incident state | Update with `Incident` type | Same semantics: posted updates with timeline. |
| Status indicator (operational, degraded, partial outage, major outage) | Rolled-up page status (`healthy`, `degraded`, `unhealthy`) | Page rollup uses `unhealthy=3 > degraded=2 > healthy=1`. Pick the worst child status. |
| API-driven component update | Metric reported by the agent | Stop calling Statuspage's `PATCH /components/:id`. The agent's status push replaces it. |
| Subscribers (email / SMS / Slack / webhook) | Page subscribers + outbound webhooks | Email subscribers move with the data export. SMS is not supported; Slack and PagerDuty are reachable through outbound webhook subscriptions. |
| Public status page domain | Status page subdomain | Both products serve a customer-facing domain. Plan a DNS cutover window. |
| Maintenance windows | Update with `Scheduled maintenance` type | Posted in advance, displays on the page during the window. |

## Steps



### Inventory the Statuspage account

Pull the list of:

- Components and component groups (one row per metric to define
  in Observer).
- Past 90 days of incidents (for the changelog you publish on the
  Observer page).
- Active subscribers, exported as CSV.
- Webhook subscribers, with their endpoint URLs.

Statuspage's REST API exposes each of these. The export from
**Account** > **Audit log** captures incident history; the
**Subscribers** page exports CSV directly.





### Stand up the Observer side in parallel

Follow [Define your first metric](/docs/quickstart/first-metric)
to install an agent and define a first reporting metric. Build out
the remaining metrics, services, and SLOs without touching the
Statuspage account. The two systems run side-by-side until the
DNS cutover.

For each Statuspage component, decide on the source signal:

- A latency or error-rate query already in Prometheus (use the
  Prometheus probe).
- An HTTP endpoint that returns 200 when the component is healthy
  (use the HTTP probe).
- A TCP socket, DNS record, or TLS certificate (matching probe
  type).

If a Statuspage component has no measurable signal today, that is
a signal of toil debt: the component's "operational" state was
the operator's manual verdict. Pick the closest measurable proxy
and document the gap.





### Build the status page

Open **Pages** > **New page** in the Observer console. Recreate
the public-facing layout: title, theme, services, metrics, SLO
strip. The page is reachable on
`.` immediately, before the DNS
cutover.

If the Statuspage account uses customer-scoped views (visible
under different domains per customer), see
[Configure customer-scoped pages](/docs/guides/customer-scoped-pages).





### Backfill incidents

Observer renders updates posted on the page. To preserve the
public changelog, post each historical Statuspage incident as an
Update with type `Incident`, dated to its original `created_at`.
The console's **Updates** > **New update** form accepts a custom
timestamp.

For high-incident accounts, scripting this against the Statuspage
incident export and the Observer API is the practical path; for a
typical SMB account with under 50 incidents a year, manual entry
is fast.





### Migrate subscribers

Observer accepts an email-subscriber import via the API. For
webhook subscribers, recreate the subscription in
**Webhooks** > **New subscription**, point it at the same
endpoint URL, and pick the events that match what the consumer
expects. Webhook payload shapes are documented in
[Webhook payload reference](/docs/reference/webhook-payloads).

SMS subscribers need to be re-acquired. Email those subscribers
during the migration window with a link to the Observer
subscribe form on the new page.





### Cut over DNS

When the Observer page renders correctly and all subscribers are
migrated, point your status subdomain (commonly `status.`)
at the Observer cloud's wildcard. The page resolves immediately;
visitors see no transition.

Disable updates from the Statuspage API in your alerting and CI
systems; the agent's metric pushes now drive Observer's status
verdict. Cancel the Statuspage subscription after one billing
cycle of overlap to allow rollback if the migration surfaces any
gap.



## API parity matrix

For teams wiring CI / IR automation, this table maps the Statuspage
endpoint to its Observer equivalent. See
[Create incidents via API](/docs/guides/incidents-via-api) for
end-to-end examples.

| Statuspage | Observer | Notes |
|---|---|---|
| `GET /pages/{id}/incidents` | `GET /api/v1/incidents` | Same cursor-paged list shape; Observer adds `state` and `since` filters. |
| `POST /pages/{id}/incidents` | `POST /api/v1/incidents` | Observer adds `affected_services`, `visible_to_customer_ids`, `publish` flag, `initial_message`. |
| `PATCH /pages/{id}/incidents/{id}` | `PATCH /api/v1/incidents/{id}` + `POST /publish` | Statuspage rolls publish + edit into one call; Observer separates them so drafts are explicit. |
| `POST /pages/{id}/incidents/{id}/components` (set state on component) | Wired automatically when incident lists `affected_services` containing manual metrics; see [Manual metrics](/docs/concepts/manual-metrics). |
| (no equivalent) | `POST /api/v1/incidents/from-metric/{metricId}` | Pre-fill a draft from a flipped metric. Observer-only. |
| `POST /pages/{id}/incidents/{id}/messages` | `POST /api/v1/incidents/{id}/messages` | Same shape; Observer's `Resolved` message also flips the parent state. |
| `DELETE /pages/{id}/incidents/{id}` | `DELETE /api/v1/incidents/{id}` | Observer is soft-delete (`deleted_at`); Statuspage is hard-delete. |
| `POST /pages/{id}/incidents/{id}/scheduled-maintenances` | `POST /api/v1/maintenances` | Observer auto-transitions `scheduled` → `in_progress` → `completed` on the configured times via cron; Statuspage requires manual start/complete. |
| `GET /pages/{id}/page-access-users` | (none) | Observer's customer-scoped access uses JWT claims; no per-customer API for the user list. |
| `POST /pages/{id}/subscribers` | `POST /status-page/{subdomain}/subscribe` | Public endpoint (no API key required). Confirmation flow is double opt-in. |

## Common questions

**Can both run in parallel during migration?**
Yes, and that is the recommended path. The agent reports to
Observer; the Statuspage API stays in place until DNS cutover.
Subscribers can be on either system during overlap.

**What about historical metric values?**
Observer's history starts when the agent first reports. Statuspage
does not offer a metric export to backfill, because Statuspage does
not store metric values; it stores manual verdicts. The 90-day
incident timeline is what migrates.

**How do I keep on-call alerting unchanged?**
Recreate the webhook subscription. Most alerting integrations
(PagerDuty, Slack, Microsoft Teams) accept generic JSON webhooks
with HMAC signatures. The signature scheme is described in
[Webhook payload reference](/docs/reference/webhook-payloads#signature-verification).

  For accounts with hundreds of components or high-volume
  subscriber lists, the Observer team can run the migration
  alongside you. Contact support before starting and a migration
  engineer is assigned.


---
url: https://docs.use.observer/docs/reference/plans-and-quotas
title: Plans and quotas
description: Per-plan limits for resources, retention, and API throughput.
---

Plan limits are enforced at create time. Existing rows over the cap
on a downgrade remain readable; new creates are blocked until the
plan is upgraded or an existing row is removed.

## Resource quotas

| Capability | Free | Starter | Pro | Enterprise |
|---|---|---|---|---|
| Status pages | 1 | 1 | 3 | unlimited |
| Services | 3 | 10 | 50 | unlimited |
| Metrics | 10 | 50 | 500 | unlimited |
| SLOs | 0 | 3 | unlimited | unlimited |
| Custom domains | 1 | 3 | unlimited | unlimited |
| Subscribers per page | 100 | 5,000 | 50,000 | 500,000 |
| Customer-scoped pages | 0 | 0 | 25 | unlimited |
| Customers | 0 | 0 | 25 | unlimited |
| Agents | 1 | 3 | 10 | unlimited |
| Webhook endpoints | 0 | 3 | 25 | unlimited |

## Daily caps

| Capability | Free | Starter | Pro | Enterprise |
|---|---|---|---|---|
| Webhook deliveries / day | 0 | 1,000 | 100,000 | unlimited |
| Public API requests / day | 0 | 10,000 | 100,000 | unlimited |
| Subscriber emails / day | 0 | 1,000 | 100,000 | unlimited |

## Retention

| Capability | Free | Starter | Pro | Enterprise |
|---|---|---|---|---|
| Metric history (days) | 7 | 30 | 90 | 365 |

  Resources that exceed the new plan's cap are not deleted. They
  remain visible and editable. The next attempted create on the
  affected capability returns a `quota_exceeded` error with a
  pointer to the upgrade path.


---
url: https://docs.use.observer/docs/reference/webhook-payloads
title: Webhook payload reference
description: JSON shapes for every event type Observer emits.
---

Every webhook delivery is a POST with a JSON body and the headers:

```text
Content-Type: application/json
X-Observer-Event: 
X-Observer-Delivery: 
X-Observer-Signature: sha256=   (when a signing secret is configured)
```

The body is always:

```json
{
  "event_type": "",
  "event_id": "",
  "occurred_at": "",
  "data": { ... }
}
```

The `data` field shape varies by event type. The reference below
documents each.

## metric.status_changed

A metric's status flipped after dwell gating.

```json
{
  "data": {
    "org_id": "org_...",
    "metric_id": "",
    "metric_title": "checkout-api 5xx ratio",
    "old_status": "healthy",
    "new_status": "unhealthy",
    "value": 0.024,
    "timestamp": ""
  }
}
```

## metric.no_data

A metric entered `no_data`: the agent could not collect a sample.

```json
{
  "data": {
    "org_id": "org_...",
    "metric_id": "",
    "metric_title": "checkout-api 5xx ratio",
    "reason": "ECONNREFUSED",
    "timestamp": ""
  }
}
```

## page.status_changed

A status page's rolled-up status flipped.

```json
{
  "data": {
    "org_id": "org_...",
    "page_id": "",
    "page_title": "Acme Cloud",
    "old_status": "healthy",
    "new_status": "degraded",
    "computed_at": ""
  }
}
```

## slo.burn_started

An SLO crossed below its target.

```json
{
  "data": {
    "org_id": "org_...",
    "slo_id": "",
    "slo_name": "checkout-api availability",
    "service_id": "",
    "service_name": "checkout-api",
    "burn_event_id": "",
    "started_at": "",
    "error_budget_burned_pct": 12.4,
    "target_pct": 99.9,
    "window_days": 30
  }
}
```

## slo.burn_resolved

An SLO recovered. The matching `burn_event_id` from the prior
`slo.burn_started` is included so consumers can pair the two.

```json
{
  "data": {
    "org_id": "org_...",
    "slo_id": "",
    "slo_name": "checkout-api availability",
    "service_id": "",
    "service_name": "checkout-api",
    "burn_event_id": "",
    "resolved_at": "",
    "final_budget_remaining_pct": 87.2,
    "target_pct": 99.9,
    "window_days": 30
  }
}
```

## agent.offline

An agent missed its expected heartbeat window.

```json
{
  "data": {
    "org_id": "org_...",
    "agent_id": "",
    "agent_name": "agent-eu-west-1",
    "last_heartbeat_at": "",
    "version": "1.2.3"
  }
}
```

## incident.created

A new incident row was created. Fires for both drafts and published
incidents on insert.

```json
{
  "data": {
    "org_id": "org_...",
    "incident_id": "",
    "title": "...",
    "severity": "major",
    "state": "draft",
    "is_customer_scoped": false,
    "affected_service_ids": [""],
    "affected_service_names": ["checkout-api"]
  }
}
```

## incident.published

A draft incident was published. The incident is now visible on the
public page.

```json
{
  "data": {
    "org_id": "org_...",
    "incident_id": "",
    "title": "...",
    "severity": "major",
    "state": "published",
    "published_at": "",
    "is_customer_scoped": false,
    "affected_service_ids": [""],
    "affected_service_names": ["checkout-api"]
  }
}
```

## incident.updated

Title, severity, affected services, or visibility changed.

```json
{
  "data": {
    "org_id": "org_...",
    "incident_id": "",
    "changed_fields": ["title", "severity"]
  }
}
```

## incident.message_added

A new message was appended to an incident timeline.

```json
{
  "data": {
    "org_id": "org_...",
    "incident_id": "",
    "message_id": "",
    "message_type": "Identified",
    "description": "...",
    "occurred_at": ""
  }
}
```

## incident.resolved

`resolved_at` was set on the incident. Posting a Resolved message
also fires this event because the appendMessage path auto-flips
the parent state.

```json
{
  "data": {
    "org_id": "org_...",
    "incident_id": "",
    "title": "...",
    "severity": "major",
    "resolved_at": ""
  }
}
```

## incident.deleted

An incident was soft-deleted via DELETE.

```json
{
  "data": {
    "org_id": "org_...",
    "incident_id": "",
    "deleted_at": ""
  }
}
```

## incident.auto_drafted

The auto-incident worker created a DRAFT incident from an unhealthy
metric flip. The draft is not visible to customers until it is
published via the email CTA, the console, or the API.

```json
{
  "data": {
    "org_id": "org_...",
    "incident_id": "",
    "metric_id": "",
    "metric_title": "checkout-api 5xx ratio",
    "severity": "major",
    "trigger_reason": "checkout-api 5xx ratio read 0.04 against threshold 0.02 at ",
    "value": 0.04,
    "threshold": 0.02,
    "affected_service_ids": [""],
    "url": "https://use.observer/console//updates/edit/"
  }
}
```

## incident.auto_published

An auto-drafted incident was published via the signed-token email
link. Equivalent to `incident.published` but distinguished so
subscribers can listen specifically for the auto-publish flow.

```json
{
  "data": {
    "org_id": "org_...",
    "incident_id": "",
    "title": "Investigating elevated errors on checkout-api 5xx ratio",
    "severity": "major",
    "state": "published",
    "published_at": "",
    "affected_service_ids": [""],
    "url": "https://use.observer/console//updates/edit/"
  }
}
```

## incident.auto_dismissed

An auto-drafted incident was dismissed via the signed-token email
link, OR auto-expired after 24h with no action. `reason` is
`operator_dismiss` or `auto_expired`.

```json
{
  "data": {
    "org_id": "org_...",
    "incident_id": "",
    "title": "Investigating elevated errors on checkout-api 5xx ratio",
    "dismissed_at": "",
    "reason": "auto_expired"
  }
}
```

## maintenance.scheduled

A maintenance window was created.

```json
{
  "data": {
    "org_id": "org_...",
    "maintenance_id": "",
    "title": "...",
    "scheduled_start_at": "",
    "scheduled_end_at": "",
    "affected_service_ids": [""],
    "affected_service_names": ["checkout-api"]
  }
}
```

## maintenance.starting_soon

Cron fires this once per maintenance row when `scheduled_start_at`
is within the next hour. Idempotent via
`maintenance_starting_soon_fired_at`.

```json
{
  "data": {
    "org_id": "org_...",
    "maintenance_id": "",
    "title": "...",
    "scheduled_start_at": ""
  }
}
```

## maintenance.started

`actual_start_at` was set (manual API call or cron auto-transition).

```json
{
  "data": {
    "org_id": "org_...",
    "maintenance_id": "",
    "title": "...",
    "actual_start_at": "",
    "scheduled_end_at": ""
  }
}
```

## maintenance.completed

`actual_end_at` was set. Posting a Resolved message on a
maintenance also fires this event because the appendMessage path
flips the parent state.

```json
{
  "data": {
    "org_id": "org_...",
    "maintenance_id": "",
    "title": "...",
    "actual_start_at": "",
    "actual_end_at": ""
  }
}
```

## maintenance.canceled

`canceled_at` was set before completion.

```json
{
  "data": {
    "org_id": "org_...",
    "maintenance_id": "",
    "title": "...",
    "canceled_at": ""
  }
}
```

## Signature verification

When a signing secret is configured, every delivery carries:

```text
X-Observer-Signature: sha256=
```

`hex` is `HMAC-SHA-256(secret, raw_body)`. Recompute on the
receiving side and compare in constant time. Reject deliveries
whose signatures do not match.

  Use `event_id` (also delivered as `X-Observer-Delivery`) as the
  idempotency key when persisting events. Retried deliveries reuse
  the same id, so a unique-key check prevents double-processing.


---
url: https://docs.use.observer/docs/reference/audit-log-events
title: Audit log events
description: Categories of administrative events recorded in the audit log.
---

Every administrative change in the console writes to an
append-only audit log scoped to the organisation. Events are
grouped into categories for filtering and retention.

## Categories

| Category | Surface |
|---|---|
| `agent` | Agent create / rename / rotate-key / delete. |
| `page` | Status page create / edit / theme change / access-mode change / delete. |
| `webhook` | Webhook subscription create / edit / pause / delete; delivery retry / discard. |
| `metric` | Metric definition create / edit / threshold change / delete. Manual metric status writes (`metric.status.set_manually`) also fall here. |
| `slo` | SLO create / target change / window change / delete; burn open / resolve. |
| `customer` | Customer create / edit / page binding change / SLO override / delete. |
| `incident` | Incident create / publish / update / resolve / delete; message append. |
| `maintenance` | Maintenance schedule / start / complete / cancel; starting-soon cron event. |
| `subscriber` | Subscriber confirm / unsubscribe; per-event delivery audit lives in `subscriber_deliveries`, not `audit_log`. |
| `org` | Organisation create / rename / member add / member remove. |
| `auth` | User sign-in, sign-out, MFA enrol, password change. |
| `subscription` | Plan change, payment method update, invoice generated. |
| `billing` | Payment provider events (charge succeeded, refund, dispute, etc.). |
| `api_key` | Org API key create / revoke. |
| `other` | Any event whose action prefix does not match the categories above. |

## Event shape

Each row carries:

- `id`: opaque identifier.
- `org_id`: the organisation the event scopes to.
- `actor`: the user or system that performed the action.
- `action`: dotted action string (for example `agent.created`,
  `slo.target_changed`).
- `target_type` and `target_id`: the resource the action affected.
- `metadata`: action-specific JSON payload.
- `created_at`: timestamp.

## Filtering

The audit log page in the console supports filtering by:

- Time range.
- Category.
- Actor (user identifier).
- Action (full dotted string).
- Target id.

## Retention

Audit log retention follows the same window as metric history (see
[Plans and quotas](/docs/reference/plans-and-quotas)). Older
entries are not deleted automatically; export them on the schedule
your compliance team requires.


---
url: https://docs.use.observer/docs/reference/threshold-operators
title: Threshold operators
description: How healthy / degraded / unhealthy is decided from a metric value.
---

A metric's status on every push follows a strict rule applied to
the value the agent reported.

## Rule

Each metric carries two operator-and-value pairs:

- `healthy_operation` and `healthy_value`
- `unhealthy_operation` and `unhealthy_value`

Operators are: `over`, `under`, `equal`.

The agent computes status as:

1. If the value matches the healthy condition, status is `healthy`.
2. Else, if the value matches the unhealthy condition, status is
   `unhealthy`.
3. Otherwise, status is `degraded`.

## Strict comparisons

Operators are strict everywhere. A value exactly equal to a
threshold under `over` or `under` does not match.

| Operator | Match condition |
|---|---|
| `over` | `value > threshold` (not `>=`) |
| `under` | `value < threshold` (not `<=`) |
| `equal` | `value == threshold` |

  The same comparison rule is applied both in the agent and in the
  read path the cloud uses to render status pages. A non-strict
  comparison in one location and a strict comparison in the other
  would cause the same value to flip status depending on the read
  surface. Strict everywhere keeps the metric's status consistent.

## Examples

### 5xx error ratio

- `healthy_operation: under`, `healthy_value: 0.005`
- `unhealthy_operation: over`, `unhealthy_value: 0.02`

Reading: healthy under 0.5%; unhealthy over 2%; anything else is
degraded. A value of exactly `0.005` is degraded (not healthy)
because `under` is strict.

### TLS certificate expiry

- `healthy_operation: over`, `healthy_value: 30`
- `unhealthy_operation: under`, `unhealthy_value: 7`

Reading: healthy when more than 30 days remain; unhealthy when
fewer than 7 days remain; degraded in between (7 to 30 days).

### Queue depth

- `healthy_operation: under`, `healthy_value: 100`
- `unhealthy_operation: over`, `unhealthy_value: 1000`

Reading: healthy under 100 messages; unhealthy over 1000; degraded
in between.

## No-data and unknown

`no_data` and `unknown` are not part of the operator rule. They
arise from the agent's collection layer:

- `no_data`: the agent attempted a probe but could not produce a
  value (timeout, connection refused, query returned empty). The
  cloud records the reason code alongside the status.
- `unknown`: no recent push has arrived for the metric within the
  expected interval.

## Stale data

A metric is `stale` when its last push timestamp is older than three
times its push interval, capped at 15 minutes. Stale and `no_data`
look identical in the database but mean different things:

- `no_data`: the agent ran the probe and the probe failed to return
  a value. This is a real signal about the customer's service. It
  counts in the SLO and surfaces in status rollups.
- `stale`: the agent has not pushed anything recently. The cause is
  on the monitoring side (cloud outage, agent crash, network
  partition between agent and cloud), not the customer's side.
  Stale metrics are excluded from the live status rollup, do not
  burn SLO budget, and do not fire `metric.status_changed` or
  `metric.no_data` webhooks.

When every metric on a service is stale, the service rolls up to
`monitoring_delayed` rather than `unhealthy`. See
[Observer availability](/docs/concepts/observer-availability) for the
contract that protects customer status pages from Observer's own
outages.


---
url: https://docs.use.observer/docs/reference/incident-lifecycle
title: Incident and maintenance lifecycle
description: States, transitions, and the events fired on each.
---

Observer treats incidents and maintenances as instances of the same
underlying row (the `updates` table). The lifecycle state is derived
from a grid of timestamp columns rather than an explicit state column;
this keeps the model coherent regardless of whether the transition
happened via API, server action, or cron.

## States

| State | Trigger |
|---|---|
| `draft` | Row exists but `published_at IS NULL`. Invisible to the public. |
| `published` | `published_at` set. Renders on the public page. |
| `resolved` | `resolved_at` set. Final lifecycle for an incident. |
| `scheduled` | Maintenance with `scheduled_start_at` set, `actual_start_at NULL`. |
| `in_progress` | Maintenance with `actual_start_at` set, `actual_end_at NULL`. |
| `completed` | Maintenance with `actual_end_at` set. |
| `canceled` | Either type with `canceled_at` set. |
| `deleted` | Soft-delete via `deleted_at`. Permanent; not displayed anywhere. |

## Transitions

```text
       draft
         │ POST /publish
         ▼
     published ──── POST /resolve ──▶ resolved
         │
         │ DELETE
         ▼
       deleted

  (maintenance only)
   scheduled ── cron @ scheduled_start_at ──▶ in_progress
                                                    │ cron @ scheduled_end_at
                                                    ▼
                                                 completed

   any state ── POST /cancel ──▶ canceled
```

## Webhook events

Each transition fires a webhook event. See
[Webhook payload reference](/docs/reference/webhook-payloads) for
exact body shapes.

| Event | Fires when |
|---|---|
| `incident.created` | Row inserted (regardless of draft / publish). |
| `incident.published` | `published_at` set. |
| `incident.updated` | Title, severity, or affected services patched. |
| `incident.message_added` | Message appended to timeline. |
| `incident.resolved` | `resolved_at` set. |
| `incident.deleted` | `deleted_at` set. |
| `maintenance.scheduled` | Row inserted with `scheduled_start_at`. |
| `maintenance.starting_soon` | Cron fires within 1h of `scheduled_start_at`. Once per row. |
| `maintenance.started` | `actual_start_at` set (manual or cron). |
| `maintenance.completed` | `actual_end_at` set. |
| `maintenance.canceled` | `canceled_at` set. |

## Auto-message side effects

Some lifecycle transitions append a system message to the timeline:

- `maintenance.started` (cron or API) appends an Information message: "Maintenance started."
- `maintenance.completed` (cron or API) appends a Resolved message: "Maintenance completed."
- `maintenance.canceled` (API) appends an Information message: "Maintenance canceled."

These are visible on the public page exactly like operator-authored
messages. They exist so the timeline reflects every state change
without requiring the operator to remember.

  Lifecycle transitions reject double-application: `POST /publish`
  on an already-published incident returns 409
  (`already_published`). Same for resolve, start, complete, cancel.
  Soft-delete returns 200 (idempotent).


---
url: https://docs.use.observer/docs/reference/subscriber-events
title: Subscriber notification events
description: Which incident and maintenance transitions trigger subscriber emails.
---

The notification worker reads from `pgmq.notification_outbox` and
fans out to confirmed subscribers on the affected page. The dispatch
matrix below lists which event types trigger subscriber email and
which do not.

## Trigger matrix

| Event | Triggers subscriber email |
|---|---|
| `incident.created` | No (draft state). |
| `incident.published` | Yes. |
| `incident.updated` | No (avoids notification spam on minor edits). |
| `incident.message_added` | Yes. |
| `incident.resolved` | Yes. |
| `incident.deleted` | No. |
| `maintenance.scheduled` | Yes. |
| `maintenance.starting_soon` | Yes (1h pre-warn). |
| `maintenance.started` | No (the starting_soon mail covered it). |
| `maintenance.completed` | Yes. |
| `maintenance.canceled` | Yes. |

## Filter scopes

When a subscriber has rows in `subscriber_filters`, the dispatch only
fires when at least one of the incident's affected services or
metrics intersects the filter list. Subscribers with no filters
receive every relevant event for the page.

## Customer-scoped incidents

Incidents with rows in `update_customer_visibility` are scoped: only
subscribers tied to one of the listed customers receive
notifications. The MVP customer-binding model on subscribers is
still under design; today, customer-scoped incidents skip subscriber
dispatch entirely. The webhook layer is unaffected — outbound webhook
subscribers always receive every event their subscription opted into,
regardless of customer scoping.

## Per-attempt audit

Each delivery attempt writes one row to `subscriber_deliveries`:

```text
id              uuid
subscriber_id   uuid
event_type      text
event_id        uuid
status          text   ('ok', 'error', 'skipped')
status_code     integer (Resend response, when applicable)
error           text   (truncated body on error)
attempted_at    timestamptz
```

The console **Subscribers** view exposes the most-recent attempt per
subscriber for triage.


---
url: https://docs.use.observer/docs/reference/feed
title: RSS / Atom feed reference
description: Public feed shape, caching headers, and exclusion rules.
---

Every public status page exposes both Atom and RSS feeds:

```text
GET https://./feed.atom
GET https://./feed.rss
```

## Content

One entry per incident message + one per maintenance lifecycle event
(scheduled / started / completed / canceled). The granularity matches
what a feed reader expects: each update on a single incident is a
separate item, sorted newest-first.

## Headers

```text
Content-Type: application/atom+xml; charset=utf-8   (or application/rss+xml)
Cache-Control: public, max-age=60
ETag: "obs--"
```

The ETag is computed from the count of feed-eligible entries and the
maximum of (`published_at`, `resolved_at`, `actual_start_at`,
`actual_end_at`, `canceled_at`, message dates). A repeat fetch with
matching `If-None-Match` returns `304 Not Modified` with no body.

## Exclusions

The feed excludes:

- Customer-scoped incidents (rows in `update_customer_visibility` are
  unconditionally hidden, even when a customer JWT is present —
  feeds have no auth).
- Drafts (`published_at IS NULL` and not a maintenance with
  `scheduled_start_at`).
- Soft-deleted rows.

## Discoverability

Status pages emit a `` in the page `` so most feed readers auto-detect the URL.

## Limit

Default 50 entries. Override via `?limit=200`. The route caps at the
hard limit set on the underlying query (200 today).


---
url: https://docs.use.observer/docs/troubleshooting/page-renders-blank
title: Status page renders blank
description: Diagnose a public status page that returns 200 but shows no content blocks.
---

A status page that resolves to the right host but renders no
content is almost always one of three problems: the page exists
but has no blocks added, the metrics on the page have not yet
reported, or the page's access mode is gating the visitor.

## Step 1: confirm the page exists

In the console, open **Pages** and verify the subdomain matches
the URL the visitor reaches. The subdomain field is unique
per organisation; a typo on save produces a different URL than
expected.

The values `admin` and `blog` are reserved and are not valid
status page subdomains.

## Step 2: confirm content blocks are present

Open the page in the builder. A page with the title and theme set
but no blocks added renders an empty body. Drag a **Metrics**
block onto the canvas, select at least one metric, and save.

A common variation: blocks were added but never saved. The
builder's draft state is local until **Save** commits it.

## Step 3: confirm the metrics are reporting

If the page has metric blocks but the visitor sees no values, open
each metric in the console and check the **Latest** column. If
the metric has not received a value, the agent has not yet
reported. Walk
[Metric shows no data](/docs/troubleshooting/metric-shows-no-data).

## Step 4: confirm access mode

Under the page's **Access** tab, the access mode determines who
can see the page:

| Mode | Who sees content |
|---|---|
| Public | Anyone with the URL. |
| Password | Visitors with the page's shared password. |
| IP allowlist | Visitors from configured IP ranges. |
| Customer-scoped (JWT) | Visitors with a valid JWT bound to a customer. |

A page rendering blank for the operator while logged into the
console, but rendering content in an incognito window, often
points at a logged-in/logged-out cookie state. Open the page in a
new private window to check.

## Step 5: check the browser console

A specific failure mode: a page with strict custom CSS that
hides body content. Open the browser developer tools, network
tab, and confirm the document body returns 200 with markup.
Check the console tab for hydration errors. If custom CSS is the
cause, edit the page's **CSS** tab and remove the offending rules.


---
url: https://docs.use.observer/docs/troubleshooting/metric-shows-no-data
title: Metric shows no data
description: Diagnose a metric that displays no current value or status in the console.
---

A metric with no recent value is almost always one of three
problems: no agent is assigned to the metric, the assigned agent
is not running, or the agent is running but the probe itself
returns an error.

## Step 1: confirm an agent is assigned

In the console, open **Metrics**, then the metric in question.
The detail page shows the assigned agent. If the field is empty,
the metric is defined but no agent is collecting it.

Set the **Agent** field, save, and wait one push interval (default
one minute). The cloud's metric-definitions endpoint refreshes the
agent's assignment every five minutes; restart the agent to pull
the updated list immediately.

## Step 2: confirm the agent is running

Open **Agents** in the console and verify the assigned agent shows
status **running**. If it shows **stopped**, walk the
[stalled agent diagnosis](/agent/guides/diagnose-stalled-agent).

## Step 3: confirm the probe is succeeding

If the metric reports `no_data` rather than no value at all, the
agent ran the probe and the probe failed. The metric's detail page
shows the latest `reason` string.

| `reason` substring | Probable cause |
|---|---|
| `ECONNREFUSED` | The target's port is closed or the host is unreachable. Verify network reachability from the agent's host. |
| `ENOTFOUND` | DNS resolution failed. Check `PROMETHEUS_SERVER_URL` or the probe's target hostname. |
| `ETIMEDOUT` | Target is reachable but did not respond within the configured timeout. |
| `HTTP 401` / `HTTP 403` | Authentication or authorization failed against the probe target. |
| `prometheus query empty` | The PromQL returned no series. The series name or label match probably does not exist. |

For Prometheus probes, run the query directly against the
Prometheus server (the same URL the agent uses) and confirm it
returns a single scalar.

## Step 4: confirm the threshold rule is correct

A metric that reports values but with status `unknown` typically
has thresholds that do not cover the value range. Open the metric
and verify:

- `healthy_under` / `healthy_over` define a band the value can
  reach.
- `unhealthy_under` / `unhealthy_over` define the failure band.
- Comparison operators are strict (`over` is `>`, `under` is `<`,
  `equal` is `=`). A value exactly on a boundary does not match
  that band.

## Step 5: confirm the dashboard view

If the metric reports values in the console's metric detail page
but a status page shows no data, verify:

- The metric is on the page (open the page builder).
- The metric's `is_public` flag is set (visible on the metric edit
  page).

  Each probe type has a dedicated configuration guide:
  [Prometheus](/agent/guides/prometheus-source),
  [HTTP](/agent/guides/http-probes),
  [TCP](/agent/guides/tcp-probes),
  [DNS](/agent/guides/dns-probes),
  [TLS certificate](/agent/guides/tls-cert-probes).
  Each one covers the probe-specific failure modes in detail.


---
url: https://docs.use.observer/docs/troubleshooting/webhook-deliveries-failing
title: Webhook deliveries failing
description: Diagnose a webhook subscription whose deliveries do not reach the receiver, or whose receiver rejects them.
---

A failing webhook subscription presents in one of three ways: the
delivery log shows non-2xx responses from the receiver, the log
shows network errors before the receiver was reached, or the log
is empty when the operator expected events.

## Step 1: read the delivery log

Open **Webhooks**, the subscription in question, then **Delivery
log**. Each entry shows:

- `event_type` and `event_id`.
- `attempted_at`.
- `response_status` (or a network-level error string).
- `response_body` (truncated).

If the log is empty, the events the subscription is bound to have
not fired since the subscription was created. Trigger a test
event by changing a metric's threshold to flip status, or wait for
an organic event.

## Step 2: non-2xx from the receiver

| Status | Probable cause |
|---|---|
| `400` | The receiver expects a different payload schema. Compare against [Webhook payload reference](/docs/reference/webhook-payloads). |
| `401` / `403` | Authentication required. Receivers like generic Slack apps or HMAC-protected endpoints require headers Observer does not set by default. |
| `404` | URL is wrong. Re-paste from the receiver's documentation. |
| `429` | Receiver is rate-limiting. Reduce subscription scope or contact the receiver's vendor. |
| `5xx` | Receiver is failing. The delivery worker retries with exponential backoff up to a fixed cap; deliveries are then moved to the dead-letter view. |

The delivery worker retries non-2xx responses. If retries exhaust,
the entry moves to the **Dead letter** view; manual replay is
available there.

## Step 3: network errors

If `response_status` is missing and the log shows a network-level
error:

| Error substring | Probable cause |
|---|---|
| `ECONNREFUSED` | Receiver host is not listening on the configured port. |
| `ENOTFOUND` | DNS resolution failed. Verify the URL hostname. |
| `ETIMEDOUT` | Receiver did not respond within the request timeout. |
| `CERT_HAS_EXPIRED` / `UNABLE_TO_VERIFY_LEAF_SIGNATURE` | The receiver's TLS certificate is expired or untrusted by the public CA bundle. Renew the certificate; Observer does not accept self-signed certificates against a public endpoint. |

## Step 4: signature verification on the receiver

If the receiver computes a signature mismatch but the URL and
secret are correct:

- Confirm the secret is the value Observer's webhook subscription
  page shows, not a copy with whitespace.
- Confirm the receiver computes
  `HMAC-SHA-256(secret, raw_body)` and compares against the
  `X-Observer-Signature` header value (after the `sha256=` prefix).
- Confirm the receiver hashes the raw request body, not a
  re-serialised JSON. Some web frameworks parse and re-serialise
  request bodies on the way to the handler; the recomputed
  signature does not match.

## Step 5: subscription is disabled

A subscription whose **enabled** flag is off does not deliver
events. The subscription edit page exposes the toggle. If a
subscription was disabled while debugging, re-enable it to resume
deliveries; new events fire from the moment it is re-enabled, not
backfilled.

  Webhook delivery actions write audit log rows under the
  `webhook` category. Audit log rows carry the failing receiver
  URL and the response status, which helps when correlating
  across multiple subscriptions.


---
url: https://docs.use.observer/docs/troubleshooting/sso-not-working
title: SSO not working
description: Diagnose JWT-based access on customer-scoped pages and authentication issues for the console.
---

Two distinct authentication paths exist in Observer. The
**console** uses Observer's hosted authentication for operators.
**Customer-scoped pages** verify JWTs that customers' identity
providers issue. The two paths fail for different reasons; this
page covers both.

## Customer-scoped pages: JWT verification fails

A customer reaches a customer-scoped page and is denied access
even with what they believe is a valid token. Walk:

### Step 1: confirm the page is in customer-scoped mode

Open the page's **Access** tab. The mode must be
**Customer-scoped (JWT)**. If it is set to anything else, the JWT
header is ignored.

### Step 2: confirm the issuer keys match

The page's access config holds either a static public key or a
JWKS endpoint. The token must be signed by a key the page can
verify.

- Static keys: confirm the key the customer's IdP is using matches
  the value pasted into the access config. Re-paste from the IdP
  on a fresh copy.
- JWKS endpoint: confirm the URL is reachable from Observer Cloud
  and returns valid JWKS JSON. Cache invalidation can cause stale
  keys; the configured cache TTL determines refresh frequency.

### Step 3: confirm the claim mapping

The access config specifies which JWT claim resolves to a customer
(typically `sub`, `customer_id`, or a custom claim). The token
must carry that claim, and the value must match a customer in the
organisation's list AND that customer must be bound to this page.

A token whose claim does not resolve to a bound customer returns
403 even when the signature is valid. Open **Customers** and
verify both:

1. A customer record exists with the value the JWT carries.
2. That customer is on the page's customer-binding list.

### Step 4: confirm the token has not expired

The `exp` claim in the JWT is enforced. Tokens past expiry are
rejected. The customer's IdP integration is responsible for
issuing fresh tokens.

### Step 5: confirm the audience and issuer

If the access config sets `audience` or `issuer` constraints, the
token must carry matching `aud` and `iss` claims. A token issued
for a different audience returns 403.

  The fastest verification is to decode the customer's failing JWT
  at [jwt.io](https://jwt.io), inspect the claims, and compare
  against the page's access config field-by-field. The JWT itself
  is the signed source of truth for what the customer claims.

## Console SSO: operators cannot sign in

Observer Cloud's console authentication is hosted. If an operator
cannot sign in:

### Step 1: confirm the email and organisation

The operator must be a member of the organisation. Sign-in fails
silently in some browsers if the email matches no account or if
the account exists but has no organisation membership.

The organisation owner can re-invite the operator from
**Settings** > **Members** > **Invite**. The invitation arrives by
email; following the link binds the account to the organisation.

### Step 2: confirm the email provider is reachable

If invitations do not arrive:

1. Check spam folders. First-time invitations from a new domain
   often filter aggressively.
2. Confirm the recipient's email provider accepts mail from
   Observer's sender domain. Corporate gateways occasionally
   block transactional senders by default.
3. Resend the invitation; each invitation has a distinct
   confirmation token.

### Step 3: confirm MFA is configured

If the operator authenticates but is rejected at the MFA step:

1. Confirm the MFA enrolment is bound to the same account they
   are signing in to.
2. If a recovery code was used, advise the operator to re-enrol
   their second factor immediately; the recovery code only grants
   one-time access.

### Step 4: SAML or social provider integrations

Observer's hosted authentication supports email + password and a
configurable set of social providers. Provider availability is
determined by the cloud's configuration; if the social button the
operator expects is missing, the provider is not enabled on this
deployment. Contact the cloud operator.

  If every administrator on an organisation has lost access (for
  example, MFA hardware reset across a team), contact Observer
  support with proof of organisation ownership. Recovery is a
  manual process.


---
url: https://docs.use.observer/docs
title: Documentation
description: Reference and guidance for Observer, the metrics-driven status page platform.
---

This section covers product setup, day-two operations, and concepts.
The two adjacent tabs are scoped narrower: the
[Observer Agent](/agent) tab covers the on-premise data plane, and
the [API](/api) tab is the generated REST reference.

## Quickstart

Pick the metric path that matches your environment, then proceed
to SLO and status page. The two metric quickstarts are
alternatives, not sequential; complete one of them.

  
  
  
  

## Guides, reference, and concepts

The remaining sections appear in the sidebar as content is published:

- **Guides** cover task-shaped configuration: customer-scoped pages,
  password protection, custom domains, outbound webhooks, theme
  customization, and migration from other status page tools.
- **Reference** covers plan limits, webhook payload shapes, audit log
  event names, and customization options.
- **Concepts** covers the operating model: metric-based status,
  service level objectives, customer scopes, threshold semantics.


---
url: https://docs.use.observer/agent/concepts/agent-cloud-boundary
title: Agent and cloud boundary
description: What crosses the network and what does not.
---

The Observer Agent is the only component that runs inside your
network. The cloud sits across an HTTPS boundary and never reaches
back into your network. This page is the explicit description of
what crosses that boundary and what stays put.

## What the agent sends to the cloud

```text
POST /api/agent/heartbeat
  every ~30 seconds. Self-state report (queue depth, uptime,
  active source types). See the heartbeat payload reference.

GET /api/agent/metrics-definitions
  every 5 minutes. Pull of the metric definitions assigned to
  this agent. The response is the canonical list the agent
  schedules against.

POST /api/agent/receiver
  per status push. The body is one row:
    { metric_id, value, status, timestamp, reason? }

POST /api/agent/log (optional)
  only when BROADCAST_LOGS=true. Forwards a subset of agent log
  lines for surfacing on the agent detail page. PromQL query
  strings are always redacted to a SHA-256 prefix and length.
```

That is the entire surface. There are no other outbound calls.

## What the agent does not send

- Raw PromQL query strings.
- Raw HTTP request bodies, response bodies, or response headers
  beyond what the probe required.
- DNS resolver responses beyond a substring match against
  `expected_value` if configured.
- TLS certificate chains. Only `days_until_expiry` and a few
  metadata fields (subject CN, issuer CN, valid_to) are sent.
- Any metric series outside the explicit metric definitions.

## What the cloud sends to the agent

Only the response to `GET /api/agent/metrics-definitions`. The
response shape is the projection in the public
`@observer/protocol` package's `MetricDefinition` type.

The cloud has no path back into your network. It cannot pull from
your Prometheus, hit your endpoints, or query your DNS. Every
probe runs from the agent's vantage point.

  Several Observer customers run in environments where outbound
  HTTPS is the only allowed network path (PCI scope, regulated
  banking, defence). The agent is designed for that constraint:
  one outbound HTTPS connection to the cloud, no inbound
  connections, no other egress.

## Trust assumptions

- The agent trusts the cloud's TLS certificate by default. Set
  `SKIP_SSL_VERIFICATION=true` in development only.
- The cloud trusts the agent only after the agent presents a
  valid `AGENT_KEY`. Keys are bound to a single agent identity
  and a single organisation.
- A compromised agent key affects only that agent's pushes.
  The cloud restricts each request to the agent's own
  organisation; a stolen key cannot read or write across
  tenants.


---
url: https://docs.use.observer/agent/concepts/probes-vs-scraping
title: Probes vs scraping
description: Why the agent runs probes from inside your network instead of having the cloud scrape endpoints.
---

The classical observability pattern (Prometheus, Datadog,
Grafana Cloud) is centralised scraping: a central system reaches
out to your endpoints on a schedule and pulls metrics. Observer
takes the inverse position: the agent runs in your network and
pushes outcomes to the cloud. This page covers why.

## The constraints scraping puts on you

Scraping requires the central system to reach every endpoint it
measures. In practice this means at least one of:

1. Public exposure of internal endpoints, sometimes with a
   reverse proxy or TLS-terminating load balancer purely to
   accept the scrape.
2. A VPN or peering connection from the central system back into
   your network.
3. A scraping agent inside your network that the central system
   pulls from (essentially shifting the same problem one hop).

Each option grows the network attack surface and the legal
review surface. For organisations in regulated environments
(PCI, HIPAA, defence, finance), opening any inbound path is a
months-long compliance exercise.

## The probe model

Observer's agent runs inside your network, hits its targets
locally, and pushes the verdict to the cloud. No IP allowlist at
your edge, no TLS-terminating proxy in front of internal
endpoints, no reverse VPN. The exact request surface (which
endpoints, which payloads, what stays put) is enumerated in
[Agent and cloud boundary](/agent/concepts/agent-cloud-boundary).

The trade-off: collection happens in your network, so collection
runs on your hardware. The agent is small (single process,
roughly 40MB image, roughly 64MB RSS at idle) and runs anywhere
a container can run.

## When scraping is still preferable

When the targets are themselves SaaS systems with public scrape
endpoints (a third-party API, a public DNS server, a hosted
queue), the central system would have a clean path. Observer
still uses the agent for these cases for one reason: a single
configuration surface. Mixing "some metrics scraped centrally,
some pushed from an agent" doubles the operator's mental model
without an offsetting benefit. The agent runs the probe from
wherever it sits and reports the result.


---
url: https://docs.use.observer/agent/concepts/local-queue
title: The local queue
description: Why the agent buffers status pushes locally, and how the buffer behaves under cloud unreachability.
---

Status pushes are written to a local SQLite file before the agent
attempts to deliver them to the cloud. The file is the agent's
durability layer. Pushes survive container restarts, daemon
restarts, and the cloud being temporarily unreachable.

## Behaviour

- Every status push enqueues a row in the local SQLite file
  (`BUFFER_PATH`, default `./observer-agent-buffer.db`).
- A background drain controller pulls batches from the queue and
  posts them to the cloud's `/api/agent/receiver` endpoint.
- Successful posts ack and remove rows from the queue.
- Failed posts back off exponentially. The queue continues to
  accept new pushes during the outage.
- When the queue reaches `BUFFER_MAX_ROWS` (default `10000`),
  oldest entries are evicted to admit new ones.

The cloud is the source of truth for historical data. The local
queue is a write-ahead log that protects against transient cloud
failures, not a long-term store.

## What the operator sees

The agent dashboard's queue panel shows three live numbers:

- `depth`: rows currently waiting.
- `oldest_age_seconds`: age of the oldest pending row.
- `drain_backoff_ms`: current backoff between drain attempts.

A growing depth combined with a non-zero backoff is the
signature of cloud unreachability. Once the cloud is reachable
again, the queue drains and depth returns to near zero.

## Cloud-side signals

The cloud's heartbeat receiver computes two derived signals from
the queue numbers in every heartbeat:

- `agent.lag_high`: opens when `queue_depth > 1000` or
  `queue_oldest_age_seconds > 300`. Surfaces in the agent detail
  page and as a webhook event when subscribed.
- `agent.uptime_degraded`: opens when 24-hour uptime falls below
  95%.

Both signals clear with 60-second hysteresis to avoid flapping.

  The queue is a process-local file. Running two agent processes
  with the same `AGENT_KEY` splits the queue between them and
  confuses the cloud's per-agent uptime computation. Run a single
  replica per agent identity.


---
url: https://docs.use.observer/agent/concepts/bun-distroless-design
title: Bun and distroless: design choices
description: Why the agent runs on Bun and ships in a distroless image.
---

The agent's runtime choices are deliberate. They shape the image
size, the operator surface, and the security posture in production.

## Bun

The agent runs on Bun rather than Node.js. The decisions Bun
makes for us:

- **Native TypeScript execution.** Source files run directly,
  with no transpile step in the build pipeline.
- **Embedded SQLite.** The `bun:sqlite` module replaces
  `better-sqlite3`. Loses the build dependency on `python`,
  `make`, and `g++`. The container image shrinks accordingly.
- **Native fetch.** `axios` is gone. One fewer dependency,
  one fewer attack surface.
- **Native scheduling.** `setInterval` is enough; `node-cron` is
  not in the dependency graph.
- **Automatic `.env` loading.** `dotenv` is gone for the agent's
  needs.

The trade-off is that Bun is younger than Node and the
ecosystem's edge cases sometimes show up. The agent's
dependencies are deliberately narrow to limit exposure.

## Distroless

The runtime image is `oven/bun:1-distroless`. The trade-offs:

- **No shell.** `sh`, `bash`, `busybox`, `curl`, `wget`, and
  every other utility are absent. An attacker who reaches the
  container has no shell to drop into.
- **No package manager.** No `apk`, `apt`, or anything that can
  install code at runtime.
- **Smaller surface.** The image contains the Bun binary, libc,
  and the agent's source. Nothing else.

Operational consequence: do not `kubectl exec -it` into the
container expecting a shell. Diagnose through the dashboard,
through logs, and through restart-and-observe.

  Distroless containers cannot run a shell-based health check
  (`HEALTHCHECK` instruction with curl + sh). The agent relies
  on liveness derived from container exit and the cloud's
  heartbeat-based agent.offline detection rather than a
  container-local health probe.

## Single-file source

The runtime entry is `src/index.ts`. Surrounding modules
(`buffer.ts`, `drain.ts`, `dashboard.ts`, `status.ts`,
`sources/*`) are ESM imports. There is no bundler step before
shipping. The image carries the source verbatim, runs it under
Bun, and that is the entire chain from `git clone` to running
process.

## Image size

The runtime image is roughly 40MB on `linux/amd64`. The Bun
distroless base is most of that; the agent's own code adds a few
hundred kilobytes. Pull time on a fresh node is dominated by the
base layer; tag-based caching makes subsequent pulls nearly
instant.

## Standalone binary

`bun build --compile` produces a single-file executable per
platform. The binary embeds the Bun runtime and every dependency,
so a release-binary install reduces to "download, chmod, run" with
no runtime install on the host.

Per-tag CI publishes five binaries to
[github.com/useobserver/agent/releases](https://github.com/useobserver/agent/releases):
`linux-x64`, `linux-arm64`, `darwin-x64`, `darwin-arm64`, and
`windows-x64.exe`, plus a `SHA256SUMS` file for verification.

The binary path is the lightest install but offers fewer
guardrails than the container path. Use the container image when
you want isolation (separate user namespace), a single upgrade
mechanism shared with other services, or tag-based version
pinning at the container-runtime layer. Use the binary on
constrained or air-gapped hosts that should not run Docker at all.

Binaries are produced from the same source as the container — the
build entry is `src/index.ts` either way. Runtime flags can be
forwarded to the embedded Bun via `BUN_OPTIONS`; see
[bun.com/docs/bundler/executables](https://bun.com/docs/bundler/executables#runtime-arguments-via-bun_options).


---
url: https://docs.use.observer/agent/quickstart/install-binary
title: Install from a release binary
description: Download a single-file executable from GitHub Releases. No runtime install required on the host.
---

The agent is published as a single-file binary per platform. Each
binary embeds the Bun runtime + every dependency, so the install
collapses to "download, chmod, run". Use this path on minimal hosts
that cannot or should not run Docker.

## Prerequisites

- An agent key from the Observer console (**Agents** > **New agent**).
- A reachable Prometheus URL (only required for Prometheus probes).

## Steps



### Pick the binary for your platform

Releases live at
[github.com/useobserver/agent/releases](https://github.com/useobserver/agent/releases).
Each release publishes five binaries plus a `SHA256SUMS` file.

| Platform        | File                                  |
|-----------------|---------------------------------------|
| Linux x64       | `observer-agent-linux-x64`            |
| Linux arm64     | `observer-agent-linux-arm64`          |
| macOS x64       | `observer-agent-darwin-x64`           |
| macOS arm64     | `observer-agent-darwin-arm64`         |
| Windows x64     | `observer-agent-windows-x64.exe`      |





### Download and verify

```bash
VERSION=1.0.4
curl -fLO https://github.com/useobserver/agent/releases/download/agent-v${VERSION}/observer-agent-linux-x64
curl -fLO https://github.com/useobserver/agent/releases/download/agent-v${VERSION}/SHA256SUMS
shasum -a 256 -c SHA256SUMS --ignore-missing
chmod +x observer-agent-linux-x64
sudo mv observer-agent-linux-x64 /usr/local/bin/observer-agent
```





### Run

```bash
AGENT_KEY=obs_live_... \
CLOUD_SERVER_URL=https://use.observer \
PROMETHEUS_SERVER_URL=http://prometheus.local:9090 \
observer-agent
```

The dashboard listens on `http://localhost:10101`. The console's
Agents page marks the agent as **running** within 90 seconds.





### Run as a systemd service (optional)

```ini title="/etc/systemd/system/observer-agent.service"
[Unit]
Description=Observer agent
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
EnvironmentFile=/etc/observer-agent.env
ExecStart=/usr/local/bin/observer-agent
Restart=on-failure
RestartSec=5s
User=observer

[Install]
WantedBy=multi-user.target
```

```bash
sudo install -m 600 -o root -g root /dev/stdin /etc/observer-agent.env <<'EOF'
AGENT_KEY=obs_live_...
CLOUD_SERVER_URL=https://use.observer
PROMETHEUS_SERVER_URL=http://prometheus.local:9090
EOF
sudo useradd --system --no-create-home observer 2>/dev/null || true
sudo systemctl daemon-reload
sudo systemctl enable --now observer-agent
```



## Forwarding flags to the embedded Bun runtime

The binary ships with a copy of the Bun runtime baked in. Runtime
flags reach it via the `BUN_OPTIONS` environment variable, not via
command-line arguments. Example:

```bash
BUN_OPTIONS="--smol" observer-agent
```

See
[bun.com/docs/bundler/executables](https://bun.com/docs/bundler/executables#runtime-arguments-via-bun_options)
for the full list of supported flags.

## Upgrades

```bash
VERSION=1.0.4
curl -fLO https://github.com/useobserver/agent/releases/download/agent-v${VERSION}/observer-agent-linux-x64
chmod +x observer-agent-linux-x64
sudo mv observer-agent-linux-x64 /usr/local/bin/observer-agent
sudo systemctl restart observer-agent
```

Pin to an exact version in your install scripts and CI. New
releases at
[github.com/useobserver/agent/releases](https://github.com/useobserver/agent/releases).
Roll forward deliberately; rollbacks are a one-line revert of the
URL.

Use the [Docker quickstart](/agent/quickstart/install-docker) when
your host already runs containers, when you want pinning via image
tags, or when you want the agent isolated from the host's user
namespace. The binary path is the lightest install but offers
fewer guardrails.


---
url: https://docs.use.observer/agent/quickstart/install-docker
title: Install on Docker
description: Run the published image with three required environment variables.
---

The fastest path to a running agent. Suitable for a development
host, a single VM, or a quick proof of concept.

## Prerequisites

- Docker installed.
- An agent key from the Observer console
  (**Agents** > **New agent**). The key is shown once; copy it
  before navigating away.
- A Prometheus URL the host can reach (only required if the agent
  will run Prometheus probes; HTTP / TCP / DNS / TLS-cert probes do
  not need it).

## Steps



### Pull the image

```bash
docker pull ghcr.io/useobserver/agent:1.0.4
```





### Run the container

```bash
docker run -d \
  --name observer-agent \
  --restart unless-stopped \
  -p 10101:10101 \
  -e AGENT_KEY=obs_live_... \
  -e CLOUD_SERVER_URL=https://use.observer \
  -e PROMETHEUS_SERVER_URL=http://prometheus:9090 \
  ghcr.io/useobserver/agent:1.0.4
```

The image listens on port `10101` for the debug dashboard. If the
agent will not run Prometheus probes, omit `PROMETHEUS_SERVER_URL`.





### Confirm the connection

Open `http://:10101` in a browser. The dashboard's *Cloud*
panel shows a recent `last_heartbeat_at` timestamp once the agent
has registered with the cloud (typically within 30 seconds).

The Agents page in the console marks the agent as **running**
within 90 seconds.



  For multi-host or cluster deployments, see
  [Install on Kubernetes](/agent/quickstart/install-kubernetes). The
  systemd wrap below works for bare-metal Linux hosts that don't
  run k8s. The Docker path is recommended only for single-host
  development.

## Compose

For a Compose-driven setup, use the snippet below. It pins the
image, binds the dashboard, and reads secrets from a `.env` file.

  Pin to an exact version (`agent:1.0.4`). Track new releases at
  [github.com/useobserver/agent/releases](https://github.com/useobserver/agent/releases)
  and roll forward when ready.

```yaml title="docker-compose.yml"
services:
  observer-agent:
    image: ghcr.io/useobserver/agent:1.0.4
    container_name: observer-agent
    restart: unless-stopped
    env_file: [.env]
    ports:
      - "10101:10101"
```

```bash title=".env"
AGENT_KEY=obs_live_...
CLOUD_SERVER_URL=https://use.observer
PROMETHEUS_SERVER_URL=http://prometheus:9090
```

## Run under systemd

Use this when the host is bare-metal Linux without k8s and you
want Docker isolation alongside systemd auto-restart + journal
log capture. Prefer
[Install from a release binary](/agent/quickstart/install-binary)
when Docker isn't a hard requirement — fewer moving parts.

```bash
sudo install -m 600 -o root -g root /dev/stdin /etc/observer-agent.env <<'EOF'
AGENT_KEY=obs_live_...
CLOUD_SERVER_URL=https://use.observer
PROMETHEUS_SERVER_URL=http://prometheus.local:9090
EOF

sudo docker pull ghcr.io/useobserver/agent:1.0.4
```

```ini title="/etc/systemd/system/observer-agent.service"
[Unit]
Description=Observer agent
Wants=network-online.target docker.service
After=network-online.target docker.service

[Service]
Type=simple
EnvironmentFile=/etc/observer-agent.env
ExecStartPre=-/usr/bin/docker rm -f observer-agent
ExecStart=/usr/bin/docker run --rm --name observer-agent \
  --network host \
  --env-file /etc/observer-agent.env \
  ghcr.io/useobserver/agent:1.0.4
ExecStop=/usr/bin/docker stop observer-agent
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target
```

```bash
sudo systemctl daemon-reload
sudo systemctl enable --now observer-agent
journalctl -u observer-agent -n 50 --no-pager
```

  `--network host` lets the agent reach Prometheus and probe
  targets bound to the host's loopback or its private interface.
  If your targets sit on a Docker bridge instead, drop the
  host-network flag and use the bridge IP.


---
url: https://docs.use.observer/agent/quickstart/install-kubernetes
title: Install on Kubernetes
description: Deployment manifest with Secret-bound credentials.
---

The recommended deployment path for production. Single-replica
Deployment, agent key delivered through a Kubernetes Secret.

## Prerequisites

- A Kubernetes cluster you can deploy into.
- An agent key from the Observer console.
- A reachable Prometheus URL inside the cluster (typically a
  `Service` in the monitoring namespace).

## Steps



### Create a namespace and Secret

```bash
kubectl create namespace observer
kubectl create secret generic observer-agent \
  --namespace observer \
  --from-literal=agent-key='obs_live_...'
```





### Apply the Deployment

```yaml title="agent.yaml"
apiVersion: apps/v1
kind: Deployment
metadata:
  name: observer-agent
  namespace: observer
spec:
  replicas: 1
  selector:
    matchLabels: { app: observer-agent }
  template:
    metadata:
      labels: { app: observer-agent }
    spec:
      containers:
        - name: agent
          image: ghcr.io/useobserver/agent:1.0.4
          imagePullPolicy: IfNotPresent
          ports:
            - { name: dashboard, containerPort: 10101 }
          env:
            - name: AGENT_KEY
              valueFrom:
                secretKeyRef: { name: observer-agent, key: agent-key }
            - name: CLOUD_SERVER_URL
              value: https://use.observer
            - name: PROMETHEUS_SERVER_URL
              value: http://prometheus.monitoring.svc.cluster.local:9090
          resources:
            requests: { cpu: "50m", memory: "64Mi" }
            limits:   { cpu: "500m", memory: "256Mi" }
```

```bash
kubectl apply -f agent.yaml
```





### Confirm the connection

Port-forward the dashboard:

```bash
kubectl -n observer port-forward deploy/observer-agent 10101:10101
```

Open `http://localhost:10101`. The *Cloud* panel reports
`last_heartbeat_at` within 30 seconds. The console's Agents page
marks the agent as **running** within 90 seconds.



  Run a single replica per agent identity. The agent's local queue
  uses an embedded SQLite file inside the container; running two
  pods with the same key splits the queue and confuses the cloud's
  agent.offline detection.

  Pin to an exact version (`agent:1.0.4`). Track new releases at
  [github.com/useobserver/agent/releases](https://github.com/useobserver/agent/releases)
  and roll forward when ready.

## Optional: dashboard Service

To expose the dashboard inside the cluster (without
`port-forward`), add a ClusterIP Service. Do not expose the
dashboard externally; the dashboard is operator-facing only.

```yaml title="agent-service.yaml"
apiVersion: v1
kind: Service
metadata:
  name: observer-agent
  namespace: observer
spec:
  type: ClusterIP
  selector: { app: observer-agent }
  ports:
    - { name: dashboard, port: 10101, targetPort: 10101 }
```


---
url: https://docs.use.observer/agent/guides/prometheus-source
title: Configure Prometheus query metrics
description: Define a metric whose value comes from a PromQL query the agent runs against your Prometheus.
---

Prometheus is the most common metric source. The agent runs the
PromQL query against the Prometheus URL configured at deploy time
and reports the scalar result.

## Configuration shape

A Prometheus metric carries a `source_config` of:

```json
{
  "query": "rate(http_requests_total{job=\"checkout-api\",status=~\"5..\"}[5m]) / rate(http_requests_total{job=\"checkout-api\"}[5m])",
  "prometheus_url": "https://prometheus.example/  (optional override)"
}
```

The optional `prometheus_url` overrides the agent's
`PROMETHEUS_SERVER_URL` for this single metric. Use it when one
agent serves multiple Prometheus servers.

## Query requirements

- The query must return a single scalar value. Use aggregation
  (`sum`, `avg`, `rate`, etc.) to collapse vector results.
- Empty results are reported as `no_data`; the metric does not
  flip status until the query produces a value.
- The agent does not interpret the query content. The cloud's
  push payload contains the precomputed status, not the query
  string.

## Authentication

When Prometheus requires basic auth, set on the agent:

```bash
PROMETHEUS_BASIC_AUTH_ENABLED=true
PROMETHEUS_USERNAME=...
PROMETHEUS_PASSWORD=...
```

These apply to every Prometheus probe the agent runs.

  Grafana Cloud's hosted Prometheus uses basic auth. The
  credentials issued by Grafana for read access drop in here
  unchanged. See
  [Connect to Grafana Cloud](/agent/guides/connect-grafana-cloud).

## Threshold examples

| Query | Healthy | Unhealthy |
|---|---|---|
| 5xx error ratio (last 5 min) | `under 0.005` | `over 0.02` |
| p95 latency in ms | `under 500` | `over 2000` |
| Queue depth | `under 100` | `over 1000` |
| Replica count | `over 1` | `under 1` |

The strict-comparison rule applies (see
[threshold operators](/docs/reference/threshold-operators) in the
Documentation tab).


---
url: https://docs.use.observer/agent/guides/http-probes
title: Configure HTTP probes
description: Probe an HTTP endpoint and report response time as the metric value.
---

HTTP probes hit a URL on the configured interval. The reported
value is `response_time_ms` for successful requests, or `no_data`
with a reason code when the request fails (timeout, connection
refused, body mismatch, unexpected status).

## Configuration shape

```json
{
  "url": "https://api.example.com/healthz",
  "method": "GET",
  "expected_status": 200,
  "timeout_ms": 5000,
  "headers": { "User-Agent": "observer-agent" },
  "body_match": "ok",
  "follow_redirects": true,
  "verify_tls": true
}
```

## Field reference

| Field | Default | Notes |
|---|---|---|
| `url` | required | Full URL including scheme. |
| `method` | `GET` | One of `GET`, `HEAD`, `POST`, `PUT`, `PATCH`, `DELETE`, `OPTIONS`. |
| `expected_status` | `200` | Single integer or array. The probe matches if the response code is in the set. |
| `timeout_ms` | `5000` | Aborts the request when exceeded. Reports `ETIMEDOUT`. |
| `headers` | none | Extra request headers. Common use: API key for protected endpoints. |
| `body_match` | none | Optional substring match against the first 4KB of the response body. Mismatch reports `body_mismatch`. |
| `follow_redirects` | `true` | When `false`, redirect responses count against `expected_status`. |
| `verify_tls` | `true` | When `false`, the probe accepts invalid TLS certificates. Useful for self-signed internal endpoints. |

## Reason codes

The `reason` field on `no_data` results uses values from the HTTP
client and Node socket layer:

- `ETIMEDOUT`: request exceeded `timeout_ms`.
- `ECONNREFUSED`: connection refused at the TCP layer.
- `ENOTFOUND`, `EAI_AGAIN`: DNS resolution failed.
- `unexpected_status:`: status code not in `expected_status`.
- `body_mismatch`: `body_match` was set and the response body did
  not contain it.

  Only the first 4KB of the response body is read. If the marker
  string is later in the response, the probe reports
  `body_mismatch`. Move the marker earlier in the response, or use
  a dedicated health endpoint that returns it in the first kilobyte.

## Threshold examples

| Goal | Healthy | Unhealthy |
|---|---|---|
| Endpoint reachable, fast | `under 500` | `over 2000` |
| Endpoint reachable | `under 5000` | `over 10000` |

For pure reachability with no latency requirement, set the
unhealthy threshold equal to the timeout and rely on `no_data`
for failures.


---
url: https://docs.use.observer/agent/guides/tcp-probes
title: Configure TCP probes
description: Open a TCP connection and report connect time as the metric value.
---

TCP probes are appropriate for non-HTTP services where reachability
of a port is the signal: Redis, Postgres, RabbitMQ, internal RPC
services. The agent opens a TCP connection, records the connect
time in milliseconds, and closes the connection.

## Configuration shape

```json
{
  "host": "redis.internal",
  "port": 6379,
  "timeout_ms": 2000
}
```

## Field reference

| Field | Default | Notes |
|---|---|---|
| `host` | required | Hostname or IP. |
| `port` | required | Integer in `1..65535`. |
| `timeout_ms` | `2000` | Aborts the connection attempt when exceeded. |

## Reason codes

| Reason | Meaning |
|---|---|
| `ETIMEDOUT` | Connection attempt did not complete within `timeout_ms`. |
| `ECONNREFUSED` | TCP connection refused. |
| `ENOTFOUND` / `EAI_AGAIN` | DNS resolution failed. |
| `tcp_error` | Other socket error. The exact code is logged on the agent. |

## Threshold examples

| Goal | Healthy | Unhealthy |
|---|---|---|
| Reachable + fast handshake | `under 50` | `over 500` |
| Reachable | `under 1000` | `over 1500` |

Pure reachability with no latency requirement: set unhealthy at
`timeout_ms - 1`, leaving anything below as healthy.


---
url: https://docs.use.observer/agent/guides/dns-probes
title: Configure DNS probes
description: Resolve a record and report resolve time as the metric value.
---

DNS probes resolve a domain through the agent's DNS resolver and
report the resolution time in milliseconds. Optional value-match
verifies the answer.

## Configuration shape

```json
{
  "domain": "api.example.com",
  "record_type": "A",
  "expected_value": "203.0.113.10",
  "resolver": "1.1.1.1"
}
```

## Field reference

| Field | Default | Notes |
|---|---|---|
| `domain` | required | The domain to resolve. |
| `record_type` | `A` | One of `A`, `AAAA`, `CNAME`, `MX`, `TXT`, `NS`, `SRV`, `CAA`, `PTR`. |
| `expected_value` | none | Optional substring match against the resolved record. Mismatch reports `expected_value_mismatch`. |
| `resolver` | system default | Optional override resolver IP. Useful for verifying a specific authoritative server. |

## Reason codes

The `reason` field surfaces standard Node DNS error codes:

| Reason | Meaning |
|---|---|
| `ENOTFOUND` | The domain does not resolve. |
| `ETIMEDOUT` | The resolver did not answer in time. |
| `ESERVFAIL` | The resolver returned `SERVFAIL`. |
| `expected_value_mismatch` | Resolution succeeded but the record did not contain `expected_value`. |
| `dns_error` | Other resolver error. |

## Threshold examples

| Goal | Healthy | Unhealthy |
|---|---|---|
| Authoritative answer fast | `under 50` | `over 500` |
| Resolution succeeds at all | `under 5000` | `over 10000` |

When the test is purely "does the domain still resolve",
`unhealthy_value` set to a timeout-equivalent threshold combined
with `no_data` on `ENOTFOUND` covers the case.


---
url: https://docs.use.observer/agent/guides/tls-cert-probes
title: Configure TLS certificate probes
description: Connect to a TLS endpoint and report days until certificate expiry.
---

TLS certificate probes connect to a host on a TLS port, read the
peer certificate, and report `days_until_expiry`. Use them to fire
a clear signal before a public certificate lapses.

## Configuration shape

```json
{
  "host": "api.example.com",
  "port": 443,
  "warn_days": 30,
  "critical_days": 7
}
```

## Field reference

| Field | Default | Notes |
|---|---|---|
| `host` | required | Hostname (preferred) or IP. SNI is set automatically when the host is a hostname. |
| `port` | `443` | TLS port to connect to. |
| `warn_days` | `30` | Informational marker. The agent reports the value regardless; thresholds drive status. |
| `critical_days` | `7` | Same as above. The relationship `warn_days >= critical_days` is enforced. |

The probe accepts certificates that fail validation (expired,
self-signed, hostname mismatch). The intent is to surface the
problem rather than refuse the connection. Status is computed
from `days_until_expiry`.

## Threshold examples

| Goal | Healthy | Unhealthy |
|---|---|---|
| Standard renewal cadence | `over 30` | `under 7` |
| Aggressive (Let's Encrypt 90d) | `over 14` | `under 3` |

Negative `days_until_expiry` indicates the certificate has already
expired. Set `unhealthy` at `under 0` to treat that as a hard
unhealthy.

## Reason codes

| Reason | Meaning |
|---|---|
| `no_cert` | Server completed TLS but did not present a certificate. |
| `bad_cert_date` | Certificate's `valid_to` could not be parsed. |
| `ETIMEDOUT` | Connection did not complete in time. |
| `ECONNREFUSED` | Connection refused at the TCP layer. |
| `tls_error` | Other TLS-handshake error. |

  When `host` is an IP literal, the SNI hint is omitted (RFC 6066
  forbids IPs as SNI values). Some virtual-hosted servers will not
  return the expected certificate without SNI. Probe a hostname
  whenever the system supports one.


---
url: https://docs.use.observer/agent/guides/connect-grafana-cloud
title: Connect to Grafana Cloud
description: Use a Grafana Cloud Prometheus endpoint as the agent's metric source.
---

Grafana Cloud's hosted Prometheus is a valid source for Observer
agents. The connection uses basic auth with credentials issued by
Grafana for read access.

## Steps

1. In Grafana Cloud, open the stack details for the Prometheus
   instance you want the agent to read. Note:
   - **URL**: the remote-read URL
     (e.g. `https://prometheus-prod-01-eu-west-0.grafana.net/api/prom`).
   - **Username**: the numeric user id (e.g. `123456`).
   - **Password**: a Grafana Cloud access policy token with
     `metrics:read` scope.

2. Set the agent's environment:

   ```bash title=".env"
   PROMETHEUS_SERVER_URL=https://prometheus-prod-01-eu-west-0.grafana.net/api/prom
   PROMETHEUS_BASIC_AUTH_ENABLED=true
   PROMETHEUS_USERNAME=123456
   PROMETHEUS_PASSWORD=glc_eyJ...   # access policy token
   ```

3. Restart the agent. Heartbeats and probe queries now hit Grafana
   Cloud.

## Verification

Define a Prometheus metric in the console (any working PromQL
query against the data Grafana Cloud holds). Within one push
interval the metric reports a value. Failures resolve to
`no_data` with a reason:

- `Unauthorized` when the access policy token is missing or
  lacking the required scope.
- `BadQuery` when the PromQL string is invalid against the data.
- `PromUpstream` for Grafana-side 5xx responses.

  When one agent reads from multiple Prometheus servers (Grafana
  Cloud + a local Prometheus, for example), the
  `PROMETHEUS_SERVER_URL` env var sets the default and individual
  metrics override it via the `prometheus_url` field on the metric
  definition.


---
url: https://docs.use.observer/agent/guides/read-the-dashboard
title: Read the agent dashboard
description: How to interpret the panels exposed on the agent's debug HTTP surface.
---

The agent serves a read-only debug dashboard on
`http://:10101` by default. The dashboard polls the agent's
in-process state every five seconds and never mutates anything.

## Panels

### Process

Identifies the agent: version string, Bun runtime version,
process uptime, and resident-set memory. Use this to confirm the
running image matches the version you expected after an upgrade.

### Config

Lists the environment variables the agent considers relevant to
operation, with values masked to `first-4 + tail-4` characters.
Variables outside the allowlist are not displayed regardless of
their name. The masking applies even to values that are not
themselves secrets, so screenshots of the dashboard never reveal a
full token.

### Queue

Reports the local SQLite queue's current depth, the age of the
oldest pending push, the configured capacity, and the drain
controller's current backoff in milliseconds. A growing queue
combined with a non-zero backoff is the signature of cloud
unreachability.

### Cloud

Reports the configured `CLOUD_SERVER_URL` and the timestamps and
results of the last heartbeat and the last metric push. Failed
recent calls show their error reason here.

### Prometheus

Reports the configured `PROMETHEUS_SERVER_URL`, the outcome of the
last Prometheus probe (`success`, `no_data`, `error`, or
`null` if the agent has not run a Prometheus probe yet), and the
timestamp.

### Definitions

One row per metric the cloud has assigned to this agent. Each row
shows the metric id, source type, intervals, and the last reported
status, value, timestamp, and reason. Use this to confirm a
specific metric is being collected.

### Active source types

The list of source types the agent has actually run since boot.
Useful when verifying that a stubbed runtime (e.g. `database`) is
not silently unused.

## Toggling the dashboard

The dashboard is enabled by default. To disable it set
`ENABLE_DEBUG_DASHBOARD=false` in the agent's environment. To
change the bind address or port, set `DEBUG_DASHBOARD_HOST` and
`DEBUG_DASHBOARD_PORT`.

  The dashboard is intended for the operators who run the agent.
  Do not expose it to the public internet. In Kubernetes, expose it
  through a `ClusterIP` Service rather than a `LoadBalancer` or
  `Ingress`.


---
url: https://docs.use.observer/agent/guides/diagnose-stalled-agent
title: Diagnose a stalled agent
description: Triage path when an agent stops reporting or its queue depth grows.
---

A stalled agent surfaces in two places: the cloud marks it as
**stopped** on the Agents page (no heartbeat in the expected
window), and the local dashboard's queue depth grows beyond a few
pending entries. The triage below covers both.

## Step 1: confirm the failure mode

Open the agent's dashboard at `http://:10101`. Three
patterns are common:

| Pattern | Probable cause |
|---|---|
| Process up, queue growing, last heartbeat recent but failing | Cloud reachability problem from this host. |
| Process up, queue depth zero, heartbeat succeeding, but cloud says **stopped** | Clock skew or stale state in the console (refresh). |
| Process up, no probes have run | The cloud has not assigned any metrics to this agent yet. |
| Dashboard unreachable | The agent process is down. Container restart loop or host crash. |

## Step 2: agent down

If the dashboard is unreachable:

```bash
# Docker
docker logs --tail 200 observer-agent

# Kubernetes
kubectl -n observer logs deploy/observer-agent --tail=200

# Linux systemd
journalctl -u observer-agent -n 200 --no-pager
```

Look for a panic, an unhandled rejection, or a configuration error
on startup. Common causes: missing `AGENT_KEY`, malformed
`CLOUD_SERVER_URL`, port `10101` already in use.

## Step 3: cloud unreachable

If the dashboard's *Cloud* panel shows a recent
`last_heartbeat_error`, the issue is between the agent and the
cloud. Verify in this order:

1. **DNS**: `getent hosts ` from inside the container.
2. **TCP**: `curl -v https://` from inside the container.
3. **TLS**: certificate trust. Custom internal CAs need the
   container's trust store updated.
4. **Auth**: a recently rotated `AGENT_KEY` requires updating
   the agent's environment.

The drain controller automatically retries with exponential
backoff. The queue continues to accept pushes up to its capacity
(`MAX_ROWS`, default `10000`). Once the cloud is reachable again,
the queue drains.

## Step 4: queue saturation

If the queue depth has hit `MAX_ROWS`, the oldest entries are
dropped to admit new ones. The dashboard's queue panel shows the
depth at the cap. After cloud reachability returns, the queue
drains and depth returns to near zero.

The cloud's `agent.offline` webhook fires when the heartbeat
window is exceeded. Subscribe to this event when on-call needs an
explicit alert.

  The agent only runs probes for metrics the cloud has assigned to
  it. If a freshly registered agent shows no probe activity, the
  cause is on the cloud side: open the metric in the console,
  confirm its **Agent** field is set, and save.


---
url: https://docs.use.observer/agent/guides/rotate-agent-key
title: Rotate the agent's authentication key
description: Generate a new agent key, deploy it, and retire the old one with no observability gap.
---

Agent keys can be rotated through the console without an
observability gap. The cloud accepts both the new key and the
previous key for a configurable grace window, so the deployment
can roll over without strict synchronisation.

## Steps



### Generate a new key

In the console, open **Agents**, select the agent, then
**Rotate key**. The cloud:

1. Generates a new key, stores its hash, and returns the
   plaintext once.
2. Demotes the previous key to `previous_agent_key_hash` with a
   `previous_key_valid_until` timestamp (default: 24 hours from
   rotation).

Copy the new key.





### Deploy the new key

Update the agent's `AGENT_KEY` environment variable to the new
value. The deployment path depends on your runtime:

- **Docker**: `docker run -e AGENT_KEY=` and restart the
  container.
- **Kubernetes**: update the `observer-agent` Secret and roll
  the Deployment (`kubectl rollout restart deploy/observer-agent`).
- **systemd-managed Docker**: edit `/etc/observer-agent.env`,
  then `systemctl restart observer-agent`.

The agent reconnects with the new key on its next heartbeat.





### Confirm the rotation took effect

Open the agent's dashboard. The *Cloud* panel reports a successful
heartbeat with the new key. The Agents page in the console shows
the agent as **running** with the new key fingerprint.





### Retire the old key

The previous key automatically becomes invalid at
`previous_key_valid_until`. To retire it sooner, open the agent in
the console and set the grace window to zero. Subsequent requests
with the previous key are rejected.



## What the cloud sees

- The cloud stores the SHA-256 of each key, never the plaintext.
- A request with the new key matches `agent_key_hash` and
  succeeds.
- A request with the previous key matches
  `previous_agent_key_hash`, and succeeds only while
  `previous_key_valid_until` is in the future.
- A lost key cannot be recovered. Rotate to issue a replacement.

  Treat agent keys with the same care as any service credential.
  Rotate when an environment file is shared, when a developer
  with access leaves, or when a host's image is exported. The
  grace window makes rotation cheap; do it often.


---
url: https://docs.use.observer/agent/reference/environment-variables
title: Environment variables
description: Every environment variable the agent reads, with defaults and meaning.
---

The agent is configured through environment variables. There is
no configuration file; this keeps the runtime container immutable
and the deployment surface small.

## Required

| Variable | Notes |
|---|---|
| `AGENT_KEY` | Authentication key issued by the cloud. Format `obs_live_<43 base64url chars>`. The cloud stores its hash, never the plaintext. |
| `CLOUD_SERVER_URL` | Base URL of Observer Cloud. Defaults to `https://localhost:3000` (development only). Override in every real deployment. |

## Required for Prometheus probes

| Variable | Notes |
|---|---|
| `PROMETHEUS_SERVER_URL` | Base URL of the Prometheus the agent should query. Used as the default for every Prometheus metric, overridable per-metric via `prometheus_url` in the metric's source config. |

## Optional Prometheus auth

| Variable | Default | Notes |
|---|---|---|
| `PROMETHEUS_BASIC_AUTH_ENABLED` | `true` | Set to any value other than `true` to disable. When enabled, the agent sends `Authorization: Basic ` on every Prometheus request. |
| `PROMETHEUS_USERNAME` | `admin` | Basic auth username. |
| `PROMETHEUS_PASSWORD` | empty | Basic auth password. Treat as a secret. |

## Dashboard

| Variable | Default | Notes |
|---|---|---|
| `ENABLE_DEBUG_DASHBOARD` | `true` | Set to `false` to disable the local debug dashboard. |
| `DEBUG_DASHBOARD_HOST` | `0.0.0.0` | Bind address for the dashboard HTTP listener. |
| `DEBUG_DASHBOARD_PORT` | `10101` | Port for the dashboard HTTP listener. |

## Logging

| Variable | Default | Notes |
|---|---|---|
| `BROADCAST_LOGS` | `false` | When `true`, the agent forwards a subset of its log lines to the cloud for surfacing in the agent detail page. PromQL query strings are always redacted to a SHA-256 prefix and length, regardless of this flag. |
| `LOG_BROADCAST_LEVEL` | `INFO` | Minimum level forwarded when `BROADCAST_LOGS=true`. One of `DEBUG`, `INFO`, `WARN`, `ERROR`. |
| `VERBOSE` | `false` | Local stdout verbosity. |

## Local queue

| Variable | Default | Notes |
|---|---|---|
| `BUFFER_PATH` | `./observer-agent-buffer.db` | Path to the agent's local SQLite write-ahead queue file. |
| `BUFFER_MAX_ROWS` | `10000` | Hard cap on queued pushes. When the queue reaches the cap, oldest entries are evicted to admit new ones. |

## Other

| Variable | Default | Notes |
|---|---|---|
| `SKIP_SSL_VERIFICATION` | `false` | Disables TLS verification on cloud-bound requests. Development only. |
| `NODE_ENV` | unset | Affects log formatting. Set to `production` in production deployments. |

  Variables visible on the debug dashboard are masked to
  `first-4 + tail-4` characters. Variables not in the dashboard's
  allowlist are omitted entirely. The allowlist is intentionally
  narrow; everything outside it does not appear in the dashboard
  regardless of value.


---
url: https://docs.use.observer/agent/reference/probe-types
title: Probe types
description: Source types the agent supports, with their value semantics and runtime status.
---

| `source_type` | Value reported | Status |
|---|---|---|
| `prometheus` | scalar from PromQL query | shipped |
| `http` | response_time_ms | shipped |
| `tcp` | connect_time_ms | shipped |
| `dns` | resolve_time_ms | shipped |
| `tls_cert` | days_until_expiry | shipped |
| `icmp` | n/a | stubbed |
| `grpc` | n/a | stubbed |
| `websocket` | n/a | stubbed |
| `mtls_http` | n/a | stubbed |
| `database` | n/a | stubbed |

## Shipped runtimes

Each shipped runtime has a dedicated guide:

- [Prometheus](/agent/guides/prometheus-source)
- [HTTP](/agent/guides/http-probes)
- [TCP](/agent/guides/tcp-probes)
- [DNS](/agent/guides/dns-probes)
- [TLS certificate](/agent/guides/tls-cert-probes)

## Stubbed runtimes

The cloud accepts metric definitions for stubbed source types and
stores their `source_config`. The agent recognises them but
reports `not_implemented` in the `reason` field on every probe.
The metric remains in `no_data` until the runtime ships.

| Source type | Why stubbed |
|---|---|
| `icmp` | Most container runtimes need `CAP_NET_RAW` to open raw sockets. The TCP probe is a better proxy for "is this host reachable" in cloud-native environments. |
| `grpc` | Adds a `@grpc/grpc-js` dependency that is not yet justified by validated demand. |
| `websocket` | Same: adds the `ws` library for a probe with limited validated demand. |
| `mtls_http` | Requires a client-cert secret store. The auth model needs design work before runtime work. |
| `database` | Requires per-driver client libraries and a connection-string secret store. |

  The cloud-side enum is defined in a database check constraint,
  the agent's dispatch table, and the Zod schema in
  `@observer/probe-config`. Adding a runtime is a coordinated
  change across these three sites plus a UI form for the
  parameters.

## Common contract

Every source's runtime exports the same interface:

```ts title="ProbeSource"
interface ProbeSource {
  validateConfig(config: unknown): null | string;
  execute(config: TConfig, env?: AgentEnv): Promise;
}

interface ProbeResult {
  value: number | null;
  timestamp: string;
  status_hint?: "no_data";
  reason?: string;
  metadata?: Record;
}
```

Sources never throw. Network errors, malformed config, missing
fields all resolve to `{ value: null, status_hint: "no_data",
reason: "" }`. The dispatcher applies the threshold rule
only when `status_hint` is absent.


---
url: https://docs.use.observer/agent/reference/dashboard-panels
title: Dashboard panels
description: Read-only state surface served on the agent's debug HTTP port.
---

The agent exposes a read-only HTTP dashboard on
`http://:10101`. Every panel reads from the agent's
in-process state. Nothing on the page mutates anything.

## Panel reference

### `process`

| Field | Meaning |
|---|---|
| `agent_started_at` | ISO timestamp of process start. |
| `uptime_seconds` | Wall-clock seconds since `agent_started_at`. |
| `memory_rss_mb` | Resident set size of the agent process. |
| `version` | Build-time version string. |
| `bun_version` | Bun runtime version reported by `Bun.version`. |

### `config`

A map of environment variable names to masked values. The mask
is applied to every value displayed (first-4 + tail-4). Names
not in the dashboard's allowlist are not displayed regardless
of value.

### `queue`

| Field | Meaning |
|---|---|
| `depth` | Pushes currently waiting to be drained. |
| `oldest_age_seconds` | Age of the oldest pending push. |
| `capacity` | Configured `BUFFER_MAX_ROWS`. |
| `drain_backoff_ms` | Current exponential backoff used by the drain controller. Zero means the next drain attempt is immediate. |

### `cloud`

| Field | Meaning |
|---|---|
| `cloud_server_url` | Configured `CLOUD_SERVER_URL`. |
| `last_heartbeat_at` | ISO timestamp of the last heartbeat attempt. |
| `last_heartbeat_ok` | Boolean result of the last heartbeat. |
| `last_heartbeat_error` | Error string when `last_heartbeat_ok` is false. |
| `last_post_at` | Last metric-push attempt timestamp. |
| `last_post_ok` | Boolean result of the last push. |
| `last_post_error` | Error string when `last_post_ok` is false. |

### `prometheus`

| Field | Meaning |
|---|---|
| `server_url` | Configured `PROMETHEUS_SERVER_URL`. |
| `last_probe_outcome` | One of `success`, `no_data`, `error`, or `null` if no Prometheus probe has run yet. |
| `last_probe_at` | ISO timestamp of the last Prometheus probe. |

### `definitions`

One row per metric the cloud has assigned to this agent.

| Field | Meaning |
|---|---|
| `id` | Metric definition id. |
| `source_type` | One of the [probe types](/agent/reference/probe-types). |
| `interval_minutes` | Configured collection interval. |
| `push_interval_minutes` | Configured forced-push interval (status pushes also fire on every status change). |
| `last_status` | Most recent status reported. |
| `last_value` | Most recent reported value. |
| `last_at` | ISO timestamp of the last push. |
| `last_reason` | Reason code on the last `no_data` (or null when the last push was healthy). |

### `active_source_types`

Distinct list of source types the agent has actually run since
boot. A source type that appears in `definitions` but not here
indicates the agent has not yet had time to run an instance.


---
url: https://docs.use.observer/agent/reference/heartbeat-payload
title: Heartbeat payload
description: JSON shape the agent posts to /api/agent/heartbeat.
---

The agent emits a heartbeat to the cloud on a fixed interval
(typically every 30 seconds). The payload is the agent's view of
its own runtime state. The cloud uses the payload to compute
24-hour uptime, restart counts, and lag alerts.

## Endpoint

```
POST /api/agent/heartbeat
Authorization: provided via Agent-Key header (key transport detail)
Content-Type: application/json
```

## Body

```json
{
  "version": "1.2.3",
  "uptime_seconds": 12345,
  "buffer_size": 0,
  "buffer_oldest_age_seconds": 0,
  "queue_depth": 0,
  "queue_oldest_age_seconds": 0,
  "queue_capacity": 10000,
  "agent_started_at": "2026-05-09T12:00:00Z",
  "source_types_active": ["prometheus", "http", "tcp"]
}
```

## Field reference

| Field | Type | Meaning |
|---|---|---|
| `version` | string | Build-time version of the agent. |
| `uptime_seconds` | integer | Wall-clock seconds since process start. |
| `buffer_size` | integer | Legacy alias for `queue_depth`. Accepted by older cloud builds; pre-21.5 fallback. |
| `buffer_oldest_age_seconds` | integer | Legacy alias for `queue_oldest_age_seconds`. |
| `queue_depth` | integer | Pushes currently waiting in the local queue. |
| `queue_oldest_age_seconds` | integer | Age of the oldest pending push, in seconds. |
| `queue_capacity` | integer | Hard cap on the queue (`BUFFER_MAX_ROWS`). |
| `agent_started_at` | ISO timestamp | When the process started. The cloud uses changes to this value to detect restarts. |
| `source_types_active` | string[] | Distinct source types the agent has actually run since boot. |

## Lag and uptime alerts

The cloud's heartbeat receiver runs two state machines per agent:

- **`agent.lag_high`**: opens when `queue_depth > 1000` or
  `queue_oldest_age_seconds > 300`. Clears when both signals stay
  below threshold for 60 seconds.
- **`agent.uptime_degraded`**: opens when uptime over the last
  24 hours falls below 95%. Same 60-second clear hysteresis.

These signals surface in the cloud console's agent detail page
and as `agent.offline` webhook events when subscribed.

## Versioning

The payload shape is part of the cloud-agent wire contract,
maintained in the public `@observer/protocol` package. Field
additions are additive; field removals require a major version
bump on `@observer/protocol`.


---
url: https://docs.use.observer/agent
title: Observer Agent
description: The Observer data plane. Probes metric sources, computes status, pushes verdicts to the cloud.
---

The Observer Agent is a small process that runs in your network. It
reads from Prometheus or probes endpoints directly (HTTP, TCP, DNS,
TLS certificates), computes status against thresholds, and pushes
the verdict to Observer Cloud over an authenticated channel.

The agent is open source. Source at
[github.com/useobserver/agent](https://github.com/useobserver/agent),
licensed Apache-2.0. The runtime is Bun on a distroless container
image; the source is TypeScript.

## Quickstart

  
  
  

## Adjacent sections

- **Guides** cover per-probe configuration, dashboard reading, key
  rotation, and diagnosis paths.
- **Reference** lists every environment variable, every probe type,
  every dashboard panel, and the heartbeat payload shape.
- **Concepts** covers the agent / cloud boundary, the local queue,
  and the design choices behind the runtime.

  Documentation for status pages, SLOs, organisation setup, and the
  REST API is in the [Documentation](/docs) and [API](/api) tabs.
  This tab covers the agent only.


---
url: https://docs.use.observer/api/getting-started/auth
title: Authentication
description: API keys, scopes, and how to authenticate requests against the public API.
---

Every request to `/api/v1` carries an API key in the
`Authorization` header.

## Headers

```text
Authorization: Bearer 
Content-Type: application/json   (on POST / PUT / PATCH)
```

## Key format

Public API keys begin with `obs_pub_` followed by a base64url
opaque string. Keys are issued per organisation in the console
under **API keys**. Each key is shown once at creation; the cloud
stores its hash and cannot recover the plaintext.

## Scopes

Each key carries a fixed set of scopes that gate which endpoints
the key may call. The scopes available today:

| Scope | Grants |
|---|---|
| `read:services` | Read service entities. |
| `read:metrics` | Read metric definitions, current values, and aggregated history. |
| `read:slos` | Read SLOs and their current burn state. |
| `read:incidents` | Read incident updates published on status pages. |

Scopes are additive. Requests against an endpoint whose required
scope is not on the key return `403`.

## Errors

The API returns RFC 7807 problem-detail responses:

```json
{
  "type": "/errors/unauthorized",
  "title": "missing or invalid bearer token",
  "status": 401
}
```

Per-endpoint scope requirements appear on each operation page in
the sidebar.

Keys can be rotated through the console. Existing keys remain
valid until you revoke them; revoke when an incident or
personnel change requires it.


---
url: https://docs.use.observer/api/services/get-services
title: GET /services
description: List services
---

## Parameters

| Name | In | Required | Type | Description |
|------|------|----------|------|-------------|
| `limit` | query | no | integer |  |
| `cursor` | query | no | string |  |

## Example request

```bash
curl -X GET "https://api.use.observer/api/v1/services" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

## Responses

### 200 — ok

```json
{
  "items": [
    null
  ],
  "next_cursor": null
}
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/services/get-services-by-id
title: GET /services/{id}
description: Get service by id
---

## Parameters

| Name | In | Required | Type | Description |
|------|------|----------|------|-------------|
| `id` | path | yes | string |  |

## Example request

```bash
curl -X GET "https://api.use.observer/api/v1/services/{id}" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

## Responses

### 200 — ok

```json
null
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/metrics/get-metrics
title: GET /metrics
description: List metrics
---

## Parameters

| Name | In | Required | Type | Description |
|------|------|----------|------|-------------|
| `limit` | query | no | integer |  |
| `cursor` | query | no | string |  |
| `status` | query | no | string |  |

## Example request

```bash
curl -X GET "https://api.use.observer/api/v1/metrics" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

## Responses

### 200 — ok

```json
{
  "items": [
    null
  ],
  "next_cursor": null
}
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/metrics/get-metrics-by-id
title: GET /metrics/{id}
description: Get metric by id
---

## Parameters

| Name | In | Required | Type | Description |
|------|------|----------|------|-------------|
| `id` | path | yes | string |  |

## Example request

```bash
curl -X GET "https://api.use.observer/api/v1/metrics/{id}" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

## Responses

### 200 — ok

```json
null
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/metrics/get-metrics-by-id-history
title: GET /metrics/{id}/history
description: Aggregated metric values over a window (max 30 days)
---

## Parameters

| Name | In | Required | Type | Description |
|------|------|----------|------|-------------|
| `id` | path | yes | string |  |
| `from` | query | yes | string |  |
| `to` | query | no | string |  |
| `resolution` | query | no | string |  |

## Example request

```bash
curl -X GET "https://api.use.observer/api/v1/metrics/{id}/history" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

## Responses

### 200 — ok

```json
null
```

### 400 — invalid range / resolution

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/metrics/post-metrics-by-id-status
title: POST /metrics/{id}/status
description: Set status on a manual metric (source_type='manual')
---

## Request body

```json
null
```

## Example request

```bash
curl -X POST "https://api.use.observer/api/v1/metrics/{id}/status" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d "null"
```

## Responses

### 200 — ok

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 409 — metric is probed (not manual)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/slos/get-slos
title: GET /slos
description: List SLOs
---

## Parameters

| Name | In | Required | Type | Description |
|------|------|----------|------|-------------|
| `limit` | query | no | integer |  |
| `cursor` | query | no | string |  |

## Example request

```bash
curl -X GET "https://api.use.observer/api/v1/slos" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

## Responses

### 200 — ok

```json
{
  "items": [
    null
  ],
  "next_cursor": null
}
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/slos/get-slos-by-id
title: GET /slos/{id}
description: Get SLO with latest burn event
---

## Parameters

| Name | In | Required | Type | Description |
|------|------|----------|------|-------------|
| `id` | path | yes | string |  |

## Example request

```bash
curl -X GET "https://api.use.observer/api/v1/slos/{id}" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

## Responses

### 200 — ok

```json
null
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/incidents/delete-incidents-by-id
title: DELETE /incidents/{id}
description: Soft-delete incident
---

## Example request

```bash
curl -X DELETE "https://api.use.observer/api/v1/incidents/{id}" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

## Responses

### 200 — deleted

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/incidents/get-incidents
title: GET /incidents
description: List incidents
---

## Parameters

| Name | In | Required | Type | Description |
|------|------|----------|------|-------------|
| `limit` | query | no | integer |  |
| `cursor` | query | no | string |  |
| `state` | query | no | string |  |
| `since` | query | no | string |  |

## Example request

```bash
curl -X GET "https://api.use.observer/api/v1/incidents" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

## Responses

### 200 — ok

```json
{
  "items": [
    null
  ],
  "next_cursor": null
}
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/incidents/get-incidents-by-id
title: GET /incidents/{id}
description: Get incident
---

## Example request

```bash
curl -X GET "https://api.use.observer/api/v1/incidents/{id}" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

## Responses

### 200 — ok

```json
null
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/incidents/patch-incidents-by-id
title: PATCH /incidents/{id}
description: Patch incident (title, severity, affected services, visibility)
---

## Request body

```json
null
```

## Example request

```bash
curl -X PATCH "https://api.use.observer/api/v1/incidents/{id}" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d "null"
```

## Responses

### 200 — ok

```json
null
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/incidents/post-incidents
title: POST /incidents
description: Create incident (draft or published)
---

## Request body

```json
null
```

## Example request

```bash
curl -X POST "https://api.use.observer/api/v1/incidents" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d "null"
```

## Responses

### 200 — created

```json
null
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/incidents/post-incidents-by-id-messages
title: POST /incidents/{id}/messages
description: Append a timeline message; type=Resolved auto-resolves the parent
---

## Request body

```json
null
```

## Example request

```bash
curl -X POST "https://api.use.observer/api/v1/incidents/{id}/messages" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d "null"
```

## Responses

### 200 — ok

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/incidents/post-incidents-by-id-publish
title: POST /incidents/{id}/publish
description: Publish a draft incident
---

## Example request

```bash
curl -X POST "https://api.use.observer/api/v1/incidents/{id}/publish" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

## Responses

### 200 — ok

```json
null
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 409 — already published

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/incidents/post-incidents-by-id-resolve
title: POST /incidents/{id}/resolve
description: Resolve an incident with optional final message
---

## Request body

```json
null
```

## Example request

```bash
curl -X POST "https://api.use.observer/api/v1/incidents/{id}/resolve" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d "null"
```

## Responses

### 200 — ok

```json
null
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 409 — already resolved

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/incidents/post-incidents-from-metric-by-metricId
title: POST /incidents/from-metric/{metricId}
description: Pre-fill a draft incident from the metric's current state (idempotent within 30 minutes)
---

## Example request

```bash
curl -X POST "https://api.use.observer/api/v1/incidents/from-metric/{metricId}" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

## Responses

### 200 — ok

```json
null
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/maintenances/get-maintenances
title: GET /maintenances
description: List maintenances
---

## Parameters

| Name | In | Required | Type | Description |
|------|------|----------|------|-------------|
| `limit` | query | no | integer |  |
| `cursor` | query | no | string |  |
| `state` | query | no | string |  |
| `since` | query | no | string |  |

## Example request

```bash
curl -X GET "https://api.use.observer/api/v1/maintenances" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

## Responses

### 200 — ok

```json
{
  "items": [
    null
  ],
  "next_cursor": null
}
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/maintenances/get-maintenances-by-id
title: GET /maintenances/{id}
description: Get maintenance
---

## Example request

```bash
curl -X GET "https://api.use.observer/api/v1/maintenances/{id}" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

## Responses

### 200 — ok

```json
null
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/maintenances/patch-maintenances-by-id
title: PATCH /maintenances/{id}
description: Edit maintenance (only allowed before actual_start_at is set)
---

## Request body

```json
null
```

## Example request

```bash
curl -X PATCH "https://api.use.observer/api/v1/maintenances/{id}" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d "null"
```

## Responses

### 200 — ok

```json
null
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 409 — already started

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/maintenances/post-maintenances
title: POST /maintenances
description: Schedule a maintenance window
---

## Request body

```json
null
```

## Example request

```bash
curl -X POST "https://api.use.observer/api/v1/maintenances" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d "null"
```

## Responses

### 200 — ok

```json
null
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/maintenances/post-maintenances-by-id-cancel
title: POST /maintenances/{id}/cancel
description: Cancel a maintenance before completion
---

## Example request

```bash
curl -X POST "https://api.use.observer/api/v1/maintenances/{id}/cancel" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

## Responses

### 200 — ok

```json
null
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/maintenances/post-maintenances-by-id-complete
title: POST /maintenances/{id}/complete
description: Manually transition an in-progress maintenance to completed
---

## Example request

```bash
curl -X POST "https://api.use.observer/api/v1/maintenances/{id}/complete" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

## Responses

### 200 — ok

```json
null
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api/maintenances/post-maintenances-by-id-start
title: POST /maintenances/{id}/start
description: Manually transition a scheduled maintenance to in_progress
---

## Example request

```bash
curl -X POST "https://api.use.observer/api/v1/maintenances/{id}/start" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

## Responses

### 200 — ok

```json
null
```

### 401 — missing or invalid bearer token

### 403 — missing required scope

### 404 — not found (or cross-tenant — same response)

### 429 — rate limit exceeded


---
url: https://docs.use.observer/api
title: API reference
description: Observer's public REST API. Authenticated with API keys scoped per organisation.
---

The Observer API lives at `https://api.use.observer/v1`. All endpoints
require an `Authorization: Bearer ` header. Keys are scoped
per organisation and per capability (read:metrics, write:incidents, etc.).

Pick an operation from the sidebar to view its parameters, request /
response schemas, and a working curl example.