# Observer Documentation (full content) # Source: https://docs.use.observer # Generated: 2026-05-14T06:26:59.026Z --- url: https://docs.use.observer/docs/concepts/architecture title: How Observer works description: A high-level view of the agent, the cloud, and how they exchange data. --- Observer has two parts: - **Observer Agent**, a small process you run inside your network. It reads metrics from your existing observability stack (Prometheus, HTTP endpoints, TCP services, DNS, TLS certificates) and computes the status verdict locally. - **Observer Cloud**, the control plane. It receives status pushes from agents, runs SLO evaluation, persists data, and renders status pages, dashboards, and the API. Observer Agent
Prometheus, HTTP, TCP, DNS, TLS"] end subgraph Cloud["Observer Cloud (control plane)"] C["SLO evaluation
Status page
Webhooks
API and audit"] end V["Customers
status pages, API"] A -->|"status push: metric_id, value, status, timestamp"| C A -->|"heartbeat / log"| C C --> V`} /> ## What crosses the network Only the precomputed verdict crosses the boundary from your network to Observer Cloud. The push payload is: ```json { "metric_id": "", "value": , "status": "", "timestamp": "" } ``` Raw query strings (PromQL, HTTP request bodies, DNS resolver responses) do not leave the agent. The cloud has no path back into your network; it cannot pull from your Prometheus or hit your endpoints directly. ## What runs where | Concern | Location | | ------------------ | ------------------------------------------ | | Metric collection | Agent, in your network | | Status verdict | Agent, computed against the threshold rule | | SLO evaluation | Cloud, against pushed status | | Status page render | Cloud | | Webhook delivery | Cloud | | Audit log | Cloud | | Public API | Cloud | Observer Cloud is a closed-source SaaS. The Observer Agent is open source: source at [github.com/useobserver/agent](https://github.com/useobserver/agent), Apache-2.0 licensed. ## Operational implications - The agent must run in a network segment that can reach your metric sources. The cloud cannot reach them on your behalf. - The agent's own health is reported back to the cloud through heartbeats. The cloud surfaces a stalled agent as `agent.offline`. - The agent is stateless with respect to historical data: lost agents do not lose history, because all status pushes are persisted in the cloud. --- url: https://docs.use.observer/docs/concepts/metrics-vs-pings title: Why metrics, not pings description: The case for metric-based status over availability pings. --- Most status page tools assert availability with periodic pings: a GET request every 60 seconds against a public endpoint, with a green check when the response code is 2xx. Observer's default is to compute status from metrics you already collect, with pings as one source among many. The reasoning: ## Pings only see the public envelope A ping confirms a single endpoint accepted a single request at a single moment. It does not see: - The error rate served to actual customers in the last five minutes. - The 95th percentile latency under real load. - The depth of an internal queue draining slower than its inflow. - A degraded backend that has been masked by retries upstream. A page that reads green from pings while customers are filing support tickets is the standard failure mode of ping-based status. ## Metrics see the actual signal Observer's primary data source is your own metrics: Prometheus queries, HTTP probes that include body checks, TCP connection times, DNS resolution times, TLS certificate expiry. The status the public page shows is computed from the same numbers your on-call team already trusts on the internal Grafana dashboard. The result: when customers see red, the on-call's dashboard shows the same red, with the same threshold semantics. There is no gap. ## Pings still have a place For systems that do not emit metrics (third-party APIs, public DNS, certificates issued by external CAs), the agent supports HTTP, TCP, DNS, and TLS-cert probes directly. These produce a metric in the same shape as a Prometheus query: a numeric value with a timestamp, evaluated against thresholds. - Internal API health: Prometheus query against your own metrics. - Third-party API reachability: HTTP probe. - Certificate expiry: TLS-cert probe. - DNS health: DNS probe. Mix as many as the system needs. Each becomes a metric on the status page with its own threshold band. --- url: https://docs.use.observer/docs/concepts/slos-and-error-budgets title: SLOs and error budgets description: How service level objectives translate metric status into a contractual signal. --- A Service Level Objective (SLO) is a commitment that a metric will remain healthy for a defined fraction of a rolling window. SLOs turn the binary "is this healthy right now" question into a running balance: the **error budget**, which is the remaining allowance of unhealthy time. ## Definition An SLO has three core fields: - **Metric**: which metric the SLO observes. - **Target percentage**: the fraction of the window the metric must be `healthy`. Common values: 99, 99.5, 99.9, 99.95, 99.99. - **Window in days**: the rolling period the target applies to. Common values: 7, 30, 90. The window is rolling: at any instant, the SLO looks back N days and computes the fraction of that time the metric was healthy. There is no calendar boundary that resets the budget. ## Error budget Given a 99.9% target over 30 days, the budget allowance is: ```text allowance = 30 days * (1 - 99.9 / 100) = 30 days * 0.001 = 43.2 minutes per 30-day window ``` The budget burns whenever the metric is in the `unhealthy` state. It does not burn for `degraded`, `no_data`, or `unknown` (the [threshold operators reference](/docs/reference/threshold-operators) covers each). ## Burn events A burn event opens when the metric flips to `unhealthy` and the SLO drops below 100% remaining. It closes when the metric returns to healthy. Each burn event records its start, end, and the percent of the budget it consumed. Webhook subscribers receive `slo.burn_started` when an event opens and `slo.burn_resolved` when it closes. Pair the two by their `burn_event_id`. ## Picking a target The right SLO target reflects the system's actual achieved availability over the prior 90 days, plus a margin for the behaviour you want to drive. Three common starting points: - **99.5%** for a new service or unknown baseline. Loose enough that noise does not drive false alerts. - **99.9%** for a service with a stable history and a reasonable remediation pipeline. - **99.99%** for systems where customers feel every minute of unhealthy time. Requires investment in error-handling and rapid remediation; otherwise the target produces churn rather than signal. A target tighter than the system's achieved availability burns budget on noise and trains the on-call team to ignore alerts. Start at the 90-day baseline and only tighten as the underlying system improves. ## Per-customer targets Different customers can sign different SLO targets against the same underlying metric. The model and configuration steps live in [Customer scopes](/docs/concepts/customer-scopes). --- url: https://docs.use.observer/docs/concepts/customer-scopes title: Customer scopes description: Per-customer status pages, JWT-verified, with per-customer SLO targets. --- A customer-scoped page is one underlying status page that renders differently per customer. Each customer sees a filtered subset of metrics and SLOs, with optional per-customer thresholds applied at render time. The same page can therefore serve a `99.99%` agreement with one customer and a `99%` agreement with another, without duplicating the underlying metric work. ## Why customer scopes exist Enterprise contracts vary. The same backend that an SMB customer signs at `99.5%` may carry a `99.99%` clause for an enterprise customer with a higher-priced contract. Two implementation paths exist: 1. Duplicate the metric definition per customer, with different thresholds. 2. Define the metric once and apply per-customer thresholds at render time. Path 2 keeps a single source of truth for collection and evaluation. Customer scopes implement path 2. ## Identity model Customer scopes use JWT-based identity. The page's access mode is set to `customer_scoped`. Observer Cloud verifies tokens against the public key (or JWKS endpoint) configured on the page, then reads a configurable claim (typically `sub`, `customer_id`, or a custom claim) to determine which customer is viewing. A customer must be: 1. Defined in the organisation's customer list. 2. Bound to the page through the page's customer-binding list. A token whose claim does not resolve to a bound customer returns 403, even when the token's signature is valid. ## SLO overrides Each customer can carry per-SLO target overrides. When the customer-scoped page renders for that customer, the SLO strip uses the override target. Customers without an override see the default target. ```text SLO: checkout-api availability default target: 99.9% Customer A: no override. renders at 99.9% Customer B: override at 99.99%. renders at 99.99% Customer C: override at 99%. renders at 99% ``` The underlying metric and the burn evaluator remain unchanged. The only differences are the threshold the page renders and the per-customer error budget displayed. Customer scopes do not change the contractual obligation; they reflect it. The contract is what your legal team and the customer signed. Observer's customer scopes ensure each customer's view of the system mirrors the agreement they read. ## Configuration See [Configure customer-scoped pages](/docs/guides/customer-scoped-pages) for the step-by-step setup. --- url: https://docs.use.observer/docs/concepts/thresholds-and-dwell title: Thresholds and dwell description: How a metric's status is decided, and how dwell gating prevents flapping. --- A metric's status flips through three layers in order: 1. **Threshold evaluation**: the strict-operator rule applied to each pushed value (see [threshold operators](/docs/reference/threshold-operators)). 2. **Dwell gating**: a status only flips after holding the new state for the configured dwell period. 3. **Shadow mode (optional)**: a metric can be marked as shadowed for a window so it does not affect status pages or fire webhook events while operators tune it. This page covers steps 2 and 3. ## Why dwell exists A naive implementation would publish every status change the agent reports. In practice, metrics flap. A network blip pushes one bad sample, the next sample is fine, the on-call gets paged twice per minute. Dwell gating requires the new status to hold for a minimum duration before propagating. Configure two values per metric: - **Dwell to breach**: how long the metric must report the new status before flipping into a worse band (healthy → degraded or healthy → unhealthy). - **Dwell to recover**: how long the metric must report the new status before flipping into a better band (unhealthy → healthy). The defaults shipped with the create form are conservative: 60 seconds to breach, 300 seconds to recover. Asymmetric values ("quick to flag, slow to recover") are appropriate for systems where premature recovery announcements have a higher cost than a delayed unhealthy alert. ## Status sources Three statuses come from the strict-operator rule: - `healthy` - `degraded` - `unhealthy` Two statuses come from collection-layer outcomes, not from values: - `no_data`: the agent attempted a probe but produced no value. The reason code is recorded alongside (`ECONNREFUSED`, `ETIMEDOUT`, `no_data_for_query`, etc.). - `unknown`: no recent push has arrived for the metric within the expected interval. `no_data` and `unknown` do not burn SLO budget by default; they are operational signals that surface in the agent dashboard and as `metric.no_data` webhook events. ## Stale data tolerance Dwell gating handles the small flaps. A separate read-time rule handles a much larger gap: what happens when no sample arrives at all, because the agent is crashed, the network between the agent and Observer Cloud is partitioned, or Observer Cloud itself is degraded. A metric is **stale** when its last push timestamp is older than three times its push interval, capped at 15 minutes: ``` threshold = min(3 × push_interval_minutes × 60s, 15 minutes) stale = (now - last_push_timestamp) > threshold ``` The 3× multiplier gives the agent one full retry-and-backoff window before a missing push is considered a problem. The 15-minute hard cap stops a slow push cadence from masking a multi-hour outage on a high-importance metric. Staleness is computed at read time. The database always carries whatever the agent last pushed. The status-page and embed renderers make the call independently each time they load. A stale metric is **excluded** from the service rollup. It is **not** counted toward SLO burn. The `metric.status_changed` and `metric.no_data` webhooks are **not** fired on staleness transitions. What does fire: `agent.lag_high` and `agent.offline`, which speak to the actual cause — they are operator-facing. When every metric on a service is stale, the service renders as `monitoring_delayed` with a "Last known: Operational" caption alongside. See [Observer availability](/docs/concepts/observer-availability) for the full trust contract. ## Shadow mode A metric can be marked shadowed until a future timestamp. While shadowed: - The metric still pushes status to the cloud. - Status pages do not consume the shadowed metric in the rolled-up page status. - Webhook events for the shadowed metric are suppressed. - The metric's history is still recorded for later inspection. Use shadow mode when introducing a new metric, tuning its threshold, or rolling out a new probe runtime. Once the metric behaves as expected, clear the shadow timestamp and it joins the public status surface. Threshold evaluation is strict in both the agent and in the read path that renders status pages. The same value cannot flip status depending on which surface read it. See the [threshold operators reference](/docs/reference/threshold-operators) for examples. --- url: https://docs.use.observer/docs/concepts/observer-availability title: Observer availability description: What happens when Observer Cloud is degraded, when the agent stops pushing, and why your customers will not see a red status page because of our outage. --- A status page exists to be honest with your customers. If the status page itself becomes a source of misinformation when the monitoring infrastructure has a bad day, it is worse than no status page at all. This document is the contract for what Observer does when Observer itself is degraded, when the agent in your network goes silent, or when the link between them is broken. ## The two failure modes that look identical A metric ends up without a recent value for two reasons. They look identical in the database; they mean very different things. - **The probe ran and got nothing.** The agent reached out to your Prometheus, your HTTP endpoint, your TLS certificate, and could not produce a value. The query returned empty, the connection was refused, the TLS handshake failed. This is a real signal about your service. It is `no_data` with a reason code. - **The agent has not pushed anything recently.** The agent crashed. The host running it lost the network. Observer Cloud could not accept the push. This is a signal about our or your monitoring infrastructure, not your service. It is `stale`. We treat these two cases differently. ## What "stale" means A metric is stale when its last push timestamp is older than three times its push interval, capped at 15 minutes. The 3× allows for the agent's normal retry-and-backoff window before declaring a problem. The 15-minute cap prevents a slow push cadence from masking a multi-hour outage. Staleness is computed at read time. Nothing about the metric's stored row changes; the database still carries whatever the agent last pushed. The status page, the embed widget, and the SLO calculator each apply the rule independently when they load. ## What happens when your agent stops pushing Every metric driven by that agent becomes stale within minutes. Your customer-facing status page does **not** flip to red. Each stale metric is excluded from the live rollup. If only some metrics on a service are stale, the service rollup uses the fresh metrics and shows a small "X of N metrics delayed" caption. If every metric on a service is stale, the service renders as **Monitoring delayed**, with a muted "Last known: Operational" (or whatever it was) pill alongside. Your SLOs do **not** burn during the stale window. Observer's SLO calculation counts samples, not wall-clock time, so a quiet agent contributes zero to both the numerator (good samples) and the denominator (total samples). The error budget freezes in place until the agent resumes pushing. We do not fire `metric.status_changed` or `metric.no_data` webhooks on staleness transitions. We do fire `agent.lag_high` and `agent.offline` — those are operator-facing and tell you the actual cause. ## What happens when Observer Cloud is degraded The agent's local SQLite buffer holds metric pushes for up to 24 hours of normal traffic. When the cloud receiver recovers, the agent drains the buffer in order. Status pages catch up to the real customer state as the backlog clears. While the cloud is degraded: - Status pages continue to serve whatever last-known status was cached by their last successful render. Most pages tolerate a full read-side outage for several minutes before any user-visible effect. - The same staleness rule applies: as time without a fresh sample exceeds the threshold, services on the page roll up to **Monitoring delayed** rather than flipping to red. - No false-positive webhooks fire. The operator-facing `agent.lag_high` event tracks the cloud-side outage from each agent's vantage point. ## Will my SLO burn during these gaps? No. Observer's SLO computation is sample-counting, not time-counting. The error budget is `(good_samples - target × total_samples) / ((1 - target) × total_samples) × 100`. When the agent is silent, no samples are added to either side of that fraction, so the budget remains exactly where it was when the last push arrived. When the agent resumes, the new samples land in the same window and contribute on their own merits. A short gap during which your service was actually unhealthy does not get retroactively counted as a healthy window — there is just no data for it. This trade-off has a name. Observer is **explicit about absence**: when we cannot say, we say so. We do not infer healthy minutes between samples and we do not infer unhealthy minutes either. ## How is this different from a status page that just lies? Some hosted status page platforms will hold a service at "All Systems Operational" indefinitely as long as no one publishes an incident. They are silent on the question of whether their monitoring is still working. Observer is loud about it. Pages render **Monitoring delayed** when we cannot confirm health. The last-known status sits alongside in a muted pill so your customer can see the trajectory rather than a blank red square. We would rather say "we don't currently know" than guess. The following are part of the published contract. We hold ourselves to them in production: - A stale metric is excluded from the service rollup, full stop. - A stale metric does not burn SLO budget. - Staleness transitions do not fire customer-facing webhooks. - The "Monitoring delayed" rollup carries the last-known status. - The push-interval policy is `>= 10 minutes` on Free/Starter and `>= 5 minutes` on Pro and Enterprise. Tightening below those bounds is gated by the plan validator. ## Layered fallbacks (planned) Two additional layers are scheduled post-launch and not part of the current contract. They harden the public read path against a full Observer Cloud outage: - **Edge-cached status pages.** A Cloudflare edge cache with stale-while-revalidate semantics serves the last-rendered HTML during a cloud outage. Customers see the same page they would have seen a minute earlier, with a small "served from cache" tag. - **Independently-hosted static fallback.** A Cloudflare Worker with status snapshots in KV serves a minimal page even if the origin is fully unreachable. Same staleness rules apply. These are scheduled work, not promises. The current contract above is what you get today. --- url: https://docs.use.observer/docs/concepts/incidents-and-metrics title: Incidents and metrics description: How customer-facing incidents relate to metric-driven status. --- Observer's status model has two layers: 1. **Metrics drive status by default.** A metric flips to `unhealthy` when its measured value crosses the threshold; the page status rolls up from the worst metric. No human action required. 2. **Incidents are the customer comm layer on top.** An incident is what the operator publishes to explain context — what is broken, what we know, what we are doing about it. Both layers can fire independently, and they often do. ## Why two layers A metric flip is an automated signal. The threshold breach happened at 14:32:18 because the agent reported 4.2% errors and the unhealthy rule said `over 2%`. That is precise, but it is not customer communication. Customers want to know: - Are you aware? - What is the impact? - When will it be fixed? - How will I know it is fixed? Those are operator-authored sentences. The metric flip cannot answer them on its own. ## How they relate at runtime The page status that customers see is **only** driven by metrics. Posting an incident does not change page status; resolving an incident does not change page status. Status is the measured truth; incidents are the human commentary. The exception is **manual metrics** (see [Manual metrics](/docs/concepts/manual-metrics)): when an open incident lists a service, the manual metrics on that service auto-set their status to mirror the incident severity. This is the case where incidents drive status — by design — because manual metrics have no probe to measure them. ## The "draft from metric" flow When a metric flips unhealthy, the metric edit page surfaces a **Draft incident** CTA. One click pre-fills: - Title: `Investigating: ` - Severity: `major` if metric is unhealthy, `minor` if degraded - Affected services: every service that has an SLO bound to this metric - Initial message: `Investigating . Current status: ` The operator reviews, edits if needed, and publishes. The "metric flipped → I need to update status" loop drops from minutes to one click. The CTA is idempotent within 30 minutes: a second click on the same metric in the same window surfaces the existing draft instead of creating a duplicate. ## Auto-drafts (opt-in) The same "draft from metric" path can run automatically. Opt a metric in via the **Automatic incident creation** section on its edit form (Pro+). When the metric flips unhealthy, Observer creates the draft for you and emails your org owners with publish / dismiss buttons. The auto flow shares the same dedup rule as the manual CTA — if an open incident already affects the metric's service, a message is appended to the existing incident instead of opening a new one. Per-metric cooldown is one hour. Drafts that go unactioned for 24 hours auto-expire. See the full setup walkthrough at [Auto-incident creation](/docs/guides/auto-incident-creation). If you have not yet read the threshold semantics doc, read [Thresholds and dwell](/docs/concepts/thresholds-and-dwell) first. The "metric is unhealthy" claim assumes you understand the comparison rule and dwell gating. --- url: https://docs.use.observer/docs/concepts/manual-metrics title: Manual metrics description: When the agent can't measure it, set the status explicitly. --- Most Observer metrics are probed: an agent runs a check on a schedule and reports the result. Manual metrics are the escape hatch for signals that have no automation, or where the operator wants to control the status surface explicitly. ## When to use a manual metric - A signal that has no observability today (a third-party SaaS outage, a vendor dependency, an internal system without instrumentation). - A high-level rollup that should follow operator judgment, not a noisy underlying measurement. - A service whose status is gated on a contract with a vendor (where Observer should reflect what the vendor says, not what we measure). ## What is different | Aspect | Probed metric | Manual metric | |---|---|---| | `source_type` | `prometheus`, `http`, `tcp`, etc. | `manual` | | Agent involvement | Agent runs the probe and pushes status. | Agent never sees the metric (filtered at definitions endpoint). | | Status transitions | Threshold + dwell gating against measured value. | Explicit set via UI / API / incident. | | Webhook payload | `source: "probe"` on flips. | `source: "manual"` or `"incident"`. | ## How status flips Three paths set status on a manual metric: 1. **Console UI**: the metric detail page shows a clickable status pill. Owner-tier users can pick a new status from the dropdown. 2. **API**: `POST /api/v1/metrics/{id}/status` with `{"status": "unhealthy"}`. Scope: `write:metrics`. 3. **Incident impact**: when an open incident lists a service that contains a manual metric, that metric auto-flips to mirror the incident severity. On resolve, it returns to its last explicitly-set status (default `healthy`). ## Threshold model Manual metrics carry no thresholds. Status is set directly. The `healthy_*` / `unhealthy_*` columns on the metric definition are ignored; the form hides the threshold section when source type is `manual`. Every manual transition writes an audit log row of action `metric.status.set_manually` with the actor, source, and old → new status. Useful when reconstructing why a public page rendered a particular status at a given time. --- url: https://docs.use.observer/docs/concepts/incident-slo-impact title: Incident SLO impact description: How the auto-impact panel computes burn rate and time to budget exhaustion. --- When an incident lists affected services, every SLO bound to those services contributes to the auto-impact panel. The panel updates every 30 seconds while an incident is open and freezes on resolve. ## What gets computed For each affected SLO: - **Burn during incident**: total seconds the metric was `unhealthy` between the incident's `published_at` and either `resolved_at` or now, whichever is earlier. - **Percent of budget consumed**: burn seconds divided by the SLO's total budget seconds. Total budget = window seconds × (1 − target%). - **Total budget remaining**: read from `slos.error_budget_remaining_pct` (populated by the SLO eval scheduler tick, not recomputed in the panel). - **Time to exhaust**: at the current burn rate (burn seconds / incident duration seconds), how long until the remaining budget reaches zero. Reported in minutes; null when the burn rate is zero. ## Caching Repeated panel polls within 30 seconds reuse the same computation (in-memory cache keyed by incident id). This protects the SLO eval pipeline from hammering when an open dashboard polls every 30s. ## Sources of error - The metric history table is the source of truth for burn. If the agent missed pushes during the incident, those gaps are not counted as unhealthy. - The remaining-budget % comes from the most recent SLO eval tick. If the scheduler fell behind, the value can be stale by a few minutes. The burn-during-incident value is always fresh. - Time-to-exhaust extrapolates a linear burn rate. Real systems rarely sustain a linear rate; treat the number as a rough budget rather than a precise countdown. ## Public visibility The auto-impact panel is console-only by default. A per-incident toggle exposes a slimmed view (burn % only, no time-to-exhaust) on the public incident page. Some operators choose to surface it for transparency; others view it as internal-only. The default is off. --- url: https://docs.use.observer/docs/quickstart/first-metric title: Define your first metric description: Install the agent, define a metric backed by a Prometheus query, and report status to Observer Cloud. --- This page walks through installing the Observer agent, defining a metric backed by a Prometheus query, and confirming that the cloud receives status pushes. ## Prerequisites - A Prometheus server reachable from the host or cluster that will run the agent. - A container runtime (Docker or Kubernetes) or a Linux host with systemd. - An Observer Cloud account. Sign up at [use.observer](https://use.observer). Observer also supports HTTP, TCP, DNS, and TLS certificate probes. Prometheus is documented first because most operators already run it, and the agent's reported value is straightforward to verify against a number you already trust. ## Steps ### Create an organisation Sign in at [use.observer](https://use.observer/console/auth) and create an organisation. The organisation slug becomes the URL path under `/console/` and defines the tenant boundary for every resource below. ### Create an agent and copy its key In the console, open **Agents**, then **New agent**. Provide a name (typically the hostname) and submit. The next screen reveals the agent key once. Copy it before navigating away. The key has the form `obs_live_<43 base64url characters>`. Observer Cloud stores its hash, never the plaintext. A lost key requires rotation through the console. ### Run the agent Pick the runtime that matches your environment. The container exposes a debug dashboard on port `10101`. ```bash title="docker run" docker run -d \ --name observer-agent \ -p 10101:10101 \ -e AGENT_KEY=obs_live_... \ -e CLOUD_SERVER_URL=https://use.observer \ -e PROMETHEUS_SERVER_URL=http://prometheus:9090 \ ghcr.io/useobserver/agent:1.0.1 ``` ```yaml title="agent.yaml" apiVersion: apps/v1 kind: Deployment metadata: name: observer-agent spec: replicas: 1 selector: { matchLabels: { app: observer-agent } } template: metadata: { labels: { app: observer-agent } } spec: containers: - name: agent image: ghcr.io/useobserver/agent:1.0.1 ports: [{ containerPort: 10101 }] env: - name: AGENT_KEY valueFrom: { secretKeyRef: { name: observer, key: agent-key } } - name: CLOUD_SERVER_URL value: https://use.observer - name: PROMETHEUS_SERVER_URL value: http://prometheus.monitoring:9090 ``` Verify the connection. With Docker, browse `http://:10101`. In Kubernetes, port-forward the deployment with `kubectl port-forward deploy/observer-agent 10101:10101` and open `http://localhost:10101`. The dashboard's *Cloud* panel shows a recent `last_heartbeat_at`. The Agents page in the console marks the agent as **running** within roughly 90 seconds. ### Define a metric In the console, open **Metrics**, then **New metric**. Select the agent created above and set the source type to **Prometheus**. Enter a query that returns a single scalar. A standard example is the five-minute 5xx error ratio: ```text title="PromQL" rate(http_requests_total{job="checkout-api",status=~"5.."}[5m]) / rate(http_requests_total{job="checkout-api"}[5m]) ``` Set thresholds: - Healthy: `under 0.005` (less than 0.5% errors). - Unhealthy: `over 0.02` (greater than 2% errors). Values that match neither boundary resolve to `degraded`. Threshold operators are strict. A value exactly equal to a threshold under `over` or `under` does not match the band. Configure thresholds with this in mind: a 0.5% healthy boundary expressed as `under 0.005` does not call exactly `0.005` healthy. Set **Interval** to `1` minute and save. ### Confirm reporting Within one push interval the metric appears in the Metrics list with its current status. Open the metric to see the latest value, last push timestamp, and rolling history. To verify the round trip, lower the unhealthy threshold below the current value. The metric flips to `unhealthy` on the next push. Restore the original threshold and the metric returns to `healthy`. ## Result ```text Prometheus → Observer Agent → Observer Cloud (your network) (control plane, debug on :10101) status pages) ``` The agent computes status client-side and pushes `{ metric_id, value, status, timestamp }` only. Raw query strings stay in your network. ## Next - [Define your first SLO](/docs/quickstart/first-slo) - [Publish your first status page](/docs/quickstart/first-status-page) - [Observer Agent reference](/agent) --- url: https://docs.use.observer/docs/quickstart/first-metric-http title: Define your first metric (HTTP probe) description: Install the agent, define a metric backed by an HTTP probe, and report status to Observer Cloud. --- This page walks through installing the Observer agent, defining a metric that probes an HTTP endpoint directly, and confirming that the cloud receives status pushes. Use this path when no Prometheus server is in place, or when the signal you want to measure is the endpoint's reachability and response time itself. ## Prerequisites - An HTTP endpoint reachable from the host or cluster that will run the agent. - A container runtime (Docker or Kubernetes) or a Linux host with systemd. - An Observer Cloud account. Sign up at [use.observer](https://use.observer). HTTP probes report `response_time_ms` for successful requests and `no_data` with a reason code on failure. Prometheus probes evaluate a PromQL query that already reflects the system's own observation of itself. Pick HTTP when the question is "is this endpoint reachable and fast"; pick Prometheus when the question is "is this metric within bounds". The [Prometheus quickstart](/docs/quickstart/first-metric) covers the latter. ## Steps ### Create an organisation Sign in at [use.observer](https://use.observer/console/auth) and create an organisation. The organisation slug becomes the URL path under `/console/` and defines the tenant boundary for every resource below. ### Create an agent and copy its key In the console, open **Agents**, then **New agent**. Provide a name (typically the hostname) and submit. The next screen reveals the agent key once. Copy it before navigating away. ### Run the agent HTTP probes do not require Prometheus. Omit `PROMETHEUS_SERVER_URL` from the agent's environment when no Prometheus probes are defined. ```bash title="docker run" docker run -d \ --name observer-agent \ -p 10101:10101 \ -e AGENT_KEY=obs_live_... \ -e CLOUD_SERVER_URL=https://use.observer \ ghcr.io/useobserver/agent:1.0.1 ``` ```yaml title="agent.yaml" apiVersion: apps/v1 kind: Deployment metadata: name: observer-agent spec: replicas: 1 selector: { matchLabels: { app: observer-agent } } template: metadata: { labels: { app: observer-agent } } spec: containers: - name: agent image: ghcr.io/useobserver/agent:1.0.1 ports: [{ containerPort: 10101 }] env: - name: AGENT_KEY valueFrom: { secretKeyRef: { name: observer, key: agent-key } } - name: CLOUD_SERVER_URL value: https://use.observer ``` Verify the connection. With Docker, browse `http://:10101`. In Kubernetes, port-forward the deployment with `kubectl port-forward deploy/observer-agent 10101:10101` and open `http://localhost:10101`. The dashboard's *Cloud* panel shows a recent `last_heartbeat_at`. The Agents page in the console marks the agent as **running** within roughly 90 seconds. ### Define an HTTP metric In the console, open **Metrics**, then **New metric**. Select the agent created above and set the source type to **HTTP**. Configure the probe: - **URL**: the full URL the agent should hit, for example `https://api.example.com/healthz`. - **Method**: `GET`. - **Expected status**: `200` (the probe reports `no_data` with `unexpected_status:` for any other code). - **Timeout (ms)**: `5000`. The probe reports `ETIMEDOUT` if the request takes longer. Set thresholds against `response_time_ms`: - Healthy: `under 500` (response under 500ms). - Unhealthy: `over 2000` (response over 2 seconds). Values that match neither boundary resolve to `degraded`. For endpoints that return 200 even when the underlying service is degraded, set **Body match** to a marker string that only appears in the healthy response (for example, `"status":"ok"`). The probe reports `body_mismatch` if the response body is missing that string. Only the first 4KB of the response is read. Set **Interval** to `1` minute and save. The probe runs every minute and pushes `response_time_ms` plus the resolved status to the cloud. ### Confirm reporting Within one push interval the metric appears in the Metrics list with its current status. Open the metric to see the latest value, last push timestamp, and rolling history. To verify the round trip, lower the unhealthy threshold below the current response time. The metric flips to `unhealthy` on the next push. Restore the original threshold and the metric returns to `healthy`. ## Probe behaviour The agent computes status client-side. The cloud receives only the verdict: ```text { metric_id, value: , status: , timestamp } ``` The full HTTP request runs from the agent's vantage point. The cloud has no path to the endpoint. Request bodies, response bodies, and headers stay in your network. The full reason-code list and field reference are in [Configure HTTP probes](/agent/guides/http-probes). ## Next - [Define your first SLO](/docs/quickstart/first-slo) - [Publish your first status page](/docs/quickstart/first-status-page) - [Configure HTTP probes](/agent/guides/http-probes) covers the full per-field reference: redirects, custom headers, TLS verification, and body matching. --- url: https://docs.use.observer/docs/quickstart/first-slo title: Define your first SLO description: Attach a service level objective to a metric and read the error budget. --- A Service Level Objective (SLO) wraps an existing metric in a target: a percentage of a rolling window during which the metric must report `healthy`. The gap between the target and the actual healthy time is tracked as an **error budget**. When the metric reports `unhealthy`, the budget burns. When it recovers, the burn stops. The budget surfaces on status pages and in webhook events. ## Prerequisites - A reporting metric. If one is not in place, complete [Define your first metric](/docs/quickstart/first-metric) first. ## Steps ### Create a service Services group related SLOs and render as a row on status pages. In the console, open **Services**, then **New service**. Name it after the system the SLOs describe, for example `checkout-api`. The description field is optional. ### Define the SLO Open the service, then **SLOs**, then **New SLO**. Configure: - **Metric**: the metric defined in the previous quickstart page. - **Target**: percentage of the window the metric must remain healthy. A common starting value is `99.9`. - **Window**: rolling window in days. A common starting value is `30`. - **Public**: enables rendering on customer-facing status pages. Save the SLO. The right target depends on the system's achieved availability over the prior 90 days. If that data is not available, start with `99.5%` and tighten once the SLO has accumulated a few weeks of history. A target tighter than reality burns budget on noise and loses signal value. ### Read the burn timeline Open the SLO. The detail page reports: - **Error budget remaining**: percent of the window's allowance still available. At `99.9% / 30 days`, the allowance is roughly 43 minutes. Below 100% indicates the metric has been unhealthy for some of the window. - **Latest burn event**: the current or most recent unhealthy stretch, including its start, end (or marker indicating still open), and percent of the budget burned. - **History**: prior burn events in the window, with duration and budget cost. The evaluator runs once per minute. If the metric flipped to unhealthy during the previous quickstart page, a burn event is visible here. ### Subscribe to webhook events If the organisation's plan includes outbound webhooks, open **Webhooks**, then **New subscription**. The events relevant to SLOs are: - `slo.burn_started`: an SLO crossed below its target. The payload includes the `slo_id`, `service_id`, `started_at`, and the current `error_budget_burned_pct`. - `slo.burn_resolved`: the SLO recovered. The payload includes the matching `burn_event_id` and the `final_budget_remaining_pct`. Wire deliveries to PagerDuty, Slack, or any HTTPS endpoint that accepts JSON. Endpoint quotas vary by plan. ## Calculation Each evaluator tick reads the metric's last status, updates the moving window, and recomputes: ```text title="error budget" budget_burned = total seconds in unhealthy status, within the window budget_total = window_seconds * (1 - target / 100) budget_remaining_pct = 100 * (1 - budget_burned / budget_total) ``` Statuses other than `unhealthy` (`degraded`, `no_data`, `unknown`) do not burn budget. Brief `degraded` flickers therefore do not consume the allowance on their own. Customers can carry per-SLO target overrides. See [Customer scopes](/docs/concepts/customer-scopes) for the model. ## Next - [Publish your first status page](/docs/quickstart/first-status-page) - [Webhook payload reference](/docs/reference/webhook-payloads) covers `slo.burn_started` and `slo.burn_resolved` payload shapes. --- url: https://docs.use.observer/docs/quickstart/first-status-page title: Publish your first status page description: Compose services, metrics, and SLOs into a customer-facing status page on a subdomain. --- Status pages are the customer-facing surface of the resources configured in the previous two quickstart pages: services, metrics, SLOs, and incident updates. This page covers creating a page, adding content blocks, and shipping it on a subdomain. ## Prerequisites - A reporting metric (see [Define your first metric](/docs/quickstart/first-metric)). - Optionally, an SLO (see [Define your first SLO](/docs/quickstart/first-slo)). Pages without SLOs render correctly but lose the rolling availability signal. ## Steps ### Create the page In the console, open **Pages**, then **New page**. Configure: - **Title**: the heading rendered at the top of the page. - **Subdomain**: the URL path, served as `.`. Lowercase letters, digits, and hyphens are accepted. The values `admin` and `blog` are reserved. - **Theme**: pick a preset; further customization is available in the page builder. Save the page. It is now reachable but contains no content blocks. ### Add metric blocks Open the page in the builder. Drag the **Metrics** block onto the canvas and select the metric defined earlier. Metrics can be grouped by namespace to mirror service topology: ```text title="example grouping" api/ checkout-api payment-router web/ dashboard ``` Each group's name renders as a section heading on the published page, and the metrics within share a status row. ### Add the SLO strip Drag the **SLO strip** block. Select the SLO. The strip renders the target, the window, the remaining error budget, and the current burn event. The block requires the SLO's **Public** flag to be enabled. If the strip does not render after publish, open the SLO and toggle public visibility on. ### Publish updates and incidents (optional) The **Updates** block surfaces incident posts at the top of the page. Open the page's update feed, create an Update with type `Incident`, and the page renders the incident with its timeline and follow-up posts. Updates are not required to publish the page; this step shows where they appear when an incident is in progress. ### Visit the page `.` resolves to the rendered page. On a local development cloud the URL is `http://.localhost:3000`. In production it follows the wildcard DNS the operator has pointed at the cloud. ## Result The page renders a header with the rolled-up status and any active SLOs, followed by a section per metric group, followed by an incident timeline. Every value reflects live metric data computed against the configured thresholds and SLO targets. ## Next - [Observer Agent reference](/agent) covers probe types, on-host configuration, and dashboard panels. - Guides on customer-scoped pages, password protection, and theme customization will appear in the sidebar as content is published. --- url: https://docs.use.observer/docs/quickstart/first-incident title: File your first incident description: Walk a draft → publish → update → resolve incident through the console. --- Incidents are the customer-facing comm layer that sits on top of metric-driven status. This page walks through filing one end to end: draft, publish, append a follow-up, resolve. Every step has an API equivalent (see the API tab) for IR automation. ## Prerequisites - An organisation with at least one service. If services are not yet defined, create one under **Services** > **New service** before starting. - Optional: an SLO bound to the service. The auto-impact panel on the incident detail page only renders when an affected service has at least one SLO. ## Steps ### Open the new-incident form In the console, navigate to **Updates** > **Post update**. Pick **Incident** as the type. The form prompts for severity, title, affected services, and customer visibility. ### Fill the headline - **Severity**: minor, major, or critical. The badge color on the public page follows this value. - **Title**: one customer-facing sentence (avoid jargon and internal IDs; this is what visitors will see at the top of the timeline). - **Affected services**: pick at least one. The auto-impact panel reads this list to compute SLO burn during the incident. - **Visibility**: leave empty for a public incident. Pick specific customers to scope the incident to those tenants only. ### Decide draft vs publish The form has two submit buttons: - **Save as draft**: creates the row but does not publish. The incident is editable; nothing renders on the public page or fires to webhook subscribers. - **Publish now**: sets `published_at = now()`, fires `incident.published`, and renders the incident on the public page. For the first incident, choose **Save as draft** so you can review the form before going customer-facing. The next step publishes. ### Append the first message Open the incident from the **Updates** list. The detail page shows an **Auto-impact** panel (live SLO burn, polled every 30 seconds), the message timeline, and a **New message** popover. Add a message of type **Investigating** with a brief description. The public page renders messages in chronological order under the incident header. ### Publish the incident Use the right-side rail action to publish. The incident now renders on the public page. Webhook subscribers receive `incident.published` and `incident.message_added` events. ### Resolve When the underlying issue clears, append a final message of type **Resolved**. Observer auto-marks the parent incident resolved (`resolved_at` populated, lifecycle pill flips to `resolved`, `incident.resolved` webhook fires). Every step above maps to a `/api/v1/incidents/*` endpoint. The [API reference](/api) documents the full surface, including `POST /incidents/{id}/publish`, `POST /incidents/{id}/messages`, and `POST /incidents/{id}/resolve`. ## Related - [Incident lifecycle reference](/docs/reference/incident-lifecycle) - [Customer-scoped incidents](/docs/concepts/customer-scopes) - [Webhook payload reference](/docs/reference/webhook-payloads) --- url: https://docs.use.observer/docs/quickstart/first-maintenance title: Schedule your first maintenance description: Schedule a maintenance window with auto-start and auto-complete. --- Maintenance windows differ from incidents in two ways: they are planned in advance, and Observer auto-transitions them through their lifecycle on a cron tick (so you do not have to remember to mark "started" / "completed" manually). ## Steps ### Open the new-maintenance form In the console, navigate to **Updates** > **Post update**. Pick **Maintenance** as the type. The form replaces the severity field with a **Scheduled start** + **Scheduled end** pair. ### Set the window Pick the start and end times in your local timezone. The cron auto-transitions the maintenance through: - `scheduled` → `in_progress` when `now() >= scheduled_start_at`. - `in_progress` → `completed` when `now() >= scheduled_end_at`. A `maintenance.starting_soon` webhook fires one hour before `scheduled_start_at` (idempotent; once per maintenance row). ### Pick affected services The public page renders a banner for the maintenance starting within 24 hours and a sticky banner while in progress. The banner lists affected services so customers know which surfaces are impacted. ### Publish Maintenances always publish on save (drafts are an incident-only flow). The page banner appears within 24 hours of the scheduled start; subscribers receive `maintenance.scheduled` immediately and `maintenance.starting_soon` one hour out. ## Cancel a scheduled maintenance Open the maintenance row from **Updates**. The right-side rail has a **Cancel** action. `canceled_at` is set, the banner is removed, and the `maintenance.canceled` webhook fires. The lifecycle transitions can be triggered manually via the API (`POST /api/v1/maintenances/{id}/start` and `/complete`) when the cron schedule does not match the actual change-window timing. --- url: https://docs.use.observer/docs/quickstart/first-subscriber title: Add email subscribers to your status page description: Configure the subscribe block, set up double opt-in, and verify a test subscription. --- Public status pages can collect email subscribers. Each subscription goes through double opt-in (the visitor must click the link in a confirmation email) and includes per-message unsubscribe. ## Steps ### Enable subscriptions on the page Open the page in the builder. In the access settings, ensure **Allow subscriptions** is on. The setting is on by default but can be turned off per page. ### Add the Subscribe block In the page builder, drag the **Subscribe** block onto the canvas. The block renders an email signup field on the public page; visitors who submit an email receive a confirmation message. ### Verify with a test subscription From the published page, submit a real email you control. Check the inbox for a confirmation email. Click the confirm link. The subscriber row's `confirmed_at` is set; future incidents fire notifications to this address. ### Verify unsubscribe Click the unsubscribe link from any received email. The row's `unsubscribed_at` is set. Future events skip the recipient. The link is idempotent: re-clicking does nothing. ## Filtering Subscribers can opt into specific services or metrics rather than the full page. The Subscribe block exposes a checkbox list when at least one service has been added to the page; the chosen scopes write to `subscriber_filters`. The subscriber-per-page cap depends on the org's plan tier (see [Plans and quotas](/docs/reference/plans-and-quotas)). Daily email caps follow the same matrix; both are enforced server-side. ## Programmatic export The console **Customers** > **Subscribers** page supports CSV export for an org's full active list. Useful when migrating between systems or generating opt-in audit trails. --- url: https://docs.use.observer/docs/guides/outbound-webhooks title: Configure outbound webhooks description: Subscribe an HTTPS endpoint to status, SLO, and agent events. --- Outbound webhooks deliver Observer events to an HTTPS endpoint as JSON. They are the integration path for paging tools, ticketing systems, and chat notifications. ## Available event types | Type | Trigger | |---|---| | `metric.status_changed` | A metric's status flips after dwell gating. | | `metric.no_data` | A metric enters `no_data` because the agent could not collect a sample. | | `page.status_changed` | A status page's rolled-up status flips. | | `slo.burn_started` | An SLO crosses below its target. | | `slo.burn_resolved` | An SLO recovers. | | `agent.offline` | An agent misses its expected heartbeat window. | ## Configure a subscription ### Create the subscription In the console, open **Webhooks**, then **New subscription**. Configure: - **Endpoint URL**: an HTTPS endpoint that accepts POST requests with a JSON body. HTTP is rejected. - **Signing secret**: optional shared secret. When set, Observer signs every delivery with HMAC-SHA-256 in the `X-Observer-Signature` header. - **Event types**: tick the events the endpoint should receive. Save the subscription. The first delivery confirms reachability. ### Verify deliveries The subscription detail page lists recent deliveries with their HTTP response code and round-trip time. Successful deliveries return a 2xx response within the timeout window. Failed deliveries are retried with exponential backoff. ### Verify the signature (recommended) When a signing secret is set, every delivery includes the header: ```text X-Observer-Signature: sha256= ``` Compute `HMAC-SHA-256(secret, raw_body)` on the receiving side and compare against the header value. Reject deliveries whose signatures do not match. ## Quotas Webhook subscription quotas vary by plan. Endpoints over the cap on a downgrade remain configured; new endpoints are blocked until the plan is upgraded or an existing endpoint is removed. Deliveries that fail every retry attempt land in the subscription's dead-letter list. Open the subscription, review the failures, and either fix the endpoint and replay, or discard. ## Payload reference See the [webhook payload reference](/docs/reference/webhook-payloads) for the JSON shape of each event type. --- url: https://docs.use.observer/docs/guides/custom-domain title: Serve a status page on your own domain description: Point status.yourdomain.com at Observer with automatic TLS. --- Add a custom domain so the public status page lives at `status.yourdomain.com` instead of `something.use.observer`. Observer provisions a TLS certificate automatically. The default subdomain keeps working as a fallback. ## Prerequisites - A Starter plan or higher. Free accounts cannot configure custom domains. - A status page already created. - Permission to add a CNAME record on the domain you want to use. ## Pick a subdomain, not the root CNAME records cannot be set on the root of a domain. Use `status.yourdomain.com`, not `yourdomain.com`. If your status page is the only thing on that domain, `www.yourdomain.com` is a reasonable choice. ## Steps ### Add the custom domain in Observer In the console, open the status page and choose **General**. Under **Custom domain**, type the hostname you want to use (for example `status.yourdomain.com`) and click **Add custom domain**. Observer creates the record in `dns_pending` state and starts checking every 30 seconds for the CNAME you'll add next. ### Add the CNAME at your DNS provider Create a CNAME record with these values: | Field | Value | | ----- | ----- | | Type | `CNAME` | | Name | the subdomain (`status` if your domain is `status.yourdomain.com`) | | Value | `cname.use.observer` | | TTL | `300` (or "automatic") | The Observer UI shows provider-specific notes for Cloudflare, Route 53, GoDaddy, Namecheap, Vercel, and Netlify under the **DNS provider** dropdown. **Cloudflare:** the CNAME record must be DNS-only (grey cloud), not proxied. A proxied CNAME ends in Cloudflare error 1014 ("CNAME Cross- User Banned") because the destination is on a different Cloudflare account. ### Verify Wait a minute or two for DNS to propagate, then click **Check now** in the custom domain card. The state pill walks through: - `dns_pending` — Observer hasn't seen the record yet. - `dns_invalid` — your CNAME exists but points somewhere else. Fix the record and click Check now. - `dns_verified` — DNS is right. Observer asks Let's Encrypt for a certificate. - `cert_pending` — certificate issuance in progress (usually under a minute, sometimes up to an hour if the issuer is rate-limited). - `active` — your domain serves the status page with a valid TLS cert. ## After it's active The page serves at your custom hostname. The original `*.use.observer` URL keeps working — feel free to redirect from it in your own infrastructure if you want a single canonical URL. Observer renews the certificate automatically about 30 days before expiry. The UI shows the next expiry date inside the custom domain card. ## Common failures **`dns_pending` for more than 30 minutes.** Your DNS provider's TTL may be aggressive (1 hour or more). Wait it out, or temporarily lower the TTL. **`dns_invalid` with `CNAME points to .com`.** Your CNAME is pointing at the wrong target. The correct value is `cname.use.observer`. **`cert_failed` with "rate limited" in the message.** Let's Encrypt limits per-domain issuance. The cron tick retries every five minutes; the rate limit resets within an hour. Clicking Check now faster than that won't help. **Cloudflare error 1014.** Your CNAME is proxied. Switch the record to DNS-only (grey cloud) in the Cloudflare DNS panel. ## Removing a custom domain Click **Remove custom domain** in the General popover. The page stops serving on the custom hostname immediately and reverts to the default `*.use.observer` URL. --- url: https://docs.use.observer/docs/guides/password-protected-pages title: Password-protect a status page description: Require visitors to enter a password before the page renders. --- Status pages run in `public` access mode by default. The `password` mode requires a shared password before the page renders. Use it for internal-only or partner-only views that do not need per-customer scoping. ## Steps ### Switch the page to password mode Open the page in the console, then **Access**. Set: - **Mode**: `password`. - **Password**: a shared secret you will distribute to authorised visitors. Save. The page now redirects unauthenticated visitors to an unlock form. ### Distribute the password Share the password through the channel that already gates access to the audience (e.g. a partner portal, an internal wiki, a signed email). ### Rotate the password Open the page's access settings and update the password. Existing unlock cookies invalidate at rotation, and visitors must re-enter the new password. ## Behaviour - The unlock cookie is named `observer-page-access-`. - The cookie is signed against the current password hash. Rotating the password invalidates outstanding cookies. - Cookie lifetime is one hour. After expiry, visitors re-enter the password. Passwords are appropriate for low-stakes gating. For per-customer views, signed JWT access, or audit trails of who saw what, use [JWT-scoped access](/docs/guides/jwt-scoped-access) or [customer-scoped pages](/docs/guides/customer-scoped-pages). --- url: https://docs.use.observer/docs/guides/jwt-scoped-access title: Configure JWT-scoped access description: Gate a status page behind a Bearer token verified against your public key or JWKS endpoint. --- The `jwt` access mode gates a status page behind a Bearer token that Observer Cloud verifies against a public key (or a JWKS endpoint) you control. Use it when the audience already has an identity issued by your auth system, and you want the same identity to authorise status-page reads. ## Prerequisites - A signing key (RS256, ES256, or any algorithm Observer's verifier supports). Either a single PEM public key or a JWKS endpoint Observer can fetch. - A way to issue tokens for the audience (typically your auth service or an Identity Provider). ## Configure the page ### Switch the page to JWT mode Open the page in the console, then **Access**. Set: - **Mode**: `jwt`. - **Public key** (PEM) **or** **JWKS URL**: whichever your issuer exposes. - **Audience** (optional): the `aud` claim Observer should require. - **Issuer** (optional): the `iss` claim Observer should require. Save. ### Issue tokens Sign tokens with the matching private key. Observer accepts: - The `Authorization: Bearer ` header on requests to the page. - The `?token=` query parameter, for embed iframes that cannot set headers. A typical claim set: ```json title="claims" { "iss": "https://your-idp.example", "aud": "observer-status-page", "sub": "user-or-customer-identifier", "exp": 1716480000 } ``` ### Validate the round trip Open the page with the Bearer header set. Successful verification renders the page. A missing or invalid token returns 401. When the audience is a set of distinct customers and each customer should see a different subset of metrics or different SLO thresholds, use [customer-scoped pages](/docs/guides/customer-scoped-pages) instead. That mode adds per-customer routing on top of the same JWT verification. --- url: https://docs.use.observer/docs/guides/customer-scoped-pages title: Configure customer-scoped pages description: Render the same status page differently per customer, with per-customer SLO thresholds. --- Customer-scoped pages let one underlying page serve multiple customers, with each customer's signed-in view filtered to the metrics, services, and SLOs in their contract. The same page can also apply per-customer SLO targets (for example, an enterprise customer with `99.99%` reads against a different threshold than a standard customer signed at `99.9%`). ## Prerequisites - The audience already authenticates against an Identity Provider capable of signing JWTs. - A list of customers in the console (open **Customers**, then **New customer**, and capture each customer's identifier). ## Configure the page ### Switch the page to customer-scoped mode Open the page in the console, then **Access**. Set: - **Mode**: `customer_scoped`. - **Public key** or **JWKS URL**: same as JWT mode. - **Customer claim**: the JWT claim Observer should read to identify the customer. Common choices are `sub`, `customer_id`, or a custom claim such as `obs_customer_id`. Save. ### Bind customers to the page Open the page, then **Access**, then **Customers**. Add each customer who is allowed to view the page. Customers without a binding receive a 403 even with a valid token. ### Apply per-customer SLO targets (optional) Open a customer, then **SLO overrides**. Add an override with: - The SLO whose target should be customer-specific. - The customer's contracted target percentage. When the customer-scoped page renders for that customer, the SLO strip uses the override target. Other customers viewing the same page see the default SLO target. Customer scopes apply at render time. A single underlying metric can therefore back a `99%` agreement with one customer and a `99.99%` agreement with another, without duplicating the metric definition or the agent's collection work. ## Issuing tokens Issue tokens for each customer with the agreed claim set. The customer-claim value must match a customer in the binding list. ```json title="claims" { "iss": "https://your-idp.example", "aud": "observer-status-page", "sub": "user-1234", "obs_customer_id": "acme-cloud", "exp": 1716480000 } ``` ## Behaviour - Pages without a binding return 403 even with a valid token. - Token expiry returns 401 and the embedded view re-fetches a token. - SLO overrides are read on every render and require no caching on the consumer side. --- url: https://docs.use.observer/docs/guides/multiple-metric-sources title: Use multiple metric sources description: Mix Prometheus, HTTP, TCP, DNS, and TLS certificate probes in one Observer organisation. --- Observer's agent supports several probe runtimes within one deployment. Pick the source that produces the most reliable signal for what you want to assert about the system. ## Source types | Source type | Returns | Typical use | |---|---|---| | `prometheus` | scalar from a PromQL query | latency / error rate / saturation against existing series | | `http` | response time in ms | reachability + body match against an endpoint | | `tcp` | connect time in ms | reachability for non-HTTP services (Redis, Postgres) | | `dns` | resolve time in ms | DNS resolution path with optional record-value match | | `tls_cert` | days until certificate expiry | leaf-cert validity for a hostname | Stubbed in the schema and reserved for future runtimes: `icmp`, `grpc`, `websocket`, `mtls_http`, `database`. Definitions using these source types are accepted by the cloud and stored, but the agent returns `no_data` until the runtime ships. ## Configure a non-Prometheus metric Open **Metrics**, then **New metric**, and pick the source type. Each source has its own configuration form: - **HTTP**: URL, expected status code(s), optional body match, optional headers, timeout, follow-redirects, verify-TLS toggle. - **TCP**: host, port, timeout. - **DNS**: domain, record type (`A`, `AAAA`, `CNAME`, `MX`, `TXT`, `NS`, `SRV`, `CAA`, `PTR`), optional expected value, optional resolver. - **TLS cert**: host, port (default `443`), warn-days, critical-days. The thresholds remain consistent: each metric has `healthy_*` and `unhealthy_*` operators applied to whatever value the source returns. - HTTP `response_time_ms`: healthy `under 500`, unhealthy `over 2000`. - TLS cert `days_until_expiry`: healthy `over 30`, unhealthy `under 7`. - DNS `resolve_time_ms`: healthy `under 100`, unhealthy `over 500`. ## Mixing sources on one page A status page can carry metrics from any combination of sources. The page renders each metric using its threshold band, regardless of the runtime that produced the value. Operators viewing the page see one consistent green / amber / red signal across heterogeneous checks. ## Agent reach The agent must be able to reach each source from its host. For Prometheus, that is your internal Prometheus URL. For HTTP probes, the URL must be reachable from wherever the agent runs (for example, an internal endpoint on a private network). The cloud never reaches your endpoints directly: the agent collects, computes status, and pushes the verdict. --- url: https://docs.use.observer/docs/guides/theme-customization title: Customise the status page theme description: Apply a built-in theme preset or override colours, typography, and spacing on a per-page basis. --- Status pages render against a token-driven theme. Every visible surface (background, foreground, accent, semantic colours, typography, spacing, border radius) is exposed as a CSS variable that a preset or per-page override can change without touching code. ## Pick a preset Open the page in the builder, then **Theme**. Each preset is a pre-baked combination of colours and typography intended for a particular brand register (warm-light, cool-dark, monochrome, and others). Selecting a preset writes its tokens to the page's `page_themes` row. ## Override individual tokens The theme editor surfaces every token the public page consumes: - **Background, surface, foreground**: page chrome. - **Accent**: status pill, primary buttons, link colour. - **Success, warning, danger**: status indicators (`healthy`, `degraded`, `unhealthy`). - **Border, muted, muted foreground**: dividers and secondary text. - **Heading, body, mono**: font families. The page builder picks Google Fonts by default; arbitrary CSS `font-family` strings are also accepted. - **Spacing scale, radius**: layout density. Every override is persisted on the page and applied at render time. Preview changes in the page builder before saving. ## Custom CSS If a token override is not enough, open **Theme**, then **Custom CSS**. The CSS you provide is injected into the rendered page after the preset and token overrides. Use it for narrow corrections (e.g. shifting a margin, hiding a block on small viewports) rather than re-skinning the page. Token overrides do not auto-correct contrast. Pick foreground colours that meet WCAG AA against the chosen background. The built-in presets are validated against AA at seed time. ## Preset rollout A theme preset selected through the **Theme** picker writes the preset's tokens to the page row. Subsequent updates to the preset itself do not retroactively rewrite pages that already adopted it. To apply a refreshed preset, re-select it on each page that should update. --- url: https://docs.use.observer/docs/guides/define-a-manual-metric title: Define a manual metric description: The cleanest path for operators without metrics infrastructure. --- If you do not have a Prometheus server, no observability for the target system, or simply want a status surface that follows operator judgment rather than a measurement, manual metrics are the right shape. ## Steps ### Create the metric In the console, navigate to **Metrics** > **New metric**. In the **Source type** picker, choose **Manual**. The form hides the probe config and threshold sections; manual metrics carry neither. Fill the title and description; pick the agent association if any (manual metrics ignore the agent at runtime, but the field stays for ownership / audit). ### Set the initial status Save. Open the metric. The detail page shows a clickable status pill. Pick the right initial status (`healthy` is the most common). ### Bind the metric to a service and (optional) SLO Manual metrics fit the same service / SLO model as probed metrics. Open the service, define an SLO that points at the manual metric, set a target, and the budget will burn whenever the metric is in the unhealthy state — same machinery as a probed metric. ### Hook up automation (optional) For systems with their own observability, you can drive a manual metric from outside Observer: ```bash curl -X POST https://use.observer/api/v1/metrics//status \ -H "Authorization: Bearer obs_pub_..." \ -H "Content-Type: application/json" \ -d '{"status":"unhealthy","note":"Vendor incident #VND-12345"}' ``` The scope `write:metrics` is required. The note ends up in the audit log. When an open incident lists a service that contains a manual metric, that metric auto-flips to mirror the incident's severity. This is intentional: manual metrics have no probe, so the only meaningful signal is what the operator says is true. See [Manual metrics](/docs/concepts/manual-metrics) for the full semantics. --- url: https://docs.use.observer/docs/guides/incidents-via-api title: Create incidents via API description: For IR automation and ChatOps integrations. --- Every console action on incidents has an API equivalent. Most IR teams wire their alerting (PagerDuty, Opsgenie) or ChatOps (Slack slash commands, GitHub Actions) to file and update Observer incidents directly without an operator touching the console. ## Auth API keys are issued per organisation. Two new scopes were added in Phase 25: - `write:incidents` — create / patch / publish / resolve / delete. - `write:maintenances` — create / patch / start / complete / cancel. Both inherit from `read:incidents` for retrieval. ## File a new incident from a Slack slash command ```bash curl -X POST https://use.observer/api/v1/incidents \ -H "Authorization: Bearer obs_pub_..." \ -H "Content-Type: application/json" \ -d '{ "title": "Checkout API errors", "severity": "major", "affected_services": [""], "publish": true, "initial_message": { "type": "Investigating", "description": "Investigating elevated error rate on checkout." } }' ``` The response includes `id`, the projected lifecycle state, and the affected-service rollup. Use the `id` for follow-up calls. ## Append a status update ```bash curl -X POST https://use.observer/api/v1/incidents/$ID/messages \ -H "Authorization: Bearer obs_pub_..." \ -H "Content-Type: application/json" \ -d '{ "type": "Identified", "description": "Identified bad deploy. Rolling back." }' ``` A `Resolved` message auto-marks the parent incident resolved. ## Resolve ```bash curl -X POST https://use.observer/api/v1/incidents/$ID/resolve \ -H "Authorization: Bearer obs_pub_..." \ -H "Content-Type: application/json" \ -d '{"description": "Rollback complete. Error rate back to baseline."}' ``` ## Schedule a maintenance window ```bash curl -X POST https://use.observer/api/v1/maintenances \ -H "Authorization: Bearer obs_pub_..." \ -H "Content-Type: application/json" \ -d '{ "title": "Database upgrade", "scheduled_start_at": "2026-06-01T02:00:00Z", "scheduled_end_at": "2026-06-01T04:00:00Z", "affected_services": [""] }' ``` ## Idempotency The `from-metric` endpoint is dedupe-protected: ```bash curl -X POST https://use.observer/api/v1/incidents/from-metric/$METRIC_ID \ -H "Authorization: Bearer obs_pub_..." ``` Calling this twice for the same metric within 30 minutes returns the same draft incident id. Useful when an alert hook may fire duplicate webhooks. Every endpoint above corresponds to a documented state transition. See [Incident lifecycle reference](/docs/reference/incident-lifecycle) for the full state machine. --- url: https://docs.use.observer/docs/guides/auto-incident-creation title: Auto-incident creation description: Opt a metric in to automatic draft-incident creation when it flips unhealthy. Drafts ship with email CTAs so a human always verifies before customers see the incident. --- When a metric flips unhealthy in the middle of the night, the on-call already knows. The question is whether the customer-facing status page should be updated to reflect that. Auto-incident creation does the typing-out part for you — without ever publishing without a human pressing a button. ## How it works 1. You opt a metric in to the feature on its edit form (Pro+). 2. The metric flips unhealthy (with dwell gating, exactly as a manual status change would). 3. The auto-incident worker creates a **draft** incident on the metric's bound service. 4. Observer emails your org owners with two buttons: **Publish** (flip to published; customers see it) and **Dismiss** (soft-delete the draft). 5. If neither button is clicked within 24 hours, the draft auto-expires. Nothing ever reaches the public page without a human action. A draft incident is just a row in your database with `publishedAt = NULL`. Your status page renders only published incidents. The draft exists for you to verify and act on — it can be safely dismissed if it turned out to be noise. ## Enable for a metric 1. Open **Console → Metrics → \ → Edit**. 2. Scroll to the **Automatic incident creation** section. 3. Pick a **Policy**: - **Off** — auto-creation is disabled for this metric. - **On — create immediately** — a draft is created the moment the metric flips unhealthy. - **On — wait then re-check** — Observer waits the configured number of seconds, then re-checks the metric's current status. If it's still unhealthy, the draft is created. If the metric recovered during the dwell window, nothing happens. This is the recommended setting for metrics that occasionally flap. 4. Pick a **Severity** (`minor` / `major` / `critical`). This value is stamped on every auto-drafted incident. 5. For dwell-mode, pick a **Dwell seconds** value between 60 and 3600. Defaults to 300 (5 minutes). 6. Save. ## What gets created When the worker fires, you get: - A new incident row with: - `title`: `Investigating elevated errors on ` - `severity`: as configured on the metric - `affected_services`: every service that has an SLO pointing at the metric - `is_auto_drafted`: `true` - An initial `Information` message describing the value vs the threshold and the timestamp. - An audit row (`incident.auto_drafted` on the metric, plus the parent row on the incident itself). - A webhook event `incident.auto_drafted` (separate from the manual `incident.created` so you can listen specifically). - An email to every org owner who hasn't opted out (see [Notification preferences](#notification-preferences)). ## Email CTAs Each email has two buttons: - **Publish incident** — `GET /api/incidents/auto-action?token=…&action=publish` inside the signed token. Flips the draft to published. Fires `incident.auto_published`. - **Dismiss draft** — `GET /api/incidents/auto-action?token=…&action=dismiss`. Soft-deletes the row. Fires `incident.auto_dismissed` with `reason: "operator_dismiss"`. The token format is `base64url(body) + "." + base64url(sig)` with body `||` and signature `HMAC-SHA-256(server_secret, body)`. Action is part of the signed body, not the URL — you can't flip a publish link to dismiss (or vice versa) by editing the URL. Tokens expire after 24 hours. Both endpoints are idempotent. Re-clicking publish after the incident is already published returns a success page. Re-clicking dismiss after it's already dismissed returns a success page. ## Dedup, cooldown, and expiry Three guardrails keep the auto-incident flow from spamming you: 1. **Dedup against open incidents on the service.** If you (or a prior auto-draft) have already filed an incident affecting the metric's service, the worker appends a new Information message to the existing incident instead of creating a duplicate. Message text: `Metric is now unhealthy (auto-detected).` 2. **One auto-draft per metric per hour.** If a metric was already auto-drafted or auto-dismissed in the last hour, the worker skips. Flapping metrics never produce more than one draft per hour. 3. **24-hour auto-expiry.** Drafts older than 24 hours that haven't been published or dismissed are soft-deleted by a 15-minute cron, audited as `incident.auto_expired`, and fire `incident.auto_dismissed` with `reason: "auto_expired"`. ## Notification preferences Per-user opt-out lives at **Console → Settings → Notifications → Auto-incident draft emails**. Default is ON for org owners. Owners who toggle this off do not receive auto-incident emails (any other type of email is unaffected). The toggle stores as `users.notification_preferences.autoIncidentDrafts = false` on the user row. ## Plan gate This feature is **Pro+ only**. Free and Starter plans see a locked-feature card on the metric edit form. Set the metric policy to `disabled` (the default) on lower plans or upgrade. ## Webhook events Three event types fire from the auto flow: - `incident.auto_drafted` — fires when the draft is created. - `incident.auto_published` — fires when the draft is published via the email link (or the equivalent API endpoint). - `incident.auto_dismissed` — fires for both the email-dismiss and the 24h auto-expiry paths. `reason` distinguishes them. Payloads are documented at [Webhook payload reference](/docs/reference/webhook-payloads#incidentauto_drafted). ## Recommended setup For most teams: - **Dwell mode with 300 seconds** for any latency or error-rate metric. The dwell window catches noisy alarms before they generate an email. - **Immediate mode** for binary signals (TLS expiry hit zero, a service is unreachable). These should not flap, so dwell adds nothing. - Leave auto-creation **off** for noisy dashboards that are not customer-visible. The console already shows unhealthy metrics; not every internal alarm deserves a draft. --- url: https://docs.use.observer/docs/guides/migrate-from-statuspage title: Migrate from Statuspage description: Move services, components, incidents, and subscribers from Atlassian Statuspage to Observer. --- This guide covers a structured migration from Atlassian Statuspage to Observer. The two products share a customer-facing surface, but their backing models differ: Statuspage records component state manually or through an API call; Observer derives status from metrics that an agent collects in your network. Plan the migration around that difference. ## Model differences to plan for | Statuspage concept | Observer equivalent | Notes | |---|---|---| | Component | Metric (one or more, behind a service) | A Statuspage component represents the operator's manual verdict. An Observer metric represents a measured value evaluated against thresholds. One Statuspage component often becomes one Observer service with two or three Observer metrics. | | Component group | Service | Logical grouping. Maps cleanly. | | Manual incident state | Update with `Incident` type | Same semantics: posted updates with timeline. | | Status indicator (operational, degraded, partial outage, major outage) | Rolled-up page status (`healthy`, `degraded`, `unhealthy`) | Page rollup uses `unhealthy=3 > degraded=2 > healthy=1`. Pick the worst child status. | | API-driven component update | Metric reported by the agent | Stop calling Statuspage's `PATCH /components/:id`. The agent's status push replaces it. | | Subscribers (email / SMS / Slack / webhook) | Page subscribers + outbound webhooks | Email subscribers move with the data export. SMS is not supported; Slack and PagerDuty are reachable through outbound webhook subscriptions. | | Public status page domain | Status page subdomain | Both products serve a customer-facing domain. Plan a DNS cutover window. | | Maintenance windows | Update with `Scheduled maintenance` type | Posted in advance, displays on the page during the window. | ## Steps ### Inventory the Statuspage account Pull the list of: - Components and component groups (one row per metric to define in Observer). - Past 90 days of incidents (for the changelog you publish on the Observer page). - Active subscribers, exported as CSV. - Webhook subscribers, with their endpoint URLs. Statuspage's REST API exposes each of these. The export from **Account** > **Audit log** captures incident history; the **Subscribers** page exports CSV directly. ### Stand up the Observer side in parallel Follow [Define your first metric](/docs/quickstart/first-metric) to install an agent and define a first reporting metric. Build out the remaining metrics, services, and SLOs without touching the Statuspage account. The two systems run side-by-side until the DNS cutover. For each Statuspage component, decide on the source signal: - A latency or error-rate query already in Prometheus (use the Prometheus probe). - An HTTP endpoint that returns 200 when the component is healthy (use the HTTP probe). - A TCP socket, DNS record, or TLS certificate (matching probe type). If a Statuspage component has no measurable signal today, that is a signal of toil debt: the component's "operational" state was the operator's manual verdict. Pick the closest measurable proxy and document the gap. ### Build the status page Open **Pages** > **New page** in the Observer console. Recreate the public-facing layout: title, theme, services, metrics, SLO strip. The page is reachable on `.` immediately, before the DNS cutover. If the Statuspage account uses customer-scoped views (visible under different domains per customer), see [Configure customer-scoped pages](/docs/guides/customer-scoped-pages). ### Backfill incidents Observer renders updates posted on the page. To preserve the public changelog, post each historical Statuspage incident as an Update with type `Incident`, dated to its original `created_at`. The console's **Updates** > **New update** form accepts a custom timestamp. For high-incident accounts, scripting this against the Statuspage incident export and the Observer API is the practical path; for a typical SMB account with under 50 incidents a year, manual entry is fast. ### Migrate subscribers Observer accepts an email-subscriber import via the API. For webhook subscribers, recreate the subscription in **Webhooks** > **New subscription**, point it at the same endpoint URL, and pick the events that match what the consumer expects. Webhook payload shapes are documented in [Webhook payload reference](/docs/reference/webhook-payloads). SMS subscribers need to be re-acquired. Email those subscribers during the migration window with a link to the Observer subscribe form on the new page. ### Cut over DNS When the Observer page renders correctly and all subscribers are migrated, point your status subdomain (commonly `status.`) at the Observer cloud's wildcard. The page resolves immediately; visitors see no transition. Disable updates from the Statuspage API in your alerting and CI systems; the agent's metric pushes now drive Observer's status verdict. Cancel the Statuspage subscription after one billing cycle of overlap to allow rollback if the migration surfaces any gap. ## API parity matrix For teams wiring CI / IR automation, this table maps the Statuspage endpoint to its Observer equivalent. See [Create incidents via API](/docs/guides/incidents-via-api) for end-to-end examples. | Statuspage | Observer | Notes | |---|---|---| | `GET /pages/{id}/incidents` | `GET /api/v1/incidents` | Same cursor-paged list shape; Observer adds `state` and `since` filters. | | `POST /pages/{id}/incidents` | `POST /api/v1/incidents` | Observer adds `affected_services`, `visible_to_customer_ids`, `publish` flag, `initial_message`. | | `PATCH /pages/{id}/incidents/{id}` | `PATCH /api/v1/incidents/{id}` + `POST /publish` | Statuspage rolls publish + edit into one call; Observer separates them so drafts are explicit. | | `POST /pages/{id}/incidents/{id}/components` (set state on component) | Wired automatically when incident lists `affected_services` containing manual metrics; see [Manual metrics](/docs/concepts/manual-metrics). | | (no equivalent) | `POST /api/v1/incidents/from-metric/{metricId}` | Pre-fill a draft from a flipped metric. Observer-only. | | `POST /pages/{id}/incidents/{id}/messages` | `POST /api/v1/incidents/{id}/messages` | Same shape; Observer's `Resolved` message also flips the parent state. | | `DELETE /pages/{id}/incidents/{id}` | `DELETE /api/v1/incidents/{id}` | Observer is soft-delete (`deleted_at`); Statuspage is hard-delete. | | `POST /pages/{id}/incidents/{id}/scheduled-maintenances` | `POST /api/v1/maintenances` | Observer auto-transitions `scheduled` → `in_progress` → `completed` on the configured times via cron; Statuspage requires manual start/complete. | | `GET /pages/{id}/page-access-users` | (none) | Observer's customer-scoped access uses JWT claims; no per-customer API for the user list. | | `POST /pages/{id}/subscribers` | `POST /status-page/{subdomain}/subscribe` | Public endpoint (no API key required). Confirmation flow is double opt-in. | ## Common questions **Can both run in parallel during migration?** Yes, and that is the recommended path. The agent reports to Observer; the Statuspage API stays in place until DNS cutover. Subscribers can be on either system during overlap. **What about historical metric values?** Observer's history starts when the agent first reports. Statuspage does not offer a metric export to backfill, because Statuspage does not store metric values; it stores manual verdicts. The 90-day incident timeline is what migrates. **How do I keep on-call alerting unchanged?** Recreate the webhook subscription. Most alerting integrations (PagerDuty, Slack, Microsoft Teams) accept generic JSON webhooks with HMAC signatures. The signature scheme is described in [Webhook payload reference](/docs/reference/webhook-payloads#signature-verification). For accounts with hundreds of components or high-volume subscriber lists, the Observer team can run the migration alongside you. Contact support before starting and a migration engineer is assigned. --- url: https://docs.use.observer/docs/reference/plans-and-quotas title: Plans and quotas description: Per-plan limits for resources, retention, and API throughput. --- Plan limits are enforced at create time. Existing rows over the cap on a downgrade remain readable; new creates are blocked until the plan is upgraded or an existing row is removed. ## Resource quotas | Capability | Free | Starter | Pro | Enterprise | |---|---|---|---|---| | Status pages | 1 | 1 | 3 | unlimited | | Services | 3 | 10 | 50 | unlimited | | Metrics | 10 | 50 | 500 | unlimited | | SLOs | 0 | 3 | unlimited | unlimited | | Custom domains | 1 | 3 | unlimited | unlimited | | Subscribers per page | 100 | 5,000 | 50,000 | 500,000 | | Customer-scoped pages | 0 | 0 | 25 | unlimited | | Customers | 0 | 0 | 25 | unlimited | | Agents | 1 | 3 | 10 | unlimited | | Webhook endpoints | 0 | 3 | 25 | unlimited | ## Daily caps | Capability | Free | Starter | Pro | Enterprise | |---|---|---|---|---| | Webhook deliveries / day | 0 | 1,000 | 100,000 | unlimited | | Public API requests / day | 0 | 10,000 | 100,000 | unlimited | | Subscriber emails / day | 0 | 1,000 | 100,000 | unlimited | ## Retention | Capability | Free | Starter | Pro | Enterprise | |---|---|---|---|---| | Metric history (days) | 7 | 30 | 90 | 365 | Resources that exceed the new plan's cap are not deleted. They remain visible and editable. The next attempted create on the affected capability returns a `quota_exceeded` error with a pointer to the upgrade path. --- url: https://docs.use.observer/docs/reference/webhook-payloads title: Webhook payload reference description: JSON shapes for every event type Observer emits. --- Every webhook delivery is a POST with a JSON body and the headers: ```text Content-Type: application/json X-Observer-Event: X-Observer-Delivery: X-Observer-Signature: sha256= (when a signing secret is configured) ``` The body is always: ```json { "event_type": "", "event_id": "", "occurred_at": "", "data": { ... } } ``` The `data` field shape varies by event type. The reference below documents each. ## metric.status_changed A metric's status flipped after dwell gating. ```json { "data": { "org_id": "org_...", "metric_id": "", "metric_title": "checkout-api 5xx ratio", "old_status": "healthy", "new_status": "unhealthy", "value": 0.024, "timestamp": "" } } ``` ## metric.no_data A metric entered `no_data`: the agent could not collect a sample. ```json { "data": { "org_id": "org_...", "metric_id": "", "metric_title": "checkout-api 5xx ratio", "reason": "ECONNREFUSED", "timestamp": "" } } ``` ## page.status_changed A status page's rolled-up status flipped. ```json { "data": { "org_id": "org_...", "page_id": "", "page_title": "Acme Cloud", "old_status": "healthy", "new_status": "degraded", "computed_at": "" } } ``` ## slo.burn_started An SLO crossed below its target. ```json { "data": { "org_id": "org_...", "slo_id": "", "slo_name": "checkout-api availability", "service_id": "", "service_name": "checkout-api", "burn_event_id": "", "started_at": "", "error_budget_burned_pct": 12.4, "target_pct": 99.9, "window_days": 30 } } ``` ## slo.burn_resolved An SLO recovered. The matching `burn_event_id` from the prior `slo.burn_started` is included so consumers can pair the two. ```json { "data": { "org_id": "org_...", "slo_id": "", "slo_name": "checkout-api availability", "service_id": "", "service_name": "checkout-api", "burn_event_id": "", "resolved_at": "", "final_budget_remaining_pct": 87.2, "target_pct": 99.9, "window_days": 30 } } ``` ## agent.offline An agent missed its expected heartbeat window. ```json { "data": { "org_id": "org_...", "agent_id": "", "agent_name": "agent-eu-west-1", "last_heartbeat_at": "", "version": "1.2.3" } } ``` ## incident.created A new incident row was created. Fires for both drafts and published incidents on insert. ```json { "data": { "org_id": "org_...", "incident_id": "", "title": "...", "severity": "major", "state": "draft", "is_customer_scoped": false, "affected_service_ids": [""], "affected_service_names": ["checkout-api"] } } ``` ## incident.published A draft incident was published. The incident is now visible on the public page. ```json { "data": { "org_id": "org_...", "incident_id": "", "title": "...", "severity": "major", "state": "published", "published_at": "", "is_customer_scoped": false, "affected_service_ids": [""], "affected_service_names": ["checkout-api"] } } ``` ## incident.updated Title, severity, affected services, or visibility changed. ```json { "data": { "org_id": "org_...", "incident_id": "", "changed_fields": ["title", "severity"] } } ``` ## incident.message_added A new message was appended to an incident timeline. ```json { "data": { "org_id": "org_...", "incident_id": "", "message_id": "", "message_type": "Identified", "description": "...", "occurred_at": "" } } ``` ## incident.resolved `resolved_at` was set on the incident. Posting a Resolved message also fires this event because the appendMessage path auto-flips the parent state. ```json { "data": { "org_id": "org_...", "incident_id": "", "title": "...", "severity": "major", "resolved_at": "" } } ``` ## incident.deleted An incident was soft-deleted via DELETE. ```json { "data": { "org_id": "org_...", "incident_id": "", "deleted_at": "" } } ``` ## incident.auto_drafted The auto-incident worker created a DRAFT incident from an unhealthy metric flip. The draft is not visible to customers until it is published via the email CTA, the console, or the API. ```json { "data": { "org_id": "org_...", "incident_id": "", "metric_id": "", "metric_title": "checkout-api 5xx ratio", "severity": "major", "trigger_reason": "checkout-api 5xx ratio read 0.04 against threshold 0.02 at ", "value": 0.04, "threshold": 0.02, "affected_service_ids": [""], "url": "https://use.observer/console//updates/edit/" } } ``` ## incident.auto_published An auto-drafted incident was published via the signed-token email link. Equivalent to `incident.published` but distinguished so subscribers can listen specifically for the auto-publish flow. ```json { "data": { "org_id": "org_...", "incident_id": "", "title": "Investigating elevated errors on checkout-api 5xx ratio", "severity": "major", "state": "published", "published_at": "", "affected_service_ids": [""], "url": "https://use.observer/console//updates/edit/" } } ``` ## incident.auto_dismissed An auto-drafted incident was dismissed via the signed-token email link, OR auto-expired after 24h with no action. `reason` is `operator_dismiss` or `auto_expired`. ```json { "data": { "org_id": "org_...", "incident_id": "", "title": "Investigating elevated errors on checkout-api 5xx ratio", "dismissed_at": "", "reason": "auto_expired" } } ``` ## maintenance.scheduled A maintenance window was created. ```json { "data": { "org_id": "org_...", "maintenance_id": "", "title": "...", "scheduled_start_at": "", "scheduled_end_at": "", "affected_service_ids": [""], "affected_service_names": ["checkout-api"] } } ``` ## maintenance.starting_soon Cron fires this once per maintenance row when `scheduled_start_at` is within the next hour. Idempotent via `maintenance_starting_soon_fired_at`. ```json { "data": { "org_id": "org_...", "maintenance_id": "", "title": "...", "scheduled_start_at": "" } } ``` ## maintenance.started `actual_start_at` was set (manual API call or cron auto-transition). ```json { "data": { "org_id": "org_...", "maintenance_id": "", "title": "...", "actual_start_at": "", "scheduled_end_at": "" } } ``` ## maintenance.completed `actual_end_at` was set. Posting a Resolved message on a maintenance also fires this event because the appendMessage path flips the parent state. ```json { "data": { "org_id": "org_...", "maintenance_id": "", "title": "...", "actual_start_at": "", "actual_end_at": "" } } ``` ## maintenance.canceled `canceled_at` was set before completion. ```json { "data": { "org_id": "org_...", "maintenance_id": "", "title": "...", "canceled_at": "" } } ``` ## Signature verification When a signing secret is configured, every delivery carries: ```text X-Observer-Signature: sha256= ``` `hex` is `HMAC-SHA-256(secret, raw_body)`. Recompute on the receiving side and compare in constant time. Reject deliveries whose signatures do not match. Use `event_id` (also delivered as `X-Observer-Delivery`) as the idempotency key when persisting events. Retried deliveries reuse the same id, so a unique-key check prevents double-processing. --- url: https://docs.use.observer/docs/reference/audit-log-events title: Audit log events description: Categories of administrative events recorded in the audit log. --- Every administrative change in the console writes to an append-only audit log scoped to the organisation. Events are grouped into categories for filtering and retention. ## Categories | Category | Surface | |---|---| | `agent` | Agent create / rename / rotate-key / delete. | | `page` | Status page create / edit / theme change / access-mode change / delete. | | `webhook` | Webhook subscription create / edit / pause / delete; delivery retry / discard. | | `metric` | Metric definition create / edit / threshold change / delete. Manual metric status writes (`metric.status.set_manually`) also fall here. | | `slo` | SLO create / target change / window change / delete; burn open / resolve. | | `customer` | Customer create / edit / page binding change / SLO override / delete. | | `incident` | Incident create / publish / update / resolve / delete; message append. | | `maintenance` | Maintenance schedule / start / complete / cancel; starting-soon cron event. | | `subscriber` | Subscriber confirm / unsubscribe; per-event delivery audit lives in `subscriber_deliveries`, not `audit_log`. | | `org` | Organisation create / rename / member add / member remove. | | `auth` | User sign-in, sign-out, MFA enrol, password change. | | `subscription` | Plan change, payment method update, invoice generated. | | `billing` | Payment provider events (charge succeeded, refund, dispute, etc.). | | `api_key` | Org API key create / revoke. | | `other` | Any event whose action prefix does not match the categories above. | ## Event shape Each row carries: - `id`: opaque identifier. - `org_id`: the organisation the event scopes to. - `actor`: the user or system that performed the action. - `action`: dotted action string (for example `agent.created`, `slo.target_changed`). - `target_type` and `target_id`: the resource the action affected. - `metadata`: action-specific JSON payload. - `created_at`: timestamp. ## Filtering The audit log page in the console supports filtering by: - Time range. - Category. - Actor (user identifier). - Action (full dotted string). - Target id. ## Retention Audit log retention follows the same window as metric history (see [Plans and quotas](/docs/reference/plans-and-quotas)). Older entries are not deleted automatically; export them on the schedule your compliance team requires. --- url: https://docs.use.observer/docs/reference/threshold-operators title: Threshold operators description: How healthy / degraded / unhealthy is decided from a metric value. --- A metric's status on every push follows a strict rule applied to the value the agent reported. ## Rule Each metric carries two operator-and-value pairs: - `healthy_operation` and `healthy_value` - `unhealthy_operation` and `unhealthy_value` Operators are: `over`, `under`, `equal`. The agent computes status as: 1. If the value matches the healthy condition, status is `healthy`. 2. Else, if the value matches the unhealthy condition, status is `unhealthy`. 3. Otherwise, status is `degraded`. ## Strict comparisons Operators are strict everywhere. A value exactly equal to a threshold under `over` or `under` does not match. | Operator | Match condition | |---|---| | `over` | `value > threshold` (not `>=`) | | `under` | `value < threshold` (not `<=`) | | `equal` | `value == threshold` | The same comparison rule is applied both in the agent and in the read path the cloud uses to render status pages. A non-strict comparison in one location and a strict comparison in the other would cause the same value to flip status depending on the read surface. Strict everywhere keeps the metric's status consistent. ## Examples ### 5xx error ratio - `healthy_operation: under`, `healthy_value: 0.005` - `unhealthy_operation: over`, `unhealthy_value: 0.02` Reading: healthy under 0.5%; unhealthy over 2%; anything else is degraded. A value of exactly `0.005` is degraded (not healthy) because `under` is strict. ### TLS certificate expiry - `healthy_operation: over`, `healthy_value: 30` - `unhealthy_operation: under`, `unhealthy_value: 7` Reading: healthy when more than 30 days remain; unhealthy when fewer than 7 days remain; degraded in between (7 to 30 days). ### Queue depth - `healthy_operation: under`, `healthy_value: 100` - `unhealthy_operation: over`, `unhealthy_value: 1000` Reading: healthy under 100 messages; unhealthy over 1000; degraded in between. ## No-data and unknown `no_data` and `unknown` are not part of the operator rule. They arise from the agent's collection layer: - `no_data`: the agent attempted a probe but could not produce a value (timeout, connection refused, query returned empty). The cloud records the reason code alongside the status. - `unknown`: no recent push has arrived for the metric within the expected interval. ## Stale data A metric is `stale` when its last push timestamp is older than three times its push interval, capped at 15 minutes. Stale and `no_data` look identical in the database but mean different things: - `no_data`: the agent ran the probe and the probe failed to return a value. This is a real signal about the customer's service. It counts in the SLO and surfaces in status rollups. - `stale`: the agent has not pushed anything recently. The cause is on the monitoring side (cloud outage, agent crash, network partition between agent and cloud), not the customer's side. Stale metrics are excluded from the live status rollup, do not burn SLO budget, and do not fire `metric.status_changed` or `metric.no_data` webhooks. When every metric on a service is stale, the service rolls up to `monitoring_delayed` rather than `unhealthy`. See [Observer availability](/docs/concepts/observer-availability) for the contract that protects customer status pages from Observer's own outages. --- url: https://docs.use.observer/docs/reference/incident-lifecycle title: Incident and maintenance lifecycle description: States, transitions, and the events fired on each. --- Observer treats incidents and maintenances as instances of the same underlying row (the `updates` table). The lifecycle state is derived from a grid of timestamp columns rather than an explicit state column; this keeps the model coherent regardless of whether the transition happened via API, server action, or cron. ## States | State | Trigger | |---|---| | `draft` | Row exists but `published_at IS NULL`. Invisible to the public. | | `published` | `published_at` set. Renders on the public page. | | `resolved` | `resolved_at` set. Final lifecycle for an incident. | | `scheduled` | Maintenance with `scheduled_start_at` set, `actual_start_at NULL`. | | `in_progress` | Maintenance with `actual_start_at` set, `actual_end_at NULL`. | | `completed` | Maintenance with `actual_end_at` set. | | `canceled` | Either type with `canceled_at` set. | | `deleted` | Soft-delete via `deleted_at`. Permanent; not displayed anywhere. | ## Transitions ```text draft │ POST /publish ▼ published ──── POST /resolve ──▶ resolved │ │ DELETE ▼ deleted (maintenance only) scheduled ── cron @ scheduled_start_at ──▶ in_progress │ cron @ scheduled_end_at ▼ completed any state ── POST /cancel ──▶ canceled ``` ## Webhook events Each transition fires a webhook event. See [Webhook payload reference](/docs/reference/webhook-payloads) for exact body shapes. | Event | Fires when | |---|---| | `incident.created` | Row inserted (regardless of draft / publish). | | `incident.published` | `published_at` set. | | `incident.updated` | Title, severity, or affected services patched. | | `incident.message_added` | Message appended to timeline. | | `incident.resolved` | `resolved_at` set. | | `incident.deleted` | `deleted_at` set. | | `maintenance.scheduled` | Row inserted with `scheduled_start_at`. | | `maintenance.starting_soon` | Cron fires within 1h of `scheduled_start_at`. Once per row. | | `maintenance.started` | `actual_start_at` set (manual or cron). | | `maintenance.completed` | `actual_end_at` set. | | `maintenance.canceled` | `canceled_at` set. | ## Auto-message side effects Some lifecycle transitions append a system message to the timeline: - `maintenance.started` (cron or API) appends an Information message: "Maintenance started." - `maintenance.completed` (cron or API) appends a Resolved message: "Maintenance completed." - `maintenance.canceled` (API) appends an Information message: "Maintenance canceled." These are visible on the public page exactly like operator-authored messages. They exist so the timeline reflects every state change without requiring the operator to remember. Lifecycle transitions reject double-application: `POST /publish` on an already-published incident returns 409 (`already_published`). Same for resolve, start, complete, cancel. Soft-delete returns 200 (idempotent). --- url: https://docs.use.observer/docs/reference/subscriber-events title: Subscriber notification events description: Which incident and maintenance transitions trigger subscriber emails. --- The notification worker reads from `pgmq.notification_outbox` and fans out to confirmed subscribers on the affected page. The dispatch matrix below lists which event types trigger subscriber email and which do not. ## Trigger matrix | Event | Triggers subscriber email | |---|---| | `incident.created` | No (draft state). | | `incident.published` | Yes. | | `incident.updated` | No (avoids notification spam on minor edits). | | `incident.message_added` | Yes. | | `incident.resolved` | Yes. | | `incident.deleted` | No. | | `maintenance.scheduled` | Yes. | | `maintenance.starting_soon` | Yes (1h pre-warn). | | `maintenance.started` | No (the starting_soon mail covered it). | | `maintenance.completed` | Yes. | | `maintenance.canceled` | Yes. | ## Filter scopes When a subscriber has rows in `subscriber_filters`, the dispatch only fires when at least one of the incident's affected services or metrics intersects the filter list. Subscribers with no filters receive every relevant event for the page. ## Customer-scoped incidents Incidents with rows in `update_customer_visibility` are scoped: only subscribers tied to one of the listed customers receive notifications. The MVP customer-binding model on subscribers is still under design; today, customer-scoped incidents skip subscriber dispatch entirely. The webhook layer is unaffected — outbound webhook subscribers always receive every event their subscription opted into, regardless of customer scoping. ## Per-attempt audit Each delivery attempt writes one row to `subscriber_deliveries`: ```text id uuid subscriber_id uuid event_type text event_id uuid status text ('ok', 'error', 'skipped') status_code integer (Resend response, when applicable) error text (truncated body on error) attempted_at timestamptz ``` The console **Subscribers** view exposes the most-recent attempt per subscriber for triage. --- url: https://docs.use.observer/docs/reference/feed title: RSS / Atom feed reference description: Public feed shape, caching headers, and exclusion rules. --- Every public status page exposes both Atom and RSS feeds: ```text GET https://./feed.atom GET https://./feed.rss ``` ## Content One entry per incident message + one per maintenance lifecycle event (scheduled / started / completed / canceled). The granularity matches what a feed reader expects: each update on a single incident is a separate item, sorted newest-first. ## Headers ```text Content-Type: application/atom+xml; charset=utf-8 (or application/rss+xml) Cache-Control: public, max-age=60 ETag: "obs--" ``` The ETag is computed from the count of feed-eligible entries and the maximum of (`published_at`, `resolved_at`, `actual_start_at`, `actual_end_at`, `canceled_at`, message dates). A repeat fetch with matching `If-None-Match` returns `304 Not Modified` with no body. ## Exclusions The feed excludes: - Customer-scoped incidents (rows in `update_customer_visibility` are unconditionally hidden, even when a customer JWT is present — feeds have no auth). - Drafts (`published_at IS NULL` and not a maintenance with `scheduled_start_at`). - Soft-deleted rows. ## Discoverability Status pages emit a `` in the page `` so most feed readers auto-detect the URL. ## Limit Default 50 entries. Override via `?limit=200`. The route caps at the hard limit set on the underlying query (200 today). --- url: https://docs.use.observer/docs/troubleshooting/page-renders-blank title: Status page renders blank description: Diagnose a public status page that returns 200 but shows no content blocks. --- A status page that resolves to the right host but renders no content is almost always one of three problems: the page exists but has no blocks added, the metrics on the page have not yet reported, or the page's access mode is gating the visitor. ## Step 1: confirm the page exists In the console, open **Pages** and verify the subdomain matches the URL the visitor reaches. The subdomain field is unique per organisation; a typo on save produces a different URL than expected. The values `admin` and `blog` are reserved and are not valid status page subdomains. ## Step 2: confirm content blocks are present Open the page in the builder. A page with the title and theme set but no blocks added renders an empty body. Drag a **Metrics** block onto the canvas, select at least one metric, and save. A common variation: blocks were added but never saved. The builder's draft state is local until **Save** commits it. ## Step 3: confirm the metrics are reporting If the page has metric blocks but the visitor sees no values, open each metric in the console and check the **Latest** column. If the metric has not received a value, the agent has not yet reported. Walk [Metric shows no data](/docs/troubleshooting/metric-shows-no-data). ## Step 4: confirm access mode Under the page's **Access** tab, the access mode determines who can see the page: | Mode | Who sees content | |---|---| | Public | Anyone with the URL. | | Password | Visitors with the page's shared password. | | IP allowlist | Visitors from configured IP ranges. | | Customer-scoped (JWT) | Visitors with a valid JWT bound to a customer. | A page rendering blank for the operator while logged into the console, but rendering content in an incognito window, often points at a logged-in/logged-out cookie state. Open the page in a new private window to check. ## Step 5: check the browser console A specific failure mode: a page with strict custom CSS that hides body content. Open the browser developer tools, network tab, and confirm the document body returns 200 with markup. Check the console tab for hydration errors. If custom CSS is the cause, edit the page's **CSS** tab and remove the offending rules. --- url: https://docs.use.observer/docs/troubleshooting/metric-shows-no-data title: Metric shows no data description: Diagnose a metric that displays no current value or status in the console. --- A metric with no recent value is almost always one of three problems: no agent is assigned to the metric, the assigned agent is not running, or the agent is running but the probe itself returns an error. ## Step 1: confirm an agent is assigned In the console, open **Metrics**, then the metric in question. The detail page shows the assigned agent. If the field is empty, the metric is defined but no agent is collecting it. Set the **Agent** field, save, and wait one push interval (default one minute). The cloud's metric-definitions endpoint refreshes the agent's assignment every five minutes; restart the agent to pull the updated list immediately. ## Step 2: confirm the agent is running Open **Agents** in the console and verify the assigned agent shows status **running**. If it shows **stopped**, walk the [stalled agent diagnosis](/agent/guides/diagnose-stalled-agent). ## Step 3: confirm the probe is succeeding If the metric reports `no_data` rather than no value at all, the agent ran the probe and the probe failed. The metric's detail page shows the latest `reason` string. | `reason` substring | Probable cause | |---|---| | `ECONNREFUSED` | The target's port is closed or the host is unreachable. Verify network reachability from the agent's host. | | `ENOTFOUND` | DNS resolution failed. Check `PROMETHEUS_SERVER_URL` or the probe's target hostname. | | `ETIMEDOUT` | Target is reachable but did not respond within the configured timeout. | | `HTTP 401` / `HTTP 403` | Authentication or authorization failed against the probe target. | | `prometheus query empty` | The PromQL returned no series. The series name or label match probably does not exist. | For Prometheus probes, run the query directly against the Prometheus server (the same URL the agent uses) and confirm it returns a single scalar. ## Step 4: confirm the threshold rule is correct A metric that reports values but with status `unknown` typically has thresholds that do not cover the value range. Open the metric and verify: - `healthy_under` / `healthy_over` define a band the value can reach. - `unhealthy_under` / `unhealthy_over` define the failure band. - Comparison operators are strict (`over` is `>`, `under` is `<`, `equal` is `=`). A value exactly on a boundary does not match that band. ## Step 5: confirm the dashboard view If the metric reports values in the console's metric detail page but a status page shows no data, verify: - The metric is on the page (open the page builder). - The metric's `is_public` flag is set (visible on the metric edit page). Each probe type has a dedicated configuration guide: [Prometheus](/agent/guides/prometheus-source), [HTTP](/agent/guides/http-probes), [TCP](/agent/guides/tcp-probes), [DNS](/agent/guides/dns-probes), [TLS certificate](/agent/guides/tls-cert-probes). Each one covers the probe-specific failure modes in detail. --- url: https://docs.use.observer/docs/troubleshooting/webhook-deliveries-failing title: Webhook deliveries failing description: Diagnose a webhook subscription whose deliveries do not reach the receiver, or whose receiver rejects them. --- A failing webhook subscription presents in one of three ways: the delivery log shows non-2xx responses from the receiver, the log shows network errors before the receiver was reached, or the log is empty when the operator expected events. ## Step 1: read the delivery log Open **Webhooks**, the subscription in question, then **Delivery log**. Each entry shows: - `event_type` and `event_id`. - `attempted_at`. - `response_status` (or a network-level error string). - `response_body` (truncated). If the log is empty, the events the subscription is bound to have not fired since the subscription was created. Trigger a test event by changing a metric's threshold to flip status, or wait for an organic event. ## Step 2: non-2xx from the receiver | Status | Probable cause | |---|---| | `400` | The receiver expects a different payload schema. Compare against [Webhook payload reference](/docs/reference/webhook-payloads). | | `401` / `403` | Authentication required. Receivers like generic Slack apps or HMAC-protected endpoints require headers Observer does not set by default. | | `404` | URL is wrong. Re-paste from the receiver's documentation. | | `429` | Receiver is rate-limiting. Reduce subscription scope or contact the receiver's vendor. | | `5xx` | Receiver is failing. The delivery worker retries with exponential backoff up to a fixed cap; deliveries are then moved to the dead-letter view. | The delivery worker retries non-2xx responses. If retries exhaust, the entry moves to the **Dead letter** view; manual replay is available there. ## Step 3: network errors If `response_status` is missing and the log shows a network-level error: | Error substring | Probable cause | |---|---| | `ECONNREFUSED` | Receiver host is not listening on the configured port. | | `ENOTFOUND` | DNS resolution failed. Verify the URL hostname. | | `ETIMEDOUT` | Receiver did not respond within the request timeout. | | `CERT_HAS_EXPIRED` / `UNABLE_TO_VERIFY_LEAF_SIGNATURE` | The receiver's TLS certificate is expired or untrusted by the public CA bundle. Renew the certificate; Observer does not accept self-signed certificates against a public endpoint. | ## Step 4: signature verification on the receiver If the receiver computes a signature mismatch but the URL and secret are correct: - Confirm the secret is the value Observer's webhook subscription page shows, not a copy with whitespace. - Confirm the receiver computes `HMAC-SHA-256(secret, raw_body)` and compares against the `X-Observer-Signature` header value (after the `sha256=` prefix). - Confirm the receiver hashes the raw request body, not a re-serialised JSON. Some web frameworks parse and re-serialise request bodies on the way to the handler; the recomputed signature does not match. ## Step 5: subscription is disabled A subscription whose **enabled** flag is off does not deliver events. The subscription edit page exposes the toggle. If a subscription was disabled while debugging, re-enable it to resume deliveries; new events fire from the moment it is re-enabled, not backfilled. Webhook delivery actions write audit log rows under the `webhook` category. Audit log rows carry the failing receiver URL and the response status, which helps when correlating across multiple subscriptions. --- url: https://docs.use.observer/docs/troubleshooting/sso-not-working title: SSO not working description: Diagnose JWT-based access on customer-scoped pages and authentication issues for the console. --- Two distinct authentication paths exist in Observer. The **console** uses Observer's hosted authentication for operators. **Customer-scoped pages** verify JWTs that customers' identity providers issue. The two paths fail for different reasons; this page covers both. ## Customer-scoped pages: JWT verification fails A customer reaches a customer-scoped page and is denied access even with what they believe is a valid token. Walk: ### Step 1: confirm the page is in customer-scoped mode Open the page's **Access** tab. The mode must be **Customer-scoped (JWT)**. If it is set to anything else, the JWT header is ignored. ### Step 2: confirm the issuer keys match The page's access config holds either a static public key or a JWKS endpoint. The token must be signed by a key the page can verify. - Static keys: confirm the key the customer's IdP is using matches the value pasted into the access config. Re-paste from the IdP on a fresh copy. - JWKS endpoint: confirm the URL is reachable from Observer Cloud and returns valid JWKS JSON. Cache invalidation can cause stale keys; the configured cache TTL determines refresh frequency. ### Step 3: confirm the claim mapping The access config specifies which JWT claim resolves to a customer (typically `sub`, `customer_id`, or a custom claim). The token must carry that claim, and the value must match a customer in the organisation's list AND that customer must be bound to this page. A token whose claim does not resolve to a bound customer returns 403 even when the signature is valid. Open **Customers** and verify both: 1. A customer record exists with the value the JWT carries. 2. That customer is on the page's customer-binding list. ### Step 4: confirm the token has not expired The `exp` claim in the JWT is enforced. Tokens past expiry are rejected. The customer's IdP integration is responsible for issuing fresh tokens. ### Step 5: confirm the audience and issuer If the access config sets `audience` or `issuer` constraints, the token must carry matching `aud` and `iss` claims. A token issued for a different audience returns 403. The fastest verification is to decode the customer's failing JWT at [jwt.io](https://jwt.io), inspect the claims, and compare against the page's access config field-by-field. The JWT itself is the signed source of truth for what the customer claims. ## Console SSO: operators cannot sign in Observer Cloud's console authentication is hosted. If an operator cannot sign in: ### Step 1: confirm the email and organisation The operator must be a member of the organisation. Sign-in fails silently in some browsers if the email matches no account or if the account exists but has no organisation membership. The organisation owner can re-invite the operator from **Settings** > **Members** > **Invite**. The invitation arrives by email; following the link binds the account to the organisation. ### Step 2: confirm the email provider is reachable If invitations do not arrive: 1. Check spam folders. First-time invitations from a new domain often filter aggressively. 2. Confirm the recipient's email provider accepts mail from Observer's sender domain. Corporate gateways occasionally block transactional senders by default. 3. Resend the invitation; each invitation has a distinct confirmation token. ### Step 3: confirm MFA is configured If the operator authenticates but is rejected at the MFA step: 1. Confirm the MFA enrolment is bound to the same account they are signing in to. 2. If a recovery code was used, advise the operator to re-enrol their second factor immediately; the recovery code only grants one-time access. ### Step 4: SAML or social provider integrations Observer's hosted authentication supports email + password and a configurable set of social providers. Provider availability is determined by the cloud's configuration; if the social button the operator expects is missing, the provider is not enabled on this deployment. Contact the cloud operator. If every administrator on an organisation has lost access (for example, MFA hardware reset across a team), contact Observer support with proof of organisation ownership. Recovery is a manual process. --- url: https://docs.use.observer/docs title: Documentation description: Reference and guidance for Observer, the metrics-driven status page platform. --- This section covers product setup, day-two operations, and concepts. The two adjacent tabs are scoped narrower: the [Observer Agent](/agent) tab covers the on-premise data plane, and the [API](/api) tab is the generated REST reference. ## Quickstart Pick the metric path that matches your environment, then proceed to SLO and status page. The two metric quickstarts are alternatives, not sequential; complete one of them. ## Guides, reference, and concepts The remaining sections appear in the sidebar as content is published: - **Guides** cover task-shaped configuration: customer-scoped pages, password protection, custom domains, outbound webhooks, theme customization, and migration from other status page tools. - **Reference** covers plan limits, webhook payload shapes, audit log event names, and customization options. - **Concepts** covers the operating model: metric-based status, service level objectives, customer scopes, threshold semantics. --- url: https://docs.use.observer/agent/concepts/agent-cloud-boundary title: Agent and cloud boundary description: What crosses the network and what does not. --- The Observer Agent is the only component that runs inside your network. The cloud sits across an HTTPS boundary and never reaches back into your network. This page is the explicit description of what crosses that boundary and what stays put. ## What the agent sends to the cloud ```text POST /api/agent/heartbeat every ~30 seconds. Self-state report (queue depth, uptime, active source types). See the heartbeat payload reference. GET /api/agent/metrics-definitions every 5 minutes. Pull of the metric definitions assigned to this agent. The response is the canonical list the agent schedules against. POST /api/agent/receiver per status push. The body is one row: { metric_id, value, status, timestamp, reason? } POST /api/agent/log (optional) only when BROADCAST_LOGS=true. Forwards a subset of agent log lines for surfacing on the agent detail page. PromQL query strings are always redacted to a SHA-256 prefix and length. ``` That is the entire surface. There are no other outbound calls. ## What the agent does not send - Raw PromQL query strings. - Raw HTTP request bodies, response bodies, or response headers beyond what the probe required. - DNS resolver responses beyond a substring match against `expected_value` if configured. - TLS certificate chains. Only `days_until_expiry` and a few metadata fields (subject CN, issuer CN, valid_to) are sent. - Any metric series outside the explicit metric definitions. ## What the cloud sends to the agent Only the response to `GET /api/agent/metrics-definitions`. The response shape is the projection in the public `@observer/protocol` package's `MetricDefinition` type. The cloud has no path back into your network. It cannot pull from your Prometheus, hit your endpoints, or query your DNS. Every probe runs from the agent's vantage point. Several Observer customers run in environments where outbound HTTPS is the only allowed network path (PCI scope, regulated banking, defence). The agent is designed for that constraint: one outbound HTTPS connection to the cloud, no inbound connections, no other egress. ## Trust assumptions - The agent trusts the cloud's TLS certificate by default. Set `SKIP_SSL_VERIFICATION=true` in development only. - The cloud trusts the agent only after the agent presents a valid `AGENT_KEY`. Keys are bound to a single agent identity and a single organisation. - A compromised agent key affects only that agent's pushes. The cloud restricts each request to the agent's own organisation; a stolen key cannot read or write across tenants. --- url: https://docs.use.observer/agent/concepts/probes-vs-scraping title: Probes vs scraping description: Why the agent runs probes from inside your network instead of having the cloud scrape endpoints. --- The classical observability pattern (Prometheus, Datadog, Grafana Cloud) is centralised scraping: a central system reaches out to your endpoints on a schedule and pulls metrics. Observer takes the inverse position: the agent runs in your network and pushes outcomes to the cloud. This page covers why. ## The constraints scraping puts on you Scraping requires the central system to reach every endpoint it measures. In practice this means at least one of: 1. Public exposure of internal endpoints, sometimes with a reverse proxy or TLS-terminating load balancer purely to accept the scrape. 2. A VPN or peering connection from the central system back into your network. 3. A scraping agent inside your network that the central system pulls from (essentially shifting the same problem one hop). Each option grows the network attack surface and the legal review surface. For organisations in regulated environments (PCI, HIPAA, defence, finance), opening any inbound path is a months-long compliance exercise. ## The probe model Observer's agent runs inside your network, hits its targets locally, and pushes the verdict to the cloud. No IP allowlist at your edge, no TLS-terminating proxy in front of internal endpoints, no reverse VPN. The exact request surface (which endpoints, which payloads, what stays put) is enumerated in [Agent and cloud boundary](/agent/concepts/agent-cloud-boundary). The trade-off: collection happens in your network, so collection runs on your hardware. The agent is small (single process, roughly 40MB image, roughly 64MB RSS at idle) and runs anywhere a container can run. ## When scraping is still preferable When the targets are themselves SaaS systems with public scrape endpoints (a third-party API, a public DNS server, a hosted queue), the central system would have a clean path. Observer still uses the agent for these cases for one reason: a single configuration surface. Mixing "some metrics scraped centrally, some pushed from an agent" doubles the operator's mental model without an offsetting benefit. The agent runs the probe from wherever it sits and reports the result. --- url: https://docs.use.observer/agent/concepts/local-queue title: The local queue description: Why the agent buffers status pushes locally, and how the buffer behaves under cloud unreachability. --- Status pushes are written to a local SQLite file before the agent attempts to deliver them to the cloud. The file is the agent's durability layer. Pushes survive container restarts, daemon restarts, and the cloud being temporarily unreachable. ## Behaviour - Every status push enqueues a row in the local SQLite file (`BUFFER_PATH`, default `./observer-agent-buffer.db`). - A background drain controller pulls batches from the queue and posts them to the cloud's `/api/agent/receiver` endpoint. - Successful posts ack and remove rows from the queue. - Failed posts back off exponentially. The queue continues to accept new pushes during the outage. - When the queue reaches `BUFFER_MAX_ROWS` (default `10000`), oldest entries are evicted to admit new ones. The cloud is the source of truth for historical data. The local queue is a write-ahead log that protects against transient cloud failures, not a long-term store. ## What the operator sees The agent dashboard's queue panel shows three live numbers: - `depth`: rows currently waiting. - `oldest_age_seconds`: age of the oldest pending row. - `drain_backoff_ms`: current backoff between drain attempts. A growing depth combined with a non-zero backoff is the signature of cloud unreachability. Once the cloud is reachable again, the queue drains and depth returns to near zero. ## Cloud-side signals The cloud's heartbeat receiver computes two derived signals from the queue numbers in every heartbeat: - `agent.lag_high`: opens when `queue_depth > 1000` or `queue_oldest_age_seconds > 300`. Surfaces in the agent detail page and as a webhook event when subscribed. - `agent.uptime_degraded`: opens when 24-hour uptime falls below 95%. Both signals clear with 60-second hysteresis to avoid flapping. The queue is a process-local file. Running two agent processes with the same `AGENT_KEY` splits the queue between them and confuses the cloud's per-agent uptime computation. Run a single replica per agent identity. --- url: https://docs.use.observer/agent/concepts/bun-distroless-design title: Bun and distroless: design choices description: Why the agent runs on Bun and ships in a distroless image. --- The agent's runtime choices are deliberate. They shape the image size, the operator surface, and the security posture in production. ## Bun The agent runs on Bun rather than Node.js. The decisions Bun makes for us: - **Native TypeScript execution.** Source files run directly, with no transpile step in the build pipeline. - **Embedded SQLite.** The `bun:sqlite` module replaces `better-sqlite3`. Loses the build dependency on `python`, `make`, and `g++`. The container image shrinks accordingly. - **Native fetch.** `axios` is gone. One fewer dependency, one fewer attack surface. - **Native scheduling.** `setInterval` is enough; `node-cron` is not in the dependency graph. - **Automatic `.env` loading.** `dotenv` is gone for the agent's needs. The trade-off is that Bun is younger than Node and the ecosystem's edge cases sometimes show up. The agent's dependencies are deliberately narrow to limit exposure. ## Distroless The runtime image is `oven/bun:1-distroless`. The trade-offs: - **No shell.** `sh`, `bash`, `busybox`, `curl`, `wget`, and every other utility are absent. An attacker who reaches the container has no shell to drop into. - **No package manager.** No `apk`, `apt`, or anything that can install code at runtime. - **Smaller surface.** The image contains the Bun binary, libc, and the agent's source. Nothing else. Operational consequence: do not `kubectl exec -it` into the container expecting a shell. Diagnose through the dashboard, through logs, and through restart-and-observe. Distroless containers cannot run a shell-based health check (`HEALTHCHECK` instruction with curl + sh). The agent relies on liveness derived from container exit and the cloud's heartbeat-based agent.offline detection rather than a container-local health probe. ## Single-file source The runtime entry is `src/index.ts`. Surrounding modules (`buffer.ts`, `drain.ts`, `dashboard.ts`, `status.ts`, `sources/*`) are ESM imports. There is no bundler step before shipping. The image carries the source verbatim, runs it under Bun, and that is the entire chain from `git clone` to running process. ## Image size The runtime image is roughly 40MB on `linux/amd64`. The Bun distroless base is most of that; the agent's own code adds a few hundred kilobytes. Pull time on a fresh node is dominated by the base layer; tag-based caching makes subsequent pulls nearly instant. ## Standalone binary `bun build --compile` produces a single-file executable per platform. The binary embeds the Bun runtime and every dependency, so a release-binary install reduces to "download, chmod, run" with no runtime install on the host. Per-tag CI publishes five binaries to [github.com/useobserver/agent/releases](https://github.com/useobserver/agent/releases): `linux-x64`, `linux-arm64`, `darwin-x64`, `darwin-arm64`, and `windows-x64.exe`, plus a `SHA256SUMS` file for verification. The binary path is the lightest install but offers fewer guardrails than the container path. Use the container image when you want isolation (separate user namespace), a single upgrade mechanism shared with other services, or tag-based version pinning at the container-runtime layer. Use the binary on constrained or air-gapped hosts that should not run Docker at all. Binaries are produced from the same source as the container — the build entry is `src/index.ts` either way. Runtime flags can be forwarded to the embedded Bun via `BUN_OPTIONS`; see [bun.com/docs/bundler/executables](https://bun.com/docs/bundler/executables#runtime-arguments-via-bun_options). --- url: https://docs.use.observer/agent/quickstart/install-binary title: Install from a release binary description: Download a single-file executable from GitHub Releases. No runtime install required on the host. --- The agent is published as a single-file binary per platform. Each binary embeds the Bun runtime + every dependency, so the install collapses to "download, chmod, run". Use this path on minimal hosts that cannot or should not run Docker. ## Prerequisites - An agent key from the Observer console (**Agents** > **New agent**). - A reachable Prometheus URL (only required for Prometheus probes). ## Steps ### Pick the binary for your platform Releases live at [github.com/useobserver/agent/releases](https://github.com/useobserver/agent/releases). Each release publishes five binaries plus a `SHA256SUMS` file. | Platform | File | |-----------------|---------------------------------------| | Linux x64 | `observer-agent-linux-x64` | | Linux arm64 | `observer-agent-linux-arm64` | | macOS x64 | `observer-agent-darwin-x64` | | macOS arm64 | `observer-agent-darwin-arm64` | | Windows x64 | `observer-agent-windows-x64.exe` | ### Download and verify ```bash VERSION=1.0.4 curl -fLO https://github.com/useobserver/agent/releases/download/agent-v${VERSION}/observer-agent-linux-x64 curl -fLO https://github.com/useobserver/agent/releases/download/agent-v${VERSION}/SHA256SUMS shasum -a 256 -c SHA256SUMS --ignore-missing chmod +x observer-agent-linux-x64 sudo mv observer-agent-linux-x64 /usr/local/bin/observer-agent ``` ### Run ```bash AGENT_KEY=obs_live_... \ CLOUD_SERVER_URL=https://use.observer \ PROMETHEUS_SERVER_URL=http://prometheus.local:9090 \ observer-agent ``` The dashboard listens on `http://localhost:10101`. The console's Agents page marks the agent as **running** within 90 seconds. ### Run as a systemd service (optional) ```ini title="/etc/systemd/system/observer-agent.service" [Unit] Description=Observer agent Wants=network-online.target After=network-online.target [Service] Type=simple EnvironmentFile=/etc/observer-agent.env ExecStart=/usr/local/bin/observer-agent Restart=on-failure RestartSec=5s User=observer [Install] WantedBy=multi-user.target ``` ```bash sudo install -m 600 -o root -g root /dev/stdin /etc/observer-agent.env <<'EOF' AGENT_KEY=obs_live_... CLOUD_SERVER_URL=https://use.observer PROMETHEUS_SERVER_URL=http://prometheus.local:9090 EOF sudo useradd --system --no-create-home observer 2>/dev/null || true sudo systemctl daemon-reload sudo systemctl enable --now observer-agent ``` ## Forwarding flags to the embedded Bun runtime The binary ships with a copy of the Bun runtime baked in. Runtime flags reach it via the `BUN_OPTIONS` environment variable, not via command-line arguments. Example: ```bash BUN_OPTIONS="--smol" observer-agent ``` See [bun.com/docs/bundler/executables](https://bun.com/docs/bundler/executables#runtime-arguments-via-bun_options) for the full list of supported flags. ## Upgrades ```bash VERSION=1.0.4 curl -fLO https://github.com/useobserver/agent/releases/download/agent-v${VERSION}/observer-agent-linux-x64 chmod +x observer-agent-linux-x64 sudo mv observer-agent-linux-x64 /usr/local/bin/observer-agent sudo systemctl restart observer-agent ``` Pin to an exact version in your install scripts and CI. New releases at [github.com/useobserver/agent/releases](https://github.com/useobserver/agent/releases). Roll forward deliberately; rollbacks are a one-line revert of the URL. Use the [Docker quickstart](/agent/quickstart/install-docker) when your host already runs containers, when you want pinning via image tags, or when you want the agent isolated from the host's user namespace. The binary path is the lightest install but offers fewer guardrails. --- url: https://docs.use.observer/agent/quickstart/install-docker title: Install on Docker description: Run the published image with three required environment variables. --- The fastest path to a running agent. Suitable for a development host, a single VM, or a quick proof of concept. ## Prerequisites - Docker installed. - An agent key from the Observer console (**Agents** > **New agent**). The key is shown once; copy it before navigating away. - A Prometheus URL the host can reach (only required if the agent will run Prometheus probes; HTTP / TCP / DNS / TLS-cert probes do not need it). ## Steps ### Pull the image ```bash docker pull ghcr.io/useobserver/agent:1.0.4 ``` ### Run the container ```bash docker run -d \ --name observer-agent \ --restart unless-stopped \ -p 10101:10101 \ -e AGENT_KEY=obs_live_... \ -e CLOUD_SERVER_URL=https://use.observer \ -e PROMETHEUS_SERVER_URL=http://prometheus:9090 \ ghcr.io/useobserver/agent:1.0.4 ``` The image listens on port `10101` for the debug dashboard. If the agent will not run Prometheus probes, omit `PROMETHEUS_SERVER_URL`. ### Confirm the connection Open `http://:10101` in a browser. The dashboard's *Cloud* panel shows a recent `last_heartbeat_at` timestamp once the agent has registered with the cloud (typically within 30 seconds). The Agents page in the console marks the agent as **running** within 90 seconds. For multi-host or cluster deployments, see [Install on Kubernetes](/agent/quickstart/install-kubernetes). The systemd wrap below works for bare-metal Linux hosts that don't run k8s. The Docker path is recommended only for single-host development. ## Compose For a Compose-driven setup, use the snippet below. It pins the image, binds the dashboard, and reads secrets from a `.env` file. Pin to an exact version (`agent:1.0.4`). Track new releases at [github.com/useobserver/agent/releases](https://github.com/useobserver/agent/releases) and roll forward when ready. ```yaml title="docker-compose.yml" services: observer-agent: image: ghcr.io/useobserver/agent:1.0.4 container_name: observer-agent restart: unless-stopped env_file: [.env] ports: - "10101:10101" ``` ```bash title=".env" AGENT_KEY=obs_live_... CLOUD_SERVER_URL=https://use.observer PROMETHEUS_SERVER_URL=http://prometheus:9090 ``` ## Run under systemd Use this when the host is bare-metal Linux without k8s and you want Docker isolation alongside systemd auto-restart + journal log capture. Prefer [Install from a release binary](/agent/quickstart/install-binary) when Docker isn't a hard requirement — fewer moving parts. ```bash sudo install -m 600 -o root -g root /dev/stdin /etc/observer-agent.env <<'EOF' AGENT_KEY=obs_live_... CLOUD_SERVER_URL=https://use.observer PROMETHEUS_SERVER_URL=http://prometheus.local:9090 EOF sudo docker pull ghcr.io/useobserver/agent:1.0.4 ``` ```ini title="/etc/systemd/system/observer-agent.service" [Unit] Description=Observer agent Wants=network-online.target docker.service After=network-online.target docker.service [Service] Type=simple EnvironmentFile=/etc/observer-agent.env ExecStartPre=-/usr/bin/docker rm -f observer-agent ExecStart=/usr/bin/docker run --rm --name observer-agent \ --network host \ --env-file /etc/observer-agent.env \ ghcr.io/useobserver/agent:1.0.4 ExecStop=/usr/bin/docker stop observer-agent Restart=on-failure RestartSec=5s [Install] WantedBy=multi-user.target ``` ```bash sudo systemctl daemon-reload sudo systemctl enable --now observer-agent journalctl -u observer-agent -n 50 --no-pager ``` `--network host` lets the agent reach Prometheus and probe targets bound to the host's loopback or its private interface. If your targets sit on a Docker bridge instead, drop the host-network flag and use the bridge IP. --- url: https://docs.use.observer/agent/quickstart/install-kubernetes title: Install on Kubernetes description: Deployment manifest with Secret-bound credentials. --- The recommended deployment path for production. Single-replica Deployment, agent key delivered through a Kubernetes Secret. ## Prerequisites - A Kubernetes cluster you can deploy into. - An agent key from the Observer console. - A reachable Prometheus URL inside the cluster (typically a `Service` in the monitoring namespace). ## Steps ### Create a namespace and Secret ```bash kubectl create namespace observer kubectl create secret generic observer-agent \ --namespace observer \ --from-literal=agent-key='obs_live_...' ``` ### Apply the Deployment ```yaml title="agent.yaml" apiVersion: apps/v1 kind: Deployment metadata: name: observer-agent namespace: observer spec: replicas: 1 selector: matchLabels: { app: observer-agent } template: metadata: labels: { app: observer-agent } spec: containers: - name: agent image: ghcr.io/useobserver/agent:1.0.4 imagePullPolicy: IfNotPresent ports: - { name: dashboard, containerPort: 10101 } env: - name: AGENT_KEY valueFrom: secretKeyRef: { name: observer-agent, key: agent-key } - name: CLOUD_SERVER_URL value: https://use.observer - name: PROMETHEUS_SERVER_URL value: http://prometheus.monitoring.svc.cluster.local:9090 resources: requests: { cpu: "50m", memory: "64Mi" } limits: { cpu: "500m", memory: "256Mi" } ``` ```bash kubectl apply -f agent.yaml ``` ### Confirm the connection Port-forward the dashboard: ```bash kubectl -n observer port-forward deploy/observer-agent 10101:10101 ``` Open `http://localhost:10101`. The *Cloud* panel reports `last_heartbeat_at` within 30 seconds. The console's Agents page marks the agent as **running** within 90 seconds. Run a single replica per agent identity. The agent's local queue uses an embedded SQLite file inside the container; running two pods with the same key splits the queue and confuses the cloud's agent.offline detection. Pin to an exact version (`agent:1.0.4`). Track new releases at [github.com/useobserver/agent/releases](https://github.com/useobserver/agent/releases) and roll forward when ready. ## Optional: dashboard Service To expose the dashboard inside the cluster (without `port-forward`), add a ClusterIP Service. Do not expose the dashboard externally; the dashboard is operator-facing only. ```yaml title="agent-service.yaml" apiVersion: v1 kind: Service metadata: name: observer-agent namespace: observer spec: type: ClusterIP selector: { app: observer-agent } ports: - { name: dashboard, port: 10101, targetPort: 10101 } ``` --- url: https://docs.use.observer/agent/guides/prometheus-source title: Configure Prometheus query metrics description: Define a metric whose value comes from a PromQL query the agent runs against your Prometheus. --- Prometheus is the most common metric source. The agent runs the PromQL query against the Prometheus URL configured at deploy time and reports the scalar result. ## Configuration shape A Prometheus metric carries a `source_config` of: ```json { "query": "rate(http_requests_total{job=\"checkout-api\",status=~\"5..\"}[5m]) / rate(http_requests_total{job=\"checkout-api\"}[5m])", "prometheus_url": "https://prometheus.example/ (optional override)" } ``` The optional `prometheus_url` overrides the agent's `PROMETHEUS_SERVER_URL` for this single metric. Use it when one agent serves multiple Prometheus servers. ## Query requirements - The query must return a single scalar value. Use aggregation (`sum`, `avg`, `rate`, etc.) to collapse vector results. - Empty results are reported as `no_data`; the metric does not flip status until the query produces a value. - The agent does not interpret the query content. The cloud's push payload contains the precomputed status, not the query string. ## Authentication When Prometheus requires basic auth, set on the agent: ```bash PROMETHEUS_BASIC_AUTH_ENABLED=true PROMETHEUS_USERNAME=... PROMETHEUS_PASSWORD=... ``` These apply to every Prometheus probe the agent runs. Grafana Cloud's hosted Prometheus uses basic auth. The credentials issued by Grafana for read access drop in here unchanged. See [Connect to Grafana Cloud](/agent/guides/connect-grafana-cloud). ## Threshold examples | Query | Healthy | Unhealthy | |---|---|---| | 5xx error ratio (last 5 min) | `under 0.005` | `over 0.02` | | p95 latency in ms | `under 500` | `over 2000` | | Queue depth | `under 100` | `over 1000` | | Replica count | `over 1` | `under 1` | The strict-comparison rule applies (see [threshold operators](/docs/reference/threshold-operators) in the Documentation tab). --- url: https://docs.use.observer/agent/guides/http-probes title: Configure HTTP probes description: Probe an HTTP endpoint and report response time as the metric value. --- HTTP probes hit a URL on the configured interval. The reported value is `response_time_ms` for successful requests, or `no_data` with a reason code when the request fails (timeout, connection refused, body mismatch, unexpected status). ## Configuration shape ```json { "url": "https://api.example.com/healthz", "method": "GET", "expected_status": 200, "timeout_ms": 5000, "headers": { "User-Agent": "observer-agent" }, "body_match": "ok", "follow_redirects": true, "verify_tls": true } ``` ## Field reference | Field | Default | Notes | |---|---|---| | `url` | required | Full URL including scheme. | | `method` | `GET` | One of `GET`, `HEAD`, `POST`, `PUT`, `PATCH`, `DELETE`, `OPTIONS`. | | `expected_status` | `200` | Single integer or array. The probe matches if the response code is in the set. | | `timeout_ms` | `5000` | Aborts the request when exceeded. Reports `ETIMEDOUT`. | | `headers` | none | Extra request headers. Common use: API key for protected endpoints. | | `body_match` | none | Optional substring match against the first 4KB of the response body. Mismatch reports `body_mismatch`. | | `follow_redirects` | `true` | When `false`, redirect responses count against `expected_status`. | | `verify_tls` | `true` | When `false`, the probe accepts invalid TLS certificates. Useful for self-signed internal endpoints. | ## Reason codes The `reason` field on `no_data` results uses values from the HTTP client and Node socket layer: - `ETIMEDOUT`: request exceeded `timeout_ms`. - `ECONNREFUSED`: connection refused at the TCP layer. - `ENOTFOUND`, `EAI_AGAIN`: DNS resolution failed. - `unexpected_status:`: status code not in `expected_status`. - `body_mismatch`: `body_match` was set and the response body did not contain it. Only the first 4KB of the response body is read. If the marker string is later in the response, the probe reports `body_mismatch`. Move the marker earlier in the response, or use a dedicated health endpoint that returns it in the first kilobyte. ## Threshold examples | Goal | Healthy | Unhealthy | |---|---|---| | Endpoint reachable, fast | `under 500` | `over 2000` | | Endpoint reachable | `under 5000` | `over 10000` | For pure reachability with no latency requirement, set the unhealthy threshold equal to the timeout and rely on `no_data` for failures. --- url: https://docs.use.observer/agent/guides/tcp-probes title: Configure TCP probes description: Open a TCP connection and report connect time as the metric value. --- TCP probes are appropriate for non-HTTP services where reachability of a port is the signal: Redis, Postgres, RabbitMQ, internal RPC services. The agent opens a TCP connection, records the connect time in milliseconds, and closes the connection. ## Configuration shape ```json { "host": "redis.internal", "port": 6379, "timeout_ms": 2000 } ``` ## Field reference | Field | Default | Notes | |---|---|---| | `host` | required | Hostname or IP. | | `port` | required | Integer in `1..65535`. | | `timeout_ms` | `2000` | Aborts the connection attempt when exceeded. | ## Reason codes | Reason | Meaning | |---|---| | `ETIMEDOUT` | Connection attempt did not complete within `timeout_ms`. | | `ECONNREFUSED` | TCP connection refused. | | `ENOTFOUND` / `EAI_AGAIN` | DNS resolution failed. | | `tcp_error` | Other socket error. The exact code is logged on the agent. | ## Threshold examples | Goal | Healthy | Unhealthy | |---|---|---| | Reachable + fast handshake | `under 50` | `over 500` | | Reachable | `under 1000` | `over 1500` | Pure reachability with no latency requirement: set unhealthy at `timeout_ms - 1`, leaving anything below as healthy. --- url: https://docs.use.observer/agent/guides/dns-probes title: Configure DNS probes description: Resolve a record and report resolve time as the metric value. --- DNS probes resolve a domain through the agent's DNS resolver and report the resolution time in milliseconds. Optional value-match verifies the answer. ## Configuration shape ```json { "domain": "api.example.com", "record_type": "A", "expected_value": "203.0.113.10", "resolver": "1.1.1.1" } ``` ## Field reference | Field | Default | Notes | |---|---|---| | `domain` | required | The domain to resolve. | | `record_type` | `A` | One of `A`, `AAAA`, `CNAME`, `MX`, `TXT`, `NS`, `SRV`, `CAA`, `PTR`. | | `expected_value` | none | Optional substring match against the resolved record. Mismatch reports `expected_value_mismatch`. | | `resolver` | system default | Optional override resolver IP. Useful for verifying a specific authoritative server. | ## Reason codes The `reason` field surfaces standard Node DNS error codes: | Reason | Meaning | |---|---| | `ENOTFOUND` | The domain does not resolve. | | `ETIMEDOUT` | The resolver did not answer in time. | | `ESERVFAIL` | The resolver returned `SERVFAIL`. | | `expected_value_mismatch` | Resolution succeeded but the record did not contain `expected_value`. | | `dns_error` | Other resolver error. | ## Threshold examples | Goal | Healthy | Unhealthy | |---|---|---| | Authoritative answer fast | `under 50` | `over 500` | | Resolution succeeds at all | `under 5000` | `over 10000` | When the test is purely "does the domain still resolve", `unhealthy_value` set to a timeout-equivalent threshold combined with `no_data` on `ENOTFOUND` covers the case. --- url: https://docs.use.observer/agent/guides/tls-cert-probes title: Configure TLS certificate probes description: Connect to a TLS endpoint and report days until certificate expiry. --- TLS certificate probes connect to a host on a TLS port, read the peer certificate, and report `days_until_expiry`. Use them to fire a clear signal before a public certificate lapses. ## Configuration shape ```json { "host": "api.example.com", "port": 443, "warn_days": 30, "critical_days": 7 } ``` ## Field reference | Field | Default | Notes | |---|---|---| | `host` | required | Hostname (preferred) or IP. SNI is set automatically when the host is a hostname. | | `port` | `443` | TLS port to connect to. | | `warn_days` | `30` | Informational marker. The agent reports the value regardless; thresholds drive status. | | `critical_days` | `7` | Same as above. The relationship `warn_days >= critical_days` is enforced. | The probe accepts certificates that fail validation (expired, self-signed, hostname mismatch). The intent is to surface the problem rather than refuse the connection. Status is computed from `days_until_expiry`. ## Threshold examples | Goal | Healthy | Unhealthy | |---|---|---| | Standard renewal cadence | `over 30` | `under 7` | | Aggressive (Let's Encrypt 90d) | `over 14` | `under 3` | Negative `days_until_expiry` indicates the certificate has already expired. Set `unhealthy` at `under 0` to treat that as a hard unhealthy. ## Reason codes | Reason | Meaning | |---|---| | `no_cert` | Server completed TLS but did not present a certificate. | | `bad_cert_date` | Certificate's `valid_to` could not be parsed. | | `ETIMEDOUT` | Connection did not complete in time. | | `ECONNREFUSED` | Connection refused at the TCP layer. | | `tls_error` | Other TLS-handshake error. | When `host` is an IP literal, the SNI hint is omitted (RFC 6066 forbids IPs as SNI values). Some virtual-hosted servers will not return the expected certificate without SNI. Probe a hostname whenever the system supports one. --- url: https://docs.use.observer/agent/guides/connect-grafana-cloud title: Connect to Grafana Cloud description: Use a Grafana Cloud Prometheus endpoint as the agent's metric source. --- Grafana Cloud's hosted Prometheus is a valid source for Observer agents. The connection uses basic auth with credentials issued by Grafana for read access. ## Steps 1. In Grafana Cloud, open the stack details for the Prometheus instance you want the agent to read. Note: - **URL**: the remote-read URL (e.g. `https://prometheus-prod-01-eu-west-0.grafana.net/api/prom`). - **Username**: the numeric user id (e.g. `123456`). - **Password**: a Grafana Cloud access policy token with `metrics:read` scope. 2. Set the agent's environment: ```bash title=".env" PROMETHEUS_SERVER_URL=https://prometheus-prod-01-eu-west-0.grafana.net/api/prom PROMETHEUS_BASIC_AUTH_ENABLED=true PROMETHEUS_USERNAME=123456 PROMETHEUS_PASSWORD=glc_eyJ... # access policy token ``` 3. Restart the agent. Heartbeats and probe queries now hit Grafana Cloud. ## Verification Define a Prometheus metric in the console (any working PromQL query against the data Grafana Cloud holds). Within one push interval the metric reports a value. Failures resolve to `no_data` with a reason: - `Unauthorized` when the access policy token is missing or lacking the required scope. - `BadQuery` when the PromQL string is invalid against the data. - `PromUpstream` for Grafana-side 5xx responses. When one agent reads from multiple Prometheus servers (Grafana Cloud + a local Prometheus, for example), the `PROMETHEUS_SERVER_URL` env var sets the default and individual metrics override it via the `prometheus_url` field on the metric definition. --- url: https://docs.use.observer/agent/guides/read-the-dashboard title: Read the agent dashboard description: How to interpret the panels exposed on the agent's debug HTTP surface. --- The agent serves a read-only debug dashboard on `http://:10101` by default. The dashboard polls the agent's in-process state every five seconds and never mutates anything. ## Panels ### Process Identifies the agent: version string, Bun runtime version, process uptime, and resident-set memory. Use this to confirm the running image matches the version you expected after an upgrade. ### Config Lists the environment variables the agent considers relevant to operation, with values masked to `first-4 + tail-4` characters. Variables outside the allowlist are not displayed regardless of their name. The masking applies even to values that are not themselves secrets, so screenshots of the dashboard never reveal a full token. ### Queue Reports the local SQLite queue's current depth, the age of the oldest pending push, the configured capacity, and the drain controller's current backoff in milliseconds. A growing queue combined with a non-zero backoff is the signature of cloud unreachability. ### Cloud Reports the configured `CLOUD_SERVER_URL` and the timestamps and results of the last heartbeat and the last metric push. Failed recent calls show their error reason here. ### Prometheus Reports the configured `PROMETHEUS_SERVER_URL`, the outcome of the last Prometheus probe (`success`, `no_data`, `error`, or `null` if the agent has not run a Prometheus probe yet), and the timestamp. ### Definitions One row per metric the cloud has assigned to this agent. Each row shows the metric id, source type, intervals, and the last reported status, value, timestamp, and reason. Use this to confirm a specific metric is being collected. ### Active source types The list of source types the agent has actually run since boot. Useful when verifying that a stubbed runtime (e.g. `database`) is not silently unused. ## Toggling the dashboard The dashboard is enabled by default. To disable it set `ENABLE_DEBUG_DASHBOARD=false` in the agent's environment. To change the bind address or port, set `DEBUG_DASHBOARD_HOST` and `DEBUG_DASHBOARD_PORT`. The dashboard is intended for the operators who run the agent. Do not expose it to the public internet. In Kubernetes, expose it through a `ClusterIP` Service rather than a `LoadBalancer` or `Ingress`. --- url: https://docs.use.observer/agent/guides/diagnose-stalled-agent title: Diagnose a stalled agent description: Triage path when an agent stops reporting or its queue depth grows. --- A stalled agent surfaces in two places: the cloud marks it as **stopped** on the Agents page (no heartbeat in the expected window), and the local dashboard's queue depth grows beyond a few pending entries. The triage below covers both. ## Step 1: confirm the failure mode Open the agent's dashboard at `http://:10101`. Three patterns are common: | Pattern | Probable cause | |---|---| | Process up, queue growing, last heartbeat recent but failing | Cloud reachability problem from this host. | | Process up, queue depth zero, heartbeat succeeding, but cloud says **stopped** | Clock skew or stale state in the console (refresh). | | Process up, no probes have run | The cloud has not assigned any metrics to this agent yet. | | Dashboard unreachable | The agent process is down. Container restart loop or host crash. | ## Step 2: agent down If the dashboard is unreachable: ```bash # Docker docker logs --tail 200 observer-agent # Kubernetes kubectl -n observer logs deploy/observer-agent --tail=200 # Linux systemd journalctl -u observer-agent -n 200 --no-pager ``` Look for a panic, an unhandled rejection, or a configuration error on startup. Common causes: missing `AGENT_KEY`, malformed `CLOUD_SERVER_URL`, port `10101` already in use. ## Step 3: cloud unreachable If the dashboard's *Cloud* panel shows a recent `last_heartbeat_error`, the issue is between the agent and the cloud. Verify in this order: 1. **DNS**: `getent hosts ` from inside the container. 2. **TCP**: `curl -v https://` from inside the container. 3. **TLS**: certificate trust. Custom internal CAs need the container's trust store updated. 4. **Auth**: a recently rotated `AGENT_KEY` requires updating the agent's environment. The drain controller automatically retries with exponential backoff. The queue continues to accept pushes up to its capacity (`MAX_ROWS`, default `10000`). Once the cloud is reachable again, the queue drains. ## Step 4: queue saturation If the queue depth has hit `MAX_ROWS`, the oldest entries are dropped to admit new ones. The dashboard's queue panel shows the depth at the cap. After cloud reachability returns, the queue drains and depth returns to near zero. The cloud's `agent.offline` webhook fires when the heartbeat window is exceeded. Subscribe to this event when on-call needs an explicit alert. The agent only runs probes for metrics the cloud has assigned to it. If a freshly registered agent shows no probe activity, the cause is on the cloud side: open the metric in the console, confirm its **Agent** field is set, and save. --- url: https://docs.use.observer/agent/guides/rotate-agent-key title: Rotate the agent's authentication key description: Generate a new agent key, deploy it, and retire the old one with no observability gap. --- Agent keys can be rotated through the console without an observability gap. The cloud accepts both the new key and the previous key for a configurable grace window, so the deployment can roll over without strict synchronisation. ## Steps ### Generate a new key In the console, open **Agents**, select the agent, then **Rotate key**. The cloud: 1. Generates a new key, stores its hash, and returns the plaintext once. 2. Demotes the previous key to `previous_agent_key_hash` with a `previous_key_valid_until` timestamp (default: 24 hours from rotation). Copy the new key. ### Deploy the new key Update the agent's `AGENT_KEY` environment variable to the new value. The deployment path depends on your runtime: - **Docker**: `docker run -e AGENT_KEY=` and restart the container. - **Kubernetes**: update the `observer-agent` Secret and roll the Deployment (`kubectl rollout restart deploy/observer-agent`). - **systemd-managed Docker**: edit `/etc/observer-agent.env`, then `systemctl restart observer-agent`. The agent reconnects with the new key on its next heartbeat. ### Confirm the rotation took effect Open the agent's dashboard. The *Cloud* panel reports a successful heartbeat with the new key. The Agents page in the console shows the agent as **running** with the new key fingerprint. ### Retire the old key The previous key automatically becomes invalid at `previous_key_valid_until`. To retire it sooner, open the agent in the console and set the grace window to zero. Subsequent requests with the previous key are rejected. ## What the cloud sees - The cloud stores the SHA-256 of each key, never the plaintext. - A request with the new key matches `agent_key_hash` and succeeds. - A request with the previous key matches `previous_agent_key_hash`, and succeeds only while `previous_key_valid_until` is in the future. - A lost key cannot be recovered. Rotate to issue a replacement. Treat agent keys with the same care as any service credential. Rotate when an environment file is shared, when a developer with access leaves, or when a host's image is exported. The grace window makes rotation cheap; do it often. --- url: https://docs.use.observer/agent/reference/environment-variables title: Environment variables description: Every environment variable the agent reads, with defaults and meaning. --- The agent is configured through environment variables. There is no configuration file; this keeps the runtime container immutable and the deployment surface small. ## Required | Variable | Notes | |---|---| | `AGENT_KEY` | Authentication key issued by the cloud. Format `obs_live_<43 base64url chars>`. The cloud stores its hash, never the plaintext. | | `CLOUD_SERVER_URL` | Base URL of Observer Cloud. Defaults to `https://localhost:3000` (development only). Override in every real deployment. | ## Required for Prometheus probes | Variable | Notes | |---|---| | `PROMETHEUS_SERVER_URL` | Base URL of the Prometheus the agent should query. Used as the default for every Prometheus metric, overridable per-metric via `prometheus_url` in the metric's source config. | ## Optional Prometheus auth | Variable | Default | Notes | |---|---|---| | `PROMETHEUS_BASIC_AUTH_ENABLED` | `true` | Set to any value other than `true` to disable. When enabled, the agent sends `Authorization: Basic ` on every Prometheus request. | | `PROMETHEUS_USERNAME` | `admin` | Basic auth username. | | `PROMETHEUS_PASSWORD` | empty | Basic auth password. Treat as a secret. | ## Dashboard | Variable | Default | Notes | |---|---|---| | `ENABLE_DEBUG_DASHBOARD` | `true` | Set to `false` to disable the local debug dashboard. | | `DEBUG_DASHBOARD_HOST` | `0.0.0.0` | Bind address for the dashboard HTTP listener. | | `DEBUG_DASHBOARD_PORT` | `10101` | Port for the dashboard HTTP listener. | ## Logging | Variable | Default | Notes | |---|---|---| | `BROADCAST_LOGS` | `false` | When `true`, the agent forwards a subset of its log lines to the cloud for surfacing in the agent detail page. PromQL query strings are always redacted to a SHA-256 prefix and length, regardless of this flag. | | `LOG_BROADCAST_LEVEL` | `INFO` | Minimum level forwarded when `BROADCAST_LOGS=true`. One of `DEBUG`, `INFO`, `WARN`, `ERROR`. | | `VERBOSE` | `false` | Local stdout verbosity. | ## Local queue | Variable | Default | Notes | |---|---|---| | `BUFFER_PATH` | `./observer-agent-buffer.db` | Path to the agent's local SQLite write-ahead queue file. | | `BUFFER_MAX_ROWS` | `10000` | Hard cap on queued pushes. When the queue reaches the cap, oldest entries are evicted to admit new ones. | ## Other | Variable | Default | Notes | |---|---|---| | `SKIP_SSL_VERIFICATION` | `false` | Disables TLS verification on cloud-bound requests. Development only. | | `NODE_ENV` | unset | Affects log formatting. Set to `production` in production deployments. | Variables visible on the debug dashboard are masked to `first-4 + tail-4` characters. Variables not in the dashboard's allowlist are omitted entirely. The allowlist is intentionally narrow; everything outside it does not appear in the dashboard regardless of value. --- url: https://docs.use.observer/agent/reference/probe-types title: Probe types description: Source types the agent supports, with their value semantics and runtime status. --- | `source_type` | Value reported | Status | |---|---|---| | `prometheus` | scalar from PromQL query | shipped | | `http` | response_time_ms | shipped | | `tcp` | connect_time_ms | shipped | | `dns` | resolve_time_ms | shipped | | `tls_cert` | days_until_expiry | shipped | | `icmp` | n/a | stubbed | | `grpc` | n/a | stubbed | | `websocket` | n/a | stubbed | | `mtls_http` | n/a | stubbed | | `database` | n/a | stubbed | ## Shipped runtimes Each shipped runtime has a dedicated guide: - [Prometheus](/agent/guides/prometheus-source) - [HTTP](/agent/guides/http-probes) - [TCP](/agent/guides/tcp-probes) - [DNS](/agent/guides/dns-probes) - [TLS certificate](/agent/guides/tls-cert-probes) ## Stubbed runtimes The cloud accepts metric definitions for stubbed source types and stores their `source_config`. The agent recognises them but reports `not_implemented` in the `reason` field on every probe. The metric remains in `no_data` until the runtime ships. | Source type | Why stubbed | |---|---| | `icmp` | Most container runtimes need `CAP_NET_RAW` to open raw sockets. The TCP probe is a better proxy for "is this host reachable" in cloud-native environments. | | `grpc` | Adds a `@grpc/grpc-js` dependency that is not yet justified by validated demand. | | `websocket` | Same: adds the `ws` library for a probe with limited validated demand. | | `mtls_http` | Requires a client-cert secret store. The auth model needs design work before runtime work. | | `database` | Requires per-driver client libraries and a connection-string secret store. | The cloud-side enum is defined in a database check constraint, the agent's dispatch table, and the Zod schema in `@observer/probe-config`. Adding a runtime is a coordinated change across these three sites plus a UI form for the parameters. ## Common contract Every source's runtime exports the same interface: ```ts title="ProbeSource" interface ProbeSource { validateConfig(config: unknown): null | string; execute(config: TConfig, env?: AgentEnv): Promise; } interface ProbeResult { value: number | null; timestamp: string; status_hint?: "no_data"; reason?: string; metadata?: Record; } ``` Sources never throw. Network errors, malformed config, missing fields all resolve to `{ value: null, status_hint: "no_data", reason: "" }`. The dispatcher applies the threshold rule only when `status_hint` is absent. --- url: https://docs.use.observer/agent/reference/dashboard-panels title: Dashboard panels description: Read-only state surface served on the agent's debug HTTP port. --- The agent exposes a read-only HTTP dashboard on `http://:10101`. Every panel reads from the agent's in-process state. Nothing on the page mutates anything. ## Panel reference ### `process` | Field | Meaning | |---|---| | `agent_started_at` | ISO timestamp of process start. | | `uptime_seconds` | Wall-clock seconds since `agent_started_at`. | | `memory_rss_mb` | Resident set size of the agent process. | | `version` | Build-time version string. | | `bun_version` | Bun runtime version reported by `Bun.version`. | ### `config` A map of environment variable names to masked values. The mask is applied to every value displayed (first-4 + tail-4). Names not in the dashboard's allowlist are not displayed regardless of value. ### `queue` | Field | Meaning | |---|---| | `depth` | Pushes currently waiting to be drained. | | `oldest_age_seconds` | Age of the oldest pending push. | | `capacity` | Configured `BUFFER_MAX_ROWS`. | | `drain_backoff_ms` | Current exponential backoff used by the drain controller. Zero means the next drain attempt is immediate. | ### `cloud` | Field | Meaning | |---|---| | `cloud_server_url` | Configured `CLOUD_SERVER_URL`. | | `last_heartbeat_at` | ISO timestamp of the last heartbeat attempt. | | `last_heartbeat_ok` | Boolean result of the last heartbeat. | | `last_heartbeat_error` | Error string when `last_heartbeat_ok` is false. | | `last_post_at` | Last metric-push attempt timestamp. | | `last_post_ok` | Boolean result of the last push. | | `last_post_error` | Error string when `last_post_ok` is false. | ### `prometheus` | Field | Meaning | |---|---| | `server_url` | Configured `PROMETHEUS_SERVER_URL`. | | `last_probe_outcome` | One of `success`, `no_data`, `error`, or `null` if no Prometheus probe has run yet. | | `last_probe_at` | ISO timestamp of the last Prometheus probe. | ### `definitions` One row per metric the cloud has assigned to this agent. | Field | Meaning | |---|---| | `id` | Metric definition id. | | `source_type` | One of the [probe types](/agent/reference/probe-types). | | `interval_minutes` | Configured collection interval. | | `push_interval_minutes` | Configured forced-push interval (status pushes also fire on every status change). | | `last_status` | Most recent status reported. | | `last_value` | Most recent reported value. | | `last_at` | ISO timestamp of the last push. | | `last_reason` | Reason code on the last `no_data` (or null when the last push was healthy). | ### `active_source_types` Distinct list of source types the agent has actually run since boot. A source type that appears in `definitions` but not here indicates the agent has not yet had time to run an instance. --- url: https://docs.use.observer/agent/reference/heartbeat-payload title: Heartbeat payload description: JSON shape the agent posts to /api/agent/heartbeat. --- The agent emits a heartbeat to the cloud on a fixed interval (typically every 30 seconds). The payload is the agent's view of its own runtime state. The cloud uses the payload to compute 24-hour uptime, restart counts, and lag alerts. ## Endpoint ``` POST /api/agent/heartbeat Authorization: provided via Agent-Key header (key transport detail) Content-Type: application/json ``` ## Body ```json { "version": "1.2.3", "uptime_seconds": 12345, "buffer_size": 0, "buffer_oldest_age_seconds": 0, "queue_depth": 0, "queue_oldest_age_seconds": 0, "queue_capacity": 10000, "agent_started_at": "2026-05-09T12:00:00Z", "source_types_active": ["prometheus", "http", "tcp"] } ``` ## Field reference | Field | Type | Meaning | |---|---|---| | `version` | string | Build-time version of the agent. | | `uptime_seconds` | integer | Wall-clock seconds since process start. | | `buffer_size` | integer | Legacy alias for `queue_depth`. Accepted by older cloud builds; pre-21.5 fallback. | | `buffer_oldest_age_seconds` | integer | Legacy alias for `queue_oldest_age_seconds`. | | `queue_depth` | integer | Pushes currently waiting in the local queue. | | `queue_oldest_age_seconds` | integer | Age of the oldest pending push, in seconds. | | `queue_capacity` | integer | Hard cap on the queue (`BUFFER_MAX_ROWS`). | | `agent_started_at` | ISO timestamp | When the process started. The cloud uses changes to this value to detect restarts. | | `source_types_active` | string[] | Distinct source types the agent has actually run since boot. | ## Lag and uptime alerts The cloud's heartbeat receiver runs two state machines per agent: - **`agent.lag_high`**: opens when `queue_depth > 1000` or `queue_oldest_age_seconds > 300`. Clears when both signals stay below threshold for 60 seconds. - **`agent.uptime_degraded`**: opens when uptime over the last 24 hours falls below 95%. Same 60-second clear hysteresis. These signals surface in the cloud console's agent detail page and as `agent.offline` webhook events when subscribed. ## Versioning The payload shape is part of the cloud-agent wire contract, maintained in the public `@observer/protocol` package. Field additions are additive; field removals require a major version bump on `@observer/protocol`. --- url: https://docs.use.observer/agent title: Observer Agent description: The Observer data plane. Probes metric sources, computes status, pushes verdicts to the cloud. --- The Observer Agent is a small process that runs in your network. It reads from Prometheus or probes endpoints directly (HTTP, TCP, DNS, TLS certificates), computes status against thresholds, and pushes the verdict to Observer Cloud over an authenticated channel. The agent is open source. Source at [github.com/useobserver/agent](https://github.com/useobserver/agent), licensed Apache-2.0. The runtime is Bun on a distroless container image; the source is TypeScript. ## Quickstart ## Adjacent sections - **Guides** cover per-probe configuration, dashboard reading, key rotation, and diagnosis paths. - **Reference** lists every environment variable, every probe type, every dashboard panel, and the heartbeat payload shape. - **Concepts** covers the agent / cloud boundary, the local queue, and the design choices behind the runtime. Documentation for status pages, SLOs, organisation setup, and the REST API is in the [Documentation](/docs) and [API](/api) tabs. This tab covers the agent only. --- url: https://docs.use.observer/api/getting-started/auth title: Authentication description: API keys, scopes, and how to authenticate requests against the public API. --- Every request to `/api/v1` carries an API key in the `Authorization` header. ## Headers ```text Authorization: Bearer Content-Type: application/json (on POST / PUT / PATCH) ``` ## Key format Public API keys begin with `obs_pub_` followed by a base64url opaque string. Keys are issued per organisation in the console under **API keys**. Each key is shown once at creation; the cloud stores its hash and cannot recover the plaintext. ## Scopes Each key carries a fixed set of scopes that gate which endpoints the key may call. The scopes available today: | Scope | Grants | |---|---| | `read:services` | Read service entities. | | `read:metrics` | Read metric definitions, current values, and aggregated history. | | `read:slos` | Read SLOs and their current burn state. | | `read:incidents` | Read incident updates published on status pages. | Scopes are additive. Requests against an endpoint whose required scope is not on the key return `403`. ## Errors The API returns RFC 7807 problem-detail responses: ```json { "type": "/errors/unauthorized", "title": "missing or invalid bearer token", "status": 401 } ``` Per-endpoint scope requirements appear on each operation page in the sidebar. Keys can be rotated through the console. Existing keys remain valid until you revoke them; revoke when an incident or personnel change requires it. --- url: https://docs.use.observer/api/services/get-services title: GET /services description: List services --- ## Parameters | Name | In | Required | Type | Description | |------|------|----------|------|-------------| | `limit` | query | no | integer | | | `cursor` | query | no | string | | ## Example request ```bash curl -X GET "https://api.use.observer/api/v1/services" \ -H "Authorization: Bearer YOUR_API_KEY" ``` ## Responses ### 200 — ok ```json { "items": [ null ], "next_cursor": null } ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/services/get-services-by-id title: GET /services/{id} description: Get service by id --- ## Parameters | Name | In | Required | Type | Description | |------|------|----------|------|-------------| | `id` | path | yes | string | | ## Example request ```bash curl -X GET "https://api.use.observer/api/v1/services/{id}" \ -H "Authorization: Bearer YOUR_API_KEY" ``` ## Responses ### 200 — ok ```json null ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/metrics/get-metrics title: GET /metrics description: List metrics --- ## Parameters | Name | In | Required | Type | Description | |------|------|----------|------|-------------| | `limit` | query | no | integer | | | `cursor` | query | no | string | | | `status` | query | no | string | | ## Example request ```bash curl -X GET "https://api.use.observer/api/v1/metrics" \ -H "Authorization: Bearer YOUR_API_KEY" ``` ## Responses ### 200 — ok ```json { "items": [ null ], "next_cursor": null } ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/metrics/get-metrics-by-id title: GET /metrics/{id} description: Get metric by id --- ## Parameters | Name | In | Required | Type | Description | |------|------|----------|------|-------------| | `id` | path | yes | string | | ## Example request ```bash curl -X GET "https://api.use.observer/api/v1/metrics/{id}" \ -H "Authorization: Bearer YOUR_API_KEY" ``` ## Responses ### 200 — ok ```json null ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/metrics/get-metrics-by-id-history title: GET /metrics/{id}/history description: Aggregated metric values over a window (max 30 days) --- ## Parameters | Name | In | Required | Type | Description | |------|------|----------|------|-------------| | `id` | path | yes | string | | | `from` | query | yes | string | | | `to` | query | no | string | | | `resolution` | query | no | string | | ## Example request ```bash curl -X GET "https://api.use.observer/api/v1/metrics/{id}/history" \ -H "Authorization: Bearer YOUR_API_KEY" ``` ## Responses ### 200 — ok ```json null ``` ### 400 — invalid range / resolution ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/metrics/post-metrics-by-id-status title: POST /metrics/{id}/status description: Set status on a manual metric (source_type='manual') --- ## Request body ```json null ``` ## Example request ```bash curl -X POST "https://api.use.observer/api/v1/metrics/{id}/status" \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d "null" ``` ## Responses ### 200 — ok ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 409 — metric is probed (not manual) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/slos/get-slos title: GET /slos description: List SLOs --- ## Parameters | Name | In | Required | Type | Description | |------|------|----------|------|-------------| | `limit` | query | no | integer | | | `cursor` | query | no | string | | ## Example request ```bash curl -X GET "https://api.use.observer/api/v1/slos" \ -H "Authorization: Bearer YOUR_API_KEY" ``` ## Responses ### 200 — ok ```json { "items": [ null ], "next_cursor": null } ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/slos/get-slos-by-id title: GET /slos/{id} description: Get SLO with latest burn event --- ## Parameters | Name | In | Required | Type | Description | |------|------|----------|------|-------------| | `id` | path | yes | string | | ## Example request ```bash curl -X GET "https://api.use.observer/api/v1/slos/{id}" \ -H "Authorization: Bearer YOUR_API_KEY" ``` ## Responses ### 200 — ok ```json null ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/incidents/delete-incidents-by-id title: DELETE /incidents/{id} description: Soft-delete incident --- ## Example request ```bash curl -X DELETE "https://api.use.observer/api/v1/incidents/{id}" \ -H "Authorization: Bearer YOUR_API_KEY" ``` ## Responses ### 200 — deleted ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/incidents/get-incidents title: GET /incidents description: List incidents --- ## Parameters | Name | In | Required | Type | Description | |------|------|----------|------|-------------| | `limit` | query | no | integer | | | `cursor` | query | no | string | | | `state` | query | no | string | | | `since` | query | no | string | | ## Example request ```bash curl -X GET "https://api.use.observer/api/v1/incidents" \ -H "Authorization: Bearer YOUR_API_KEY" ``` ## Responses ### 200 — ok ```json { "items": [ null ], "next_cursor": null } ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/incidents/get-incidents-by-id title: GET /incidents/{id} description: Get incident --- ## Example request ```bash curl -X GET "https://api.use.observer/api/v1/incidents/{id}" \ -H "Authorization: Bearer YOUR_API_KEY" ``` ## Responses ### 200 — ok ```json null ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/incidents/patch-incidents-by-id title: PATCH /incidents/{id} description: Patch incident (title, severity, affected services, visibility) --- ## Request body ```json null ``` ## Example request ```bash curl -X PATCH "https://api.use.observer/api/v1/incidents/{id}" \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d "null" ``` ## Responses ### 200 — ok ```json null ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/incidents/post-incidents title: POST /incidents description: Create incident (draft or published) --- ## Request body ```json null ``` ## Example request ```bash curl -X POST "https://api.use.observer/api/v1/incidents" \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d "null" ``` ## Responses ### 200 — created ```json null ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/incidents/post-incidents-by-id-messages title: POST /incidents/{id}/messages description: Append a timeline message; type=Resolved auto-resolves the parent --- ## Request body ```json null ``` ## Example request ```bash curl -X POST "https://api.use.observer/api/v1/incidents/{id}/messages" \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d "null" ``` ## Responses ### 200 — ok ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/incidents/post-incidents-by-id-publish title: POST /incidents/{id}/publish description: Publish a draft incident --- ## Example request ```bash curl -X POST "https://api.use.observer/api/v1/incidents/{id}/publish" \ -H "Authorization: Bearer YOUR_API_KEY" ``` ## Responses ### 200 — ok ```json null ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 409 — already published ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/incidents/post-incidents-by-id-resolve title: POST /incidents/{id}/resolve description: Resolve an incident with optional final message --- ## Request body ```json null ``` ## Example request ```bash curl -X POST "https://api.use.observer/api/v1/incidents/{id}/resolve" \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d "null" ``` ## Responses ### 200 — ok ```json null ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 409 — already resolved ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/incidents/post-incidents-from-metric-by-metricId title: POST /incidents/from-metric/{metricId} description: Pre-fill a draft incident from the metric's current state (idempotent within 30 minutes) --- ## Example request ```bash curl -X POST "https://api.use.observer/api/v1/incidents/from-metric/{metricId}" \ -H "Authorization: Bearer YOUR_API_KEY" ``` ## Responses ### 200 — ok ```json null ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/maintenances/get-maintenances title: GET /maintenances description: List maintenances --- ## Parameters | Name | In | Required | Type | Description | |------|------|----------|------|-------------| | `limit` | query | no | integer | | | `cursor` | query | no | string | | | `state` | query | no | string | | | `since` | query | no | string | | ## Example request ```bash curl -X GET "https://api.use.observer/api/v1/maintenances" \ -H "Authorization: Bearer YOUR_API_KEY" ``` ## Responses ### 200 — ok ```json { "items": [ null ], "next_cursor": null } ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/maintenances/get-maintenances-by-id title: GET /maintenances/{id} description: Get maintenance --- ## Example request ```bash curl -X GET "https://api.use.observer/api/v1/maintenances/{id}" \ -H "Authorization: Bearer YOUR_API_KEY" ``` ## Responses ### 200 — ok ```json null ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/maintenances/patch-maintenances-by-id title: PATCH /maintenances/{id} description: Edit maintenance (only allowed before actual_start_at is set) --- ## Request body ```json null ``` ## Example request ```bash curl -X PATCH "https://api.use.observer/api/v1/maintenances/{id}" \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d "null" ``` ## Responses ### 200 — ok ```json null ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 409 — already started ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/maintenances/post-maintenances title: POST /maintenances description: Schedule a maintenance window --- ## Request body ```json null ``` ## Example request ```bash curl -X POST "https://api.use.observer/api/v1/maintenances" \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d "null" ``` ## Responses ### 200 — ok ```json null ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/maintenances/post-maintenances-by-id-cancel title: POST /maintenances/{id}/cancel description: Cancel a maintenance before completion --- ## Example request ```bash curl -X POST "https://api.use.observer/api/v1/maintenances/{id}/cancel" \ -H "Authorization: Bearer YOUR_API_KEY" ``` ## Responses ### 200 — ok ```json null ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/maintenances/post-maintenances-by-id-complete title: POST /maintenances/{id}/complete description: Manually transition an in-progress maintenance to completed --- ## Example request ```bash curl -X POST "https://api.use.observer/api/v1/maintenances/{id}/complete" \ -H "Authorization: Bearer YOUR_API_KEY" ``` ## Responses ### 200 — ok ```json null ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api/maintenances/post-maintenances-by-id-start title: POST /maintenances/{id}/start description: Manually transition a scheduled maintenance to in_progress --- ## Example request ```bash curl -X POST "https://api.use.observer/api/v1/maintenances/{id}/start" \ -H "Authorization: Bearer YOUR_API_KEY" ``` ## Responses ### 200 — ok ```json null ``` ### 401 — missing or invalid bearer token ### 403 — missing required scope ### 404 — not found (or cross-tenant — same response) ### 429 — rate limit exceeded --- url: https://docs.use.observer/api title: API reference description: Observer's public REST API. Authenticated with API keys scoped per organisation. --- The Observer API lives at `https://api.use.observer/v1`. All endpoints require an `Authorization: Bearer ` header. Keys are scoped per organisation and per capability (read:metrics, write:incidents, etc.). Pick an operation from the sidebar to view its parameters, request / response schemas, and a working curl example.