Observer
Observer

Auto-incident creation

Opt a metric in to automatic draft-incident creation when it flips unhealthy. Drafts ship with email CTAs so a human always verifies before customers see the incident.

When a metric flips unhealthy in the middle of the night, the on-call already knows. The question is whether the customer-facing status page should be updated to reflect that. Auto-incident creation does the typing-out part for you — without ever publishing without a human pressing a button.

How it works

  1. You opt a metric in to the feature on its edit form (Pro+).
  2. The metric flips unhealthy (with dwell gating, exactly as a manual status change would).
  3. The auto-incident worker creates a draft incident on the metric's bound service.
  4. Observer emails your org owners with two buttons: Publish (flip to published; customers see it) and Dismiss (soft-delete the draft).
  5. If neither button is clicked within 24 hours, the draft auto-expires. Nothing ever reaches the public page without a human action.

Enable for a metric

  1. Open Console → Metrics → <your metric> → Edit.
  2. Scroll to the Automatic incident creation section.
  3. Pick a Policy:
    • Off — auto-creation is disabled for this metric.
    • On — create immediately — a draft is created the moment the metric flips unhealthy.
    • On — wait then re-check — Observer waits the configured number of seconds, then re-checks the metric's current status. If it's still unhealthy, the draft is created. If the metric recovered during the dwell window, nothing happens. This is the recommended setting for metrics that occasionally flap.
  4. Pick a Severity (minor / major / critical). This value is stamped on every auto-drafted incident.
  5. For dwell-mode, pick a Dwell seconds value between 60 and 3600. Defaults to 300 (5 minutes).
  6. Save.

What gets created

When the worker fires, you get:

  • A new incident row with:
    • title: Investigating elevated errors on <metric title>
    • severity: as configured on the metric
    • affected_services: every service that has an SLO pointing at the metric
    • is_auto_drafted: true
    • An initial Information message describing the value vs the threshold and the timestamp.
  • An audit row (incident.auto_drafted on the metric, plus the parent row on the incident itself).
  • A webhook event incident.auto_drafted (separate from the manual incident.created so you can listen specifically).
  • An email to every org owner who hasn't opted out (see Notification preferences).

Email CTAs

Each email has two buttons:

  • Publish incidentGET /api/incidents/auto-action?token=…&action=publish inside the signed token. Flips the draft to published. Fires incident.auto_published.
  • Dismiss draftGET /api/incidents/auto-action?token=…&action=dismiss. Soft-deletes the row. Fires incident.auto_dismissed with reason: "operator_dismiss".

The token format is base64url(body) + "." + base64url(sig) with body <incidentId>|<action>|<expiresAtMs> and signature HMAC-SHA-256(server_secret, body). Action is part of the signed body, not the URL — you can't flip a publish link to dismiss (or vice versa) by editing the URL. Tokens expire after 24 hours.

Both endpoints are idempotent. Re-clicking publish after the incident is already published returns a success page. Re-clicking dismiss after it's already dismissed returns a success page.

Dedup, cooldown, and expiry

Three guardrails keep the auto-incident flow from spamming you:

  1. Dedup against open incidents on the service. If you (or a prior auto-draft) have already filed an incident affecting the metric's service, the worker appends a new Information message to the existing incident instead of creating a duplicate. Message text: Metric <name> is now unhealthy (auto-detected).
  2. One auto-draft per metric per hour. If a metric was already auto-drafted or auto-dismissed in the last hour, the worker skips. Flapping metrics never produce more than one draft per hour.
  3. 24-hour auto-expiry. Drafts older than 24 hours that haven't been published or dismissed are soft-deleted by a 15-minute cron, audited as incident.auto_expired, and fire incident.auto_dismissed with reason: "auto_expired".

Notification preferences

Per-user opt-out lives at Console → Settings → Notifications → Auto-incident draft emails. Default is ON for org owners. Owners who toggle this off do not receive auto-incident emails (any other type of email is unaffected).

The toggle stores as users.notification_preferences.autoIncidentDrafts = false on the user row.

Plan gate

This feature is Pro+ only. Free and Starter plans see a locked-feature card on the metric edit form. Set the metric policy to disabled (the default) on lower plans or upgrade.

Webhook events

Three event types fire from the auto flow:

  • incident.auto_drafted — fires when the draft is created.
  • incident.auto_published — fires when the draft is published via the email link (or the equivalent API endpoint).
  • incident.auto_dismissed — fires for both the email-dismiss and the 24h auto-expiry paths. reason distinguishes them.

Payloads are documented at Webhook payload reference.

For most teams:

  • Dwell mode with 300 seconds for any latency or error-rate metric. The dwell window catches noisy alarms before they generate an email.
  • Immediate mode for binary signals (TLS expiry hit zero, a service is unreachable). These should not flap, so dwell adds nothing.
  • Leave auto-creation off for noisy dashboards that are not customer-visible. The console already shows unhealthy metrics; not every internal alarm deserves a draft.
Was this page helpful?