Observer
Observer Agent

Read metrics from a database query

Configure the agent to run a single read-only query against PostgreSQL, MySQL, Redis, or MongoDB on each cron tick and report the scalar result.

The agent runs one read-only query per cron tick against a customer database and reports the scalar return value. One metric definition maps to one query against one database. Supported engines in v1:

  • PostgreSQL and MySQL: single SELECT statement returning one row with one column.
  • Redis: single read-only command from an allowlist (DBSIZE, LLEN, SCARD, ZCARD, HLEN, STRLEN, TTL, PTTL, BITCOUNT, EXISTS, GET).
  • MongoDB: a JSON specification with db, collection, and op (countDocuments or estimatedDocumentCount); an optional filter narrows the count.

This source is for health signals that live inside the database itself: replication lag, connection saturation, queue depth, or business KPIs like pending orders. For metrics that already exist in Prometheus or CloudWatch, use those sources instead.

When NOT to use this

  • The signal already exists in Prometheus (via postgres_exporter, mysqld_exporter, or your application's metrics endpoint). Use the Prometheus source. It avoids running a query against the production database on the cron interval.
  • The query is expensive. A probe runs every minute by default; a query that takes 500 ms is fine, a query that takes 5 seconds is a recipe for a database-CPU regression. The agent enforces a hard statement timeout (default 5 s, max 30 s), but the right answer for expensive queries is to materialize them into a summary table the probe reads cheaply.
  • You need transaction-aware metrics. The probe issues a single one-shot query; it cannot observe values inside a transaction at the caller's isolation level.

Critical: provision a read-only role

The agent parser rejects non-SELECT statements before they leave the host, but that parser is belt-and-suspenders. The actual security boundary is the database role the agent connects as. Provision a dedicated read-only role and use it for every database probe.

PostgreSQL

-- Run as a superuser (or the database owner).
CREATE ROLE observer_agent WITH LOGIN PASSWORD 'replace-with-strong-password';
GRANT CONNECT ON DATABASE your_database TO observer_agent;

-- pg_monitor covers the catalog views the sample queries below read
-- (pg_stat_activity, pg_stat_replication, etc.). The pg_class and
-- pg_namespace tables are publicly readable by default, so no extra
-- grants are needed for the catalog-only samples.
GRANT pg_monitor TO observer_agent;

-- For business-metric queries against your application schema, grant
-- SELECT on the specific tables you want to read. Skip this block if
-- all your metrics target system catalogs only.
GRANT USAGE ON SCHEMA public TO observer_agent;
GRANT SELECT ON public.orders TO observer_agent;

-- Optional: auto-grant SELECT on future tables created by the
-- application owner role. Replace `app_owner_role` with whichever
-- role owns your tables. ALTER DEFAULT PRIVILEGES is per-creator;
-- without `FOR ROLE` it only affects tables that the current session
-- (the superuser running this block) creates later, which is rarely
-- what you want.
ALTER DEFAULT PRIVILEGES FOR ROLE app_owner_role IN SCHEMA public
  GRANT SELECT ON TABLES TO observer_agent;

Then build a DSN with this role and export it on the agent host:

Use sslmode=verify-full when the agent host trusts the database's CA; fall back to sslmode=require only if certificate distribution isn't in place. require encrypts the connection but doesn't verify the server identity.

MySQL

-- Run as root (or a user with GRANT permission).
CREATE USER 'observer_agent'@'%' IDENTIFIED BY 'replace-with-strong-password';

-- Read access on performance_schema covers the server-wide health
-- queries below (connection counts, replication status, slow query
-- counter). On MySQL 8.0+, these views are gated by SELECT on
-- performance_schema, not by the legacy PROCESS / REPLICATION CLIENT
-- privileges.
GRANT SELECT ON performance_schema.* TO 'observer_agent'@'%';

-- For business-metric queries, grant SELECT on the specific schema:
GRANT SELECT ON your_database.* TO 'observer_agent'@'%';

FLUSH PRIVILEGES;

If you target MySQL 5.7 (out of mainstream support), add GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'observer_agent'@'%'; so the replication views work. On MySQL 8.0+ the SELECT grant above is sufficient and avoids letting the role see queries from every other session.

Export the DSN on the agent host:

Redis

Provision a read-only ACL user. Redis 6+ supports per-command ACLs; restrict the agent user to the same allowlist the agent enforces.

ACL SETUSER observer_agent on >replace-with-strong-password ~* \
    +DBSIZE +LLEN +SCARD +ZCARD +HLEN +STRLEN +TTL +PTTL +BITCOUNT +EXISTS +GET

For Redis 5 (no ACLs) use a requirepass and rely on the agent's client-side allowlist; the parser rejects every mutating command at save time, so a misconfigured legacy user still can't fire FLUSHALL.

Export the DSN on the agent host:

# Or with TLS:

MongoDB

Provision a read-only role on the specific database (or admin if you need cross-database reads).

use admin
db.createUser({
  user: "observer_agent",
  pwd: "replace-with-strong-password",
  roles: [
    { role: "read", db: "appdb" }
  ]
})

Export the DSN on the agent host. SRV records work too:

# Or with Atlas / SRV:

Configure a metric in the Observer console

Create a metric, pick SQL query, fill in:

  • Database: PostgreSQL, MySQL, Redis, or MongoDB.

  • Statement timeout: per-query timeout in milliseconds. Default 5,000; max 30,000. The agent applies this at the database connection level (Postgres statement_timeout, MySQL MAX_EXECUTION_TIME) so a stuck query aborts even if the agent-side timeout is late to fire.

  • Connection string env var: name of an env var on the agent host (OBSERVER_PG_PROD_DSN in the examples above). The cloud never sees the DSN; the agent reads the value at query time and never logs it.

  • Query: shape depends on the database kind.

    • PostgreSQL / MySQL: a single SELECT (or WITH … SELECT) returning exactly one row with one column.
    • Redis: a single command from the allowlist (DBSIZE, LLEN, SCARD, ZCARD, HLEN, STRLEN, TTL, PTTL, BITCOUNT, EXISTS, GET). Mutating commands are rejected at save time.
    • MongoDB: a JSON object with db, collection, op (countDocuments or estimatedDocumentCount), and an optional filter. Server-side JS operators ($where, $function, $expr) are rejected.

    See Sample queries and Result shape.

Result shape

The query must return exactly one row and one column. The value is coerced to a number:

Returned typeCoercionExample
Number / integer / floatUsed directly42
Booleantrue → 1, false → 0true
Numeric stringParsed'97'
BigIntCast to Number when within Number.MAX_SAFE_INTEGER42n
NULLRejected (db_null_value)NULL
Non-numeric stringRejected (db_non_numeric)'healthy'
Multiple rowsRejected (db_multi_row)n/a
Multiple columnsRejected (db_multi_column)n/a

When the metric you care about is naturally multi-row (e.g. one row per replication slot), reduce it inside the query (max, sum, count) so a single scalar reaches the agent.

Sample queries

PostgreSQL

-- Replication lag, in seconds. NULL on a primary (returns no_data).
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()));

-- Long-running active queries (count).
SELECT count(*)
  FROM pg_stat_activity
 WHERE state = 'active'
   AND query_start < now() - interval '5 minutes';

-- Total backend connections.
SELECT count(*) FROM pg_stat_activity;

-- Largest table size, bytes.
SELECT max(pg_relation_size(c.oid))
  FROM pg_class c
  JOIN pg_namespace n ON n.oid = c.relnamespace
 WHERE c.relkind = 'r'
   AND n.nspname NOT IN ('pg_catalog','information_schema');

-- Business metric: pending orders.
SELECT count(*) FROM orders WHERE status = 'pending';

MySQL

-- Replication lag, in seconds. Replica-only: returns zero rows on a
-- primary, which the agent surfaces as db_empty_result. Pin the
-- metric definition to replica hosts.
SELECT TIMESTAMPDIFF(
         SECOND,
         APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
         NOW()
       )
  FROM performance_schema.replication_applier_status_by_worker
 WHERE APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP IS NOT NULL
 ORDER BY APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP DESC
 LIMIT 1;

-- Total connected threads.
SELECT VARIABLE_VALUE + 0
  FROM performance_schema.global_status
 WHERE VARIABLE_NAME = 'Threads_connected';

-- Slow queries, cumulative since server start.
SELECT VARIABLE_VALUE + 0
  FROM performance_schema.global_status
 WHERE VARIABLE_NAME = 'Slow_queries';

-- InnoDB row-lock waits.
SELECT VARIABLE_VALUE + 0
  FROM performance_schema.global_status
 WHERE VARIABLE_NAME = 'Innodb_row_lock_waits';

For business metrics, query your application schema the same way you would in any other SELECT.

Redis

Each command is one line; the agent rejects anything outside the read-only allowlist before it reaches the server.

# Total number of keys in the selected DB.
DBSIZE

# Queue depth.
LLEN myqueue

# Active sessions tracked in a set.
SCARD active_users

# Hash field count.
HLEN session:abc

# Time until a distributed lock expires.
TTL lock:job-42

# A numeric counter stored as a string value.
GET counter:errors

Coercion: integers come back as integers; string replies (e.g. from GET) are parsed as numbers. Non-numeric strings surface as db_non_numeric.

MongoDB

Each metric definition stores a JSON object that names the database, collection, and operation. Whitespace inside the JSON is preserved verbatim, so format it however you prefer.

{
  "db": "appdb",
  "collection": "orders",
  "op": "estimatedDocumentCount"
}
{
  "db": "appdb",
  "collection": "orders",
  "op": "countDocuments",
  "filter": { "status": "pending" }
}
{
  "db": "appdb",
  "collection": "sessions",
  "op": "countDocuments",
  "filter": { "created_at": { "$gte": "2026-05-20T00:00:00Z" } }
}

The filter accepts any BSON-serialisable JSON, with two exceptions: $where, $function, $accumulator, and $expr are rejected at save time. These operators trigger server-side JavaScript or full- collection evaluation and don't belong in a per-cron probe.

estimatedDocumentCount reads from cached metadata and never accepts a filter; use it when an exact count isn't required.

Reason codes specific to SQL

The reason field on no_data results carries one of:

  • db_dsn_missing: the env var named in connection_string_ref is not set on the agent. Export it and restart the agent.
  • db_query_not_allowed: the parser rejected the query (e.g. not a SELECT, multi-statement body, or write inside a CTE).
  • db_empty_result: the query returned zero rows. Fix the WHERE clause or change the metric to a count(*) so it always returns one row.
  • db_multi_row / db_multi_column: the query returned more than one row or column. Reduce with count, max, sum, or use a subquery.
  • db_null_value: the single returned cell was NULL. Use coalesce(value, 0) if zero is the right fallback for your metric, or add a WHERE clause that filters out the NULL state.
  • db_non_numeric: the cell wasn't a number, boolean, or numeric string. Cast inside the query (::int, ::numeric, CAST(... AS UNSIGNED)).
  • db_timeout: the query exceeded statement_timeout_ms. Optimize the query or raise the timeout (cap is 30,000 ms).
  • db_access_denied: the role lacks SELECT privilege on the table. Grant it explicitly.
  • db_auth_failed: the DSN's username / password is wrong.
  • db_connection_failed: the database is unreachable (ECONNREFUSED, ENOTFOUND, ETIMEDOUT, or driver-level network errors). The metric goes no_data, not unhealthy. The database being down is a separate concern from the metric being out of range.
  • db_syntax_error: the query is malformed or references a missing table / column.
  • db_error: catch-all for uncategorized driver errors.

Performance

  • One probe = one query per cron interval. At the default 1-minute interval, every metric def runs 1,440 queries per day. Keep each query under a few hundred milliseconds.
  • The agent caps its connection pool at 2 connections per unique (DSN, timeout) pair. Even with 50 metric defs against the same database, the agent never holds more than 2 connections open.
  • Pool entries idle out after 10 seconds, so a probe that runs every 5 minutes typically holds zero open connections between ticks.
  • The agent never retries a failed query inside a tick; failures surface as no_data and the next cron tick is the next attempt.

Troubleshooting

  • db_dsn_missing even though I set the env var. The agent reads the variable at query time from process.env, so it must be present in the agent process's environment, not just in your shell. For Docker / Kubernetes deployments, set it on the container or pod spec.
  • db_query_not_allowed: UPDATE statements are not allowed. The parser is strict. Even if your role is read-only, the parser rejects the statement before sending it. Rewrite the query as a SELECT.
  • db_timeout on a query that runs fine in psql. The database server may be applying a per-session statement_timeout that's lower than the agent's. Check the role's statement_timeout setting (SHOW statement_timeout) and either raise it or lower the agent's statement_timeout_ms.
  • db_connection_failed from inside a Kubernetes pod, but the same DSN works from a debug pod. Most often a NetworkPolicy or egress rule blocks the agent's pod. Confirm by execing into the agent pod and curling the database host:port.
  • db_multi_row on a query that returns one row in psql. Postgres's psql sometimes treats multi-row results differently with \g vs \gx. The agent reads every returned row; if your query has any path that returns multiple rows, reduce with LIMIT 1 or an aggregate.

Known limits

  • PostgreSQL, MySQL, Redis, and MongoDB only. SQLite, BigQuery, Snowflake, and ClickHouse are not implemented. The Redis runtime supports a fixed allowlist of read-only commands; the MongoDB runtime supports countDocuments and estimatedDocumentCount only. Aggregation pipelines + Mongo find are deferred until customer demand surfaces.
  • No parameterized queries. The query is taken from configuration verbatim. There is no string interpolation surface, so there's no injection vector, but it also means runtime values (e.g. "alert if more than X rows match") have to be baked into the SQL.
  • No "test connection" button in the form. The cloud has no path to the customer's network. Save the metric and watch the agent dashboard for the first probe result; failures surface as a no_data row with a specific reason code within one cron tick.
Was this page helpful?