Read metrics from a database query
Configure the agent to run a single read-only query against PostgreSQL, MySQL, Redis, or MongoDB on each cron tick and report the scalar result.
The agent runs one read-only query per cron tick against a customer database and reports the scalar return value. One metric definition maps to one query against one database. Supported engines in v1:
- PostgreSQL and MySQL: single
SELECTstatement returning one row with one column. - Redis: single read-only command from an allowlist (DBSIZE, LLEN, SCARD, ZCARD, HLEN, STRLEN, TTL, PTTL, BITCOUNT, EXISTS, GET).
- MongoDB: a JSON specification with
db,collection, andop(countDocumentsorestimatedDocumentCount); an optionalfilternarrows the count.
This source is for health signals that live inside the database itself: replication lag, connection saturation, queue depth, or business KPIs like pending orders. For metrics that already exist in Prometheus or CloudWatch, use those sources instead.
When NOT to use this
- The signal already exists in Prometheus (via
postgres_exporter,mysqld_exporter, or your application's metrics endpoint). Use the Prometheus source. It avoids running a query against the production database on the cron interval. - The query is expensive. A probe runs every minute by default; a query that takes 500 ms is fine, a query that takes 5 seconds is a recipe for a database-CPU regression. The agent enforces a hard statement timeout (default 5 s, max 30 s), but the right answer for expensive queries is to materialize them into a summary table the probe reads cheaply.
- You need transaction-aware metrics. The probe issues a single one-shot query; it cannot observe values inside a transaction at the caller's isolation level.
Critical: provision a read-only role
The agent parser rejects non-SELECT statements before they leave the host, but that parser is belt-and-suspenders. The actual security boundary is the database role the agent connects as. Provision a dedicated read-only role and use it for every database probe.
PostgreSQL
-- Run as a superuser (or the database owner).
CREATE ROLE observer_agent WITH LOGIN PASSWORD 'replace-with-strong-password';
GRANT CONNECT ON DATABASE your_database TO observer_agent;
-- pg_monitor covers the catalog views the sample queries below read
-- (pg_stat_activity, pg_stat_replication, etc.). The pg_class and
-- pg_namespace tables are publicly readable by default, so no extra
-- grants are needed for the catalog-only samples.
GRANT pg_monitor TO observer_agent;
-- For business-metric queries against your application schema, grant
-- SELECT on the specific tables you want to read. Skip this block if
-- all your metrics target system catalogs only.
GRANT USAGE ON SCHEMA public TO observer_agent;
GRANT SELECT ON public.orders TO observer_agent;
-- Optional: auto-grant SELECT on future tables created by the
-- application owner role. Replace `app_owner_role` with whichever
-- role owns your tables. ALTER DEFAULT PRIVILEGES is per-creator;
-- without `FOR ROLE` it only affects tables that the current session
-- (the superuser running this block) creates later, which is rarely
-- what you want.
ALTER DEFAULT PRIVILEGES FOR ROLE app_owner_role IN SCHEMA public
GRANT SELECT ON TABLES TO observer_agent;
Then build a DSN with this role and export it on the agent host:
Use sslmode=verify-full when the agent host trusts the database's
CA; fall back to sslmode=require only if certificate distribution
isn't in place. require encrypts the connection but doesn't verify
the server identity.
MySQL
-- Run as root (or a user with GRANT permission).
CREATE USER 'observer_agent'@'%' IDENTIFIED BY 'replace-with-strong-password';
-- Read access on performance_schema covers the server-wide health
-- queries below (connection counts, replication status, slow query
-- counter). On MySQL 8.0+, these views are gated by SELECT on
-- performance_schema, not by the legacy PROCESS / REPLICATION CLIENT
-- privileges.
GRANT SELECT ON performance_schema.* TO 'observer_agent'@'%';
-- For business-metric queries, grant SELECT on the specific schema:
GRANT SELECT ON your_database.* TO 'observer_agent'@'%';
FLUSH PRIVILEGES;
If you target MySQL 5.7 (out of mainstream support), add
GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'observer_agent'@'%';
so the replication views work. On MySQL 8.0+ the SELECT grant above
is sufficient and avoids letting the role see queries from every
other session.
Export the DSN on the agent host:
Redis
Provision a read-only ACL user. Redis 6+ supports per-command ACLs; restrict the agent user to the same allowlist the agent enforces.
ACL SETUSER observer_agent on >replace-with-strong-password ~* \
+DBSIZE +LLEN +SCARD +ZCARD +HLEN +STRLEN +TTL +PTTL +BITCOUNT +EXISTS +GET
For Redis 5 (no ACLs) use a requirepass and rely on the agent's
client-side allowlist; the parser rejects every mutating command at
save time, so a misconfigured legacy user still can't fire FLUSHALL.
Export the DSN on the agent host:
# Or with TLS:
MongoDB
Provision a read-only role on the specific database (or admin if
you need cross-database reads).
use admin
db.createUser({
user: "observer_agent",
pwd: "replace-with-strong-password",
roles: [
{ role: "read", db: "appdb" }
]
})
Export the DSN on the agent host. SRV records work too:
# Or with Atlas / SRV:
Configure a metric in the Observer console
Create a metric, pick SQL query, fill in:
-
Database:
PostgreSQL,MySQL,Redis, orMongoDB. -
Statement timeout: per-query timeout in milliseconds. Default 5,000; max 30,000. The agent applies this at the database connection level (Postgres
statement_timeout, MySQLMAX_EXECUTION_TIME) so a stuck query aborts even if the agent-side timeout is late to fire. -
Connection string env var: name of an env var on the agent host (
OBSERVER_PG_PROD_DSNin the examples above). The cloud never sees the DSN; the agent reads the value at query time and never logs it. -
Query: shape depends on the database kind.
- PostgreSQL / MySQL: a single
SELECT(orWITH … SELECT) returning exactly one row with one column. - Redis: a single command from the allowlist (
DBSIZE,LLEN,SCARD,ZCARD,HLEN,STRLEN,TTL,PTTL,BITCOUNT,EXISTS,GET). Mutating commands are rejected at save time. - MongoDB: a JSON object with
db,collection,op(countDocumentsorestimatedDocumentCount), and an optionalfilter. Server-side JS operators ($where,$function,$expr) are rejected.
See Sample queries and Result shape.
- PostgreSQL / MySQL: a single
Result shape
The query must return exactly one row and one column. The value is coerced to a number:
| Returned type | Coercion | Example |
|---|---|---|
| Number / integer / float | Used directly | 42 |
| Boolean | true → 1, false → 0 | true |
| Numeric string | Parsed | '97' |
| BigInt | Cast to Number when within Number.MAX_SAFE_INTEGER | 42n |
| NULL | Rejected (db_null_value) | NULL |
| Non-numeric string | Rejected (db_non_numeric) | 'healthy' |
| Multiple rows | Rejected (db_multi_row) | n/a |
| Multiple columns | Rejected (db_multi_column) | n/a |
When the metric you care about is naturally multi-row (e.g. one row
per replication slot), reduce it inside the query (max, sum,
count) so a single scalar reaches the agent.
Sample queries
PostgreSQL
-- Replication lag, in seconds. NULL on a primary (returns no_data).
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()));
-- Long-running active queries (count).
SELECT count(*)
FROM pg_stat_activity
WHERE state = 'active'
AND query_start < now() - interval '5 minutes';
-- Total backend connections.
SELECT count(*) FROM pg_stat_activity;
-- Largest table size, bytes.
SELECT max(pg_relation_size(c.oid))
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
AND n.nspname NOT IN ('pg_catalog','information_schema');
-- Business metric: pending orders.
SELECT count(*) FROM orders WHERE status = 'pending';
MySQL
-- Replication lag, in seconds. Replica-only: returns zero rows on a
-- primary, which the agent surfaces as db_empty_result. Pin the
-- metric definition to replica hosts.
SELECT TIMESTAMPDIFF(
SECOND,
APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
NOW()
)
FROM performance_schema.replication_applier_status_by_worker
WHERE APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP IS NOT NULL
ORDER BY APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP DESC
LIMIT 1;
-- Total connected threads.
SELECT VARIABLE_VALUE + 0
FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Threads_connected';
-- Slow queries, cumulative since server start.
SELECT VARIABLE_VALUE + 0
FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Slow_queries';
-- InnoDB row-lock waits.
SELECT VARIABLE_VALUE + 0
FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Innodb_row_lock_waits';
For business metrics, query your application schema the same way you would in any other SELECT.
Redis
Each command is one line; the agent rejects anything outside the read-only allowlist before it reaches the server.
# Total number of keys in the selected DB.
DBSIZE
# Queue depth.
LLEN myqueue
# Active sessions tracked in a set.
SCARD active_users
# Hash field count.
HLEN session:abc
# Time until a distributed lock expires.
TTL lock:job-42
# A numeric counter stored as a string value.
GET counter:errors
Coercion: integers come back as integers; string replies (e.g. from
GET) are parsed as numbers. Non-numeric strings surface as
db_non_numeric.
MongoDB
Each metric definition stores a JSON object that names the database, collection, and operation. Whitespace inside the JSON is preserved verbatim, so format it however you prefer.
{
"db": "appdb",
"collection": "orders",
"op": "estimatedDocumentCount"
}
{
"db": "appdb",
"collection": "orders",
"op": "countDocuments",
"filter": { "status": "pending" }
}
{
"db": "appdb",
"collection": "sessions",
"op": "countDocuments",
"filter": { "created_at": { "$gte": "2026-05-20T00:00:00Z" } }
}
The filter accepts any BSON-serialisable JSON, with two exceptions:
$where, $function, $accumulator, and $expr are rejected at
save time. These operators trigger server-side JavaScript or full-
collection evaluation and don't belong in a per-cron probe.
estimatedDocumentCount reads from cached metadata and never accepts
a filter; use it when an exact count isn't required.
Reason codes specific to SQL
The reason field on no_data results carries one of:
db_dsn_missing: the env var named inconnection_string_refis not set on the agent. Export it and restart the agent.db_query_not_allowed: the parser rejected the query (e.g. not a SELECT, multi-statement body, or write inside a CTE).db_empty_result: the query returned zero rows. Fix the WHERE clause or change the metric to acount(*)so it always returns one row.db_multi_row/db_multi_column: the query returned more than one row or column. Reduce withcount,max,sum, or use a subquery.db_null_value: the single returned cell was NULL. Usecoalesce(value, 0)if zero is the right fallback for your metric, or add a WHERE clause that filters out the NULL state.db_non_numeric: the cell wasn't a number, boolean, or numeric string. Cast inside the query (::int,::numeric,CAST(... AS UNSIGNED)).db_timeout: the query exceededstatement_timeout_ms. Optimize the query or raise the timeout (cap is 30,000 ms).db_access_denied: the role lacks SELECT privilege on the table. Grant it explicitly.db_auth_failed: the DSN's username / password is wrong.db_connection_failed: the database is unreachable (ECONNREFUSED,ENOTFOUND,ETIMEDOUT, or driver-level network errors). The metric goesno_data, notunhealthy. The database being down is a separate concern from the metric being out of range.db_syntax_error: the query is malformed or references a missing table / column.db_error: catch-all for uncategorized driver errors.
Performance
- One probe = one query per cron interval. At the default 1-minute interval, every metric def runs 1,440 queries per day. Keep each query under a few hundred milliseconds.
- The agent caps its connection pool at 2 connections per unique (DSN, timeout) pair. Even with 50 metric defs against the same database, the agent never holds more than 2 connections open.
- Pool entries idle out after 10 seconds, so a probe that runs every 5 minutes typically holds zero open connections between ticks.
- The agent never retries a failed query inside a tick; failures
surface as
no_dataand the next cron tick is the next attempt.
Troubleshooting
db_dsn_missingeven though I set the env var. The agent reads the variable at query time fromprocess.env, so it must be present in the agent process's environment, not just in your shell. For Docker / Kubernetes deployments, set it on the container or pod spec.db_query_not_allowed: UPDATE statements are not allowed. The parser is strict. Even if your role is read-only, the parser rejects the statement before sending it. Rewrite the query as a SELECT.db_timeouton a query that runs fine inpsql. The database server may be applying a per-sessionstatement_timeoutthat's lower than the agent's. Check the role'sstatement_timeoutsetting (SHOW statement_timeout) and either raise it or lower the agent'sstatement_timeout_ms.db_connection_failedfrom inside a Kubernetes pod, but the same DSN works from a debug pod. Most often a NetworkPolicy or egress rule blocks the agent's pod. Confirm by execing into the agent pod and curling the database host:port.db_multi_rowon a query that returns one row inpsql. Postgres'spsqlsometimes treats multi-row results differently with\gvs\gx. The agent reads every returned row; if your query has any path that returns multiple rows, reduce withLIMIT 1or an aggregate.
Known limits
- PostgreSQL, MySQL, Redis, and MongoDB only. SQLite, BigQuery,
Snowflake, and ClickHouse are not implemented. The Redis runtime
supports a fixed allowlist of read-only commands; the MongoDB
runtime supports
countDocumentsandestimatedDocumentCountonly. Aggregation pipelines + Mongofindare deferred until customer demand surfaces. - No parameterized queries. The query is taken from configuration verbatim. There is no string interpolation surface, so there's no injection vector, but it also means runtime values (e.g. "alert if more than X rows match") have to be baked into the SQL.
- No "test connection" button in the form. The cloud has no path
to the customer's network. Save the metric and watch the agent
dashboard for the first probe result; failures surface as a
no_datarow with a specific reason code within one cron tick.