In database triage, the workflow can be represented as a Directed Acyclic Graph (DAG) of checks rather than a flat, linear script. Navigating this graph requires gathering data from a hierarchy of sources: logs, query statistics, active sessions, and execution plans. Based on the evidence a specific edge has to be traversed to the next investigation task.
Running the wrong database command at the wrong time can saturate CPU or lock catalog tables or gather the wrong evidence. To safely automate triage, we must separate the mechanical tasks (codified in local tools) from the branching decisions (evaluated by the agent). The agent’s job is to navigate the path, not to execute arbitrary SQL.
The Runbook as a DAG: Tasks vs. Decisions
A database triage runbook is structured as a Directed Acyclic Graph (DAG) where each node divides labor between the runner and the agent:
- The Task (Mechanical): Executing CLI commands, verifying file existence, checking database connectivity, parsing log strings, or querying system catalog views. This is purely mechanical and belongs in codified tool logic.
- The Decision (Agentic): Evaluating the task outputs and deciding which path in the DAG to traverse next. This requires reasoning under context and is the sole responsibility of the agent.
graph TD
Start[Root: Inspect Environment] --> TaskInspect[Task: pg-logstats inspect]
TaskInspect --> DecScope{Agent Decision: Historical, Live, or Both?}
DecScope -->|Historical Only / Logs Present| PathHist[Path A: Parse Logs]
DecScope -->|Live Only / DSN Present| PathLive[Path B: Query pg_stat_activity]
DecScope -->|Both Present| PathBoth[Path C: Prioritize & Correlate]
DecScope -->|Telemetry Missing| PathAbort[Abort & Prompt Operator]
PathHist --> TaskQueryFamilies[Task: group & rank queries]
TaskQueryFamilies --> DecAnalyze{Agent Decision: Which query family is the culprit?}
PathLive --> TaskLiveStats[Task: get active session waits]
TaskLiveStats --> DecLock{Agent Decision: Is there lock contention?}To see how this division of labor works under load, look at how a typical slow-query incident branches:
- Checking telemetry availability is mechanical. Validating whether Postgres logs are readable or testing if the DSN environment variable works requires no reasoning. These are utility tasks.
- Choosing where to focus is agentic. If both logs and a live connection are available, the agent must evaluate the symptoms:
- If database CPU is currently pinned at 100%, it should follow the live path (
pg_stat_activity) to catch query locks active right now. - If the CPU spike has already subsided, it must pivot to the historical path (parsing logs) to find past durations.
- If a firewall blocks live database access, it must decide to proceed in offline-only mode and log that limitation in its final summary.
- If database CPU is currently pinned at 100%, it should follow the live path (
The tool executes the nodes (reads the files, queries the catalogs); the agent simply decides which edge to follow next.
The Triage Sequence: Why Order and Cost Matter
Within the database triage DAG, nodes have different performance costs and operational risks. When all telemetry sources are available, the agent should traverse the DAG in a specific sequence that minimizes database impact. This is the recommended sequence of triage:
Logs (Offline) -> pg_stat_statements (Low) & pg_stat_activity (Low) -> EXPLAIN (Medium) -> EXPLAIN ANALYZE (High/Risk)
Each source answers a different question:
| Source | Diagnostic Value | Performance Cost | Primary Risk |
|---|---|---|---|
| Postgres Logs | Historical query text, exact durations, connection spikes | None (offline parse) | High token consumption if un-normalized |
pg_stat_statements | Aggregate query statistics and frequency | Negligible | Requires database connection |
pg_stat_activity | Live running queries, lock waits, active sessions | Low | Snapshot representation (misses transient spikes) |
EXPLAIN | Query plan estimates | Low | Executes planner only, can still lock tables in rare cases |
EXPLAIN ANALYZE | Actual execution plan and buffer usage | High | Executes the query. Will mutate data on writes and saturate CPU on reads |
The order of execution matters.
If an agent starts by querying live state (pg_stat_activity), it will miss transient query spikes that occurred five minutes ago. If it jumps directly to running EXPLAIN ANALYZE on a suspected slow query during an active database stall, it can consume the remaining CPU capacity, worsening the outage.
This sequence enforces that we move from offline, low-cost evidence to live, higher-cost diagnostics only as needed.
The Risk of Cold-Start Database Access
When an agent starts triage without a runbook, it faces the “cold-start” problem: it must learn the database schema and operational boundaries while operating inside them.
This leads to several failure modes:
- Inefficient Token Usage: An LLM trying to parse raw multi-line PostgreSQL logs via regular expressions consumes thousands of tokens and frequently fails on complex stack traces or nested query literals.
- Arbitrary SQL Execution: Without constraints, the agent will write ad-hoc queries against
pg_catalogtables. Running unindexed scans across system catalogs likepg_classholdsAccessShareLockon database metadata. If a migration or DDL statement runs concurrently, it will queue behind the catalog read to acquire anAccessExclusiveLock, blocking all subsequent application queries and causing a full database hang. - Safety Violations: An agent cannot be trusted to consistently remember safety rules (like wrapping write queries in a
ROLLBACKtransaction before runningEXPLAIN ANALYZE) solely based on a system prompt.
The mechanical tasks—parsing logs, normalizing SQL literals, ranking queries by runtime, and enforcing execution constraints—should remain in compiled code. The agent’s role should be limited to making decisions at explicit branch points.
Codifying the Runbook: pg-logstats
pg-logstats is a Rust-based tool designed to act as the boundary between the agent and the database. It structures triage into a stateful, step-by-step workflow:
inspect -> triage report -> next_actions -> execute_action -> updated report
Instead of executing arbitrary SQL, the agent interacts with pg-logstats by calling predefined command-line actions. The tool parses the raw logs, generates structured JSON reports, and exposes a set of safe, allowed transitions.
Phase 1: Inspecting the Environment
Triage begins by verifying the available telemetry. Running the inspect command evaluates the logs and determines which operating mode is safe:
pg-logstats --workspace /tmp/triage-workspace inspect production_stderr.log
The tool produces a structured state file (inspect.json):
{
"workflow": "inspect",
"operating_mode": "log_backed_only",
"limitations": [
"live_database_checks_unavailable"
],
"next_actions": [
{
"action_id": "inspect.top_query_families",
"action_type": "run_workflow",
"status": "allowed",
"reason": "Log file contains parsed statements and durations."
}
]
}
By analyzing the log format, pg-logstats determines that a live database connection (DSN) is missing. It restricts the agent’s environment to log_backed_only and explicitly blocks any live database queries.
Phase 2: Isolating Query Families
Once inspection is complete, the agent is permitted to run the next allowed workflow:
pg-logstats --workspace /tmp/triage-workspace query-families production_stderr.log
The tool parses the logs, groups queries by stripping literal values (e.g., converting id = 42 to id = ?), and ranks them by execution cost:
{
"id": "qf_51125b8829ab1fdf",
"rank": 1,
"reason": "12 executions contributed 14.2 seconds total runtime; max execution was 2.4 seconds",
"query_family": {
"normalized_sql": "SELECT * FROM transactions WHERE user_id = ? AND status = ?;",
"database": "payment_prod",
"user": "billing_service",
"application_name": "worker_node_1"
}
}
The normalization and grouping happen locally in Rust, consuming zero LLM tokens. The agent receives a clean, structured digest of the problem query family without having to parse raw log strings.
Phase 3: Enforcing Boundaries (EXPLAIN vs. EXPLAIN ANALYZE)
A database connection is active, but the runbook still restricts the agent’s SQL execution boundaries. If the agent wants to inspect the query plan for a slow query family, the runbook allows standard EXPLAIN planning, but blocks EXPLAIN ANALYZE if the query is not a read-only SELECT or if the statement is too complex:
pg-logstats --workspace /tmp/triage-workspace explain qf_51125b8829ab1fdf
The tool checks the query semantics and active locks, returning:
{
"workflow": "explain",
"query_family_id": "qf_51125b8829ab1fdf",
"next_actions": [
{
"action_id": "explain.standard",
"status": "allowed",
"command": "EXPLAIN SELECT * FROM transactions WHERE user_id = $1;"
},
{
"action_id": "explain.analyze",
"status": "blocked",
"reason": "EXPLAIN ANALYZE is blocked because the target table (transactions) has write lock contention. Running it would worsen active CPU saturation."
}
]
}
The tool acts as a policy gate. Standard EXPLAIN runs instantly in the planner without executing the query, so it is safe. EXPLAIN ANALYZE actually executes the query, which could worsen CPU starvation on a locked table. By blocking this transition at the node layer, the tool protects the database from unsafe agent choices.
Practical Rule
When designing agentic database tooling:
Codify the diagnostics in the tool; use the agent only to choose the path.
| Step | Responsible Layer | Action | Safety Guardrail |
|---|---|---|---|
| Parsing & Grouping | Rust CLI (pg-logstats) | String normalization, duration sorting | No LLM hallucinations on query statistics |
| State Verification | Rust CLI (pg-logstats) | Verifying log paths and DSN connectivity | Agent cannot execute SQL without a verified inspect run |
| Path Selection | AI Agent | Choosing to request live access or stop triage | LLM focuses on operator interaction and final recommendation |
| SQL Execution | Rust CLI (pg-logstats) | Running parametrized queries on pg_stat_* | Arbitrary writes and unsafe EXPLAIN ANALYZE calls are blocked |
Moving the operational state from the agent’s prompt into a stateful CLI contract keeps the database safe. It also stops the agent from wasting tokens writing custom SQL or trying to parse multi-line log streams with raw regex.