How to Map RTO and RPO for PostgreSQL Clusters

This page solves one concrete task inside the broader RTO and RPO mapping frameworks methodology: translating two abstract continuity targets into deterministic PostgreSQL cluster behaviors, then proving compliance with a drill that measures them empirically rather than assuming them. Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are only meaningful once each is bound to a specific database subsystem — RPO to Write-Ahead Log (WAL) generation and replication topology, RTO to the discrete phases of standby promotion and replay. Ad-hoc failover runbooks neither expose configuration drift nor emit machine-parseable evidence, so a drill that “passed” tells you nothing an auditor can trust. The orchestrator below decouples the two objectives, asserts WAL lag before it will promote, classifies each drill outcome into a status tier consumable by downstream error categorization frameworks, and only ever reports success against an integrity-verified backup — the same guarantee a checksum validation pipeline provides upstream. Time-bounded promotion also depends on the point-in-time recovery targeting that fixes exactly which WAL position the standby replays to.

Architecture and Execution Model

Figure. The PostgreSQL drill orchestrator chaining backup verification, WAL gap measurement, standby promotion, and empirical RTO and RPO scoring.

The orchestrator treats backup integrity as a hard precondition, measures the RPO surface (replication flush lag) before touching the PostgreSQL cluster, and only then executes the RTO-bearing operation (promotion). RPO is evaluated by inspection; RTO is evaluated by timing the real recovery, because the only defensible RTO figure is one the live cluster actually produced under drill conditions. The two verdicts are reported independently — a drill can meet RTO while breaching RPO, and conflating them hides exactly the failure mode you ran the drill to find.

Prerequisites

Python 3.8+ — the type hints and datetime.timezone usage are stable from 3.8 onward.
The PostgreSQL driver, installed into the automation environment:
bash
```
pip install "psycopg2-binary>=2.9"
```
pgbackrest (or pg_verifybackup) on the automation host, with a configured stanza whose repository the host can reach for the pre-drill check.
A streaming standby reachable from the automation host, lagging within the RPO tolerance defined for the drill. Promote the standby, never the primary.
A dedicated drill account with the attributes needed to read replication state and promote. pg_stat_replication exposes lag columns to pg_monitor; promotion via the SQL function needs the recovery role:
sql
```
CREATE ROLE dr_admin LOGIN PASSWORD '<from-vault>';
GRANT pg_monitor TO dr_admin;
GRANT EXECUTE ON FUNCTION pg_promote(boolean, integer) TO dr_admin;
```
OS-level access to the data directory if you promote with pg_ctl rather than pg_promote() — the drill user must own or be able to signal the target PGDATA.

Mapping RPO to WAL Synchronization Controls

RPO defines the maximum tolerable data-loss window, and in PostgreSQL it maps directly to WAL generation velocity, archival cadence, and replication synchronization topology. To enforce an RPO of five seconds or less, set synchronous_commit to remote_apply and deploy a synchronous standby quorum; synchronous replication guarantees zero committed data loss at the cost of transaction latency. When a storage subsystem introduces archival latency, the PostgreSQL cluster must degrade gracefully to asynchronous replication with a compensating archive_timeout so an idle primary still ships a WAL segment on a bounded schedule:

ini

synchronous_commit = remote_apply
synchronous_standby_names = 'ANY 1 (standby1, standby2)'
archive_timeout = 60s
archive_command = 'pgbackrest --stanza=prod archive-push %p'

Validation must parse replication metrics continuously and assert lag thresholds before permitting any failover drill. Query pg_stat_replication for the per-standby write and flush lag, and treat flush_lag as the RPO-bearing figure — it is the point past which committed data has not yet reached durable storage on the standby:

sql

SELECT application_name, state, sync_state,
       extract(epoch from write_lag) AS write_lag_sec,
       extract(epoch from flush_lag) AS flush_lag_sec
FROM pg_stat_replication
WHERE state = 'streaming';

The drill must raise a hard failure when flush_lag_sec exceeds the defined RPO, so a promotion never executes against a standby cluster with stale replication state.

Mapping RTO to Recovery-Phase Orchestration

RTO governs maximum permissible downtime, and mapping it means isolating recovery into discrete, measurable phases: standby provisioning, WAL replay, shared-buffer warming, and application routing. For an RTO under ten minutes, promote a pre-warmed read replica rather than restoring from cold storage — a fresh Point-in-Time Recovery (PITR) is bounded by recovery_target_time resolution and max_wal_size, and will not meet a sub-ten-minute target on a large cluster. When replay is the critical path, tune checkpoint behavior and parallelize WAL retrieval:

ini

max_wal_size = 4GB
checkpoint_completion_target = 0.9
restore_command = 'pgbackrest --stanza=prod archive-get %f "%p"'

The RTO the drill reports is the wall-clock interval from issuing promotion to the instance accepting writes — measured, not estimated. Consult the PostgreSQL documentation on WAL recovery configuration for parameter precedence and restart requirements before tuning these values in production.

Production Implementation

The orchestrator chains backup verification, WAL-gap measurement, and promotion timing into a single headless run that emits a structured result and a strict POSIX exit code the DR pipeline can branch on. It verifies the backup first, refuses to proceed if integrity fails, measures flush lag against the RPO limit, promotes the standby, and times the real recovery to compute empirical RTO. RPO and RTO verdicts are reported separately.

python

#!/usr/bin/env python3
"""PostgreSQL DR drill orchestrator.

Maps RTO/RPO targets to measured cluster behavior and gates failover.

Exit codes (consumed by the DR pipeline):
    0  backup verified, RPO met, and RTO met   -> drill passed
    1  backup verified but RPO and/or RTO breached -> quarantine, escalate
    2  backup integrity check failed or usage error -> abort pipeline
"""
import json
import logging
import subprocess
import sys
import time
from datetime import datetime, timezone

import psycopg2

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
)
logger = logging.getLogger("pg_dr_drill")


class DRDrillOrchestrator:
    def __init__(self, primary_dsn, standby_dsn, pgdata,
                 stanza="prod", rpo_sec=5, rto_sec=600, poll_timeout=900):
        self.primary_dsn = primary_dsn
        self.standby_dsn = standby_dsn
        self.pgdata = pgdata
        self.stanza = stanza
        self.rpo_limit = rpo_sec
        self.rto_limit = rto_sec
        self.poll_timeout = poll_timeout

    def validate_backup_integrity(self) -> bool:
        # Hard precondition: never promote against an unverified backup.
        result = subprocess.run(
            ["pgbackrest", f"--stanza={self.stanza}", "check"],
            capture_output=True, text=True,
        )
        if result.returncode != 0:
            logger.error("Backup integrity check failed: %s", result.stderr.strip())
        return result.returncode == 0

    def measure_wal_gap(self) -> float:
        # Worst-case flush lag across all streaming standbys, in seconds.
        with psycopg2.connect(self.primary_dsn) as conn:
            with conn.cursor() as cur:
                cur.execute(
                    "SELECT COALESCE(MAX(EXTRACT(EPOCH FROM flush_lag)), 0) "
                    "FROM pg_stat_replication WHERE state = 'streaming';"
                )
                return float(cur.fetchone()[0])

    def orchestrate_failover(self) -> dict:
        start_ts = time.time()
        subprocess.run(["pg_ctl", "promote", "-D", self.pgdata], check=True)

        # RTO is the measured interval until the instance leaves recovery.
        while time.time() - start_ts < self.poll_timeout:
            try:
                with psycopg2.connect(self.standby_dsn) as conn:
                    with conn.cursor() as cur:
                        cur.execute("SELECT pg_is_in_recovery();")
                        if not cur.fetchone()[0]:
                            rto_actual = time.time() - start_ts
                            return {"rto_sec": rto_actual,
                                    "met": rto_actual <= self.rto_limit}
            except psycopg2.OperationalError:
                time.sleep(1)

        # Timed out before the standby accepted writes.
        return {"rto_sec": self.poll_timeout, "met": False}

    def run_drill(self) -> dict:
        wal_lag = self.measure_wal_gap()
        rpo_met = wal_lag <= self.rpo_limit
        if not rpo_met:
            logger.warning("RPO breached: flush_lag %.3fs exceeds limit %ss",
                           wal_lag, self.rpo_limit)

        recovery = self.orchestrate_failover()
        if not recovery["met"]:
            logger.warning("RTO breached: recovery %.1fs exceeds limit %ss",
                           recovery["rto_sec"], self.rto_limit)

        return {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "rpo_met": rpo_met,
            "wal_lag_sec": wal_lag,
            "rto_met": recovery["met"],
            "rto_actual_sec": recovery["rto_sec"],
        }


def main() -> int:
    if len(sys.argv) != 2:
        logger.error("Usage: pg_dr_drill.py <drill_config.json>")
        return 2

    try:
        with open(sys.argv[1], "r", encoding="utf-8") as handle:
            cfg = json.load(handle)
    except (OSError, json.JSONDecodeError) as exc:
        logger.error("Configuration error: %s", exc)
        return 2

    orchestrator = DRDrillOrchestrator(
        primary_dsn=cfg["primary_dsn"],
        standby_dsn=cfg["standby_dsn"],
        pgdata=cfg["pgdata"],
        stanza=cfg.get("stanza", "prod"),
        rpo_sec=cfg.get("rpo_sec", 5),
        rto_sec=cfg.get("rto_sec", 600),
    )

    if not orchestrator.validate_backup_integrity():
        logger.critical("Backup validation failed. Aborting drill.")
        return 2

    report = orchestrator.run_drill()
    logger.info("Drill report: %s", json.dumps(report))

    if report["rpo_met"] and report["rto_met"]:
        logger.info("Drill passed: RTO and RPO both met.")
        return 0

    logger.critical("Drill failed: RTO and/or RPO breached. Quarantine and escalate.")
    return 1


if __name__ == "__main__":
    sys.exit(main())

The configuration file supplies the two DSNs, the data directory to promote, the stanza, and the objective limits in seconds:

json

{
  "primary_dsn": "host=primary dbname=postgres user=dr_admin",
  "standby_dsn": "host=standby dbname=postgres user=dr_admin",
  "pgdata": "/var/lib/postgresql/data",
  "stanza": "prod",
  "rpo_sec": 5,
  "rto_sec": 600
}

Step-by-Step Execution Walkthrough

Render the config from your secret store. Inject the dr_admin credentials into primary_dsn and standby_dsn at deploy time; never commit them.
Confirm the standby is streaming and within tolerance. Run the pg_stat_replication query above and verify flush_lag_sec is below rpo_sec before starting — the drill will fail it anyway, but checking first avoids burning a promotion.
Run the orchestrator against a drill environment, capturing the exit code:
bash
```
python3 pg_dr_drill.py drill_config.json; echo "exit=$?"
```
Read the structured report line. The Drill report: JSON carries rpo_met, wal_lag_sec, rto_met, and rto_actual_sec — the empirical evidence the audit store needs.
Branch on the exit code. 0 records a clean drill, 1 quarantines the backup and escalates on a breached objective, 2 aborts because the backup failed verification or the invocation was malformed.

Verification and Expected Output

A passing drill verifies the backup, measures lag under the limit, times a fast recovery, and exits 0:

text

2026-07-05 04:20:03 | INFO | pg_dr_drill | Drill report: {"timestamp": "2026-07-05T04:20:03+00:00", "rpo_met": true, "wal_lag_sec": 1.842, "rto_met": true, "rto_actual_sec": 47.9}
2026-07-05 04:20:03 | INFO | pg_dr_drill | Drill passed: RTO and RPO both met.

A breached objective downgrades to WARNING, emits a CRITICAL summary, and exits 1:

text

2026-07-05 04:20:03 | WARNING | pg_dr_drill | RPO breached: flush_lag 9.310s exceeds limit 5s
2026-07-05 04:20:03 | CRITICAL | pg_dr_drill | Drill failed: RTO and/or RPO breached. Quarantine and escalate.

The exit code is the contract the pipeline reads:

0 — backup verified and both objectives met. Record the drill as green.
1 — backup verified but RPO and/or RTO breached. Quarantine and escalate.
2 — backup integrity check failed, or the invocation was malformed. Abort the pipeline.

Failure Modes and Troubleshooting

Symptom	Cause	Remediation
Exit `2` with `Backup integrity check failed`	`pgbackrest check` cannot reach the repo or the archive is behind	Confirm the stanza config and repository path; run `pgbackrest --stanza=prod check` by hand and read `stderr`
`wal_lag_sec` is always `0`	No rows in `pg_stat_replication` — the standby is not streaming or the query ran on the standby	Run `measure_wal_gap` against the primary DSN; verify `state = 'streaming'` for the standby
`RPO breached` on a write-heavy window	Async replication or a saturated network link inflates `flush_lag`	Move to `synchronous_commit = remote_apply`; run drills off peak; verify `synchronous_standby_names` names the standby
`pg_ctl: could not send promote signal`	Drill user cannot signal `PGDATA`, or wrong data directory	Point `pgdata` at the standby’s real `PGDATA`; run as the `postgres` OS user or use `pg_promote()` instead
`RTO breached` with `rto_actual_sec == poll_timeout`	Replay stalled on I/O, or the standby never left recovery	Raise `poll_timeout` for large clusters; tune `checkpoint_completion_target`; parallelize `restore_command`
`psycopg2.OperationalError` loops until timeout	Standby restarts on promotion and briefly refuses connections	Expected during promotion — the poll retries; only a full `poll_timeout` elapse is a real failure

Integration Notes

The orchestrator is built for headless scheduling, and its strict exit codes let any scheduler own the drill without extra glue:

Airflow — invoke it from a BashOperator, or a PythonOperator that shells out and inspects returncode; a non-zero exit fails the task and short-circuits any downstream promotion step, keeping the DAG run history as the audit trail.
Celery — wrap the call in a task that raises on non-zero so the broker records the failure and event-driven drills (fired when a fresh verified backup lands) get low-latency dispatch.
cron / systemd — schedule the wrapper directly; because it returns POSIX codes, a systemd OnFailure= handler can route quarantine alerts natively.

Feed the JSON report into the broader Core DR Architecture & Validation Fundamentals audit store so every drill carries immutable evidence of measured RTO, measured RPO, and the backup version it ran against. For compliance alignment, cross-reference the recorded objectives with NIST SP 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems so auditors can trace each measured recovery figure to a stated requirement.

Frequently Asked Questions

Why measure RTO empirically instead of estimating it from replay throughput?

An estimate assumes the critical path you modeled is the one that actually runs. In practice promotion time is dominated by whichever phase stalls — checkpoint completion, shared-buffer warming, connection draining, or replay I/O contention — and those interact non-linearly under load. Timing the real promotion until the instance leaves recovery is the only figure an auditor can trust, and the only one that surfaces regressions when a config change quietly lengthens recovery.

Why is flush_lag the RPO-bearing metric rather than write_lag or replay_lag?

write_lag only proves the standby received the WAL into memory; replay_lag measures visibility to readers, which is an RTO concern. flush_lag is the interval until the WAL is durably on the standby's storage — the point past which a primary loss would lose committed data. That makes it the exact quantity RPO bounds, so the drill asserts against it.

Should I run the drill against production or a promoted clone?

Run it against an isolated drill environment built from the same backups and replication topology, not production. The orchestrator issues a real promotion, which breaks replication for the promoted node; doing that to a live standby removes redundancy from the production cluster for the duration. Provision a sandbox that mirrors the topology, run the drill there, and treat the measured figures as representative.

What distinguishes exit code 1 from exit code 2?

Exit 2 means the run never produced a verdict: the backup failed its integrity check, or the config was malformed. Exit 1 means the drill ran end to end and at least one objective was breached. Orchestrators treat 2 as "fix the backup or the invocation" and 1 as "the recovery envelope was violated — quarantine and escalate."

RTO vs RPO mapping frameworks — the parent methodology this PostgreSQL mapping instantiates.
Point-in-time targeting for MongoDB backups — the same recovery-target discipline applied to oplog-based engines.
Handling page corruption in PostgreSQL backups — the storage-level integrity check that precedes a trustworthy drill.
Python script for MySQL checksum validation — the logical-integrity gate that gives “backup verified” its meaning.
Fallback chain design for Kubernetes clusters — where a breached objective routes recovery to the next tier.

This drill is one component of the broader RTO vs RPO Mapping Frameworks workflow.