What causes exit code 2 versus exit code 1 in the orchestrator?

Exit 2 is a pipeline-abort condition such as a malformed manifest or a NetworkPolicy the service account could not create, where the drill never judged the backup. Exit 1 is a validation verdict where the boundary was asserted and credentials scoped but the restored instance failed the read-only checks, so orchestrators quarantine the backup and escalate.

Implementing Zero-Trust Boundaries in DR Sandboxes

This page implements one concrete task inside Security Boundaries for DR Environments: a Python orchestrator that asserts a zero-trust perimeter around a disaster-recovery drill before any restored database accepts a connection, then revokes it deterministically when validation finishes. Historically DR sandboxes run on implicit trust, assuming logical isolation removes the need for granular access controls; during automated backup validation that assumption becomes a lateral-movement vector between restored instances, application proxies, and validation agents. Consistent with the wider Core DR Architecture & Validation Fundamentals, every restored workload here is treated as untrusted until it is network-fenced and policy-evaluated. The same sandbox provisioning automation that stands up disposable infrastructure must stand up a default-deny boundary around it, a backup only crosses that boundary after a checksum validation pipeline proves it untampered, and any boundary violation is mapped onto the shared error categorization framework so a credential leak and a slow mount are never triaged alike. The operational objective is to synchronize network segmentation, database authentication, and validation-agent access without adding latency that violates the recovery windows defined in your RTO and RPO mapping.

Architecture and Execution Model

The orchestrator sequences four irreversible-by-default operations: fence the namespace, mint short-lived credentials, run read-only checks, then tear everything down. The teardown branch runs whether validation passes or fails, so a crashed drill never leaves a credential lease or an open network path behind.

Figure. The ordered interactions where the orchestrator applies network isolation, injects TTL-bound credentials, runs validation, and tears down the sandbox.

Two enforcement layers do the work. The first is an ephemeral network slice: an isolated namespace or VPC subnet with default-deny ingress and egress, so nothing in the sandbox can reach production and nothing in production can reach the sandbox. The second is a drill-scoped credential lifecycle: no static passwords or long-lived IAM roles, only leases bound to the sandbox and expiring inside the validation window.

Ephemeral Micro-Segmentation

Static firewall rules and broad security groups are replaced with programmatically generated policies bound to backup-manifest metadata, so each drill iteration receives a cryptographically distinct network slice. For Kubernetes sandboxes, apply a namespace-scoped NetworkPolicy that restricts traffic exclusively to the validation orchestrator’s pod selector and the target database port; see the Kubernetes NetworkPolicy specification for exact API semantics.

yaml

# dr-sandbox-network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: dr-drill-isolation
  namespace: dr-sandbox-
spec:
  podSelector:
    matchLabels:
      app: restored-db
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              role: validation-agent
      ports:
        - protocol: TCP
          port: 5432
  egress: []

The empty egress list is the load-bearing detail: it forbids the restored database from initiating any outbound connection, which is what stops a compromised or misconfigured restore from phoning home or reaching a production endpoint during the drill.

Drill-Scoped Credential Lifecycle

A centralized secrets manager (HashiCorp Vault, AWS Secrets Manager) issues credentials with automatic TTL expiration — typically 15–30 minutes for a validation window. The orchestrator requests them immediately after the restore completes; the secrets engine binds the generated role to the ephemeral slice and enforces strict session boundaries. This eliminates credential reuse across drill iterations and satisfies least-privilege without a human ever seeing a password.

Prerequisites

Python 3.10+ (the orchestrator uses structural typing and modern typing syntax).
In-cluster execution: the orchestrator runs as a pod with a service account bound to a role granting create/delete on networkpolicies in the sandbox namespace only.
A reachable Vault with a configured database secrets engine and a role whose default_ttl/max_ttl bound the validation window.
A sandbox CA bundle mounted into the pod so the validator can pin TLS to the sandbox certificate authority.
Install the client libraries:

bash

pip install "kubernetes>=29.0" "hvac>=2.1" "psycopg2-binary>=2.9" "pydantic>=2.6"

The service account must have no standing grant on production namespaces, and the Vault token must be a short-lived drill token, not a root token.

Production Implementation

The orchestrator validates its manifest with pydantic, applies the policy with the official Kubernetes Python client, mints credentials with hvac, runs a read-only validation connection with psycopg2, and always executes teardown. Every gating path resolves to an explicit POSIX exit code.

python

#!/usr/bin/env python3
"""Zero-trust DR sandbox orchestrator: fence, inject, validate, tear down."""
import json
import logging
import sys
from typing import Optional

import hvac  # HashiCorp Vault client for credential generation
import psycopg2
from kubernetes import client, config
from kubernetes.client.rest import ApiException
from pydantic import BaseModel, ValidationError

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | zt_orchestrator | %(message)s",
)

# Explicit, orchestrator-readable exit contract.
EXIT_OK = 0        # sandbox validated; safe to record a passing drill
EXIT_FAILED = 1    # validation ran and failed; quarantine the backup
EXIT_ABORT = 2     # misconfiguration; the drill never judged the backup


class DrillManifest(BaseModel):
    drill_id: str
    backup_snapshot: str
    db_engine: str
    db_host: str
    db_name: str
    target_port: int = 5432
    orchestrator_cidr: str
    vault_addr: str
    vault_token: str
    vault_role: str
    sandbox_ca: str
    ttl_seconds: int = 900


class ZeroTrustDrillOrchestrator:
    def __init__(self, manifest: DrillManifest):
        self.manifest = manifest
        self.namespace = f"dr-sandbox-{manifest.drill_id}"
        self.vault_client = hvac.Client(
            url=manifest.vault_addr, token=manifest.vault_token
        )
        config.load_incluster_config()
        self.k8s_net = client.NetworkingV1Api()
        self._lease_id: Optional[str] = None

    def apply_network_policy(self) -> None:
        policy = client.V1NetworkPolicy(
            api_version="networking.k8s.io/v1",
            kind="NetworkPolicy",
            metadata=client.V1ObjectMeta(
                name="dr-drill-isolation", namespace=self.namespace
            ),
            spec=client.V1NetworkPolicySpec(
                pod_selector=client.V1LabelSelector(
                    match_labels={"app": "restored-db"}
                ),
                policy_types=["Ingress", "Egress"],
                ingress=[
                    client.V1NetworkPolicyIngressRule(
                        _from=[
                            client.V1NetworkPolicyPeer(
                                pod_selector=client.V1LabelSelector(
                                    match_labels={"role": "validation-agent"}
                                )
                            )
                        ],
                        ports=[
                            client.V1NetworkPolicyPort(
                                protocol="TCP", port=self.manifest.target_port
                            )
                        ],
                    )
                ],
                egress=[],  # default-deny egress, matching the YAML manifest
            ),
        )
        self.k8s_net.create_namespaced_network_policy(
            namespace=self.namespace, body=policy
        )
        logging.info("Ephemeral NetworkPolicy applied to %s", self.namespace)

    def inject_credentials(self) -> dict:
        # TTL is enforced on the Vault role (default_ttl/max_ttl), not per request;
        # generate_credentials() takes only the role name (and optional mount_point).
        response = self.vault_client.secrets.database.generate_credentials(
            name=self.manifest.vault_role
        )
        self._lease_id = response["lease_id"]
        creds = response["data"]
        logging.info(
            "Short-lived credentials issued (lease %s, TTL %ds)",
            self._lease_id,
            self.manifest.ttl_seconds,
        )
        return creds

    def run_validation(self, creds: dict) -> bool:
        # Read-only session, TLS pinned to the sandbox CA, hard statement timeout.
        conn = psycopg2.connect(
            host=self.manifest.db_host,
            port=self.manifest.target_port,
            dbname=self.manifest.db_name,
            user=creds["username"],
            password=creds["password"],
            sslmode="verify-full",
            sslrootcert=self.manifest.sandbox_ca,
            connect_timeout=10,
            options="-c default_transaction_read_only=on -c statement_timeout=30000",
        )
        try:
            with conn.cursor() as cur:
                cur.execute(
                    "SELECT count(*) FROM information_schema.tables "
                    "WHERE table_schema NOT IN ('pg_catalog', 'information_schema')"
                )
                table_count = cur.fetchone()[0]
            logging.info(
                "Read-only validation reachable; %d user tables visible", table_count
            )
            return table_count > 0
        finally:
            conn.close()

    def cleanup(self) -> None:
        try:
            self.k8s_net.delete_namespaced_network_policy(
                name="dr-drill-isolation", namespace=self.namespace
            )
            logging.info("NetworkPolicy removed; sandbox teardown initiated")
        except ApiException as exc:
            logging.warning("Policy cleanup skipped: %s", exc.reason)
        if self._lease_id:
            try:
                self.vault_client.sys.revoke_lease(lease_id=self._lease_id)
                logging.info("Credential lease %s revoked", self._lease_id)
            except Exception as exc:  # noqa: BLE001 - teardown must never raise
                logging.warning("Lease revocation skipped: %s", exc)


def execute_drill(manifest_path: str) -> int:
    try:
        with open(manifest_path, "r", encoding="utf-8") as handle:
            manifest = DrillManifest(**json.load(handle))
    except (ValidationError, FileNotFoundError, json.JSONDecodeError) as exc:
        logging.error("Invalid drill manifest: %s", exc)
        return EXIT_ABORT

    orchestrator = ZeroTrustDrillOrchestrator(manifest)
    try:
        orchestrator.apply_network_policy()
    except ApiException as exc:
        logging.error("NetworkPolicy creation failed: %s", exc.reason)
        return EXIT_ABORT

    try:
        creds = orchestrator.inject_credentials()
        success = orchestrator.run_validation(creds)
    except Exception as exc:  # noqa: BLE001 - any failure fails the gate
        logging.error("Validation error under zero-trust boundary: %s", exc)
        success = False
    finally:
        orchestrator.cleanup()

    if success:
        logging.info("DR drill %s passed zero-trust validation", manifest.drill_id)
        return EXIT_OK
    logging.error("DR drill %s failed validation checks", manifest.drill_id)
    return EXIT_FAILED


if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: zt_orchestrator.py <manifest.json>", file=sys.stderr)
        sys.exit(EXIT_ABORT)
    sys.exit(execute_drill(sys.argv[1]))

Step-by-Step Execution Walkthrough

Provision the sandbox namespace with the same automation that mounts the restored snapshot, labelling the database pod app: restored-db and the validator role: validation-agent so the policy selectors match.
Render the manifest (drill_manifest.json) from your secret store, injecting vault_token as a drill-scoped token and pointing sandbox_ca at the mounted CA bundle — never commit these values.

Run the orchestrator and capture its status:

bash

python3 zt_orchestrator.py drill_manifest.json; echo "exit=$?"

Observe the ordered log lines — policy applied, credentials issued, validation reachable, policy removed, lease revoked. The teardown lines appear even on a failing run.
Branch on the exit code. 0 records a passing drill and admits the backup, 1 quarantines the backup and escalates, 2 signals a misconfiguration to fix before re-running.

Verification and Expected Output

A clean run emits the full lifecycle and exits 0:

text

2026-07-05 04:12:01 | INFO | zt_orchestrator | Ephemeral NetworkPolicy applied to dr-sandbox-2026-07-05-a
2026-07-05 04:12:02 | INFO | zt_orchestrator | Short-lived credentials issued (lease database/creds/dr-ro/9fbc, TTL 900s)
2026-07-05 04:12:03 | INFO | zt_orchestrator | Read-only validation reachable; 142 user tables visible
2026-07-05 04:12:03 | INFO | zt_orchestrator | NetworkPolicy removed; sandbox teardown initiated
2026-07-05 04:12:03 | INFO | zt_orchestrator | Credential lease database/creds/dr-ro/9fbc revoked
2026-07-05 04:12:03 | INFO | zt_orchestrator | DR drill 2026-07-05-a passed zero-trust validation

The exit code is the contract the orchestrator reads:

0 — the boundary held, credentials were scoped, and the restored instance answered read-only queries. Admit the backup.
1 — the boundary held but validation failed (unreachable, empty, or write-attempted). Quarantine the backup and escalate.
2 — malformed manifest or a policy the service account was not allowed to create. Fix the invocation; no backup was judged.

Because teardown runs in a finally block, a 1 or an exception still logs NetworkPolicy removed and lease ... revoked before the process exits.

Failure Modes and Troubleshooting

Symptom	Cause	Remediation
`ApiException: 403 Forbidden` on policy create	Service account lacks `create` on `networkpolicies` in the sandbox namespace	Bind a namespaced Role granting `networkpolicies` verbs; never widen it to production namespaces
Validation hangs then `connect_timeout` fires	`NetworkPolicy` ingress selector does not match the validator pod labels	Confirm the validator carries `role: validation-agent` and the DB pod carries `app: restored-db`
`psycopg2.errors.ReadOnlySqlTransaction`	A validation query attempted a write under the read-only session	Keep validation strictly `SELECT`; the read-only option is a safety net, not a licence to mutate
`hvac.exceptions.InvalidRequest: lease is not renewable`	Role `max_ttl` shorter than the validation window	Raise the Vault role `max_ttl` above the longest expected validation, or shorten the checks
Credential works after teardown	Lease revocation was skipped and the DB caches the grant	Confirm `revoke_lease` ran in the logs; enforce Vault’s revocation hook on the database engine
Exit `2` with `JSONDecodeError`	Malformed manifest or unresolved `${VAULT_*}` placeholder	Render secrets before invocation; validate the manifest JSON in CI before deploy

Latency introduced by policy generation and credential issuance must stay within the drill’s recovery budget: cache the policy template, pre-warm the Vault connection, and issue credentials asynchronously during the restore phase rather than after it.

Integration Notes

The orchestrator is headless and returns strict POSIX codes, so a thin wrapper turns its verdict into a promotion decision and an alert:

bash

#!/usr/bin/env bash
set -euo pipefail

python3 zt_orchestrator.py drill_manifest.json
case $? in
  0) echo "[$(date -u)] boundary held and validation passed; admitting backup" ;;
  1) curl -s -X POST "$PAGERDUTY_WEBHOOK" -d '{"event":"dr_boundary_validation_failed"}'; exit 1 ;;
  *) echo "[$(date -u)] orchestrator misconfigured; aborting drill"; exit 2 ;;
esac

Wire the wrapper into whichever scheduler owns the drill:

Airflow — invoke from a BashOperator (or a PythonOperator that shells out and inspects returncode); a non-zero exit fails the task and short-circuits the downstream promotion task, keeping the DAG run history as the audit trail.
Celery — wrap the call in a task that raises on non-zero so the broker records the failure and event-driven drills (fired when a fresh snapshot lands) dispatch with low latency.
cron / systemd — schedule the wrapper directly; strict POSIX codes let OnFailure handlers route quarantine alerts with no extra glue.

Feed each drill’s boundary result into the broader Automated Backup Integrity Check Implementation audit store so every promotion decision carries immutable evidence of which perimeter was asserted, which lease was issued, and when it was revoked. Align continuous policy evaluation and cryptographic attestation with the NIST SP 800-207 Zero Trust Architecture guidance so the boundary remains enforced throughout the validation window.

Frequently Asked Questions

Why apply a default-deny NetworkPolicy instead of relying on namespace isolation?

A Kubernetes namespace is an organizational boundary, not a network boundary — by default any pod can reach any other pod cluster-wide. Zero trust requires that the restored database accept traffic only from the validation agent and initiate no outbound connections at all. The empty egress list is what prevents a compromised or misconfigured restore from reaching a production endpoint during the drill, which namespace isolation alone does not stop.

Why mint TTL-bound credentials per drill rather than reuse a read-only account?

A standing read-only account is a long-lived secret that survives every drill, leaks across iterations, and must be rotated by hand. A Vault-issued lease is created after the restore, bound to the sandbox, and revoked at teardown, so no credential outlives the validation window. That satisfies least privilege and gives the audit trail a concrete lease id for every promotion decision.

What guarantees teardown runs if validation crashes?

The validation and credential-injection block is wrapped in try/finally, and cleanup() runs in the finally clause. Whether validation returns false or raises, the NetworkPolicy is deleted and the Vault lease is revoked before the process exits, so a crashed drill never leaves an open path or a live credential behind. cleanup() itself swallows its own errors so teardown can never mask the original failure.

What causes exit code 2 versus exit code 1?

Exit 2 is a pipeline-abort condition: a malformed manifest, unreadable file, or a NetworkPolicy the service account was not permitted to create — the drill never judged the backup. Exit 1 is a validation verdict: the boundary was asserted, credentials were scoped, and the restored instance still failed the read-only checks. Orchestrators treat 2 as "fix the invocation" and 1 as "quarantine the backup and escalate."

Security Boundaries for DR Environments — the parent reference this orchestrator enforces at runtime.
Automating sandbox provisioning with Terraform — how the disposable namespace and labelled pods this policy targets get created.
Fallback chain design for Kubernetes clusters — what the orchestrator does when a fenced drill fails validation.
Python script for MySQL checksum validation — the integrity gate that runs inside this boundary before a backup is admitted.

This orchestrator is one component of the broader Security Boundaries for DR Environments reference within Core DR Architecture & Validation Fundamentals.