Sandbox Provisioning Automation

A restore drill is only trustworthy if the environment it runs in is disposable, isolated, and reconstructed from code on every execution — a hand-built staging box that lingers between drills accumulates drift and quietly invalidates the recovery metrics measured against it. Sandbox provisioning automation is the phase that stands up that disposable environment: it ingests a versioned drill manifest, materializes compute, network, and storage boundaries that mirror production topology without inheriting production identity, attaches restored backup volumes, and tags every resource for automatic garbage collection. Within the broader Restore Drill Orchestration & Environment Isolation framework, this is the foundational execution layer that every later phase depends on — nothing can be validated until there is a segregated environment to validate it in.

The provisioning contract is precise: reconstruct enough of production for validation to be meaningful, inherit none of production’s blast radius, and leave zero residue when the drill ends. That contract binds this phase to the ones around it. The remaining recovery budget derived from RTO/RPO mapping caps how much topology it is worth reconstructing; a corrupt artifact caught by the upstream checksum validation pipeline must never reach the provisioner at all; and any provisioning failure is mapped onto the shared error categorization framework so a network-policy rejection and a volume-attach timeout escalate through the same severity contract. For DBAs, SREs, and Python automation engineers, this phase is where “we have a backup” becomes “we have a running, isolated, query-ready instance of that backup.”

Architecture and Execution Workflow

Figure. Ephemeral sandbox lifecycle from manifest ingestion and state locking through provisioning, data restore, recovery targeting, validation, and reverse-order teardown.

Provisioning is implemented as a state machine over a session manifest, not as a linear shell script. When the orchestrator triggers a drill, the provisioning subsystem ingests a manifest that pins the backup artifact identity, the target recovery coordinate, the topology to reconstruct, compute sizing, storage-class mappings, and the lifecycle tags that will later drive teardown. State is versioned and locked per session so two concurrent drills against the same artifact allocate distinct sandboxes and never race on shared infrastructure state. Each phase below is idempotent and emits a structured transition record, so a crashed provision resumes at the last committed edge — reattaching an already-created VPC rather than leaking a second one — instead of restarting against a dirty account. The phases decompose into independent engineering concerns: manifest resolution and locking, infrastructure materialization, volume attachment and restore, endpoint registration and cache warming, and teardown.

Phase-by-Phase Breakdown

Manifest Resolution and State Locking

The entry phase resolves the drill manifest and acquires an exclusive lock on the session’s state file before a single resource is created. The manifest is treated as immutable for the lifetime of the session, which is what makes the whole run replayable: the same manifest always resolves to the same topology and the same recovery coordinate. Locking is mandatory because provisioning is not atomic — a partially provisioned sandbox that another worker also tries to provision produces orphaned VPCs and duplicate volumes that survive teardown and bill indefinitely. Backends such as an S3-plus-DynamoDB state lock or a Terraform Cloud workspace enforce a single writer per session key; a failure to acquire the lock is a fast, explicit abort, not a retry loop that races the existing holder.

Isolated Infrastructure Materialization

The second phase materializes compute and network boundaries from declarative templates that abstract cloud-provider specifics. Modular infrastructure-as-code parameterizes the VPC or namespace, subnet layout, security-group egress rules, and database instance class so the same module reconstructs a Postgres, MySQL, or document-store topology by configuration alone. Isolation is enforced structurally here, before any data lands: a dedicated VPC with no peering to production, security groups that deny egress to production endpoints by default, and IAM roles scoped exclusively to the session and revoked at teardown. Lifecycle guards such as prevent_destroy on any accidentally-referenced production resource prevent a misconfigured module from mutating live state during a restore; the official Terraform resource lifecycle documentation specifies the exact semantics these guards rely on. The security boundaries for DR environments reference defines the network-policy and key-propagation invariants this phase must satisfy before it is allowed to advance.

Volume Attachment and Data Restore

Once the infrastructure reaches a ready state, the pipeline attaches backup artifacts and restores data onto the provisioned instances. This stage coordinates with backup retention systems and transaction-log archives: snapshots are attached read-only where possible, and restore proceeds against the ephemeral instance rather than any shared volume. The restore target is rarely a static snapshot — realistic drills demand a specific failure window — so this phase hands the resolved coordinate to point-in-time recovery targeting, which replays transaction logs up to the exact timestamp the manifest specifies. Attachment sequencing matters: data volumes must be bound and mounted before the engine process starts, or the instance boots against an empty data directory and the restore silently produces a hollow sandbox.

Endpoint Registration and Cache Warming

With the engine running against restored data, the phase registers the sandbox’s endpoints into the drill’s routing registry and pre-warms caches so validation is not measuring cold-start latency. Endpoint addresses are extracted from infrastructure state outputs and published for the smoke-test routing logic layer to consume, which is what confines synthetic traffic to the sandbox instead of leaking it into production ingress. Cache pre-warming replays a representative set of index scans and query plans immediately after restore, so the functional validation that follows reflects steady-state behavior rather than an artificially cold instance. If the primary sandbox degrades during this window, the pre-declared fallback chain configuration redirects the drill onto a secondary compute pool without operator intervention.

Reverse-Dependency Teardown

The terminal phase destroys the sandbox in reverse dependency order the instant validation completes or the drill times out. Endpoints are deregistered, engine processes are stopped, restored volumes are detached and deleted, compute instances are terminated, and finally the network and IAM artifacts are torn down — the exact inverse of creation order, so nothing is destroyed while a still-live resource depends on it. Teardown is driven by the session’s lifecycle tags and is itself idempotent: a second teardown pass over an already-clean session is a no-op, which lets a garbage-collection sweep safely reap sessions whose orchestrator crashed mid-drill. This is what guarantees zero residual cost and a clean audit boundary between drill sessions.

Python Implementation Patterns

Python models this cleanly: the abc module expresses each cloud target as a pluggable provisioner behind one interface, a dataclass carries the immutable manifest, concurrent.futures parallelizes the independent volume attachments, and strict POSIX exit codes let the whole provision gate a shell-driven DR runbook. The orchestrator below is complete and runnable: it reads a manifest JSON, acquires a session lock, provisions the sandbox, attaches volumes concurrently with bounded exponential backoff, records the resources it created for teardown, and returns an explicit exit code. Provisioning specifics are behind the Provisioner interface so the same control flow drives AWS, GCP, or a Kubernetes backend unchanged.

python

#!/usr/bin/env python3
"""Provision a disposable restore-drill sandbox from a session manifest.

Exit codes (consumed by the DR drill orchestrator):
    0  sandbox provisioned and endpoints registered -> proceed to validation
    1  provisioning failed -> tear down partial resources and escalate
    2  usage / configuration error (bad manifest, lock unavailable) -> abort
"""
from __future__ import annotations

import itertools
import json
import os
import sys
import time
from abc import ABC, abstractmethod
from concurrent.futures import ThreadPoolExecutor, as_completed
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Dict, List


@dataclass(frozen=True)
class DrillManifest:
    """Immutable, versioned description of one drill session."""

    session_id: str
    artifact_id: str
    recovery_target: str          # ISO-8601 point-in-time coordinate
    instance_class: str
    volume_ids: List[str]
    lifecycle_tags: Dict[str, str] = field(default_factory=dict)


class ProvisioningError(RuntimeError):
    """Raised when a cloud operation fails after exhausting retries."""


class Provisioner(ABC):
    """Uniform interface so the orchestrator can target any backend."""

    @abstractmethod
    def create_network(self, m: DrillManifest) -> str:
        """Create an isolated, default-deny-egress network; return its id."""

    @abstractmethod
    def create_instance(self, m: DrillManifest, network_id: str) -> str:
        """Launch a session-scoped compute instance; return its id."""

    @abstractmethod
    def attach_volume(self, instance_id: str, volume_id: str) -> str:
        """Attach one restored backup volume; return the attachment id."""

    @abstractmethod
    def destroy(self, resource_id: str) -> None:
        """Idempotently destroy a single resource by id."""


def with_backoff(fn, *args, attempts: int = 5, base: float = 0.5) -> Any:
    """Retry a cloud call with bounded exponential backoff for rate limits."""
    for attempt in range(attempts):
        try:
            return fn(*args)
        except ProvisioningError:
            if attempt == attempts - 1:
                raise
            time.sleep(base * (2 ** attempt))
    raise ProvisioningError("unreachable")  # pragma: no cover


class SessionLock:
    """Exclusive per-session lock; a held lock is a hard abort, not a retry."""

    def __init__(self, session_id: str, root: Path) -> None:
        self._path = root / f"{session_id}.lock"

    def __enter__(self) -> "SessionLock":
        try:
            # O_CREAT | O_EXCL fails atomically if another worker holds it.
            fd = os.open(str(self._path), os.O_CREAT | os.O_EXCL | os.O_WRONLY)
            os.close(fd)
        except FileExistsError as exc:
            raise ProvisioningError(f"session already locked: {self._path}") from exc
        return self

    def __exit__(self, *exc: Any) -> None:
        self._path.unlink(missing_ok=True)


def provision(m: DrillManifest, p: Provisioner) -> Dict[str, Any]:
    """Materialize the sandbox and return the created-resource record."""
    created: List[str] = []
    try:
        network_id = with_backoff(p.create_network, m)
        created.append(network_id)

        instance_id = with_backoff(p.create_instance, m, network_id)
        created.append(instance_id)

        # Volume attachments are independent -> parallelize them, but bound
        # concurrency so a large manifest cannot saturate the cloud API.
        attachments: List[str] = []
        with ThreadPoolExecutor(max_workers=min(8, len(m.volume_ids) or 1)) as pool:
            futures = {pool.submit(with_backoff, p.attach_volume, instance_id, v): v
                       for v in m.volume_ids}
            for fut in as_completed(futures):
                attachments.append(fut.result())  # re-raises ProvisioningError
        created.extend(attachments)

        return {"session_id": m.session_id, "network_id": network_id,
                "instance_id": instance_id, "attachments": attachments,
                "created": created}
    except ProvisioningError:
        # Reverse-order teardown of whatever was created before re-raising,
        # so a failed provision never leaks billable resources.
        for resource_id in reversed(created):
            try:
                p.destroy(resource_id)
            except ProvisioningError:
                pass
        raise


class LocalProvisioner(Provisioner):
    """Reference backend that simulates a cloud, so the script runs as-is.

    A production deployment swaps this for a boto3, google-cloud, or
    kubernetes-client implementation; the orchestrator control flow is
    identical because it only depends on the Provisioner interface.
    """

    def __init__(self) -> None:
        self._ids = itertools.count(1)

    def create_network(self, m: DrillManifest) -> str:
        return f"net-{m.session_id}-{next(self._ids)}"

    def create_instance(self, m: DrillManifest, network_id: str) -> str:
        return f"inst-{m.session_id}-{next(self._ids)}"

    def attach_volume(self, instance_id: str, volume_id: str) -> str:
        return f"att-{instance_id}-{volume_id}"

    def destroy(self, resource_id: str) -> None:
        # Idempotent by construction: destroying an unknown id is a no-op.
        return None


def load_manifest(path: Path) -> DrillManifest:
    raw = json.loads(path.read_text())
    return DrillManifest(
        session_id=raw["session_id"],
        artifact_id=raw["artifact_id"],
        recovery_target=raw["recovery_target"],
        instance_class=raw["instance_class"],
        volume_ids=list(raw["volume_ids"]),
        lifecycle_tags=dict(raw.get("lifecycle_tags", {})),
    )


def main(provisioner: Provisioner) -> int:
    if len(sys.argv) != 3:
        print("usage: provision_sandbox.py <manifest.json> <lock-dir>",
              file=sys.stderr)
        return 2
    try:
        manifest = load_manifest(Path(sys.argv[1]))
        lock_dir = Path(sys.argv[2])
        lock_dir.mkdir(parents=True, exist_ok=True)
    except (OSError, KeyError, ValueError, json.JSONDecodeError) as exc:
        print(f"config error: {exc}", file=sys.stderr)
        return 2

    try:
        with SessionLock(manifest.session_id, lock_dir):
            record = provision(manifest, provisioner)
    except ProvisioningError as exc:
        # A held lock is a configuration/timing abort; a real cloud failure
        # is a provisioning failure. Both are non-zero, distinguished here.
        code = 2 if "already locked" in str(exc) else 1
        print(f"provisioning error: {exc}", file=sys.stderr)
        return code

    Path(f"sandbox-{manifest.session_id}.json").write_text(json.dumps(record))
    print(f"PROVISIONED {manifest.session_id} "
          f"instance={record['instance_id']} volumes={len(record['attachments'])}")
    return 0


if __name__ == "__main__":
    # Swap LocalProvisioner for a boto3/google-cloud/kubernetes backend in
    # production; the control flow above never changes.
    raise SystemExit(main(provisioner=LocalProvisioner()))

The orchestrator never leaks resources on failure: any ProvisioningError triggers a reverse-order teardown of everything already created before the exit code propagates, so a mid-provision API outage does not strand a half-built sandbox. A new backend — a Kubernetes namespace instead of a VPC — is added by implementing one Provisioner subclass and injecting it at the entrypoint, without touching the locking, backoff, or teardown logic. The injected Provisioner is the single seam where the concrete cloud SDK is bound.

Integration with DR Drill Orchestration

Provisioning sits at the head of the drill chain and both consumes and produces gates. Upstream, the checksum validation pipeline must exit 0 before the provisioner runs, because standing up an isolated instance and restoring a corrupt artifact wastes the entire provisioning budget on a sandbox that can never validate. The topology depth the provisioner reconstructs is bounded by the remaining budget from RTO/RPO mapping, and which validation model will run inside the sandbox determines whether a single node suffices or the full replica topology is required.

Downstream, the sandbox record this phase persists is the input every later phase reads: point-in-time recovery targeting restores onto the instance it created, smoke-test routing logic resolves the endpoints it registered, and fallback chain configuration reroutes to a secondary pool if the primary sandbox degrades under load. A provisioning failure short-circuits the chain: the orchestrator tears down partial resources, marks the drill unrunnable, and escalates with the failing resource id attached so operators see exactly which materialization step broke.

Error Classification and Threshold Management

A provisioning failure is not automatically a page. Severity depends on which step failed and whether it threatens isolation or merely wastes a drill cycle. A security-group or IAM misconfiguration that could expose production is an unconditional escalation because it breaches the isolation contract, whereas a transient API rate-limit that the backoff loop absorbs is invisible telemetry, not an alert. Classification happens after the exit code is resolved, so the deterministic provisioning core stays free of alerting policy and tolerance windows can be retuned without touching provisioning code. Failures map onto the shared error categorization framework so every phase reports through one severity contract.

Tier	Trigger condition	Tolerance	Orchestrator action
`CRITICAL`	Isolation breach: egress rule, IAM scope, or peering that exposes production; leaked resources after teardown	Zero	Halt all drills, quarantine session, page on-call
`WARNING`	Volume-attach or instance-launch failure after retries; lock contention on a stale session	Bounded retries	Retry once on a fresh session id, annotate audit trail, raise ticket
`INFO`	Transient rate-limit absorbed by backoff; budget-forced single-node topology downgrade	Unbounded	Record only, feed capacity-planning trend

Tolerance is expressed per failure class, not as a single global count: a stuck-lock abort must never be retried against the same session id (it races the holder), while a rate-limited attach is retried automatically inside the backoff loop. Encoding tolerance per class keeps the classifier stable as new backends are added.

Telemetry and Compliance Output

Every provision emits structured telemetry so provisioning latency, leak rate, and fallback frequency are visible as trends rather than discovered during an incident. Metrics are exported through Prometheus-compatible endpoints and feed both capacity planning and regulatory evidence.

Metric	Type	Purpose
`dr_sandbox_provision_duration_seconds`	Histogram	Wall-clock provisioning cost against the recovery budget
`dr_sandbox_provision_total`	Counter	Provisions by outcome (success, failed, aborted) and backend
`dr_sandbox_leaked_resources_total`	Counter	Resources surviving teardown — the isolation-integrity signal
`dr_sandbox_volume_attach_seconds`	Histogram	Per-volume attach latency, for restore-throughput planning
`dr_sandbox_fallback_total`	Counter	Drills rerouted to a secondary pool during provisioning

The audit trail is written to write-once, append-only storage and signed, capturing the manifest version, the resources created, the isolation policy applied, and the teardown result. Because promotion decisions and post-incident reviews read these records, they cannot be altered after the fact. This aligns provisioning output with the demonstrable-and-repeatable evidence expectations of frameworks such as NIST SP 800-34 Rev. 1, SOC 2, and ISO 22301, which require proof that recovery was verified inside a controlled environment, not merely asserted.

Operational Best Practices

Lock before you create. Acquire an exclusive session lock before any resource is provisioned; a held lock is a hard abort with exit code 2, never a retry that races the existing holder into duplicate VPCs.
Isolate structurally, before data lands. Enforce a dedicated network with default-deny egress and session-scoped IAM at materialization time, so no restore ever runs against a boundary that could reach production.
Guard production references. Apply prevent_destroy and read-only mounts to any production resource a module can reach, so a misconfigured template cannot mutate live state during a restore.
Attach volumes before the engine starts. Sequence data-volume attachment ahead of the database process, or the instance boots against an empty data directory and produces a hollow sandbox that validates nothing.
Tear down on failure, not just success. Wire reverse-order teardown into the failure path so a mid-provision outage never strands billable resources; treat any dr_sandbox_leaked_resources_total increment as a critical isolation defect.
Make every phase idempotent. Key resources by session id and make teardown a no-op on already-clean sessions, so a garbage-collection sweep can safely reap sandboxes whose orchestrator crashed.

By treating the sandbox as code that is rebuilt and destroyed on every drill, teams eliminate environment drift as a source of false recovery confidence. Provisioning automation turns “we think staging looks like production” into a reconstructed, isolated, disposable environment whose fidelity is defined by a version-controlled manifest.

Frequently Asked Questions

Why rebuild the sandbox on every drill instead of keeping a warm environment?

A persistent environment accumulates configuration drift, stale data, and manual patches between drills, so the recovery metrics measured against it no longer reflect a clean restore. Rebuilding from a versioned manifest guarantees the sandbox is defined entirely by code, is byte-for-byte reproducible across runs, and carries no residue from previous drills. The trade-off is provisioning latency, which is why the topology depth is bounded by the remaining recovery budget rather than always reconstructing the full production stack.

How is the session state file locked to prevent concurrent-drill races?

Each session acquires an exclusive lock keyed by session id before any resource is created, using an atomic create-if-not-exists primitive — an S3-plus-DynamoDB state lock, a Terraform Cloud workspace, or an O_CREAT | O_EXCL lock file. If the lock is already held, provisioning aborts immediately with a configuration exit code rather than retrying, because two workers provisioning the same session key produce orphaned VPCs and duplicate volumes that survive teardown.

What happens to resources when provisioning fails partway through?

The orchestrator records every resource id as it is created, and any provisioning error triggers a reverse-order teardown of that record before the non-zero exit code propagates. A network created first is destroyed last, after the instance and volumes that depend on it are already gone. This guarantees a mid-provision API outage never strands billable resources, and any resource that does survive teardown increments a leaked-resources counter that is treated as a critical isolation defect.

How does provisioning enforce isolation from production?

Isolation is structural and applied before data lands: a dedicated VPC or namespace with no peering to production, security groups that deny egress to production endpoints by default, session-scoped IAM roles revoked at teardown, and prevent_destroy guards on any production resource a module could reach. Only two paths cross the boundary — a read-only mount of the backup snapshot and the synthetic smoke-test ingress — so restored data can never reach live systems and a misconfigured template cannot mutate production state.

Point-in-time recovery targeting — resolves the exact coordinate the provisioned instance is restored to.
Smoke-test routing logic — consumes the endpoints this phase registers to confine synthetic traffic to the sandbox.
Fallback chain configuration — reroutes the drill to a secondary pool when the primary sandbox degrades.
Security boundaries for DR environments — the isolation invariants materialization must satisfy before data restore.
Validation model selection — decides how much topology the sandbox must reconstruct for the chosen validation depth.

This topic is one component of the broader Restore Drill Orchestration & Environment Isolation framework.

Explore this section