Page Corruption Scanning Techniques

Physical page degradation is the failure mode most likely to survive a backup pipeline undetected and detonate mid-restore, and closing that gap is the specific job of this stage within an Automated Backup Integrity Check Implementation. Unlike logical inconsistencies that surface as constraint violations, a corrupted storage page carries a broken header or a stale trailing checksum that no SELECT will ever touch until recovery replays it. This page defines the scanning contract that turns raw block-level anomalies into deterministic gate signals: the scanner reads backup artifacts directly from cold storage, reconstructs exact page boundaries, recomputes each engine’s embedded checksum, and emits a machine-readable verdict that feeds the checksum validation pipeline and the downstream error categorization layer. The design constraint is unforgiving — scan multi-terabyte archives fast enough to fit inside the recovery window your RTO and RPO mapping already committed to, without exhausting host memory and without generating the false positives that train on-call engineers to ignore the alert.

Architecture and Execution Workflow

Page scanning is a streaming pipeline, not a batch load. Each artifact is memory-mapped, walked page by page, and every non-empty page is verified against its own embedded integrity code before any logical restore is attempted. Failures are routed through classification so that a stale checksum from an unclean shutdown never triggers the same escalation as a zeroed heap page.

Figure. The page scanning process that memory maps the artifact, reconstructs pages, verifies embedded checksums, and routes failures through categorization to DR drill gating.

The workflow decomposes into four discrete execution phases. Each phase has a single responsibility and a well-defined hand-off, which is what makes the scanner testable and lets you swap engine-specific logic without rewriting the traversal core.

Phase 1 — Artifact Resolution and Boundary Reconstruction

The scanner first resolves the physical layout of the backup target. A pg_basebackup tarball, a WAL-G delta, an XtraBackup stream, and a raw block snapshot all present pages differently on disk, so the resolution phase determines page size, segment file boundaries, and the mapping from file offset to logical block number. Relational and document engines universally use fixed-size pages — a header holding metadata (page number, LSN, free-space pointers), a variable-length row region, and a trailing checksum or CRC. Reconstructing boundaries means seeking to block * page_size and slicing exactly one page; getting this offset arithmetic wrong is the single most common source of phantom corruption reports, because a one-byte misalignment shifts every subsequent checksum computation. Segment files (PostgreSQL splits relations at 1 GB) carry their base block number in the filename suffix, so the resolver must fold that offset back in before the block number is mixed into the checksum.

Phase 2 — Checksum Extraction and Recomputation

Each page’s stored checksum is lifted from a fixed header (or trailer) offset, the field is zeroed in a working copy, and the engine’s checksum is recomputed over the payload. This is deterministic and side-effect free: given identical bytes the recomputation always yields the same value, which is what lets the pipeline treat a mismatch as hard evidence of bit-rot, torn writes, or a truncated snapshot stream rather than as noise. Engine parity matters here — PostgreSQL data pages use a custom FNV-1a block checksum, not the CRC32C it reserves for WAL and the control file, so a scanner that naively applies CRC32C will flag every valid page. The recomputation must be a bit-for-bit reimplementation of the server algorithm, which is exactly what the PostgreSQL page-corruption handler documents in full.

Phase 3 — Differential Verdict and Quarantine

The extracted and recomputed values are compared, and a PageVerdict is produced per page. Pages whose stored checksum is zero are legitimately unused (engines never write a zero checksum) and are skipped rather than flagged — this exclusion is the difference between a scanner that runs clean and one that pages the on-call rotation over sparse-file holes. Mismatches are recorded with the file, offset, block, and both checksum values so the quarantine manifest is self-describing and forensically useful. The verdict stream is the pipeline’s contract: everything downstream consumes verdicts, never raw pages.

Phase 4 — State Persistence and Gate Emission

The scanner serialises its verdicts to a durable manifest (JSON on immutable object storage) and returns a POSIX exit code that the orchestration layer reads directly: 0 for a clean artifact, 2 for detected corruption, and a sysexits-style usage code for operator error. Persisting the manifest before emitting the exit code guarantees that a gate failure is always accompanied by an auditable record — a compliance requirement, not a nicety.

Python Implementation Patterns

The traversal core is engine-agnostic; engine specifics live behind a small pluggable interface. This keeps the memory-mapped scanner — the performance-critical, well-tested part — completely independent of how any single database lays out its pages. The base contract requires only two predicates from each engine: whether a page is unused, and whether a page is valid.

python

#!/usr/bin/env python3
"""Pluggable page-corruption scanner core.

Defines an engine-agnostic PageValidator contract and a memory-mapped scanner
that streams fixed-size pages past a registered validator without loading whole
relations into RAM.
"""
from __future__ import annotations

import abc
import mmap
from dataclasses import dataclass
from pathlib import Path
from typing import Dict, Iterator, Type


@dataclass(frozen=True)
class PageVerdict:
    block: int
    offset: int
    ok: bool
    reason: str = ""


class PageValidator(abc.ABC):
    """Engine-specific page contract. Subclasses declare page size and the
    predicate that decides whether one raw page is structurally sound."""

    page_size: int = 8192

    @abc.abstractmethod
    def is_unused(self, page: bytes) -> bool:
        """True for legitimately zeroed/unallocated pages (never corruption)."""

    @abc.abstractmethod
    def verify(self, page: bytes, block: int) -> PageVerdict:
        """Validate a single page belonging to logical block number `block`."""


REGISTRY: Dict[str, Type[PageValidator]] = {}


def register(engine: str):
    """Class decorator that binds an engine name to its validator."""
    def _wrap(cls: Type[PageValidator]) -> Type[PageValidator]:
        REGISTRY[engine] = cls
        return cls
    return _wrap


def scan_file(path: Path, validator: PageValidator, base_block: int = 0) -> Iterator[PageVerdict]:
    """Stream every page of `path` through `validator` using a read-only mmap.

    Peak memory is one page regardless of relation size, because the mmap is a
    view over the kernel page cache rather than a full read into the heap.
    """
    psize = validator.page_size
    size = path.stat().st_size
    if size == 0 or size % psize != 0:
        yield PageVerdict(base_block, 0, False, f"unaligned size {size} for page {psize}")
        return
    with path.open("rb") as fh:
        with mmap.mmap(fh.fileno(), 0, access=mmap.ACCESS_READ) as mm:
            for i in range(size // psize):
                off = i * psize
                page = mm[off:off + psize]
                if validator.is_unused(page):
                    continue
                yield validator.verify(page, base_block + i)

Zero-copy traversal is the load-bearing choice. The official Python documentation for the mmap module describes how a read-only mapping lets the scanner index into a multi-gigabyte segment as if it were a bytes object while the kernel pages data in on demand, so a 1 GB relation costs one page of resident memory, not a gigabyte. A concrete validator only has to implement the two predicates. Many custom backup formats store a little-endian CRC32 of the payload in the final four bytes of each page; that layout drops in cleanly:

python

import zlib


@register("generic-crc32")
class GenericCrc32Validator(PageValidator):
    """Validates pages that store a little-endian CRC32 of the payload in the
    final 4 bytes of each page — a common custom on-disk layout."""

    page_size = 8192

    def is_unused(self, page: bytes) -> bool:
        # A fully zeroed page is an unallocated hole in a sparse file.
        return not any(page)

    def verify(self, page: bytes, block: int) -> PageVerdict:
        payload, trailer = page[:-4], page[-4:]
        stored = int.from_bytes(trailer, "little")
        computed = zlib.crc32(payload) & 0xFFFFFFFF
        if computed != stored:
            return PageVerdict(
                block,
                block * self.page_size,
                False,
                f"crc mismatch stored={stored:#010x} computed={computed:#010x}",
            )
        return PageVerdict(block, block * self.page_size, True)

Sequential scanning of an enterprise archive is a throughput bottleneck that blows the recovery window, so the runner fans files out across a bounded worker pool. Because scan_file is CPU-light but I/O-heavy, offloading each file to the default executor keeps the event loop responsive while saturating disk bandwidth; the semaphore bounds concurrency so the scanner never overwhelms the host kernel or the storage backend’s IOPS ceiling. The same partitioning discipline underpins async batching for large datasets, where monolithic artifacts are split into independently verifiable chunks.

python

#!/usr/bin/env python3
"""Concurrent scan runner that gates a DR drill via POSIX exit codes."""
import asyncio
import json
import sys
from pathlib import Path
from typing import List

EXIT_OK = 0          # artifact is clean, promotion may proceed
EXIT_CORRUPT = 2     # corruption detected, quarantine the artifact
EXIT_USAGE = 64      # EX_USAGE from sysexits.h — operator error


async def scan_path(path: Path, validator: PageValidator,
                    sem: asyncio.Semaphore) -> List[PageVerdict]:
    """Scan one file in the executor, returning only failing verdicts."""
    async with sem:
        loop = asyncio.get_running_loop()
        return await loop.run_in_executor(
            None, lambda: [v for v in scan_file(path, validator) if not v.ok]
        )


async def run(root: Path, engine: str, concurrency: int = 8) -> int:
    validator = REGISTRY[engine]()
    sem = asyncio.Semaphore(concurrency)
    files = [p for p in root.rglob("*") if p.is_file()]
    batches = await asyncio.gather(*(scan_path(p, validator, sem) for p in files))
    corrupt = [v.__dict__ for batch in batches for v in batch]
    json.dump(
        {"engine": engine, "scanned": len(files), "corrupt": corrupt},
        sys.stdout,
        indent=2,
    )
    sys.stdout.write("\n")
    return EXIT_CORRUPT if corrupt else EXIT_OK


def main() -> int:
    if len(sys.argv) != 3:
        print("usage: scan.py <backup-root> <engine>", file=sys.stderr)
        return EXIT_USAGE
    root = Path(sys.argv[1])
    if not root.is_dir():
        print(f"not a directory: {root}", file=sys.stderr)
        return EXIT_USAGE
    if sys.argv[2] not in REGISTRY:
        print(f"unknown engine: {sys.argv[2]}", file=sys.stderr)
        return EXIT_USAGE
    return asyncio.run(run(root, sys.argv[2]))


if __name__ == "__main__":
    sys.exit(main())

Integration with DR Drill Orchestration

Page scanning is a gate, not a report. The orchestration engine evaluates the scanner’s exit code against a recovery-readiness matrix before it provisions a single compute instance for restoration. A clean exit (0) releases the drill to WAL replay and promotion; a corruption exit (2) halts the drill, triggers a fallback to the most recent verified snapshot, and opens a forensic storage audit. This is the same contract the PostgreSQL page-corruption handler implements at the engine level, and it is why the scanner runs before any restore target is materialised inside a provisioned sandbox. Gating on a mathematically verified signal means a drill consumes expensive orchestration resources only when success is probable, and it prevents a corrupted base backup from silently poisoning every downstream validation stage.

The scanner’s JSON manifest is the interface to adjacent pipelines. The checksum validation pipeline consumes the same verdict schema to correlate page-level failures with file-level hash divergence, and the error categorization framework consumes the reason strings to assign severity. Because every stage speaks the same verdict vocabulary, the orchestrator can reason about corruption uniformly whether it originated from a torn page, a truncated stream, or a hash mismatch.

Error Classification and Threshold Management

Not every mismatch is a catastrophe, and treating them uniformly is how alert fatigue erodes trust in the gate. Verdicts are bucketed into severity tiers with distinct tolerance windows and escalation paths. A stale checksum on a single page after an unclean shutdown is recoverable; a zeroed heap page or a corrupt page header is not. Tolerance is expressed as a rate against a rolling baseline, so a scanner that finds one soft anomaly in ten million pages reports and moves on, while a database cluster crossing its hard-fail threshold escalates immediately.

Figure. Verdict severity drives the gate state machine: soft anomalies self-loop below tolerance and escalate to degraded only when their rate crosses the rolling baseline, while hard and structural faults transition straight to fail.

Severity tier	Trigger condition	Tolerance window	Gate action
`soft`	Stale checksum, single page, clean neighbours	≤ 1 page per 10M scanned	Log to manifest, continue scan
`degraded`	Soft-anomaly rate exceeds rolling baseline	Rate over 3-drill trailing window	Warn, page on-call, allow promotion with flag
`hard`	Zeroed heap page, corrupt page header, torn write	Zero tolerance	Exit `2`, halt drill, quarantine artifact
`structural`	Segment size unaligned to page size	Zero tolerance	Exit `2`, halt drill, storage integrity audit

Threshold calibration is empirical. Seed the baseline from historical scan telemetry and storage-vendor error rates rather than a guessed constant, and re-derive it as the fleet grows — a static threshold either misses a slow-burning corruption trend or floods the channel when a new high-write cluster joins. Suppressing the soft tier below its tolerance window is deliberate noise reduction, not negligence: the manifest still records every anomaly for later trend analysis, but only rate-crossing events reach a human.

Telemetry and Compliance Output

Every scan emits Prometheus metrics so corruption trends are observable across drills rather than discovered one crisis at a time. The minimal instrument set is a counter of pages scanned, a counter of corrupt pages partitioned by severity, and a histogram of scan duration per artifact. These feed the same dashboards that track recovery-readiness, letting an SRE see corruption rate and scan throughput on one pane.

python

from prometheus_client import Counter, Histogram

PAGES_SCANNED = Counter(
    "backup_pages_scanned_total",
    "Pages inspected by the corruption scanner",
    ["engine"],
)
PAGES_CORRUPT = Counter(
    "backup_pages_corrupt_total",
    "Pages that failed checksum verification",
    ["engine", "severity"],
)
SCAN_SECONDS = Histogram(
    "backup_scan_duration_seconds",
    "Wall-clock duration of a single artifact scan",
    ["engine"],
)


def record(engine: str, scanned: int, failures: dict, elapsed: float) -> None:
    """Emit one scan's counters and timing to the Prometheus registry."""
    PAGES_SCANNED.labels(engine=engine).inc(scanned)
    for severity, count in failures.items():
        PAGES_CORRUPT.labels(engine=engine, severity=severity).inc(count)
    SCAN_SECONDS.labels(engine=engine).observe(elapsed)

The audit trail is the compliance artifact. Each scan writes a signed JSON manifest — artifact identifier, scan timestamp, engine, page counts, and the full corruption list — to write-once object storage. NIST SP 800-34 contingency-planning controls and SOC 2 availability criteria both expect demonstrable evidence that a backup was verified before it was relied upon for recovery; an immutable, timestamped manifest per drill is exactly that evidence. Retaining manifests for the full regulatory window turns “we believe our backups are good” into “here is the cryptographic record that this artifact passed structural verification at this instant.”

Operational Best Practices

Enable checksums at initialization. PostgreSQL page verification requires data_checksums = on at initdb time; a database cluster initialized without it cannot be scanned by this method. Confirm the setting is on before you rely on the gate.
Verify page-size alignment for block snapshots. Scanning a raw block device or LVM snapshot demands page-aligned reads; a misaligned mapping produces struct-level parse failures that masquerade as corruption. Treat an unaligned segment as structural severity, not hard.
Skip zeroed pages explicitly. Never treat an all-zero page as corruption — it is an unallocated hole in a sparse file. Enforce the is_unused predicate in every engine validator.
Bound concurrency to the storage IOPS ceiling. Tune the semaphore to available cores and the backend’s throughput limit; over-subscribing S3 or NFS inflates scan time and destabilises the host.
Persist the manifest before emitting the exit code. A gate failure must always leave an auditable record; write the manifest to immutable storage first, then return the POSIX exit code.
Re-derive thresholds from telemetry. Seed tolerance windows from historical scan metrics and recompute them as the fleet changes, so the soft tier stays quiet and the hard tier stays trusted.
Pin engine algorithm parity. Reimplement each engine’s checksum bit-for-bit and test it against a known-good page; a CRC32C-vs-FNV-1a mix-up flags an entire artifact as corrupt.

Handling Page Corruption in PostgreSQL Backups — the engine-specific FNV-1a scanner this stage delegates to.
Checksum Validation Pipelines — file-level hash verification that consumes the same verdict schema.
Async Batching for Large Datasets — partitioning multi-terabyte artifacts for concurrent verification.
Error Categorization Frameworks — severity assignment and escalation routing for scanner verdicts.
Sandbox Provisioning Automation — the isolated targets a passing scan releases a restore into.

This scanning stage is one control within the broader Automated Backup Integrity Check Implementation workflow.

Explore this section