Handling Page Corruption in PostgreSQL Backups

Silent page corruption in PostgreSQL physical backups bypasses filesystem-level integrity checks and storage array snapshots. Logical inconsistencies within 8KB page headers frequently survive standard pg_verifybackup routines, allowing structurally compromised base backups to enter the promotion queue during disaster recovery drills. For DBAs, SREs, and disaster recovery planners, the operational mandate requires shifting from post-restore verification to strict pre-restore quarantine. A production-grade Automated Backup Integrity Check Implementation must intercept structural anomalies before they trigger cascading index failures or heap page panics during WAL replay.

Pre-Restore Quarantine Architecture

When data_checksums=on is configured at cluster initialization, PostgreSQL stores a 16-bit checksum in each page header (pd_checksum), computed with a custom FNV-1a block algorithm — not CRC32C, which PostgreSQL reserves for WAL records and the control file. Validation requires reading physical blocks directly, reconstructing the 24-byte page header, zeroing the stored checksum field, and recomputing the page checksum (mixing in the block number) against the raw payload. This methodology forms the foundation of modern Page Corruption Scanning Techniques, where validation executes against archived tarballs or uncompressed base directories prior to DR promotion.

The validation pipeline must operate independently of the running PostgreSQL instance. It reads directly from backup storage targets (S3, NFS, or block snapshots), processes files concurrently, and outputs a structured quarantine manifest. Synchronous traversal introduces unacceptable latency for multi-terabyte clusters; therefore, async-driven I/O with bounded worker pools is mandatory for SLA-compliant DR drills.

Production Async Validation Scanner

flowchart TD
  A["Walk backup dir skip pg wal"] --> B["Match relation files"]
  B --> C["Read 32 page batch via executor"]
  C --> D["Parse 24 byte header struct"]
  D --> E{"Stored checksum is zero"}
  E -->|"yes"| F["Skip unused page"]
  E -->|"no"| G["Recompute FNV-1a with block number"]
  G --> H{"Computed equals stored"}
  H -->|"yes"| I["Mark page valid"]
  H -->|"no"| J["Record corrupted page"]
  J --> K["exit code 2 quarantine"]
  I --> L["exit code 0 proceed replay"]

Figure. The async PostgreSQL scanner batching page reads, parsing headers, skipping zeroed pages, and recomputing the FNV-1a checksum to gate promotion via exit codes 0 and 2.

The following implementation provides a production-ready, memory-efficient scanner. It batches file reads, parses page headers using struct, and validates checksums without loading entire relations into memory. It requires only Python 3.8 or newer and the standard library — no third-party packages — and reimplements PostgreSQL’s exact page-checksum algorithm for bit-for-bit parity with the server.

python
#!/usr/bin/env python3
"""
PostgreSQL Backup Page Corruption Scanner
Validates each page's stored pd_checksum using PostgreSQL's own FNV-1a page
checksum algorithm (NOT CRC32C) for all relation files in a base backup.
"""

import asyncio
import struct
import os
import re
import sys
import json
from pathlib import Path
from typing import AsyncGenerator, Tuple, Dict

PAGE_SIZE = 8192
# PostgreSQL PageHeaderData layout: pd_lsn(8) + pd_checksum(2) + pd_flags(2) + 
# pd_lower(2) + pd_upper(2) + pd_special(2) + pd_pagesize_version(2) + pd_prune_xid(4)
HEADER_FORMAT = '<QHHHHHHI'
HEADER_SIZE = 24
CHECKSUM_OFFSET = 8  # pd_checksum starts at byte 8 in the header
RELSEG_SIZE = 131072  # 8KB pages per 1GB relation segment file

# PostgreSQL's data-page checksum is a custom FNV-1a block checksum (see
# src/include/storage/checksum_impl.h) — it is NOT CRC32C. These are the exact
# constants and mixing the server uses, so results match pd_checksum bit-for-bit.
N_SUMS = 32
FNV_PRIME = 0x01000193  # 16777619
CHECKSUM_BASE_OFFSETS = (
    0x5B1F36E9, 0xB8525960, 0x02AB50AA, 0x1DE66D2A,
    0x79FF467A, 0x9BB9F8A3, 0x217E7CD2, 0x83E13D2C,
    0xF8D4474F, 0xE39EB970, 0x42C6AE16, 0x993216FA,
    0x7B093B5D, 0x98DAFF3C, 0xF718902A, 0x0B1C9CDB,
    0xE58F764B, 0x187636BC, 0x5D7B3BB1, 0xE73DE7DE,
    0x92BEC979, 0xCCA6C0B2, 0x304A0979, 0x85AA43D4,
    0x783125BB, 0x6CA8EAA2, 0xE407EAC6, 0x4B5CFC3E,
    0x9FBF8C76, 0x15CA20BE, 0xF2CA9FD3, 0x959BD756,
)
RELATION_RE = re.compile(r"^[0-9]+(\.[0-9]+)?(_fsm|_vm|_init)?$")


def pg_checksum_page(page: bytes, blkno: int) -> int:
    """Reimplements PostgreSQL's pg_checksum_page() for one 8KB page.

    Zeroes the pd_checksum field, folds 32 FNV-1a partial sums over the page,
    mixes in the block number, and reduces to the stored 1..65535 range.
    """
    buf = bytearray(page)
    buf[CHECKSUM_OFFSET:CHECKSUM_OFFSET + 2] = b"\x00\x00"
    words = struct.unpack_from(f"<{PAGE_SIZE // 4}I", buf)

    sums = list(CHECKSUM_BASE_OFFSETS)
    rows = PAGE_SIZE // (4 * N_SUMS)  # 64 rows of 32 uint32s

    def comp(checksum: int, value: int) -> int:
        tmp = (checksum ^ value) & 0xFFFFFFFF
        return ((tmp * FNV_PRIME) ^ (tmp >> 17)) & 0xFFFFFFFF

    for i in range(rows):
        base = i * N_SUMS
        for j in range(N_SUMS):
            sums[j] = comp(sums[j], words[base + j])
    # Two extra rounds of zero mixing, exactly as the server does.
    for _ in range(2):
        for j in range(N_SUMS):
            sums[j] = comp(sums[j], 0)

    result = 0
    for s in sums:
        result ^= s
    result ^= blkno & 0xFFFFFFFF
    return (result % 65535) + 1


def relation_base_block(file_path: Path) -> int:
    """First block of a relation segment file ('12345.3' -> 3 * RELSEG_SIZE)."""
    m = re.match(r"^[0-9]+\.([0-9]+)$", file_path.name)
    return int(m.group(1)) * RELSEG_SIZE if m else 0

async def read_page_batch(file_path: Path, offset: int, batch_size: int = 32) -> bytes:
    """Reads a batch of pages using executor to avoid blocking the event loop."""
    def _sync_read():
        with open(file_path, 'rb') as f:
            f.seek(offset)
            return f.read(PAGE_SIZE * batch_size)
    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(None, _sync_read)

async def validate_file_pages(file_path: Path) -> AsyncGenerator[Tuple[int, bool, str], None]:
    """Yields (page_offset, is_valid, error_msg) for each page in the file."""
    try:
        file_size = os.path.getsize(file_path)
        if file_size == 0 or file_size % PAGE_SIZE != 0:
            yield (0, False, f"File size {file_size} not aligned to {PAGE_SIZE}B")
            return
    except OSError as e:
        yield (0, False, f"OS error: {e}")
        return

    total_pages = file_size // PAGE_SIZE
    batch_size = 32
    base_block = relation_base_block(file_path)

    for batch_start in range(0, total_pages, batch_size):
        offset = batch_start * PAGE_SIZE
        raw_data = await read_page_batch(file_path, offset, batch_size)
        pages_in_batch = min(batch_size, total_pages - batch_start)

        for i in range(pages_in_batch):
            page_offset = i * PAGE_SIZE
            page_data = raw_data[page_offset:page_offset + PAGE_SIZE]
            
            if len(page_data) < PAGE_SIZE:
                yield (offset + page_offset, False, "Truncated page at EOF")
                continue

            # Parse header and extract stored checksum
            header_fields = struct.unpack_from(HEADER_FORMAT, page_data)
            stored_checksum = header_fields[1]

            # Skip empty/zeroed pages (common in sparse files). PostgreSQL never
            # writes a checksum of 0, so a zero here marks an unused page.
            if stored_checksum == 0:
                continue

            # Recompute with PostgreSQL's own algorithm, mixing in this page's
            # block number relative to the relation fork.
            blkno = base_block + batch_start + i
            computed = pg_checksum_page(page_data, blkno)

            if computed != stored_checksum:
                yield (offset + page_offset, False, f"Checksum mismatch: stored={stored_checksum:#06x}, computed={computed:#06x}")
            else:
                yield (offset + page_offset, True, "")

async def scan_directory(backup_dir: str, max_concurrency: int = 8):
    """Orchestrates concurrent scanning across all relation files."""
    base_path = Path(backup_dir)
    semaphore = asyncio.Semaphore(max_concurrency)
    results: Dict[str, list] = {"valid": [], "corrupted": [], "skipped": []}

    async def process_file(file_path: Path):
        async with semaphore:
            async for page_offset, is_valid, msg in validate_file_pages(file_path):
                if msg and "Truncated" in msg:
                    results["skipped"].append({"file": str(file_path), "offset": page_offset, "reason": msg})
                elif not is_valid:
                    results["corrupted"].append({"file": str(file_path), "offset": page_offset, "reason": msg})
                # Valid pages are not logged to avoid massive output

    tasks = []
    # Only relation files (base/, global/, pg_tblspc/) carry 8KB page checksums.
    # WAL segments use a separate CRC scheme, so pg_wal/ is skipped.
    for root, _, files in os.walk(base_path):
        if "pg_wal" in Path(root).parts or "archive_status" in root:
            continue
        for f in files:
            fpath = Path(root) / f
            if RELATION_RE.match(fpath.name):
                tasks.append(process_file(fpath))

    await asyncio.gather(*tasks)
    return results

def main():
    if len(sys.argv) != 2:
        print("Usage: python pg_backup_validator.py /path/to/base/backup", file=sys.stderr)
        sys.exit(1)
    
    backup_dir = sys.argv[1]
    print(f"Starting validation scan: {backup_dir}")
    results = asyncio.run(scan_directory(backup_dir))
    
    print(json.dumps(results, indent=2))
    if results["corrupted"]:
        sys.exit(2)  # Non-zero exit for pipeline quarantine triggers
    sys.exit(0)

if __name__ == "__main__":
    main()

DR Drill Orchestration & Pipeline Integration

Automated validation must execute deterministically within disaster recovery runbooks. The scanner integrates into CI/CD pipelines or cron-driven DR rehearsal schedules. The following orchestration pattern enforces quarantine, generates audit trails, and gates promotion workflows.

1. Backup Extraction & Validation Gate

bash
#!/usr/bin/env bash
set -euo pipefail

BACKUP_PATH="/mnt/dr-staging/base_backup_$(date +%Y%m%d_%H%M%S)"
VALIDATION_LOG="/var/log/pg_dr/validation_$(date +%Y%m%d_%H%M%S).json"
QUARANTINE_FLAG="/tmp/pg_dr_quarantine_active"

# Extract base backup (example using pg_basebackup or WAL-G)
# wal-g backup-fetch "$BACKUP_PATH" LATEST

# Run async validator
python3 pg_backup_validator.py "$BACKUP_PATH" > "$VALIDATION_LOG" 2>&1 || {
    EXIT_CODE=$?
    if [ $EXIT_CODE -eq 2 ]; then
        echo "[$(date -u)] CRITICAL: Page corruption detected. Promotion halted." | tee -a /var/log/pg_dr/alert.log
        touch "$QUARANTINE_FLAG"
        # Notify DR planners via webhook/email
        curl -s -X POST https://alerts.internal/dr-quarantine \
             -H "Content-Type: application/json" \
             -d "{\"status\":\"quarantined\",\"log\":\"$VALIDATION_LOG\",\"timestamp\":\"$(date -u)\"}"
        exit 1
    else
        # Any other non-zero exit (crash, bad arguments) must also block promotion.
        echo "[$(date -u)] ERROR: validator failed with exit $EXIT_CODE. Promotion halted." | tee -a /var/log/pg_dr/alert.log
        exit "$EXIT_CODE"
    fi
}

echo "[$(date -u)] Validation passed. Proceeding to WAL replay and promotion."

2. Systemd Service for Continuous Validation

For environments requiring continuous integrity monitoring of backup repositories, deploy a systemd timer that triggers the scanner against the latest snapshot.

ini
# /etc/systemd/system/pg-backup-validator.service
[Unit]
Description=PostgreSQL Backup Page Corruption Validator
After=network-online.target

[Service]
Type=oneshot
ExecStart=/opt/scripts/pg_backup_validator.sh
User=postgres
Group=postgres
StandardOutput=journal
StandardError=journal
ini
# /etc/systemd/system/pg-backup-validator.timer
[Unit]
Description=Run backup validation every 6 hours

[Timer]
OnCalendar=*-*-* 00/6:00:00
Persistent=true

[Install]
WantedBy=timers.target

3. DR Runbook Integration Steps

  1. Pre-Drill Validation: Execute the scanner against the target backup set. Exit code 0 permits promotion. Exit code 2 triggers automatic quarantine and alerts the SRE on-call.
  2. WAL Replay Gating: Only after validation passes, initiate pg_rewind or pg_waldump replay. Corrupted base backups will cause PANIC: could not locate a valid checkpoint record during startup. Pre-validation eliminates this failure mode.
  3. Audit Trail Retention: Store JSON validation manifests in immutable object storage. Compliance frameworks require cryptographic proof that backups were structurally verified prior to failover execution.

Operational Considerations

  • Checksum Prerequisite: This validation pipeline requires data_checksums=on at cluster initialization. Clusters initialized without checksums cannot be validated using this method. Refer to the PostgreSQL documentation on the data_checksums parameter for initialization parameters.
  • I/O Alignment: When scanning block-level snapshots, ensure the underlying storage presents 8KB-aligned reads. Misaligned reads will cause struct.unpack failures and false positives.
  • Memory Footprint: The batch size (32 pages) limits peak memory to ~256KB per worker thread, regardless of table size. Adjust max_concurrency based on available CPU cores and IOPS limits.
  • False Positive Mitigation: Pages with pd_checksum=0 are intentionally skipped. These represent unallocated or zeroed pages in sparse files. Do not treat them as corruption.

Implementing this validation workflow eliminates the silent failure vector in PostgreSQL disaster recovery. By enforcing pre-restore quarantine and integrating async scanning into DR orchestration, teams guarantee that only structurally sound backups enter the promotion pipeline.