Handling Page Corruption in PostgreSQL Backups

This page implements a standalone scanner that recomputes every 8KB page checksum inside a PostgreSQL physical backup and quarantines the backup before it is promoted — the concrete task that anchors the broader Page Corruption Scanning Techniques discipline. Silent page corruption bypasses filesystem-level integrity checks and storage-array snapshots, and logical inconsistencies inside page headers routinely survive pg_verifybackup, allowing a structurally compromised base backup into the promotion queue during a DR drill. The scanner here runs as a strict pre-restore gate: it reads relation files directly from backup storage, reimplements PostgreSQL’s own FNV-1a page-checksum algorithm for bit-for-bit parity with the server, and emits a POSIX exit code the orchestrator can branch on. It slots beside the live-replica work in the checksum validation pipeline, feeds structured findings into your error categorization taxonomy, and is budgeted against the recovery windows fixed by your RTO/RPO mapping.

Architecture and Execution Model

When data_checksums=on is set at cluster initialization, PostgreSQL stores a 16-bit checksum in each page header (pd_checksum), computed with a custom FNV-1a block algorithm — not CRC32C, which the server reserves for WAL records and the control file. Validation reads physical blocks directly, reconstructs the 24-byte page header, zeroes the stored checksum field, and recomputes the checksum (mixing in the block number) against the raw payload. Because the scanner reads from the backup target (S3, NFS, or a block snapshot) rather than from a running instance, it must process files concurrently: synchronous traversal introduces unacceptable latency for multi-terabyte clusters, so async I/O with a bounded worker pool is mandatory for SLA-compliant drills.

Figure. The async PostgreSQL scanner batching page reads, parsing headers, skipping zeroed pages, and recomputing the FNV-1a checksum to gate promotion via exit codes 0 and 2.

Prerequisites

Python 3.9+ — the scanner uses only the standard library (asyncio, struct, pathlib); there is nothing to pip install.
A base backup initialized with data_checksums=on. Verify with SELECT current_setting('data_checksums') on the source cluster, or pg_controldata "$PGDATA" | grep "Data page checksum". Clusters without checksums cannot be validated by this method.
Read access to the extracted backup directory — an uncompressed pg_basebackup output, a wal-g backup-fetch staging directory, or a mounted block snapshot. The scanner never connects to a running PostgreSQL instance.
8KB-aligned reads on the storage target. When scanning block-level snapshots, confirm the device presents 8192-byte-aligned blocks so struct.unpack does not fault on a short read.

Production Implementation

The scanner batches file reads, parses page headers with struct, and validates checksums without loading whole relations into memory. It reimplements pg_checksum_page() exactly as the server computes it (see src/include/storage/checksum_impl.h), skips zeroed pages, resolves the correct block number for multi-segment relations, and exits non-zero when any page fails.

python

#!/usr/bin/env python3
"""
PostgreSQL Backup Page Corruption Scanner
Validates each page's stored pd_checksum using PostgreSQL's own FNV-1a page
checksum algorithm (NOT CRC32C) for all relation files in a base backup.
"""

import asyncio
import struct
import os
import re
import sys
import json
from pathlib import Path
from typing import AsyncGenerator, Tuple, Dict

PAGE_SIZE = 8192
# PostgreSQL PageHeaderData layout: pd_lsn(8) + pd_checksum(2) + pd_flags(2) + 
# pd_lower(2) + pd_upper(2) + pd_special(2) + pd_pagesize_version(2) + pd_prune_xid(4)
HEADER_FORMAT = '<QHHHHHHI'
HEADER_SIZE = 24
CHECKSUM_OFFSET = 8  # pd_checksum starts at byte 8 in the header
RELSEG_SIZE = 131072  # 8KB pages per 1GB relation segment file

# PostgreSQL's data-page checksum is a custom FNV-1a block checksum (see
# src/include/storage/checksum_impl.h) — it is NOT CRC32C. These are the exact
# constants and mixing the server uses, so results match pd_checksum bit-for-bit.
N_SUMS = 32
FNV_PRIME = 0x01000193  # 16777619
CHECKSUM_BASE_OFFSETS = (
    0x5B1F36E9, 0xB8525960, 0x02AB50AA, 0x1DE66D2A,
    0x79FF467A, 0x9BB9F8A3, 0x217E7CD2, 0x83E13D2C,
    0xF8D4474F, 0xE39EB970, 0x42C6AE16, 0x993216FA,
    0x7B093B5D, 0x98DAFF3C, 0xF718902A, 0x0B1C9CDB,
    0xE58F764B, 0x187636BC, 0x5D7B3BB1, 0xE73DE7DE,
    0x92BEC979, 0xCCA6C0B2, 0x304A0979, 0x85AA43D4,
    0x783125BB, 0x6CA8EAA2, 0xE407EAC6, 0x4B5CFC3E,
    0x9FBF8C76, 0x15CA20BE, 0xF2CA9FD3, 0x959BD756,
)
RELATION_RE = re.compile(r"^[0-9]+(\.[0-9]+)?(_fsm|_vm|_init)?$")


def pg_checksum_page(page: bytes, blkno: int) -> int:
    """Reimplements PostgreSQL's pg_checksum_page() for one 8KB page.

    Zeroes the pd_checksum field, folds 32 FNV-1a partial sums over the page,
    mixes in the block number, and reduces to the stored 1..65535 range.
    """
    buf = bytearray(page)
    buf[CHECKSUM_OFFSET:CHECKSUM_OFFSET + 2] = b"\x00\x00"
    words = struct.unpack_from(f"<{PAGE_SIZE // 4}I", buf)

    sums = list(CHECKSUM_BASE_OFFSETS)
    rows = PAGE_SIZE // (4 * N_SUMS)  # 64 rows of 32 uint32s

    def comp(checksum: int, value: int) -> int:
        tmp = (checksum ^ value) & 0xFFFFFFFF
        return ((tmp * FNV_PRIME) ^ (tmp >> 17)) & 0xFFFFFFFF

    for i in range(rows):
        base = i * N_SUMS
        for j in range(N_SUMS):
            sums[j] = comp(sums[j], words[base + j])
    # Two extra rounds of zero mixing, exactly as the server does.
    for _ in range(2):
        for j in range(N_SUMS):
            sums[j] = comp(sums[j], 0)

    result = 0
    for s in sums:
        result ^= s
    result ^= blkno & 0xFFFFFFFF
    return (result % 65535) + 1


def relation_base_block(file_path: Path) -> int:
    """First block of a relation segment file ('12345.3' -> 3 * RELSEG_SIZE)."""
    m = re.match(r"^[0-9]+\.([0-9]+)$", file_path.name)
    return int(m.group(1)) * RELSEG_SIZE if m else 0

async def read_page_batch(file_path: Path, offset: int, batch_size: int = 32) -> bytes:
    """Reads a batch of pages using executor to avoid blocking the event loop."""
    def _sync_read():
        with open(file_path, 'rb') as f:
            f.seek(offset)
            return f.read(PAGE_SIZE * batch_size)
    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(None, _sync_read)

async def validate_file_pages(file_path: Path) -> AsyncGenerator[Tuple[int, bool, str], None]:
    """Yields (page_offset, is_valid, error_msg) for each page in the file."""
    try:
        file_size = os.path.getsize(file_path)
        if file_size == 0 or file_size % PAGE_SIZE != 0:
            yield (0, False, f"File size {file_size} not aligned to {PAGE_SIZE}B")
            return
    except OSError as e:
        yield (0, False, f"OS error: {e}")
        return

    total_pages = file_size // PAGE_SIZE
    batch_size = 32
    base_block = relation_base_block(file_path)

    for batch_start in range(0, total_pages, batch_size):
        offset = batch_start * PAGE_SIZE
        raw_data = await read_page_batch(file_path, offset, batch_size)
        pages_in_batch = min(batch_size, total_pages - batch_start)

        for i in range(pages_in_batch):
            page_offset = i * PAGE_SIZE
            page_data = raw_data[page_offset:page_offset + PAGE_SIZE]
            
            if len(page_data) < PAGE_SIZE:
                yield (offset + page_offset, False, "Truncated page at EOF")
                continue

            # Parse header and extract stored checksum
            header_fields = struct.unpack_from(HEADER_FORMAT, page_data)
            stored_checksum = header_fields[1]

            # Skip empty/zeroed pages (common in sparse files). PostgreSQL never
            # writes a checksum of 0, so a zero here marks an unused page.
            if stored_checksum == 0:
                continue

            # Recompute with PostgreSQL's own algorithm, mixing in this page's
            # block number relative to the relation fork.
            blkno = base_block + batch_start + i
            computed = pg_checksum_page(page_data, blkno)

            if computed != stored_checksum:
                yield (offset + page_offset, False, f"Checksum mismatch: stored={stored_checksum:#06x}, computed={computed:#06x}")
            else:
                yield (offset + page_offset, True, "")

async def scan_directory(backup_dir: str, max_concurrency: int = 8):
    """Orchestrates concurrent scanning across all relation files."""
    base_path = Path(backup_dir)
    semaphore = asyncio.Semaphore(max_concurrency)
    results: Dict[str, list] = {"valid": [], "corrupted": [], "skipped": []}

    async def process_file(file_path: Path):
        async with semaphore:
            async for page_offset, is_valid, msg in validate_file_pages(file_path):
                if msg and "Truncated" in msg:
                    results["skipped"].append({"file": str(file_path), "offset": page_offset, "reason": msg})
                elif not is_valid:
                    results["corrupted"].append({"file": str(file_path), "offset": page_offset, "reason": msg})
                # Valid pages are not logged to avoid massive output

    tasks = []
    # Only relation files (base/, global/, pg_tblspc/) carry 8KB page checksums.
    # WAL segments use a separate CRC scheme, so pg_wal/ is skipped.
    for root, _, files in os.walk(base_path):
        if "pg_wal" in Path(root).parts or "archive_status" in root:
            continue
        for f in files:
            fpath = Path(root) / f
            if RELATION_RE.match(fpath.name):
                tasks.append(process_file(fpath))

    await asyncio.gather(*tasks)
    return results

def main():
    if len(sys.argv) != 2:
        print("Usage: python pg_backup_validator.py /path/to/base/backup", file=sys.stderr)
        sys.exit(1)
    
    backup_dir = sys.argv[1]
    print(f"Starting validation scan: {backup_dir}")
    results = asyncio.run(scan_directory(backup_dir))
    
    print(json.dumps(results, indent=2))
    if results["corrupted"]:
        sys.exit(2)  # Non-zero exit for pipeline quarantine triggers
    sys.exit(0)

if __name__ == "__main__":
    main()

Step-by-Step Execution Walkthrough

Stage the backup. Extract or mount the base backup to a scratch path, e.g. wal-g backup-fetch /mnt/dr-staging/base LATEST. Never validate against a live $PGDATA that is still receiving writes — an in-flight page rewrite produces a false mismatch.
Confirm checksums are present. Run pg_controldata /mnt/dr-staging/base | grep -i checksum and verify the version is non-zero. If it reads 0, the database cluster has no page checksums and this scanner cannot gate it.
Run the scanner. Execute python3 pg_backup_validator.py /mnt/dr-staging/base > validation.json. Traversal skips pg_wal/ and archive_status/ automatically and processes up to eight relation files concurrently.
Read the exit code. Immediately capture echo $?. Exit 0 means every page validated; exit 2 means at least one page failed and the backup is quarantined; exit 1 means bad arguments.
Inspect the manifest. Parse validation.json for the corrupted array. Each entry names the relation file, the byte offset of the failing page, and the stored-vs-computed checksum, which you map onto a relation with SELECT relname FROM pg_class WHERE relfilenode = <oid> on a reference cluster.

Verification and Expected Output

A clean backup prints a JSON document with empty corrupted and skipped arrays and returns 0:

json

{
  "valid": [],
  "corrupted": [],
  "skipped": []
}

A quarantined backup returns 2 and enumerates each failing page so the finding is actionable rather than a bare boolean:

json

{
  "valid": [],
  "corrupted": [
    {
      "file": "/mnt/dr-staging/base/base/16384/24591",
      "offset": 40960,
      "reason": "Checksum mismatch: stored=0x8f2a, computed=0x1c4d"
    }
  ],
  "skipped": []
}

Wire the exit code into a pre-promotion gate. Any non-0 result must halt WAL replay and page the on-call SRE:

bash

#!/usr/bin/env bash
set -euo pipefail

BACKUP_PATH="/mnt/dr-staging/base"
VALIDATION_LOG="/var/log/pg_dr/validation_$(date +%Y%m%d_%H%M%S).json"

python3 pg_backup_validator.py "$BACKUP_PATH" > "$VALIDATION_LOG" 2>&1 || {
    EXIT_CODE=$?
    if [ "$EXIT_CODE" -eq 2 ]; then
        echo "[$(date -u)] CRITICAL: page corruption detected. Promotion halted." | tee -a /var/log/pg_dr/alert.log
        curl -s -X POST https://alerts.internal/dr-quarantine \
             -H "Content-Type: application/json" \
             -d "{\"status\":\"quarantined\",\"log\":\"$VALIDATION_LOG\"}"
        exit 2
    fi
    # Any other non-zero exit (crash, bad arguments) must also block promotion.
    echo "[$(date -u)] ERROR: validator failed with exit $EXIT_CODE. Promotion halted." | tee -a /var/log/pg_dr/alert.log
    exit "$EXIT_CODE"
}

echo "[$(date -u)] Validation passed. Proceeding to WAL replay and promotion."
exit 0

Failure Modes and Troubleshooting

Symptom	Cause	Remediation
`File size N not aligned to 8192B`	Truncated download, or a non-relation file matched by accident	Re-fetch the backup; confirm the object’s byte length matches the source segment before scanning
Every page reports a mismatch	Cluster was initialized without `data_checksums`, so `pd_checksum` holds stale bytes	Run `pg_controldata
Mismatch on exactly one relation, clustered offsets	Genuine media/storage corruption in that relfilenode	Restore that relation from an earlier backup or rebuild it; do not promote the backup
`struct.error: unpack requires a buffer`	Misaligned block-snapshot reads (device not 8KB-aligned)	Re-present the snapshot with 8192-byte alignment, or copy to a filesystem before scanning
Scan saturates the storage target	`max_concurrency` exceeds the backup store’s IOPS ceiling	Lower `max_concurrency`; batch size caps peak memory at ~256KB per worker regardless of table size
Pages with `pd_checksum=0` flagged	Misreading unused/sparse pages as corrupt	None needed — the scanner already skips zeroed pages; PostgreSQL never writes a checksum of `0`

Integration Notes

The scanner is headless and idempotent, so it drops into any orchestrator that can branch on a POSIX exit code. In Airflow, wrap it in a BashOperator whose non-zero exit fails the task and short-circuits the downstream promote_replica task; publish validation.json as an XCom so the categorization stage can classify each finding. In Celery, invoke it from a task that re-raises on exit 2 so the DR chain aborts rather than swallowing the failure. For unattended repository monitoring, a systemd timer runs it against the latest snapshot on a fixed cadence:

ini

# /etc/systemd/system/pg-backup-validator.timer
[Unit]
Description=Run PostgreSQL backup page-corruption validation every 6 hours

[Timer]
OnCalendar=*-*-* 00/6:00:00
Persistent=true

[Install]
WantedBy=timers.target

Downstream, a clean exit is the precondition for WAL replay: a corrupted base backup surfaces as PANIC: could not locate a valid checkpoint record at startup, and pre-validation eliminates that failure mode entirely. Store each validation.json in immutable object storage — compliance frameworks such as NIST SP 800-34 Rev. 1 require cryptographic proof that a backup was structurally verified before failover. This scanner is one restored coordinate in the wider drill: the same manifests that gate promotion here feed the sandbox provisioning step that stands up the isolated environment for application-level DR tests.

Frequently Asked Questions

Why reimplement the checksum instead of running pg_checksums or pg_verifybackup?

pg_checksums operates on a stopped local $PGDATA, and pg_verifybackup validates the backup manifest's SHA hashes, not the internal pd_checksum of each 8KB page. Neither reads directly from a remote object-store or block-snapshot target with bounded concurrency. Reimplementing pg_checksum_page() lets the scanner run against S3, NFS, or a mounted snapshot with no PostgreSQL binary present, and prove page-level integrity that a manifest hash cannot distinguish from a correctly-hashed-but-internally-corrupt page.

Why is the algorithm FNV-1a and not CRC32C?

PostgreSQL uses two different checksums. WAL records and the control file use CRC32C, but data pages use a custom FNV-1a block checksum defined in src/include/storage/checksum_impl.h — 32 partial sums folded over the page, mixed with the block number, and reduced to the 1..65535 range. Validating a data page with CRC32C would fail every page. The constants and mixing in this scanner match the server exactly, so a passing page is bit-for-bit identical to what PostgreSQL would compute.

Why does the scanner skip pages whose stored checksum is zero?

PostgreSQL never writes a page checksum of 0 — the algorithm reduces into the range 1..65535. A stored value of 0 therefore marks an unallocated or all-zero page, common in sparse relation files. Treating those as corruption would flood the manifest with false positives, so they are counted as unused rather than failed.

What must happen when the scanner exits with code 2?

Exit 2 is a hard quarantine: WAL replay and promotion must not proceed. The backup contains at least one page whose on-disk checksum no longer matches its contents, which will surface as a heap or index panic during recovery. The gate should halt the pipeline, persist the JSON manifest to immutable storage for audit, and page the on-call SRE to restore the affected relation from an earlier backup rather than promote the compromised set.

Page Corruption Scanning Techniques — the parent discipline this scanner implements one method of.
Python Script for MySQL Checksum Validation — the live-replica counterpart for InnoDB backup sets.
Async Batching Strategies with Python Multiprocessing — scaling this batched read pattern across multi-terabyte relations.
Error Categorization Frameworks — the severity taxonomy that turns a corrupted-page manifest into an action.
How to Map RTO and RPO for PostgreSQL Clusters — the recovery windows this pre-restore gate is budgeted against.

This script is one component of the broader Page Corruption Scanning Techniques workflow.