Schema Validation Pipelines

Q: Why quarantine a malformed record instead of dropping it?

A dropped grant is indistinguishable from a revoked grant to the diff engine, so a silent drop looks like a clean audit while hiding a hole. Quarantining keeps the record, tags it with the stage and reason it failed, and blocks the publish so a human sees why.

Q: Is it safe to run the validation pipeline against production?

Yes. Its only database access is a read-only referential lookup run with default_transaction_read_only on under the detection principal. It holds no DDL-capable credential, so even a wrong manifest cannot mutate a catalog.

Q: How does validation make the diff idempotent?

By canonicalizing and hashing the batch into a stable matrix_sha256. Two runs of an unchanged database produce the same fingerprint, so a pipeline that remembers the last accepted digest can skip the diff when they match. The shortcut is sound only because the invariant gate first proves no non-deterministic field is in the identity key.

Q: What is the difference between a dangling grantee and true schema drift?

A dangling grantee is a role that no longer resolves in pg_roles or mysql.user, often a legitimate drop between runs, so the tuple simply leaves the matrix. True schema drift is an object that changed during the read, yielding orphaned references mid-batch that require re-extraction.

A schema validation pipeline is the gate that stands between raw, normalized catalog output and the drift diff engine that acts on it. Its single job is to guarantee that every privilege tuple crossing into the comparison stage is structurally sound, semantically resolvable, deterministically ordered, and carries provenance — so that a diff is a trustworthy signal rather than an artifact of a malformed read. The failure scenario this section prevents is the silently invalid matrix: an extraction run where a role was renamed mid-read, an ACL string arrived truncated, or three of forty targets timed out, and the pipeline emits the result anyway. The diff engine then reads the missing grants as mass revocations, fires a page at 3 a.m., and burns the on-call’s trust in the entire system. Validation turns those conditions from silent corruption into loud, quarantined, actionable errors at the boundary — before a single REVOKE is ever proposed.

The tuples this gate inspects are produced upstream: the canonical shape is defined by the Cross-Environment Privilege Extraction & Parsing domain, and the per-engine translation that populates it is the concern of Cross-DB Parser Adapters. This section owns what happens after normalization and before the diff — enforcing the three invariants (determinism, completeness, provenance) that every downstream drift decision depends on, and refusing to publish when they do not hold.

Figure — The validation pipeline as four sequential gates: schema conformance, referential resolution, invariant enforcement, and canonical hashing. Any failure routes the record to a reason-tagged quarantine, and a non-empty quarantine or a missing target blocks publication of the whole matrix.

Figure — The same four gates seen as a funnel: each gate lets fewer tuples through and chutes its rejects, tagged with a distinct reason, into one shared quarantine bin. The valve to the diff engine opens only when that bin is empty and every target reported.

Prerequisites and Scope

The techniques below target PostgreSQL 12+ and MySQL 8.0.16+, the versions whose role-membership catalogs (pg_auth_members, mysql.role_edges) and object-grant views (information_schema.role_table_grants, mysql.tables_priv) behave as described. On the Python side, use Python 3.11+ (for datetime.UTC, tomllib, and structural pattern matching) and pydantic 2.x, whose Rust-backed core makes per-tuple validation cheap enough to run inline on hundreds of thousands of grants. See the pydantic validators documentation for the field- and model-validator hooks this pipeline leans on.

Validation is a read-only stage. It never connects with credentials that can mutate a catalog; its only database interaction is a set of referential lookups against pg_roles / mysql.user to confirm that the grantee and object named in a tuple actually exist. Grant those lookups to the same read-only detection principal used for extraction — on PostgreSQL, membership in pg_monitor plus read on information_schema; on MySQL, SELECT on mysql.user, mysql.role_edges, and the relevant privilege tables. The extraction and normalization that feed this gate are owned by System Catalog Query Optimization (safe, low-contention catalog reads) and Async Privilege Batching (the fan-out that produces the target roster this gate checks for completeness). This section assumes you can already materialize a normalized tuple stream and focuses on proving it is safe to diff.

Core Implementation Walkthrough

Validation is four gates in series: conform the shape, resolve the references, enforce the invariants, then canonicalize and hash. A record that clears all four joins the published matrix; a record that fails any gate is quarantined with a machine-readable reason, and a non-empty quarantine blocks the whole run.

Step 1 — Conform the shape with a strict pydantic contract

The first gate rejects anything that is not structurally a privilege tuple. A strict model with field validators catches truncated ACL strings, null grantees, unmapped privilege verbs, and timestamps that leaked into the identity key. extra="forbid" ensures a parser that starts emitting an unexpected field fails loudly here rather than silently polluting the diff.

from __future__ import annotations

from datetime import datetime, timezone
from pydantic import BaseModel, ConfigDict, field_validator

# Canonical verb ontology — the only privileges the diff engine understands.
VERB_ONTOLOGY: frozenset[str] = frozenset({
    "SELECT", "INSERT", "UPDATE", "DELETE", "TRUNCATE", "REFERENCES",
    "TRIGGER", "USAGE", "EXECUTE", "CREATE", "CONNECT", "TEMPORARY",
})
OBJECT_TYPES: frozenset[str] = frozenset({
    "table", "view", "schema", "sequence", "routine", "database",
})


class ValidatedGrant(BaseModel):
    model_config = ConfigDict(extra="forbid", frozen=True, str_strip_whitespace=True)

    environment: str
    engine: str
    grantee: str
    object_type: str
    object_name: str
    privilege: str
    grantable: bool = False
    source_catalog: str            # provenance — mandatory, never defaulted
    extracted_at: datetime

    @field_validator("privilege")
    @classmethod
    def _known_verb(cls, v: str) -> str:
        upper = v.upper()
        if upper not in VERB_ONTOLOGY:
            raise ValueError(f"privilege {v!r} outside canonical ontology")
        return upper

    @field_validator("object_type")
    @classmethod
    def _known_object(cls, v: str) -> str:
        if v not in OBJECT_TYPES:
            raise ValueError(f"object_type {v!r} not recognized")
        return v

    @field_validator("grantee", "object_name")
    @classmethod
    def _non_empty(cls, v: str) -> str:
        if not v:
            raise ValueError("identity field must be non-empty")
        return v

    @field_validator("extracted_at")
    @classmethod
    def _tz_aware(cls, v: datetime) -> datetime:
        if v.tzinfo is None:
            raise ValueError("extracted_at must be timezone-aware (UTC)")
        return v.astimezone(timezone.utc)

    def identity(self) -> tuple[str, str, str, str, str]:
        """The key the diff compares on — excludes grantable and provenance."""
        return (self.environment, self.grantee,
                self.object_type, self.object_name, self.privilege)

Running the whole stream through the model separates the clean tuples from the rejects without ever raising past the first bad row:

from pydantic import ValidationError

def conform(rows: list[dict]) -> tuple[list[ValidatedGrant], list[dict]]:
    valid: list[ValidatedGrant] = []
    quarantined: list[dict] = []
    for row in rows:
        try:
            valid.append(ValidatedGrant(**row))
        except ValidationError as exc:
            quarantined.append({"row": row, "stage": "schema",
                                "reason": exc.errors(include_url=False)})
    return valid, quarantined

Expected output on a clean batch is an empty quarantined list; any entry names the exact field and rule that failed, which is the first line of the validation report an auditor or on-call reads.

Step 2 — Resolve references against the live catalog

A structurally valid tuple can still be a lie: it may name a grantee that was dropped or an object that was renamed between the parser’s read and this check. The second gate confirms every tuple’s grantee and object resolve in the live catalog. On PostgreSQL, one round trip against pg_roles validates the entire grantee set:

-- Which of the extracted grantees still exist as real roles?
SELECT rolname
FROM   pg_roles
WHERE  rolname = ANY($1::text[]);   -- $1 = distinct grantees from the batch

The Python side diffs the returned set against the batch’s grantees; anything missing is a dangling reference — the exact signature of a role renamed or dropped mid-extraction — and is quarantined rather than diffed:

import psycopg

def resolve_grantees(dsn: str, grants: list[ValidatedGrant]
                     ) -> tuple[list[ValidatedGrant], list[dict]]:
    wanted = sorted({g.grantee for g in grants})
    with psycopg.connect(dsn, autocommit=True) as conn:
        with conn.cursor() as cur:
            cur.execute("SET default_transaction_read_only = on")
            cur.execute("SELECT rolname FROM pg_roles WHERE rolname = ANY(%s)",
                        (wanted,))
            live = {r[0] for r in cur.fetchall()}
    resolved, dangling = [], []
    for g in grants:
        if g.grantee in live:
            resolved.append(g)
        else:
            dangling.append({"row": g.model_dump(mode="json"),
                             "stage": "referential",
                             "reason": f"grantee {g.grantee!r} not in pg_roles"})
    return resolved, dangling

On MySQL the same check reads mysql.user (and mysql.role_edges for role-only identities), matching on the user@host pair that the adapter normalized into grantee. A dangling reference is not always an error — a role legitimately dropped between runs should disappear from the matrix — so this stage distinguishes a grantee that vanished (drop it from the matrix, note it in the report) from an object that vanished mid-read (a genuine schema-evolution event). Handling that second, harder case — a table dropped or a schema renamed while the extractor is still walking it — is worked through end to end in Handling schema drift during catalog extraction.

Step 3 — Enforce the completeness, determinism, and provenance invariants

The third gate checks properties of the batch as a whole, not individual rows. Completeness: every target on the roster must have contributed at least the roles the membership graph implies, so a grant reaching a user through a two-hop role chain is present as a tuple. Determinism: no field that varies between two runs of an unchanged database may appear in the identity key. Provenance: every tuple carries a source_catalog. The membership-expansion check is the one most often skipped and the most dangerous to skip — an object grant read without expanding pg_auth_members looks complete and audits clean while hiding inherited access.

from dataclasses import dataclass

@dataclass(frozen=True)
class InvariantReport:
    complete: bool
    deterministic: bool
    has_provenance: bool
    missing_membership: frozenset[tuple[str, str]]  # (member, granted_role)

def check_invariants(grants: list[ValidatedGrant],
                     membership_edges: set[tuple[str, str]],
                     ) -> InvariantReport:
    # Provenance: every tuple must name the catalog view it came from.
    has_provenance = all(g.source_catalog for g in grants)

    # Determinism: identity keys must be unique — a duplicate means a
    # non-identity field (e.g. a timestamp) leaked into the key.
    ids = [g.identity() for g in grants]
    deterministic = len(ids) == len(set(ids))

    # Completeness: every membership edge must be represented so inherited
    # privilege is materialized, not silently dropped.
    grantees_seen = {g.grantee for g in grants}
    missing = {(m, r) for (m, r) in membership_edges if m not in grantees_seen}

    return InvariantReport(
        complete=not missing,
        deterministic=deterministic,
        has_provenance=has_provenance,
        missing_membership=frozenset(missing),
    )

The membership_edges set comes straight from pg_auth_members (or mysql.role_edges), and the flattening of those edges into effective grants follows the propagation rules in Role Hierarchy Design; the reduction of a raw grant to its effective, inheritance-resolved form is the same normalization formalized in Privilege Scope Mapping. If any invariant fails, the pipeline does not publish — it emits the report and stops, because a batch that violates completeness or determinism will manufacture drift on the next diff.

Step 4 — Canonicalize, hash, and gate publication

The final gate imposes a total order and a single case-handling rule, then content-addresses the batch. A stable sort plus a SHA-256 over the serialized tuples gives the batch a fingerprint that is byte-identical across runs of an unchanged database — the property that lets the diff engine skip reconciliation entirely when the fingerprint matches the last accepted one (the idempotency shortcut).

import hashlib, json

def canonicalize(grants: list[ValidatedGrant]) -> tuple[list[dict], str]:
    ordered = sorted(grants, key=lambda g: g.identity())
    payload = [
        {"identity": g.identity(), "grantable": g.grantable,
         "source_catalog": g.source_catalog}
        for g in ordered
    ]
    blob = json.dumps(payload, sort_keys=True, separators=(",", ":"))
    digest = hashlib.sha256(blob.encode()).hexdigest()
    return payload, digest


def publish_or_block(payload, digest, quarantine, expected_targets,
                     seen_targets) -> dict:
    missing_targets = set(expected_targets) - set(seen_targets)
    ok = not quarantine and not missing_targets
    return {
        "published": ok,
        "matrix_sha256": digest,
        "tuple_count": len(payload),
        "quarantined": len(quarantine),
        "missing_targets": sorted(missing_targets),
        "matrix": payload if ok else None,   # withheld unless the gate opens
    }

The gate opens only when the quarantine is empty and every expected target reported. A partial batch — three targets timed out — is withheld with its missing_targets listed, so the diff engine never sees a hole it would read as revocations. The validated matrix, once published, is the input to Drift Detection Engines & Diff Logic, where each surviving delta is weighted by Rule-Based Drift Scoring and the resulting GRANT/REVOKE plan is sequenced by Grant and Revoke Chain Logic.

Idempotency and the Dry-Run Safety Contract

The pipeline is designed to run on a cron every few minutes, forever, and never mutate a catalog. The contract has three clauses. First, determinism as a fixed point: because canonicalization imposes a total order and hashes the result, two runs against an unchanged database produce the same matrix_sha256. A pipeline that stores the last accepted digest can short-circuit — if the new digest matches, there is no drift to compute, so the diff engine is never invoked. That shortcut is only sound because Step 3 proved determinism; a timestamp in the identity key would break the fixed point and manufacture a new hash on every run.

Second, read-only by construction. The only database access this stage performs is the referential lookup in Step 2, and it runs with default_transaction_read_only = on under the detection principal. Validation holds no credential that can issue DDL, so a validation run — even against a wrong manifest — cannot alter state. In CI, the pipeline runs in dry-run mode: it validates a captured snapshot and asserts an empty quarantine and a full target roster, failing the build if either is violated, so a pull request that would ship a malformed extractor is rejected before merge.

def assert_publishable(result: dict) -> None:
    if not result["published"]:
        raise SystemExit(
            f"validation gate closed: {result['quarantined']} quarantined "
            f"tuple(s), missing targets {result['missing_targets']}"
        )   # non-zero exit fails CI

Third, quarantine is inspectable, not silent. Every rejected record carries its stage and reason, so the quarantine buffer is an evidence artifact in its own right: it tells you why a run was blocked, which is the difference between a pipeline you trust and one you learn to ignore. Re-running after the underlying cause is fixed converges to an empty quarantine and a clean publish — the same input always yields the same verdict.

Compliance Alignment and Evidence Artifacts

The validation gate is the evidence-generation layer for the controls that demand provable access data, not merely present access data. Its two artifacts — the published matrix’s provenance manifest and the quarantine/validation report — map directly onto framework clauses.

SOC 2 — CC6.1 and CC7.2. CC6.1 requires that logical access be restricted and reviewed; the content-addressed matrix with per-tuple source_catalog provenance is contemporaneous evidence that each asserted privilege traces to a system-of-record catalog view. CC7.2 (monitoring for anomalies) is satisfied by the validation report itself — a blocked run with a non-empty quarantine is a detected, logged anomaly with a root cause attached.
HIPAA — §164.312(b). The audit-controls standard requires mechanisms that record and examine access-relevant activity. The signed provenance manifest, hashed and timestamped, is a tamper-evident record that the access data underpinning every ePHI drift decision was validated before use.
PCI DSS — Requirements 7 and 10. Requirement 7 expects a machine-verifiable access matrix restricted to least privilege; the published matrix is that artifact, regenerated and revalidated every run. Requirement 10’s logging and integrity-monitoring clauses are met by the matrix_sha256 fingerprint, which lets an assessor prove the matrix reviewed at audit time is byte-for-byte the one produced by the run.

Because provenance and the control mapping are explicit metadata on each tuple, evidence generation is a projection over the validated matrix rather than a manual reconciliation, mirroring the classification-to-access model in Privilege Scope Mapping. The canonical control language behind these mappings is the NIST SP 800-53 Rev. 5 Access Control family (AC-6, least privilege; AU-family for audit). Legitimately-authorized deviations — a sanctioned break-glass grant that would otherwise read as a validation-passing anomaly — are matched and suppressed downstream by Exception Routing and Whitelisting, and the cross-environment parity the validated matrix enables is consumed by Environment Comparison Workflows.

Troubleshooting Matrix

Failure scenario	Root-cause signature	Remediation
Every run manufactures fresh drift on an unchanged database	`matrix_sha256` differs each run; identity keys duplicate in Step 3	A non-identity field (timestamp, OID, locale-folded case) leaked into `identity()`. Confirm the identity key is the five stable fields and that case-folding is applied per engine before hashing.
Whole batch blocked, quarantine full of `schema`-stage rejects	pydantic errors cite the same field across many rows	The parser adapter changed its output shape or started emitting an unmapped verb. Update `VERB_ONTOLOGY`/`OBJECT_TYPES` or fix the adapter — do not loosen `extra="forbid"`.
Grants for a real role quarantined as `referential` dangling	Grantee absent from `pg_roles`/`mysql.user` at Step 2	The role was renamed or dropped between the parser read and validation. If intentional, let it drop from the matrix; if mid-extraction, treat as schema drift and re-extract the affected target.
Matrix withheld, `missing_targets` non-empty	Fewer targets reported than the roster expects	One or more targets timed out during batching. The gate is working — never publish a partial matrix. Re-queue the failed targets via backoff/checkpoint before republishing.
Completeness fails though object grants look present	`missing_membership` non-empty in the invariant report	The batch read direct object grants but never expanded `pg_auth_members`/`mysql.role_edges`. Add membership expansion to the adapter so inherited privilege materializes as tuples.
Snowflake tuples flagged as stale/inconsistent	Provenance names an `ACCOUNT_USAGE` view; counts lag the live grant set	The lagging catalog (up to ~90 min) was read as real-time. Tag those tuples with a freshness tolerance and validate them against a same-lag baseline, not a fresh manifest.

Schema Validation Questions Engineers Ask

Where does validation sit relative to parsing and diffing? Strictly between them. Parser adapters emit normalized tuples; validation is the gate that decides whether that batch is safe to compare; the diff engine only ever sees a matrix that cleared all four gates. Putting validation after the diff would mean the diff already acted on bad data — the whole point is to fail before a REVOKE is proposed.

Why quarantine a malformed record instead of dropping it? Because a dropped grant is indistinguishable from a revoked grant to the diff engine, and a silent drop is the most dangerous failure mode — it looks like a clean audit. Quarantining keeps the record, tags it with the stage and reason it failed, and blocks the whole publish so a human sees why rather than inheriting a hole in the matrix.

Is it safe to run the validation pipeline against production? Yes. Its only database access is the read-only referential lookup in Step 2, run under default_transaction_read_only = on with the detection principal. It holds no DDL-capable credential, so even a wrong manifest cannot mutate a catalog. Everything else — schema conformance, invariant checks, hashing — is pure computation over the extracted snapshot.

How does validation make the diff idempotent? By canonicalizing and hashing the batch into a stable matrix_sha256. Two runs of an unchanged database produce the same fingerprint, so a pipeline that remembers the last accepted digest can skip the diff entirely when they match. That shortcut is only sound because Step 3 first proves no non-deterministic field is in the identity key.

What is the difference between a dangling grantee and true schema drift? A dangling grantee is a role that no longer resolves in pg_roles/mysql.user — often a legitimate drop between runs, so the tuple simply leaves the matrix. True schema drift is an object (table, schema) that changed during the read, yielding orphaned references mid-batch; that requires re-extraction and is handled in Handling schema drift during catalog extraction.

Cross-DB Parser Adapters — produces the normalized tuples this gate validates.
System Catalog Query Optimization — the low-contention catalog reads whose output feeds validation.
Async Privilege Batching — the fan-out that defines the target roster the completeness gate checks.
Handling schema drift during catalog extraction — reconciling objects that change mid-read before they reach this gate.
Drift Detection Engines & Diff Logic — the consumer of the validated, published matrix.

Up: Cross-Environment Privilege Extraction & Parsing