Cross-Environment Privilege Extraction & Parsing

Automated RBAC drift detection begins with a single, non-negotiable prerequisite: accurate, synchronized privilege extraction across every database environment in your topology. For Database Reliability Engineers, compliance officers, and platform operators, the gap between declared infrastructure-as-code state and the grants actually resident in a catalog is where regulatory exposure silently accumulates. This section owns the first stage of the drift pipeline — turning fragmented, vendor-specific catalog output from PostgreSQL, MySQL, Oracle, and cloud data warehouses into one canonical, diff-ready dataset that every downstream engine can trust. Get this stage wrong and everything after it — scoring, alerting, remediation, audit evidence — inherits the error.

Figure — Extraction architecture: vendor-specific catalogs are read through async, rate-limited batches, normalized by per-engine parser adapters, validated, and emitted as a single canonical privilege matrix that feeds the diff engine.

What a Canonical Privilege Matrix Actually Is

Before any code runs, the pipeline needs a precise data model, because every guarantee downstream depends on it. The canonical unit of this domain is the privilege tuple: a fully-qualified statement that a specific grantee holds a specific privilege on a specific object in a specific environment. Concretely, a tuple is (environment, engine, grantee, object_type, object_name, privilege, grantable). The canonical privilege matrix is the deduplicated set of all such tuples across your estate. It is deliberately flat — inheritance chains, ACL arrays, and role membership graphs are all resolved into explicit, atomic tuples so that a diff becomes a set-difference operation rather than a graph-comparison problem.

Three invariants govern this stage and every page beneath it depends on them holding:

Determinism. Two extraction runs against an unchanged database must produce byte-identical canonical output. Non-determinism — unsorted result sets, timestamp fields inside the identity key, locale-dependent case folding — manifests downstream as phantom drift that erodes trust in the whole system.
Completeness. Every effective privilege must appear as at least one tuple. A grant that reaches a user only through a two-hop role chain is exactly as real as a direct grant, and the matrix must contain it. Silent omission is the most dangerous failure mode because it looks like a clean audit.
Provenance. Every tuple carries the catalog view and extraction timestamp it came from. When an auditor asks “how do you know this is true,” provenance is the answer.

The identity of a tuple — the fields a diff keys on — is the first five fields: (environment, grantee, object_type, object_name, privilege). The grantable flag and provenance metadata are attributes of a tuple, not part of its identity, which matters because a change to WITH GRANT OPTION is a modification of an existing tuple, not a delete-plus-insert. Modeling that distinction correctly is the difference between an accurate delta and a noisy one. The distinction between a raw grant and its effective, inheritance-resolved form is the same one formalized in Privilege Scope Mapping, and the resolution of who-inherits-what follows the propagation rules in Role Hierarchy Design.

A minimal, strongly-typed manifest of the tuple keeps the contract explicit across the whole pipeline:

from __future__ import annotations

from datetime import datetime, timezone
from pydantic import BaseModel, Field


class PrivilegeGrant(BaseModel):
    environment: str          # "prod" | "staging" | "dev"
    engine: str               # "postgresql" | "mysql" | "oracle" | "snowflake"
    grantee: str              # role or user name, engine-normalized
    object_type: str          # "table" | "view" | "schema" | "sequence" | "routine"
    object_name: str          # fully qualified: schema.object
    privilege: str            # canonical verb: SELECT, INSERT, USAGE, EXECUTE, ...
    grantable: bool = False   # WITH GRANT OPTION / admin_option
    source_catalog: str       # provenance: e.g. "information_schema.role_table_grants"
    extracted_at: datetime = Field(
        default_factory=lambda: datetime.now(timezone.utc)
    )

    def identity(self) -> tuple[str, str, str, str, str]:
        """The tuple key a drift diff compares on. Deliberately excludes
        grantable and provenance so an option change reads as a modify."""
        return (
            self.environment,
            self.grantee,
            self.object_type,
            self.object_name,
            self.privilege,
        )

Reading Grants Across PostgreSQL, MySQL, Oracle, and Cloud Warehouses

Modern database estates rarely run on a single engine, and each engine exposes role and privilege metadata through a different set of system tables, views, and procedural APIs. The extraction stage must speak all of them while emitting one shape. The table below is the source of truth for which catalog view answers which question on each engine; the parser adapters described later are written directly against it.

Figure — The source of truth for extraction: each cell is the exact catalog view a parser adapter reads to answer one question on one engine. Blank future-grant cells (MySQL, Oracle) have no native catalog for them.

On PostgreSQL, object-level grants are cleanest to read from information_schema.role_table_grants, which already explodes the ACL arrays stored in pg_class.relacl into one row per privilege:

SELECT
    grantee,
    table_schema,
    table_name,
    privilege_type,
    is_grantable
FROM information_schema.role_table_grants
WHERE grantee <> 'PUBLIC'
ORDER BY grantee, table_schema, table_name, privilege_type;

Role membership — the inheritance edges the matrix must flatten — lives in pg_auth_members, joined back to pg_roles to resolve OIDs to names:

SELECT
    member_role.rolname   AS member,
    granted_role.rolname  AS granted_role,
    m.admin_option
FROM pg_auth_members m
JOIN pg_roles member_role  ON member_role.oid  = m.member
JOIN pg_roles granted_role ON granted_role.oid = m.roleid
ORDER BY member, granted_role;

On MySQL 8.0+, object grants come from information_schema privilege views (SCHEMA_PRIVILEGES, TABLE_PRIVILEGES), while role membership — the analogue of pg_auth_members — lives in mysql.role_edges:

SELECT
    FROM_USER  AS member_user,
    FROM_HOST  AS member_host,
    TO_USER    AS granted_role,
    TO_HOST    AS granted_role_host,
    WITH_ADMIN_OPTION
FROM mysql.role_edges
ORDER BY member_user, granted_role;

The PostgreSQL pg_auth_members versus MySQL mysql.role_edges split is the single most common source of extraction bugs, because MySQL identities are user@host pairs while PostgreSQL identities are bare role names — the adapter has to normalize both into one grantee field without losing the host scope that MySQL relies on for access decisions.

On Oracle, there is no single grants view; effective privilege reconstruction requires unioning DBA_SYS_PRIVS (system privileges), DBA_TAB_PRIVS (object privileges), and DBA_ROLE_PRIVS (role membership). The full walk-through of that reconstruction, including the recursive role expansion Oracle needs, is covered in Extracting user grants from Oracle data dictionary. Cloud warehouses add their own dialects — Snowflake exposes SNOWFLAKE.ACCOUNT_USAGE.GRANTS_TO_ROLES with a materialization lag, and Redshift overlays SVV_RELATION_PRIVILEGES on a PostgreSQL-derived catalog with subtle column differences.

Querying these catalogs against live production is not free. Catalog reads can contend for locks, invalidate metadata caches, and exhaust connection pools under load. Constraining extraction to indexed metadata views, read-only transactions, and statement timeouts is the job of System Catalog Query Optimization, which keeps extraction windows inside SLA even on high-throughput OLTP primaries.

The same tuple must resolve identically whether it was read from a production primary, a staging replica, or a developer sandbox. Environment is part of the identity key precisely so that the same grantee holding the same privilege in two environments produces two distinct tuples — that is what lets the diff engine answer “is staging still a faithful mirror of production?” This cross-environment parity check is the foundation the Environment Comparison Workflows build on.

Where Python Pipelines Hook Into Extraction

The automation surface of this domain is a fan-out over targets and a fan-in into one matrix. When you orchestrate extraction across dozens of clusters spanning development, staging, and production, sequential execution becomes a structural bottleneck; a single slow replica stalls the entire compliance cycle. Parallelizing catalog reads under a bounded concurrency limit is the responsibility of Async Privilege Batching, which dispatches concurrent, rate-limited extraction jobs and aggregates their results into a unified staging buffer. Using non-blocking I/O and connection multiplexing decouples extraction latency from the reporting cadence.

The core pattern is a semaphore-bounded gather over asyncpg connections. Note that concurrency is capped so extraction never saturates a target’s connection limit, and per-target failures are isolated rather than aborting the batch:

import asyncio
import asyncpg

CATALOG_QUERY = """
    SELECT grantee, table_schema, table_name, privilege_type, is_grantable
    FROM information_schema.role_table_grants
    WHERE grantee <> 'PUBLIC'
"""


async def extract_one(
    environment: str, dsn: str, sem: asyncio.Semaphore
) -> list[dict]:
    async with sem:  # bound concurrency to protect target connection pools
        conn = await asyncpg.connect(dsn, timeout=10)
        try:
            await conn.execute("SET statement_timeout = '15s'")
            rows = await conn.fetch(CATALOG_QUERY)
        finally:
            await conn.close()
    return [{"environment": environment, **dict(row)} for row in rows]


async def extract_all(
    targets: dict[str, str], max_concurrency: int = 8
) -> tuple[list[dict], list[str]]:
    sem = asyncio.Semaphore(max_concurrency)
    tasks = [extract_one(env, dsn, sem) for env, dsn in targets.items()]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    grants: list[dict] = []
    failed: list[str] = []
    for env, result in zip(targets, results):
        if isinstance(result, Exception):
            failed.append(env)   # re-queued by the caller, never silently dropped
        else:
            grants.extend(result)
    return grants, failed

The exact production-hardened form of this scraper — connection reuse, per-engine dispatch, and checkpointing — is worked through in Python scripts for async batch privilege scraping. For the underlying event-loop and driver semantics, the asyncio documentation is the authoritative reference.

Raw rows are not yet a matrix. Cross-DB Parser Adapters translate each engine’s grant syntax, role hierarchy, and object-level permissions into the shared PrivilegeGrant shape — flattening inheritance chains, resolving PUBLIC grants, and mapping engine-specific verbs onto one standardized ontology. Every adapter implements the same interface so the orchestrator never branches on engine at the call site:

from typing import Protocol


class PrivilegeAdapter(Protocol):
    engine: str

    def normalize(self, raw_row: dict, environment: str) -> PrivilegeGrant:
        """Map one engine-specific catalog row onto the canonical tuple."""
        ...


class PostgresAdapter:
    engine = "postgresql"

    # Engine verbs -> canonical ontology. Kept explicit so a new engine
    # privilege fails loudly rather than leaking through unmapped.
    _VERB = {
        "SELECT": "SELECT", "INSERT": "INSERT", "UPDATE": "UPDATE",
        "DELETE": "DELETE", "TRUNCATE": "TRUNCATE", "REFERENCES": "REFERENCES",
        "TRIGGER": "TRIGGER", "USAGE": "USAGE", "EXECUTE": "EXECUTE",
    }

    def normalize(self, raw_row: dict, environment: str) -> PrivilegeGrant:
        return PrivilegeGrant(
            environment=environment,
            engine=self.engine,
            grantee=raw_row["grantee"],
            object_type="table",
            object_name=f'{raw_row["table_schema"]}.{raw_row["table_name"]}',
            privilege=self._VERB[raw_row["privilege_type"]],
            grantable=raw_row["is_grantable"] == "YES",
            source_catalog="information_schema.role_table_grants",
        )

Once normalized, the matrix feeds an idempotent apply stage. Reconciliation never issues blind REVOKE ALL or blanket GRANT; it computes the set difference between desired and actual tuples and emits only the minimal delta. Because tuple identity is well-defined, the delta computation is a pure set operation:

def compute_delta(
    desired: list[PrivilegeGrant], actual: list[PrivilegeGrant]
) -> tuple[set, set]:
    desired_ids = {g.identity() for g in desired}
    actual_ids = {g.identity() for g in actual}
    to_grant = desired_ids - actual_ids   # present in IaC, missing in catalog
    to_revoke = actual_ids - desired_ids  # present in catalog, absent from IaC
    return to_grant, to_revoke

The DDL generated from that delta is wrapped in explicit transactions with existence-check guards — pg_auth_members / mysql.role_edges lookups for membership, information_schema checks for object grants — so repeated runs converge to the declared state with no side effects. The cascade semantics of the resulting GRANT/REVOKE statements, including dependent-object handling, follow Grant and Revoke Chain Logic. The full extraction-to-scoring handoff is owned by Drift Detection Engines & Diff Logic, where each surviving delta is weighted by Rule-Based Drift Scoring.

Which Controls This Extraction Stage Satisfies

Extraction is not merely a technical convenience; it is the evidence-generation layer for several regulatory controls. Because the canonical matrix is deterministic and carries provenance, it produces audit artifacts that map directly onto framework requirements.

SOC 2 — CC6.1 and CC6.3. The Trust Services Criteria require that logical access controls be documented, enforced, and reviewed. A timestamped canonical matrix, diffed against a version-controlled manifest, is contemporaneous evidence that access was reviewed on a defined cadence. The source_catalog provenance field satisfies the auditor’s need to trace each asserted privilege back to a system-of-record catalog view.
HIPAA — §164.312(a)(1). The access control standard for electronic protected health information demands that only authorized principals reach ePHI. Cross-environment extraction proves the same least-privilege posture holds in every environment where ePHI is present, including non-production copies, which are a frequent audit blind spot.
PCI DSS — Requirement 7. Access to cardholder data must be restricted to a documented business need. The extracted matrix, tagged with data classification per object, is the machine-verifiable “access matrix” Requirement 7 expects — and unlike a spreadsheet it is regenerated on every run rather than drifting from reality.

The mapping from a tuple to a control is explicit metadata, so evidence generation is a projection over the matrix rather than a manual reconciliation. Each tuple can be enriched with the control it supports, and the classification-to-access alignment mirrors the model in Privilege Scope Mapping. For the canonical control language behind these mappings, the NIST SP 800-53 Rev. 5 Access Control family (AC-6, least privilege) is the reference standard. Suppression of legitimately-authorized deviations — so that a sanctioned break-glass grant does not read as a control failure — is handled downstream by Exception Routing and Whitelisting.

Failure Modes That Break Naive Extractors

Extraction runs against live, distributed, version-skewed infrastructure, so the failure surface is large. The following are the recurring ways a naive implementation produces a matrix that is quietly wrong — the most dangerous kind of wrong, because it passes review.

Partial extraction that looks complete. A batch where three of forty targets timed out must not emit a matrix, or the diff engine will read the missing targets’ grants as mass revocations. The orchestrator classifies failures as recoverable (timeout, rate limit, transient lock) versus terminal (revoked credentials, schema version drift) and refuses to publish a matrix while any target is in a recoverable-but-unresolved state. Exponential backoff with jitter, circuit breakers, and idempotent checkpointing let interrupted runs resume without duplicating or dropping tuples.
Inheritance flattening gaps. A grant that reaches a user only through nested role membership is invisible if the adapter reads object grants but never expands the membership graph. On Oracle this is acute because default roles and PUBLIC grants both contribute effective privileges that no single view lists. The completeness invariant is only real if membership expansion is part of every adapter, not an afterthought.
Non-deterministic ordering and case. Unsorted result sets, or folding "MixedCase" quoted identifiers to lowercase on one engine but not another, produce tuples that differ only cosmetically and register as drift on every run. Canonicalization must fix a total order and a single case-handling rule per engine.
Catalog materialization lag. Snowflake’s ACCOUNT_USAGE views lag real grants by up to ~90 minutes. An extractor that treats them as real-time will diff stale data against a fresh manifest and manufacture drift. Lagging catalogs must be read with a freshness tolerance and their tuples tagged accordingly.
Schema evolution during the read. A table dropped or a role renamed mid-extraction yields dangling object references. Detecting and reconciling those mid-flight changes — rather than crashing or emitting orphans — is the concern of Schema Validation Pipelines and is walked through end-to-end in Handling schema drift during catalog extraction.

Validation is the gate that turns these failure modes from silent corruption into loud, actionable errors. Structural checks — mandatory field presence, verb-ontology membership, referential integrity between grants and the roles they reference — run before any tuple enters the diff engine, so malformed ACL strings, orphaned role references, and grants to decommissioned service accounts are caught at the boundary rather than in an audit finding.

Figure — Every target runs this machine independently. Recoverable faults re-queue through backoff so an interrupted run resumes without dropping tuples; a terminal fault quarantines the target and withholds the matrix rather than emitting a partial one that reads as mass revocation.

The Sections Beneath This Domain

This domain resolves into four working areas, each owning one stage of the extraction pipeline. Together they take you from a live catalog to a validated, diff-ready matrix.

System Catalog Query Optimization — how to read privilege metadata from production without lock contention or SLA impact: indexed metadata views, read-only transactional boundaries, and statement timeouts. Its worked example, Extracting user grants from Oracle data dictionary, reconstructs effective Oracle privileges by unioning DBA_SYS_PRIVS, DBA_TAB_PRIVS, and DBA_ROLE_PRIVS.
Async Privilege Batching — fanning extraction out across dozens of targets under bounded concurrency, with per-target failure isolation and checkpointing. Its implementation guide, Python scripts for async batch privilege scraping, is copy-ready for a production scraper.
Cross-DB Parser Adapters — the dialect-normalization layer that maps PostgreSQL, MySQL, Oracle, and warehouse grant syntax onto the single canonical tuple, flattening inheritance and resolving PUBLIC.
Schema Validation Pipelines — the structural gate that enforces the determinism, completeness, and provenance invariants before any tuple reaches the diff engine, including Handling schema drift during catalog extraction for mid-flight schema changes.

Read together, these four give you a deterministic baseline: read safely, read in parallel, normalize into one shape, and validate before you trust. That baseline is what makes every downstream drift decision defensible to an auditor.

Core RBAC Architecture & Privilege Fundamentals — the role and privilege model the canonical tuple encodes.
Drift Detection Engines & Diff Logic — the consumer of the canonical matrix, where deltas are scored and routed.
Privilege Scope Mapping — aligning object grants to data-classification tiers.
Environment Comparison Workflows — comparing extracted matrices across prod, staging, and dev.
Grant and Revoke Chain Logic — the cascade semantics of the DDL the delta generates.

↑ Back to rbac-drift.org home