Potential database inconsistency with very large transactions

13 May 2026
Product Affected Versions Related Issues Fixed In
YSQL All #30772 v2025.2.3.0, v2025.1.4.0, v2024.2.9.0

Description

A rare race condition can cause rows written in a very large transaction (more than 100K storage records) to overwrite rows written by later, smaller transactions when a specific sequence of SST flush, restart, and compaction events occurs.

This issue was observed during DDL operations in the following sequence:

  1. A full-database ANALYZE runs on a database whose tables have a total column count greater than or equal to 4300.
  2. A restart of the YB-Master process(es) happens before the system catalog is flushed to disk.
  3. A DDL statement such as ALTER TABLE runs.
  4. In the background, a flush and compaction occurs.
  5. Another DDL runs. At this point, caches are invalidated, and the inconsistency can cause the following effects:
    • Queries against the altered table can fail in v2025.2.0.0 and later, unless additional catalog cache preload flags are enabled on YB-TServer.
    • In other cases, all connections to the database can fail.

If your cluster may have gone through steps 1–4 above but not hit the actual connection failures yet (step 5), you can run the following query to identify this potential failure waiting to happen. The query checks each YSQL database for mismatches between pg_class and related PostgreSQL system catalogs; a clean result (no rows) means none of the targeted inconsistencies were found.

-- Run via ysqlsh against any YSQL database.  A clean result (zero rows) means
-- no pg_class partial-revert corruption was detected.  Any returned rows
-- identify the affected relation, the specific inconsistency, the actual
-- divergent values, and the DDL family that likely caused it.
BEGIN READ ONLY;
SELECT c.oid AS rel_oid, c.relname, v.check_name, v.description
FROM pg_class c
CROSS JOIN LATERAL (VALUES
  (CASE WHEN c.relkind IN ('r','m','t','p')
        AND c.relhasindex IS DISTINCT FROM
            EXISTS(SELECT 1 FROM pg_index WHERE indrelid = c.oid)
   THEN 'relhasindex_mismatch' END,
   format('pg_class.relhasindex=%s but pg_index %s entries for this relation. '
          'Possible DDL: CREATE INDEX, ALTER TABLE ADD UNIQUE/PRIMARY KEY',
          c.relhasindex,
          CASE WHEN EXISTS(SELECT 1 FROM pg_index WHERE indrelid = c.oid)
               THEN 'has' ELSE 'has no' END)),
  (CASE WHEN c.relkind IN ('r','m','t','p')
        AND c.relnatts::int IS DISTINCT FROM
            (SELECT COUNT(*)::int FROM pg_attribute
             WHERE attrelid = c.oid AND attnum > 0)
   THEN 'relnatts_mismatch' END,
   format('pg_class.relnatts=%s but pg_attribute has %s live user columns. '
          'Possible DDL: ALTER TABLE ADD/DROP COLUMN',
          c.relnatts,
          (SELECT COUNT(*) FROM pg_attribute
           WHERE attrelid = c.oid AND attnum > 0))),
  (CASE WHEN c.relkind IN ('r','m','t','p')
        AND c.relchecks::int IS DISTINCT FROM
            (SELECT COUNT(*)::int FROM pg_constraint
             WHERE conrelid = c.oid AND contype = 'c')
   THEN 'relchecks_mismatch' END,
   format('pg_class.relchecks=%s but pg_constraint has %s check constraints. '
          'Possible DDL: ALTER TABLE ADD/DROP CHECK CONSTRAINT',
          c.relchecks,
          (SELECT COUNT(*) FROM pg_constraint
           WHERE conrelid = c.oid AND contype = 'c'))),
  (CASE WHEN c.relkind IN ('r','m','t','p')
        AND c.relhastriggers IS DISTINCT FROM
            EXISTS(SELECT 1 FROM pg_trigger WHERE tgrelid = c.oid)
   THEN 'relhastriggers_mismatch' END,
   format('pg_class.relhastriggers=%s but pg_trigger %s entries for this relation. '
          'Possible DDL: ALTER TABLE ADD FOREIGN KEY, CREATE/DROP TRIGGER',
          c.relhastriggers,
          CASE WHEN EXISTS(SELECT 1 FROM pg_trigger WHERE tgrelid = c.oid)
               THEN 'has' ELSE 'has no' END)),
  (CASE WHEN c.relkind IN ('r','m','v','p')
        AND c.relhasrules IS DISTINCT FROM
            EXISTS(SELECT 1 FROM pg_rewrite WHERE ev_class = c.oid)
   THEN 'relhasrules_mismatch' END,
   format('pg_class.relhasrules=%s but pg_rewrite %s rules for this relation. '
          'Possible DDL: CREATE/DROP RULE',
          c.relhasrules,
          CASE WHEN EXISTS(SELECT 1 FROM pg_rewrite WHERE ev_class = c.oid)
               THEN 'has' ELSE 'has no' END)),
  (CASE WHEN c.reltype != 0
        AND c.relname IS DISTINCT FROM
            (SELECT typname FROM pg_type WHERE oid = c.reltype)
   THEN 'relname_typname_mismatch' END,
   format('pg_class.relname=%L but pg_type.typname=%L for the associated row type. '
          'Possible DDL: ALTER TABLE/MATERIALIZED VIEW RENAME TO',
          c.relname,
          (SELECT typname FROM pg_type WHERE oid = c.reltype))),
  (CASE WHEN c.reltype != 0
        AND c.relnamespace IS DISTINCT FROM
            (SELECT typnamespace FROM pg_type WHERE oid = c.reltype)
   THEN 'relnamespace_typnamespace_mismatch' END,
   format('pg_class.relnamespace=%s (%s) but pg_type.typnamespace=%s (%s). '
          'Possible DDL: ALTER TABLE SET SCHEMA',
          c.relnamespace,
          (SELECT nspname FROM pg_namespace WHERE oid = c.relnamespace),
          (SELECT typnamespace FROM pg_type WHERE oid = c.reltype),
          (SELECT n.nspname FROM pg_type t
           JOIN pg_namespace n ON n.oid = t.typnamespace
           WHERE t.oid = c.reltype))),
  (CASE WHEN c.reltype != 0
        AND c.relowner IS DISTINCT FROM
            (SELECT typowner FROM pg_type WHERE oid = c.reltype)
   THEN 'relowner_typowner_mismatch' END,
   format('pg_class.relowner=%s (%s) but pg_type.typowner=%s (%s). '
          'Possible DDL: ALTER TABLE/MATERIALIZED VIEW OWNER TO',
          c.relowner,
          (SELECT rolname FROM pg_roles WHERE oid = c.relowner),
          (SELECT typowner FROM pg_type WHERE oid = c.reltype),
          (SELECT r.rolname FROM pg_type t
           JOIN pg_roles r ON r.oid = t.typowner
           WHERE t.oid = c.reltype))),
  (CASE WHEN c.relkind IN ('r','m','t','p')
        AND c.relispartition IS DISTINCT FROM
            EXISTS(SELECT 1 FROM pg_inherits WHERE inhrelid = c.oid)
   THEN 'relispartition_mismatch' END,
   format('pg_class.relispartition=%s but pg_inherits %s entry for this relation. '
          'Possible DDL: ALTER TABLE ATTACH/DETACH PARTITION',
          c.relispartition,
          CASE WHEN EXISTS(SELECT 1 FROM pg_inherits WHERE inhrelid = c.oid)
               THEN 'has an' ELSE 'has no' END)),
  (CASE WHEN c.relkind = 'i'
        AND EXISTS(SELECT 1 FROM pg_constraint con
                   WHERE con.conindid = c.oid
                     AND con.contype IN ('u','p','x')
                     AND con.conname != c.relname)
   THEN 'constraint_index_name_mismatch' END,
   format('Backing index pg_class.relname=%L but pg_constraint.conname=%L for '
          'constraint oid %s. Possible DDL: ALTER TABLE ADD CONSTRAINT USING INDEX, '
          'ALTER TABLE RENAME CONSTRAINT',
          c.relname,
          (SELECT con.conname FROM pg_constraint con
           WHERE con.conindid = c.oid AND con.contype IN ('u','p','x')
           ORDER BY con.oid LIMIT 1),
          (SELECT con.oid FROM pg_constraint con
           WHERE con.conindid = c.oid AND con.contype IN ('u','p','x')
           ORDER BY con.oid LIMIT 1))),
  (CASE WHEN c.relkind = 'i'
        AND NOT EXISTS(SELECT 1 FROM pg_index WHERE indexrelid = c.oid)
   THEN 'index_pg_index_mismatch' END,
   format('Index %L (oid %s) exists in pg_class but has no pg_index entry. '
          'Possible DDL: reverted DROP INDEX',
          c.relname, c.oid))
) v(check_name, description)
WHERE v.check_name IS NOT NULL
  AND c.relnamespace NOT IN (
    SELECT oid FROM pg_namespace
    WHERE nspname IN ('pg_catalog', 'information_schema', 'pg_toast')
  );
COMMIT;

Mitigation

To reduce the likelihood of encountering this issue on affected versions:

  • Do not run full-database ANALYZE on versions earlier than v2025.2.3.0, v2025.1.4.0, or v2024.2.9.0. Instead, run ANALYZE on individual tables in batches. On v2025.2.0.0 and later, prefer Auto Analyze, which is enabled by default when the cost-based optimizer (CBO) is enabled.
  • Upgrade to a release that contains the fix. Upgrading does not correct any data inconsistencies which have already occurred.
  • If you have already encountered the issue and are at Step 5 as described above with database connection failures, setting the YB-TServer flag --ysql_minimal_catalog_caches_preload=true should allow you to connect to the database, though queries on certain tables may still fail. The pg_class entries for affected tables may require manual correction; contact Yugabyte Support for assistance.

Details

A sequence of events was identified in which changes from a later transaction can be partially undone by an earlier large transaction that updates more than 100K RocksDB rows. This scenario in theory can happen on data tables but is unlikely on ordinary user data tables and more likely when DDL modifies the PostgreSQL catalog on databases with roughly 4300 or more columns in total, because those paths use the same transaction infrastructure.

The following DDL example describes the sequence of events that can trigger this bug. The example uses a full-database ANALYZE as the large, earlier transaction and ALTER TABLE ADD COLUMN as the later transaction.

  1. Large transaction commits. A large transaction commits with an update of more than 100K RocksDB rows in unpacked form.
    • A full-database ANALYZE can reach this limit when the total number of columns across all tables in the database is approximately 4300 or greater.
  2. Server restart. Before those rows are flushed to SST files, the YB-Master or the YB-TServer holding the transaction rows restarts.
  3. Partial WAL replay. During restart, WAL replay of these rows occurs. The transaction transfer code path erroneously causes only the first batch of 100K RocksDB rows to be transferred to regular RocksDB. The remaining rows of the transaction still stay in the IntentsDB memtable, awaiting the next WAL replay.
    • In this example, remaining rows from the full-database ANALYZE can include pg_class rows that describe important table metadata, such as the number of columns (relnatts) and the relfilenode.
  4. Newer transaction commits. A newer transaction can update some of those rows that still reside in the intents memtable. No issues are seen at this point, and reads continue to behave as expected because reads merge pending intents with newer updates. Note that on commit, these newer writes move from intents to regular RocksDB.
    • In this example, an ALTER TABLE ADD COLUMN can update the pg_class entry to update the number of columns in the table (relnatts column of pg_class).
  5. Flush and compaction occurs. When enough rows accumulate, a flush and compaction can then occur on the regular RocksDB. During compaction, newer updates (such as, from ALTER TABLE) that use per-column format are merged into existing packed rows. The resulting packed row can retain an older timestamp after per-column updates are applied.
  6. Inconsistency becomes visible. After this compaction, the remaining writes from the older large transaction in IntentsDB, still waiting to transfer to regular RocksDB, can have the highest timestamp among remaining entries and become visible to readers.
    • In this example, the compaction step causes the pg_class rows from the earlier ANALYZE transaction in IntentsDB to appear to have the latest timestamp. Consequently, the pg_class row reverts to the column count from the ANALYZE, undoing the ALTER TABLE ADD COLUMN change. This catalog inconsistency causes PostgreSQL backends to error out.
  7. Inconsistency is persisted. A subsequent restart replays the remaining writes correctly, and they are eventually compacted into the existing packed row. However, they merge into the packed row after a later transaction was already merged.
    • In this example, this means that the PostgreSQL catalog now contains the new pg_attribute definition of the new columns added by an ALTER TABLE ADD COLUMN but the relnatts column of pg_class is inconsistent with this new column.