An xCluster target universe's YB-Master may crash when certain types of schema changes are applied after database upgrade
| Product | Affected Versions | Related Issues | Fixed In |
|---|---|---|---|
| YugabyteDB | v2025.1.1.0+, v2025.2.0.0 to v2025.2.3.0 | #24990, #27275 | Upcoming releases: v2025.2.3.1+, v2025.2.4+, v2026.1 |
Description
After upgrading an xCluster setup (of any xCluster type) to v2025.1.1.0 or later, when running certain DDLs (specifically, DDLs that modify a table's dependent objects, such as commands that create, drop, or modify indexes, foreign keys, views, or triggers), the target universe's YB-Master leader process may SIGSEGV and enter a crash loop. This can also occur during the upgrade process.
The YB-Master core dump contains a stack trace similar to the following:
* thread #1, name = 'yb-master', stop reason = signal SIGSEGV
* frame #0: 0x000055faa5e1b2c3 yb-master`yb::master::CatalogManager::DoUpdateConsumerOnProducerMetadata(this=0x000071bc7fc09f80, replication_group_id=0x00007f9a07fd6a10, stream_id=0x00007f9a07fd69e0, producer_schema_version=6, consumer_schema_version=<unavailable>, colocation_id=<unavailable>, check_min_consumer_schema_version=<unavailable>, resp=0x000071bcff5c3508) at xrepl_catalog_manager.cc:4037:39
frame #1: 0x000055faa5e1a8d7 yb-master`yb::master::CatalogManager::UpdateConsumerOnProducerMetadata(this=0x000071bc7fc09f80, req=0x000071bcff5c34c0, resp=0x000071bcff5c3508, rpc=0x00007f9a07fd6bc0) at xrepl_catalog_manager.cc:3975:10
frame #2: 0x000055faa5c206bc yb-master`std::__1::__function::__func<void yb::master::MasterServiceBase::HandleIn<yb::master::CatalogManager, yb::master::UpdateConsumerOnProducerMetadataRequestPB, yb::master::UpdateConsumerOnProducerMetadataResponsePB>(...)::'lambda'(), ...>::operator()() [inlined] void yb::master::MasterServiceBase::HandleIn<yb::master::CatalogManager, yb::master::UpdateConsumerOnProducerMetadataRequestPB, yb::master::UpdateConsumerOnProducerMetadataResponsePB>(...)::'lambda'()::operator()() const at master_service_base-internal.h:158:14
...
Root cause
Earlier DDLs may have put the universe (and certain tables in particular) in a problematic state. When in such a state, after an upgrade, any additional DDL on the specifically affected tables causes a crash loop.
This only applies to tables that have undergone the following:
- DDLs that affect the tables' dependent objects (for example, their indexes, foreign keys, views, or triggers); and
- Tables have only had one or two (and not more) such DDLs applied to them before the upgrade.
Specific DDLs that can lead to this issue include, for example, CREATE INDEX and ADD CONSTRAINT. DDLs that modify the primary tables, which actually hold the physical rows of data (for example, ALTER TABLE), do not lead to this issue.
Mitigation
Use the verification script to check whether your universe (and some particular tables) are in a problematic state. You can run this script either before or after upgrading; running it before upgrading is strongly recommended.
Before you upgrade
Run the verification script on any universe with xCluster enabled. The script is read-only and only inspects some configuration settings; it has negligible impact on the universe's performance regardless of the universe's size and scale.
If the script identifies no affected schemas, proceed with the upgrade before performing any additional DDLs. You will not be affected by this issue during or after the upgrade. After upgrading, you will never encounter this problem, even with additional DDLs. In short, upgrading permanently mitigates the problem.
If the script identifies affected schemas, then do one of the following:
-
Upgrade only to a version with the fix (recommended).
-
For each affected table, do the following:
-
Run the following
ADD CONSTRAINTon the source and on the target. For bidirectional setups, run on both sides:ALTER TABLE <table> ADD CONSTRAINT xcluster_schema_verification_fix CHECK (true) NOT VALID; -
After step 1 is complete, run the following
DROP CONSTRAINTon the source and on the target. For bidirectional setups, run on both sides:ALTER TABLE <table> DROP CONSTRAINT xcluster_schema_verification_fix; -
Run the script again to verify. You should find no affected schemas. If you do, contact Yugabyte Support.
-
Upgrade (to any version). After upgrading, you will never encounter this problem again, even with additional DDLs.
-
After you upgrade
If, after upgrading, your universe is hitting the crash loop:
- Set the YB-Master flag
xcluster_skip_schema_compatibility_checks_on_altertotrueon the YB-Masters to stop the crash looping. - Delete and recreate your xCluster setup.
- Set the
xcluster_skip_schema_compatibility_checks_on_alterflag back tofalseon YB-Masters.
Thereafter, you will not encounter this problem again, even with additional DDLs.
If, after upgrading, your universe is not hitting the crash loop:
- Run the validation script to check whether any problematic schemas exist.
- If the script finds none, there are no issues. You will not encounter this problem (with the crash loop). Continue operations (including DDLs) as usual.
- If the script finds affected schemas, avoid running any DDL on the affected tables. Doing so will cause crash loops. Upgrade to a version with the fix as soon as possible.
Details
xCluster stores historical schema versions to correctly update packed row information. Previously, it only stored the current and previous schema versions; however, newer code requires a longer history of previous schema versions.
Commit 11482d6e incorrectly changed the protobuf field old_consumer_schema_version from a non-repeated field to a repeated field. When old_consumer_schema_version is set to the default value of 0 in older versions, but old_producer_schema_version is non-zero, then the new code only sees the non-zero value. This results in old_consumer_schema_versions and old_producer_schema_versions having different sizes in the new code, which then tries to iterate over both, causing a segfault.
This issue affects all variations of xCluster, including YSQL (both transactional and non-transactional, used uni-directionally and bi-directionally) and YCQL. It affects both xCluster Replication and xCluster DR.