Failure of upgrades to release versions 2.18 and 2.20

19 Mar 2024
Product Affected Versions Related Issues Fixed In
YBDB, YB Anywhere v2.18, v2.20 #21491 v2.20.2.1, v2.18.7.0

Description

Upgrading from prior versions (other than v2.14, v2.16) to v2.18 or v2.20 fails due to a race condition during post upgrade. While the yb-tservers themselves can be healthy and their raft configurations can remain intact, they will fail to heartbeat to the yb-master. This is a race condition ( that can happen even while the probability is low ) that requires YugabyteDB Anywhere to execute a post-upgrade action of un-blacklisting yb-tservers at the exact same time as yb-master executing a background task of generating universe_uuid field. This issue is less likely in v2.20 due to the post upgrade actions taking much longer in v2.20 compared to v2.18 reducing the probability of hitting it significantly.

v2.14 and v2.16 releases are not impacted by this issue. This is because the flag master_enable_universe_uuid_heartbeat_check is not auto-promoted and so the functionality is OFF by default until you explicitly turn it ON.

Mitigation

Set the master_enable_universe_uuid_heartbeat_check flag on yb-master to false. It can be performed as a non-rolling, non-restart YBA upgrade after the database upgrade is complete. After this flag change is applied, upgrade to a release with the fix and to re-enable the flag. Re-enabling the flag requires running a yb-ts-cli command to clear the universe_uuid on all nodes. After the universe_uuid is cleared, the flag can be re-enabled on yb-master.

Details

The universe_uuid field was added to ClusterConfig as part of #17904. This is essentially an identity for the universe which all the yb-tservers inherit from the yb-master as part of the heartbeat. If set, this value is not meant to change on either the yb-tservers or yb-masters and provides a way for the yb-master to reject any heartbeats from a different universe.

For universes upgrading from an older release to one having the preceeding commit, the catalog manager generates a new universe_uuid and propagates that to the yb-tserver. However, before persisting the universe_uuid in cluster_config, the version number is not being incremented. As a result of this, the following race is possible:

  1. Cluster gets upgraded to a release with commit fb98e56 and the feature master_enable_universe_uuid_heartbeat_check is enabled due to promotion of flags.
  2. YBA reads the cluster configuration (ClusterConfig) at version 'X'.
  3. Catalog manager background thread runs and generates a new universe_uuid, persists it in ClusterConfig and propagates it to all the yb-tservers.
  4. YBA from Step 2 updates the ClusterConfig using ChangeMasterClusterConfigRequestPB with version 'X'. (For un-blacklisting nodes)
  5. Update from Step 4 succeeds because ClusterConfig version 'X' on disk matches the one in the request 'X', effectively overwriting the universe_uuid generated in Step 3.
  6. Catalog manager background thread runs again and because the universe_uuid is empty, it generates a new one again.

After the new universe_uuid is generated on the catalog manager in Step 6, yb-master essentially starts rejecting heartbeats from all the yb-tservers which keep reporting the previous universe_uuid generated by the catalog manager in Step 3.