Risk of data loss when upgrading to or from version 2.20.3.0 or 2.21.0

06 May 2024
Product Affected Versions Related Issues Fixed In
YSQL v2.20.3.0, v2.21.0 #22057 Planned v2.20.3.1, v2.20.4, v2.21.1

Description

During a rolling upgrade of a YugabyteDB cluster to or from either the v2.20.3.0 or v2.21.0 releases, if the cluster experiences an active YSQL write workload, there is a risk of data loss caused by an issue related to the row-locking feature.

Mitigation

If you have upgraded to or from 2.20.3.0 or 2.21.0, contact Yugabyte Support for steps to identify which tablets have been affected, and how to fix them.

If you have created a new universe on v2.20.3.0 or v2.21.0, run the following steps to ensure the issue does not happen when upgrading to a different version.

  1. Manually override the YB-TServer flag ysql_skip_row_lock_for_update to false, using the JSON flags override page as follows:

    {"ysql_skip_row_lock_for_update":"false"}
    
  2. Upgrade the universe to a version with the fix.

  3. After the upgrade is successful, the YB-TServer flag override for ysql_skip_row_lock_for_update can be removed safely.

Details

v2.20.3 introduced the row-locking feature to address issues arising from concurrent updates. With this enhancement, UPDATE operations currently acquire a row-level lock similar to PostgreSQL, instead of per-column locks. As a component of this feature, there was a subtle adjustment made to the raft-replicate message handling for the related DocDB write operations.

During rolling upgrades, certain nodes will operate using the previous version while others use the new version. This setup may lead to a scenario where the leader node runs the new version (employing row locking), while at least one follower node remains on the older version (using per-column locking). In such cases, the older follower might receive a raft-replicate message containing new logic from the updated leader. Nodes on the old version may mishandle writes generated by the new version, potentially resulting in corrupted table data or data loss. If the affected follower later becomes the leader, you may observe missing data for the updated rows.

Additionally, there will be data inconsistency between tablet replicas (new version versus old version) during this transition period. This inconsistency will persist until the affected rows are fully overwritten. Consequently, you may observe either the presence or absence of rows depending on which replica is serving as the leader at any given moment.