Missing rows on xCluster target on range-sharded tables with multiple tablets

03 Jun 2025

Product	Affected Versions	Related Issues	Fixed In
YSQL	v2.20.x, v2024.1, v2024.2	#27380	v2.20.11.0, v2024.1.6.1, v2024.2.3.1

Description

When the following conditions are true, the xCluster target could be missing some of the rows from the xCluster source permanently with no way to recover except to rebootstrap xCluster.

The table or index involved in xCluster replication is range-sharded. (Hash-sharded tables or indexes are not affected by this issue).
The table or index has multiple tablets on the target side. Writes to the table/index are transactional. For YSQL this is almost always true.
The changes do get replicated from xCluster source to target, but the final apply step of transactions gets skipped on the xCluster target for all tablet key ranges except one - resulting in effectively only 1 tablet getting all the writes. As a result of this, all of the xCluster troubleshooting views and metrics on YugabyteDB Anywhere show the replication lag as 0 and replication status as Healthy. However, a data comparison between source and target will indicate a difference in row count/data.

A few additional considerations to assess impact are:

Range sharding is currently not enabled by default, so if a table/index is created without any qualifiers on the primary key, it is hash-sharded by default.
YCQL tables cannot be created with range sharding.

To explicitly create a range-sharded table/index, you would have to specify ASC or DESC as part of the primary key specification as described in the following example:

CREATE TABLE range(k1 int,
                   k2 int,
                   v1 int,
                   v2 text,
                   PRIMARY KEY (k1 ASC, k2 DESC));

Range sharding is also enabled by default for new tables with the pg_parity flags.
Default tablet count with range sharding is 1, so if automatic tablet splitting is NOT enabled, then you will not experience this issue. Note that tablet splitting is enabled by default.

To identify whether you are impacted:

The problem occurs almost always when a range partitioned table has multiple tablets. The only scenario where it may not happen is if all writes are going to a single tablet which is unlikely. Another unlikely case is where the workload is not using transactions at all.
For tables, row counts can be compared on both source and target tables. As the rows on the target are lost permanently, the inconsistency is permanent.
For hash-sharded tables with range-sharded indexes, the row count may match as it scans the table, but an index only scan would show a difference between the source and target tables.
While depending on the key type, it might be possible to identify exactly which key range/tablet is correctly replicated and the remaining ones would be missing data - this may not really help because all tablets except 1 tablet would be missing data.
The issue is caused by an implementation bug in determining whether a given key belongs to the tablet's key ranges during the apply of transactions on the xCluster target.

Mitigation

The mitigation is to fully rebootstrap the xCluster target after applying the fix or using one of the following alternative approaches. If you're using range sharding with xCluster, it is recommended that you apply the fix to prevent the issue from happening.

Alternatives to avoid encountering the issue:

Use hash sharding instead of range sharding, if that is a viable option.
Use a single tablet for the range-sharded table/index on the xCluster target and disable automatic tablet splitting to avoid creating multiple tablets.
Instead of a range-sharded table, use a partitioned table with a single tablet per partition and disable automatic tablet splitting.

Details

A bug in the xCluster target intent application process causes the system to skip applying intents for range-sharded tables. While xCluster tablet mappings are accurate and all records replicate correctly to the target cluster, the results in telemetry and UI views incorrectly report a healthy system status, masking data inconsistency.

Transactional updates generate a sequence of intent records followed by a single APPLY record in the Write-Ahead Log (WAL) for each transaction affecting a tablet. Typically, the APPLY operation commits all associated transaction intents to the regular database tables.

xCluster replicates these intent and APPLY records to the target cluster as they appear on the source. In scenarios with multiple producer tablets replicating to a single consumer tablet, the consumer receives interleaved sequences of intents and APPLY operations from each producer.

For example:

Tablet -> T
Intent -> I

Given the following sequence of operations:

T1 I1, T1 I2, T2 I3, T1 APPLY, T2 I4, T2 APPLY

If processed without modification, the T1 APPLY operation may incorrectly cause the T2 I3 intent to be moved to the regular database.

To mitigate this, a fix was implemented to attach the key range to each APPLY record, ensuring only the intents within that specific key range are applied when processing the corresponding APPLY.

However, a flaw exists in the logic that determines which intent record corresponds to its key range for range-partitioned tables. The logic incorrectly uses a suffix of the key instead of the entire key. This results in only writes to a single tablet being correctly applied.

A test case with the initial fix did not adequately cover transactional writes combined with range-partitioned tables, allowing this flaw to persist.