Partial Streaming Stall After Tablet Split

3 Dec 2024

Description

The YugabyteDB gRPC (Debezium) Connector may stop sending checkpoints to YugabyteDB, causing data streaming to stall for certain tablets.

This issue is more likely to occur when a single task is configured to poll all the tablets of a table while the tablets are being split. If the checkpoint for a tablet is not updated and that tablet gets split, the connector fails to transition to the child tablets. As a result, no data from the child tablets will be streamed, causing the streaming process to halt for those tablets.

Mitigation

To mitigate this issue, either upgrade the connector to dz.1.9.5.yb.grpc.2024.2, or restart the connector using the following steps:

  1. Run the get_unpolled_tablets.sh script to get a list of unpolled tablets. For best results, run the script iteratively every 15 minutes and track the list of tablets it outputs.
  2. If there are tablets that are output across multiple runs of the script and never disappear, those tablets are stuck, and you should restart the connector.

Details

When a Kafka commit occurs, the connector snapshots the current offsets across all source partitions. It identifies the highest offset for each tablet, logs these as checkpoints, and then updates YugabyteDB with the latest progress tracking information.

The problem is how higher offsets are calculated: instead of merging new values into the existing global offset map, the map is completely overwritten. As a result, in cases of hot shards or actively written tablet ranges, some tablets might not receive a callback. In other words, offsets for certain tablets may be inadvertently skipped or filtered out.