Incorrectly configured ulimit along with snapshot operations can cause data loss
| Product | Affected Versions | Related Issues | Fixed In |
|---|---|---|---|
| YugabyteDB, YugabyteDBÂ Anywhere | v2.20, v2024.1, v2024.2 | #26910 | v2.20.11.0, v2024.2.4.0 |
Description
In YugabyteDB, a snapshot is a consistent state of data taken across all nodes in a cluster at a point in time. The final phase of a snapshot operation involves a few disk writes. If a ulimit is incorrectly configured, the YB-TServer may crash during these writes, which can result in loss of data on that node.
Mitigation
To mitigate the issue, perform one of the following:
- Ensure ulimits are correctly configured. For more details, see Set ulimits.
- Upgrade to a version that includes the fix: v2.20.11.0 or v2024.2.4.0.
Note that until the correct ulimit configuration is set, snapshot operations remain at risk. Snapshots are taken during backups and when Point-in-Time Recovery (PITR) snapshot schedules are enabled. This includes setting up xCluster, which triggers a backup, and transactional xCluster, which enables PITR snapshot schedules on both clusters.
Details
A snapshot operation is performed on all tablets in the database, during operations like backups, and PITR snapshot schedules.
As part of applying SNAPSHOT_OP (the internal snapshot operation applied on each tablet), YugabyteDB updates the flushed frontier in the RocksDB manifest file via Tablet::ModifyFlushedFrontie. During this process, while previous write operations have been applied to the RocksDB instance, they might not have been flushed to the SST file yet. If ulimits are incorrectly set, the SST flush can fail, resulting in the the YB-TServer process crashing. Upon restart, the process reads the manifest data and incorrectly assumes the flush has already completed successfully, instead of retrying the operation. This can result in loss of data on that node.
In the fixed versions, YugabyteDB performs a synchronous flush of RocksDB before calling Tablet::ModifyFlushedFrontie to update the flushed frontier in the manifest file.