Recover YB-TServer from crash loop
When a YB-TServer process or node has failed, YugabyteDB automatically triggers a remote bootstrap for most types of tablet data corruption or failures. However, in some cases the automatic bootstrap may not solve the problem, resulting in a crash loop. This can happen when a condition leads to a code path encoded as a crash (for example,
FATAL) or when the cause of a crash is unknown (for example, code errors that lead to
SIGSEGV). In all of these cases, the root cause would likely repeat itself when the process is restarted. Moreover, when using YugabyteDB Anywhere, crashed processes are automatically restarted to ensure minimum downtime.
Since servers stuck in a crash loop typically cannot stay up long enough to be safely issued runtime commands against them, manual intervention by the administrator is required to bring a YB-TServer back to a healthy state.
To do this, the administrator needs to find all the faulty tablets, look for their data on disk (possibly spread across multiple disks, depending on your
fs_data_dirs), and then remove it.
The following are the steps to address this scenario:
Stop the YB-TServer process to prevent new restarts during operations. For YugabyteDB Anywhere, execute the
yb-server-ctl tserver stopcommand.
Find the tablets that are encountering these problems. You may consult logs to get the UUID of tablets. In the described scenario, the UUID of the tablet is
Find and remove all the tablet files, as follows:
find /mnt/disk1 -name '*FOO*' | xargs rm -rf
Repeat the preceding command for each disk in
Restart the YB-TServer process. In YugabyteDB Anywhere, execute the
yb-server-ctl tserver startcommand.
When completed, the YB-TServer should be able to start, stay alive, and rejoin the cluster, while the centralized load balancer re-replicates or redistributes copies of any affected tablets.