Handle node alerts
Universes deployed using YugabyteDB Anywhere include following node alerts by default:
- DB Instance Down. A node is unreachable or down.
- DB Node Restart. The operating system rebooted.
- DB Instance Restart. A YugabyteDB process restarted (outside of a universe update).
If you are notified of one of these alerts, you can take the following steps.
DB Instance Down
This alert fires when Prometheus is unable to scrape a node for metrics for (by default) more than 15 minutes.
What to do
-
Check the universe Nodes tab Status column.
-
If the status is Unreachable, confirm whether the host is up and reachable via your cloud provider (for example, AWS EC2) or on-premises environment.
-
If the host is down, restart it from the cloud console or equivalent.
If necessary, follow the steps in Replace a live or unreachable node to replace the node.
-
If the host is running, check the status of the node_exporter process. Prometheus uses the node_exporter service to export metrics, and if the service is down, the node will appear unreachable.
-
Connect to the node and check the status of node_exporter by running the following command:
ps -ef | grep node_exporter systemctl status node_exporter
-
Restart node_exporter if needed:
sudo systemctl restart node_exporter
-
Confirm node_exporter access at
http://<node_ip>:9300
.
-
DB Node Restart
This alert tracks OS-level restarts using the node_boot_time
metric.
What to do
-
SSH into the host and check if the reboot was planned or due to an issue.
-
Check logs as follows:
cat /var/log/messages | grep -i reboot
For Ubuntu:
journalctl --list-boots
-
Check for other causes, such as power loss, kernel panic, or scheduled reboots.
DB Instance Restart
This alert fires when a YugabyteDB process (TServer or Master) restarts without a planned update.
What to do
-
SSH into the node and inspect the following logs:
- YugabyteDB logs (
/home/yugabyte/tserver/logs/
or/mnt/d0/yb-data/tserver/logs/
). - OS logs, for memory pressure or crash signals.
Look for FATAL logs; the presence of a FATAL log file that corresponds with the time of the failure is a positive indicator for a crash and analyzing this log will likely point to the root cause.
- YugabyteDB logs (
-
Check whether a core dump or out of memory event occurred.
-
If a user or automation restarted the process, confirm the intent and whether alerts can be tuned.