Handle node alerts

What to do when you get a node alert

Universes deployed using YugabyteDB Anywhere include following node alerts by default:

DB Instance Down. A node is unreachable or down.
DB Node Restart. The operating system rebooted.
DB Instance Restart. A YugabyteDB process restarted (outside of a universe update).

If you are notified of one of these alerts, you can take the following steps.

DB Instance Down

This alert fires when Prometheus is unable to scrape a node for metrics for (by default) more than 15 minutes.

Check the universe Nodes tab Status column.
If the status is Unreachable, confirm whether the host is up and reachable via your cloud provider (for example, AWS EC2) or on-premises environment.
If the host is down, restart it from the cloud console or equivalent.

If necessary, follow the steps in Replace a live or unreachable node to replace the node.
If the host is running, check the status of the node_exporter process. Prometheus uses the node_exporter service to export metrics, and if the service is down, the node will appear unreachable.
- Connect to the node and check the status of node_exporter by running the following command:
```
ps -ef | grep node_exporter
systemctl status node_exporter
```
- Restart node_exporter if needed:
```
sudo systemctl restart node_exporter
```
- Confirm node_exporter access at http://<node_ip>:9300.

This alert tracks OS-level restarts using the node_boot_time metric.

Check logs as follows:

cat /var/log/messages | grep -i reboot

For Ubuntu:

journalctl --list-boots

This alert fires when a YugabyteDB process (TServer or Master) restarts without a planned update.

SSH into the node and inspect the following logs:
- YugabyteDB logs (/home/yugabyte/tserver/logs/ or /mnt/d0/yb-data/tserver/logs/).
- OS logs, for memory pressure or crash signals.
Look for FATAL logs; the presence of a FATAL log file that corresponds with the time of the failure is a positive indicator for a crash and analyzing this log will likely point to the root cause.
Check whether a core dump or out of memory event occurred.
If a user or automation restarted the process, confirm the intent and whether alerts can be tuned.