Universes deployed using YugabyteDB Anywhere include following node alerts by default:

  • DB Instance Down. A node is unreachable or down.
  • DB Node Restart. The operating system rebooted.
  • DB Instance Restart. A YugabyteDB process restarted (outside of a universe update).

If you are notified of one of these alerts, you can take the following steps.

DB Instance Down

This alert fires when Prometheus is unable to scrape a node for metrics for (by default) more than 15 minutes.

What to do

  1. Check the universe Nodes tab Status column.

  2. If the status is Unreachable, confirm whether the host is up and reachable via your cloud provider (for example, AWS EC2) or on-premises environment.

  3. If the host is down, restart it from the cloud console or equivalent.

    If necessary, follow the steps in Replace a live or unreachable node to replace the node.

  4. If the host is running, check the status of the node_exporter process. Prometheus uses the node_exporter service to export metrics, and if the service is down, the node will appear unreachable.

    • Connect to the node and check the status of node_exporter by running the following command:

      ps -ef | grep node_exporter
      systemctl status node_exporter
      
    • Restart node_exporter if needed:

      sudo systemctl restart node_exporter
      
    • Confirm node_exporter access at http://<node_ip>:9300.

DB Node Restart

This alert tracks OS-level restarts using the node_boot_time metric.

What to do

  1. SSH into the host and check if the reboot was planned or due to an issue.

  2. Check logs as follows:

    cat /var/log/messages | grep -i reboot
    

    For Ubuntu:

    journalctl --list-boots
    
  3. Check for other causes, such as power loss, kernel panic, or scheduled reboots.

DB Instance Restart

This alert fires when a YugabyteDB process (TServer or Master) restarts without a planned update.

What to do

  1. SSH into the node and inspect the following logs:

    • YugabyteDB logs (/home/yugabyte/tserver/logs/ or /mnt/d0/yb-data/tserver/logs/).
    • OS logs, for memory pressure or crash signals.

    Look for FATAL logs; the presence of a FATAL log file that corresponds with the time of the failure is a positive indicator for a crash and analyzing this log will likely point to the root cause.

  2. Check whether a core dump or out of memory event occurred.

  3. If a user or automation restarted the process, confirm the intent and whether alerts can be tuned.