Handling region failures

Be resilient to failures of entire regions

YugabyteDB is resilient to a single-domain failure in a deployment with a replication factor (RF) of 3. To survive region failures, you deploy across multiple regions. Let's see how YugabyteDB survives a region failure.

RF 3 vs RF 5

Although you could use an RF 3 cluster, an RF 5 cluster provides quicker failover; with two replicas in the preferred regions, when a leader fails, a local follower can be elected as a leader, rather than a follower in a different region.

Setup

Consider a scenario where you have deployed your database across three regions - us-west, us-east, and us-central. Typically, you choose one region as the preferred region for your database. This is the region where your applications are active. Then determine which region is closest to the preferred to be the failover region for your applications, and set this region as your second preferred region for the database. The third region needs no explicit setting and automatically becomes the third preferred region.

Set up a local cluster

If a local universe is currently running, first destroy it.

Start a local three-node universe with an RF of 3 by first creating a single node, as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.1 \
                --base_dir=${HOME}/var/node1 \
                --cloud_location=aws.us-east.us-east-1a

On macOS, the additional nodes need loopback addresses configured, as follows:

sudo ifconfig lo0 alias 127.0.0.2
sudo ifconfig lo0 alias 127.0.0.3

Next, join more nodes with the previous node as needed. yugabyted automatically applies a replication factor of 3 when a third node is added.

Start the second node as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.2 \
                --base_dir=${HOME}/var/node2 \
                --cloud_location=aws.us-central.us-central-1a \
                --join=127.0.0.1

Start the third node as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.3 \
                --base_dir=${HOME}/var/node3 \
                --cloud_location=aws.us-west.us-west-1a \
                --join=127.0.0.1

After starting the yugabyted processes on all the nodes, configure the data placement constraint of the universe, as follows:

./bin/yugabyted configure data_placement --base_dir=${HOME}/var/node1 --fault_tolerance=region

This command can be executed on any node where you already started YugabyteDB.

To check the status of a running multi-node universe, run the following command:

./bin/yugabyted status --base_dir=${HOME}/var/node1

Setup

To set up a universe, refer to Set up a YugabyteDB Anywhere universe.

All illustrations adhere to the legend outlined in Legend for illustrations

In the following illustration, the leaders are in us-east (the preferred region), which is also where the applications are active. The standby application is in us-central and will be the failover. This has been set as the second preferred region for the database

Sync replication setup - Handling region outage

Users reach the application via a router/load balancer which would point to the active application in us-east.

Now when the application writes, the leaders replicate to both the followers in us-west and us-central.

As us-central is closer to us-east than us-west, the followers in central will acknowledge the write faster, so that follower will be up-to-date with the leader in us-east.

Third region failure

If the third (least preferred) region fails, availability is not affected at all. This is because the leaders will be in the preferred region, and there will be one follower in the second preferred region. There is no data loss and no recovery is needed.

Simulate failure of the third region locally

To simulate the failure of the 3rd region locally, you can just stop the third node.

./bin/yugabyted stop --base_dir=${HOME}/var/node3

To stop a node in YB Anywhere, see YBA - Manage nodes.

The following illustration shows the scenario of the third region (us-west) region failing. As us-east has been set as the preferred region, there are no leaders in us-west. When this region fails, availability of the applications is not affected at all as there is still one follower active (in us-central) for replication to continue correctly.

Third region failure - Handling region outage

Secondary region failure

When the second preferred region fails, availability is not affected at all. This is because there are no leaders in this region - all the leaders are in the preferred region. But your write latency could be affected as every write to the leader has to wait for acknowledgment from the third region, which is farther away. Reads are not affected as the primary application will read from the leaders in the preferred region. There is no data loss at all.

Simulate failure of the secondary region locally

To simulate the failure of the secondary region locally, you can just stop the second node.

./bin/yugabyted stop --base_dir=${HOME}/var/node2

To stop a node in YB Anywhere, see YBA - Manage nodes.

In the following illustration, you can see that as us-central has failed, writes are replicated to us-west. This leads to higher write latencies as us-west is farther away from us-east than us-central. But there is no data loss.

Second region failure - Handling region outage

Preferred region failure

When the preferred region fails, there is no data loss but availability will be affected for a short time. This is because all the leaders are located in this region, and your applications are also active in this region. At the moment of the preferred region failure, all tablets will have no leaders.

Simulate failure of the primary region locally

To simulate the failure of the primary region locally, you can just stop the first node.

./bin/yugabyted stop --base_dir=${HOME}/var/node1

To stop a node in YB Anywhere, see YBA - Manage nodes.

The following illustration shows the failure of the preferred region, us-east.

Preferred region failure - Handling region outage

Because there are no leaders for the tablets, leader election is triggered and the followers in the secondary region will be promoted to leaders. Leader election typically takes about 3 seconds.

The following illustration shows that the followers in us-central have been elected as leaders.

Preferred region failure - Leader election

You will now have to fail over your applications. As you have already set up standby applications in us-central, you have to make changes in your load balancer to route the user traffic to the secondary region. The following illustration shows the load balancer routing user requests to the application instance in us-central.

Preferred region failure - Leader election

Smart driver

You can opt to use a YugabyteDB Smart Driver in your applications. The major advantage of this driver is that it is aware of the cluster configuration and will automatically route user requests to the secondary region in case the preferred region fails.

The following illustration shows how the primary application (assuming it is still running) in us-east automatically routes the user requests to the secondary region in the case of preferred region failure without any change needed in the load balancer.

Preferred region failure - Leader election

Region failure in a two-region setup

There may be scenarios where you want to deploy the database in just one region. It is quite common for enterprises to have one data center as their primary and another data center just for failover. For this scenario, you can deploy YugabyteDB in your primary data center and set up another cluster in the second data center that gets the data from the primary cluster via asynchronous replication. This is also known as the 2DC or xCluster model.

You can set this up by following the instructions of the Active-Active Single-Master pattern.