Fault tolerance
Attention
This page documents an earlier version. Go to the latest (v2.3) version.YugabyteDB can automatically handle failures and therefore provides high availability. We will create YSQL tables with a replication factor = 3 that allows a fault tolerance of 1. This means the cluster will remain available for both reads and writes even if one node fails. However, if another node fails bringing the number of failures to 2, then writes will become unavailable on the cluster in order to preserve data consistency.
If you haven't installed YugabyteDB yet, do so first by following the Quick Start guide.
1. Create a universe
If you have a previously running local universe, destroy it using the following.
$ ./bin/yb-ctl destroy
Start a new local cluster - a 3-node universe with a replication factor of 3.
$ ./bin/yb-ctl --rf 3 create
2. Run the sample key-value app
Download the sample app JAR file.
$ wget https://github.com/yugabyte/yb-sample-apps/releases/download/v1.2.0/yb-sample-apps.jar?raw=true -O yb-sample-apps.jar
Run the SqlInserts
sample key-value app against the local universe by typing the following command.
$ java -jar ./yb-sample-apps.jar --workload SqlInserts \
--nodes 127.0.0.1:5433 \
--num_threads_write 1 \
--num_threads_read 4
The sample application prints some stats while running, which is also shown below. You can read more details about the output of the sample applications here.
2018-05-10 09:10:19,538 [INFO|...] Read: 8988.22 ops/sec (0.44 ms/op), 818159 total ops | Write: 1095.77 ops/sec (0.91 ms/op), 97120 total ops | ...
2018-05-10 09:10:24,539 [INFO|...] Read: 9110.92 ops/sec (0.44 ms/op), 863720 total ops | Write: 1034.06 ops/sec (0.97 ms/op), 102291 total ops | ...
3. Observe even load across all nodes
You can check a lot of the per-node stats by browsing to the tablet-servers page. It should look like this. The total read and write IOPS per node are highlighted in the screenshot below. Note that both the reads and the writes are roughly the same across all the nodes indicating uniform usage across the nodes.
4. Remove a node and observe continuous write availability
Remove a node from the universe.
$ ./bin/yb-ctl remove_node 3
Refresh the tablet-servers page to see the stats update. The Time since heartbeat
value for that node will keep increasing. Once that number reaches 60s (i.e. 1 minute), YugabyteDB will change the status of that node from ALIVE to DEAD. Note that at this time the universe is running in an under-replicated state for some subset of tablets.
4. Remove another node and observe write unavailability
Remove another node from the universe.
$ ./bin/yb-ctl remove_node 2
Refresh the tablet-servers page to see the stats update. Writes are now unavailable but reads can continue to be served for whichever tablets available on the remaining node.
6. [Optional] Clean up
Optionally, you can shutdown the local cluster created in Step 1.
$ ./bin/yb-ctl destroy
1. Create a universe
If you have a previously running local universe, destroy it using the following.
$ ./bin/yb-ctl destroy
Start a new local cluster - a 3-node universe with a replication factor of 3.
$ ./bin/yb-ctl --rf 3 create
2. Run the sample key-value app
Download the sample app JAR file.
$ wget https://github.com/yugabyte/yb-sample-apps/releases/download/v1.2.0/yb-sample-apps.jar?raw=true -O yb-sample-apps.jar
Run the SqlInserts
sample key-value app against the local universe by typing the following command.
$ java -jar ./yb-sample-apps.jar --workload SqlInserts \
--nodes 127.0.0.1:5433 \
--num_threads_write 1 \
--num_threads_read 4
The sample application prints some stats while running, which is also shown below. You can read more details about the output of the sample applications here.
2018-05-10 09:10:19,538 [INFO|...] Read: 8988.22 ops/sec (0.44 ms/op), 818159 total ops | Write: 1095.77 ops/sec (0.91 ms/op), 97120 total ops | ...
2018-05-10 09:10:24,539 [INFO|...] Read: 9110.92 ops/sec (0.44 ms/op), 863720 total ops | Write: 1034.06 ops/sec (0.97 ms/op), 102291 total ops | ...
3. Observe even load across all nodes
You can check a lot of the per-node stats by browsing to the tablet-servers page. It should look like this. The total read and write IOPS per node are highlighted in the screenshot below. Note that both the reads and the writes are roughly the same across all the nodes indicating uniform usage across the nodes.
4. Remove a node and observe continuous write availability
Remove a node from the universe.
$ ./bin/yb-ctl remove_node 3
Refresh the tablet-servers page to see the stats update. The Time since heartbeat
value for that node will keep increasing. Once that number reaches 60s (i.e. 1 minute), YugabyteDB will change the status of that node from ALIVE to DEAD. Note that at this time the universe is running in an under-replicated state for some subset of tablets.
4. Remove another node and observe write unavailability
Remove another node from the universe.
$ ./bin/yb-ctl remove_node 2
Refresh the tablet-servers page to see the stats update. Writes are now unavailable but reads can continue to be served for whichever tablets available on the remaining node.
6. [Optional] Clean up
Optionally, you can shutdown the local cluster created in Step 1.
$ ./bin/yb-ctl destroy
1. Create universe
If you have a previously running local universe, destroy it using the following.
$ ./yb-docker-ctl destroy
Start a new local universe with replication factor 5.
$ ./yb-docker-ctl create --rf 5
Connect to cqlsh on node 1.
$ docker exec -it yb-tserver-n1 /home/yugabyte/bin/cqlsh
Connected to local cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.9-SNAPSHOT | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
cqlsh>
Create a Cassandra keyspace and a table.
cqlsh> CREATE KEYSPACE users;
cqlsh> CREATE TABLE users.profile (id bigint PRIMARY KEY,
email text,
password text,
profile frozen<map<text, text>>);
2. Insert data through node 1
Now insert some data by typing the following into cqlsh shell we joined above.
cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
(1000, '[email protected]', 'licensed2Kill',
{'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
);
cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
(2000, '[email protected]', 'itsElementary',
{'firstname': 'Sherlock', 'lastname': 'Holmes'}
);
Query all the rows.
cqlsh> SELECT email, profile FROM users.profile;
email | profile
------------------------------+---------------------------------------------------------------
[email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
[email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
(2 rows)
3. Read data through another node
Let us now query the data from node 5.
$ docker exec -it yb-tserver-n5 /home/yugabyte/bin/cqlsh
cqlsh> SELECT email, profile FROM users.profile;
email | profile
------------------------------+---------------------------------------------------------------
[email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
[email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
(2 rows)
4. Verify that one node failure has no impact
We have 5 nodes in this universe. You can verify this by running the following.
$ ./yb-docker-ctl status
Let us simulate node 5 failure by doing the following.
$ ./yb-docker-ctl remove_node 5
Now running the status command should show only 4 nodes:
$ ./yb-docker-ctl status
Now connect to node 4.
$ docker exec -it yb-tserver-n4 /home/yugabyte/bin/cqlsh
Let us insert some data.
cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
(3000, '[email protected]', 'imGroovy',
{'firstname': 'Austin', 'lastname': 'Powers'});
Now query the data.
cqlsh> SELECT email, profile FROM users.profile;
email | profile
------------------------------+---------------------------------------------------------------
[email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
[email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
[email protected] | {'firstname': 'Austin', 'lastname': 'Powers'}
(3 rows)
5. Verify that second node failure has no impact
This cluster was created with replication factor 5 and hence needs only 3 replicas to make consensus. Therefore, it is resilient to 2 failures without any data loss. Let us simulate another node failure.
$ ./yb-docker-ctl remove_node 1
We can check the status to verify:
$ ./yb-docker-ctl status
Now let us connect to node 2.
$ docker exec -it yb-tserver-n2 /home/yugabyte/bin/cqlsh
Insert some data.
cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
(4000, '[email protected]', 'iCanFly',
{'firstname': 'Clark', 'lastname': 'Kent'});
Run the query.
cqlsh> SELECT email, profile FROM users.profile;
email | profile
------------------------------+---------------------------------------------------------------
[email protected] | {'firstname': 'Clark', 'lastname': 'Kent'}
[email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
[email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
[email protected] | {'firstname': 'Austin', 'lastname': 'Powers'}
(4 rows)
Step 6. Clean up (optional)
Optionally, you can shutdown the local cluster created in Step 1.
$ ./yb-docker-ctl destroy
1. Create universe
If you have a previously running local universe, destroy it using the following.
$ kubectl delete -f yugabyte-statefulset.yaml
Start a new local cluster - by default, this will create a 3 node universe with a replication factor of 3.
$ kubectl apply -f yugabyte-statefulset.yaml
Check the Kubernetes dashboard to see the 3 yb-tserver and 3 yb-master pods representing the 3 nodes of the cluster.
$ minikube dashboard
Connect to cqlsh on node 1.
$ kubectl exec -it yb-tserver-0 /home/yugabyte/bin/cqlsh
Connected to local cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.9-SNAPSHOT | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
cqlsh>
Create a Cassandra keyspace and a table.
cqlsh> CREATE KEYSPACE users;
cqlsh> CREATE TABLE users.profile (id bigint PRIMARY KEY,
email text,
password text,
profile frozen<map<text, text>>);
2. Insert data through node 1
Now insert some data by typing the following into cqlsh shell we joined above.
cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
(1000, '[email protected]', 'licensed2Kill',
{'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
);
cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
(2000, '[email protected]', 'itsElementary',
{'firstname': 'Sherlock', 'lastname': 'Holmes'}
);
Query all the rows.
cqlsh> SELECT email, profile FROM users.profile;
email | profile
------------------------------+---------------------------------------------------------------
[email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
[email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
(2 rows)
3. Read data through another node
Let us now query the data from node 3.
$ kubectl exec -it yb-tserver-2 /home/yugabyte/bin/cqlsh
cqlsh> SELECT email, profile FROM users.profile;
email | profile
------------------------------+---------------------------------------------------------------
[email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
[email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
(2 rows)
cqlsh> exit;
4. Verify one node failure has no impact
This cluster was created with replication factor 3 and hence needs only 2 replicas to make consensus. Therefore, it is resilient to 1 failure without any data loss. Let us simulate node 3 failure.
$ kubectl delete pod yb-tserver-2
Now running the status command should would show that the yb-tserver-2
pod is Terminating
.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
yb-master-0 1/1 Running 0 33m
yb-master-1 1/1 Running 0 33m
yb-master-2 1/1 Running 0 33m
yb-tserver-0 1/1 Running 1 33m
yb-tserver-1 1/1 Running 1 33m
yb-tserver-2 1/1 Terminating 0 33m
Now connect to node 2.
$ kubectl exec -it yb-tserver-1 /home/yugabyte/bin/cqlsh
Let us insert some data to ensure that the loss of a node hasn't impacted the ability of the universe to take writes.
cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
(3000, '[email protected]', 'imGroovy',
{'firstname': 'Austin', 'lastname': 'Powers'});
Now query the data. We see that all the data inserted so far is returned and the loss of the node has no impact on data integrity.
cqlsh> SELECT email, profile FROM users.profile;
email | profile
------------------------------+---------------------------------------------------------------
[email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
[email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
[email protected] | {'firstname': 'Austin', 'lastname': 'Powers'}
(3 rows)
5. Verify that Kubernetes brought back the failed node
We can now check the cluster status to verify that Kubernetes has indeed brought back the yb-tserver-2
node that had failed before. This is because the replica count currently effective in Kubernetes for the yb-tserver
StatefulSet is 3 and there were only 2 nodes remaining after 1 node failure.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
yb-master-0 1/1 Running 0 34m
yb-master-1 1/1 Running 0 34m
yb-master-2 1/1 Running 0 34m
yb-tserver-0 1/1 Running 1 34m
yb-tserver-1 1/1 Running 1 34m
yb-tserver-2 1/1 Running 0 7s
YugabyteDB's fault tolerance when combined with Kubernetes's automated operations ensures that planet-scale applications can be run with ease while ensuring extreme data resilience.
6. Clean up (optional)
Optionally, you can shutdown the local cluster created in Step 1.
$ kubectl delete -f yugabyte-statefulset.yaml
Further, to destroy the persistent volume claims (you will lose all the data if you do this), run:
kubectl delete pvc -l app=yb-master
kubectl delete pvc -l app=yb-tserver