Join us on
Star us on
Get Started
Slack
GitHub
Get Started
v2.5 (latest) v2.2 (stable) v2.1 (earlier version) v2.0 (earlier version) v1.3 (earlier version)
  • GET STARTED
    • Quick start
      • 1. Install YugabyteDB
      • 2. Create a local cluster
      • 3. Explore YSQL
      • 4. Build an application
        • Java
        • NodeJS
        • Go
        • Python
        • Ruby
        • C#
        • PHP
        • C++
        • C
    • Introduction
    • Explore core
      • 1. Linear scalability
      • 2. Fault tolerance
      • 3. Global distribution
      • 4. Auto sharding
      • 5. Tunable reads
      • 6. Observability
  • USER GUIDES
    • Develop
      • Learn app development
        • 1. SQL vs NoSQL
        • 2. Data modeling
        • 3. Data types
        • 4. ACID transactions
        • 5. Aggregations
        • 6. Batch operations
        • 7. Date and time
        • 8. Strings and text
      • Ecosystem integrations
        • Apache Kafka
        • Apache Spark
        • JanusGraph
        • KairosDB
        • Presto
        • Metabase
      • Real-world examples
        • E-Commerce App
        • IoT Fleet Management
        • Retail Analytics
      • Explore sample applications
    • Deploy
      • Checklist
      • Manual deployment
        • 1. System configuration
        • 2. Install software
        • 3. Start YB-Masters
        • 4. Start YB-TServers
        • 5. Verify deployment
      • Kubernetes
        • Helm Chart
        • Helm configuration
        • Local SSD
      • Docker
      • Public clouds
        • Amazon Web Services
        • Google Cloud Platform
        • Microsoft Azure
      • Pivotal Cloud Foundry
      • Yugabyte Platform
        • 1. Prepare cloud environment
        • 2. Install Admin Console
        • 3. Configure Admin Console
        • 4. Configure Cloud Providers
    • Benchmark
      • Performance
      • YCSB
      • Large datasets
    • Secure
      • Security checklist
      • Authentication
      • Authorization
        • 1. RBAC Model
        • 2. Create Roles
        • 3. Grant permissions
      • TLS encryption
        • 1. Prepare nodes
        • 2. Server-server encryption
        • 3. Client-server encryption
        • 4. Connect to cluster
      • Encryption at Rest
    • Manage
      • Backup and restore
        • Backing up data
        • Restoring data
      • Data migration
        • Bulk import
        • Bulk export
      • Change cluster config
      • Upgrade deployment
      • Diagnostics reporting
      • Yugabyte Platform
        • Create universe - Multi-zone
        • Create universe - Multi-region
        • Edit universe
        • Edit config flags
        • Health checking and alerts
        • Create and edit instance tags
        • Node status and actions
        • Read replicas
        • Back up and restore
        • Upgrade universe
        • Delete universe
    • Troubleshoot
      • Troubleshooting overview
      • Cluster level issues
        • YCQL connection issues
        • YEDIS connection Issues
      • Node level issues
        • Check processes
        • Inspect logs
        • System statistics
      • Yugabyte Platform
        • Troubleshoot universes
  • REFERENCE
    • APIs
      • YSQL
        • Statements
          • ABORT
          • ALTER DATABASE
          • ALTER DOMAIN
          • ALTER TABLE
          • BEGIN
          • COMMENT
          • COMMIT
          • COPY
          • CREATE DATABASE
          • CREATE DOMAIN
          • CREATE INDEX
          • CREATE SCHEMA
          • CREATE SEQUENCE
          • CREATE TABLE
          • CREATE TABLE AS
          • CREATE TYPE
          • CREATE USER
          • CREATE VIEW
          • DEALLOCATE
          • DELETE
          • DROP DATABASE
          • DROP DOMAIN
          • DROP SEQUENCE
          • DROP TABLE
          • DROP TYPE
          • END
          • EXECUTE
          • EXPLAIN
          • GRANT
          • INSERT
          • LOCK
          • PREPARE
          • RESET
          • REVOKE
          • ROLLBACK
          • SELECT
          • SET
          • SET CONSTRAINTS
          • SET TRANSACTION
          • SHOW
          • SHOW TRANSACTION
          • TRUNCATE
          • UPDATE
        • Data types
          • Binary
          • Boolean
          • Character
          • Date-time
          • Json
          • Money
          • Numeric
          • Serial
          • UUID
        • Expressions
          • currval()
          • lastval()
          • nextval()
        • Keywords
        • Reserved Names
      • YCQL
        • Quick Start YCQL
        • ALTER KEYSPACE
        • ALTER ROLE
        • ALTER TABLE
        • CREATE INDEX
        • CREATE KEYSPACE
        • CREATE ROLE
        • CREATE TABLE
        • CREATE TYPE
        • DROP INDEX
        • DROP KEYSPACE
        • DROP ROLE
        • DROP TABLE
        • DROP TYPE
        • GRANT PERMISSION
        • GRANT ROLE
        • REVOKE PERMISSION
        • REVOKE ROLE
        • USE
        • INSERT
        • SELECT
        • UPDATE
        • DELETE
        • TRANSACTION
        • TRUNCATE
        • Simple Value
        • Subscript
        • Function Call
        • Operator Call
        • BLOB
        • BOOLEAN
        • MAP, SET, LIST
        • FROZEN
        • INET
        • Integer & Counter
        • Non-Integer
        • TEXT
        • Date & Time Types
        • UUID & TIMEUUID
        • JSONB
        • Date and time functions
    • CLIs
      • yb-ctl
      • yb-docker-ctl
      • yb-master
      • yb-tserver
      • ysqlsh
      • cqlsh
    • Sample data
      • Chinook
      • Northwind
      • PgExercises
      • SportsDB
    • Tools
      • TablePlus
  • RELEASES
    • Release history
      • v1.3.1
      • v1.3.0
      • v1.2.12
      • v1.2.11
      • v1.2.10
      • v1.2.9
      • v1.2.8
      • v1.2.6
      • v1.2.5
      • v1.2.4
  • CONCEPTS
    • Architecture
      • Design goals
      • Layered architecture
      • Basic concepts
        • Universe
        • YB-TServer
        • YB-Master
        • Acknowledgements
      • Query layer
        • Overview
      • DocDB store
        • Sharding
        • Replication
        • Persistence
        • Performance
      • DocDB transactions
        • Isolation Levels
        • Single row transactions
        • Distributed transactions
        • Transactional IO path
  • FAQ
    • Comparisons
      • CockroachDB
      • Google Cloud Spanner
      • MongoDB
      • FoundationDB
      • Amazon DynamoDB
      • Azure Cosmos DB
      • Apache Cassandra
      • Redis in-memory store
      • Apache HBase
    • Other FAQs
      • Product
      • Architecture
      • Yugabyte Platform
      • API compatibility
  • CONTRIBUTOR GUIDES
    • Get involved
  • Misc
    • YEDIS
      • Quick start
      • Develop
        • Client drivers
          • C
          • C++
          • C#
          • Go
          • Java
          • NodeJS
          • Python
      • API reference
        • APPEND
        • AUTH
        • CONFIG
        • CREATEDB
        • DELETEDB
        • LISTDB
        • SELECT
        • DEL
        • ECHO
        • EXISTS
        • EXPIRE
        • EXPIREAT
        • FLUSHALL
        • FLUSHDB
        • GET
        • GETRANGE
        • GETSET
        • HDEL
        • HEXISTS
        • HGET
        • HGETALL
        • HINCRBY
        • HKEYS
        • HLEN
        • HMGET
        • HMSET
        • HSET
        • HSTRLEN
        • HVALS
        • INCR
        • INCRBY
        • KEYS
        • MONITOR
        • PEXPIRE
        • PEXPIREAT
        • PTTL
        • ROLE
        • SADD
        • SCARD
        • RENAME
        • SET
        • SETEX
        • PSETEX
        • SETRANGE
        • SISMEMBER
        • SMEMBERS
        • SREM
        • STRLEN
        • ZRANGE
        • TSADD
        • TSCARD
        • TSGET
        • TSLASTN
        • TSRANGEBYTIME
        • TSREM
        • TSREVRANGEBYTIME
        • TTL
        • ZADD
        • ZCARD
        • ZRANGEBYSCORE
        • ZREM
        • ZREVRANGE
        • ZSCORE
        • PUBSUB
        • PUBLISH
        • SUBSCRIBE
        • UNSUBSCRIBE
        • PSUBSCRIBE
        • PUNSUBSCRIBE
> Explore core >

Fault tolerance

Attention

This page documents an earlier version. Go to the latest (v2.3) version.

YugabyteDB can automatically handle failures and therefore provides high availability. We will create YSQL tables with a replication factor = 3 that allows a fault tolerance of 1. This means the cluster will remain available for both reads and writes even if one node fails. However, if another node fails bringing the number of failures to 2, then writes will become unavailable on the cluster in order to preserve data consistency.

If you haven't installed YugabyteDB yet, do so first by following the Quick Start guide.

  • macOS
  • Linux
  • Docker
  • Kubernetes

1. Create a universe

If you have a previously running local universe, destroy it using the following.

$ ./bin/yb-ctl destroy

Start a new local cluster - a 3-node universe with a replication factor of 3.

$ ./bin/yb-ctl --rf 3 create

2. Run the sample key-value app

Download the sample app JAR file.

$ wget https://github.com/yugabyte/yb-sample-apps/releases/download/v1.2.0/yb-sample-apps.jar?raw=true -O yb-sample-apps.jar 

Run the SqlInserts sample key-value app against the local universe by typing the following command.

$ java -jar ./yb-sample-apps.jar --workload SqlInserts \
                                    --nodes 127.0.0.1:5433 \
                                    --num_threads_write 1 \
                                    --num_threads_read 4

The sample application prints some stats while running, which is also shown below. You can read more details about the output of the sample applications here.

2018-05-10 09:10:19,538 [INFO|...] Read: 8988.22 ops/sec (0.44 ms/op), 818159 total ops  |  Write: 1095.77 ops/sec (0.91 ms/op), 97120 total ops  | ... 
2018-05-10 09:10:24,539 [INFO|...] Read: 9110.92 ops/sec (0.44 ms/op), 863720 total ops  |  Write: 1034.06 ops/sec (0.97 ms/op), 102291 total ops  | ...

3. Observe even load across all nodes

You can check a lot of the per-node stats by browsing to the tablet-servers page. It should look like this. The total read and write IOPS per node are highlighted in the screenshot below. Note that both the reads and the writes are roughly the same across all the nodes indicating uniform usage across the nodes.

Read and write IOPS with 3 nodes

4. Remove a node and observe continuous write availability

Remove a node from the universe.

$ ./bin/yb-ctl remove_node 3

Refresh the tablet-servers page to see the stats update. The Time since heartbeat value for that node will keep increasing. Once that number reaches 60s (i.e. 1 minute), YugabyteDB will change the status of that node from ALIVE to DEAD. Note that at this time the universe is running in an under-replicated state for some subset of tablets.

Read and write IOPS with 3rd node dead

4. Remove another node and observe write unavailability

Remove another node from the universe.

$ ./bin/yb-ctl remove_node 2

Refresh the tablet-servers page to see the stats update. Writes are now unavailable but reads can continue to be served for whichever tablets available on the remaining node.

Read and write IOPS with 2nd node removed

6. [Optional] Clean up

Optionally, you can shutdown the local cluster created in Step 1.

$ ./bin/yb-ctl destroy

1. Create a universe

If you have a previously running local universe, destroy it using the following.

$ ./bin/yb-ctl destroy

Start a new local cluster - a 3-node universe with a replication factor of 3.

$ ./bin/yb-ctl --rf 3 create

2. Run the sample key-value app

Download the sample app JAR file.

$ wget https://github.com/yugabyte/yb-sample-apps/releases/download/v1.2.0/yb-sample-apps.jar?raw=true -O yb-sample-apps.jar 

Run the SqlInserts sample key-value app against the local universe by typing the following command.

$ java -jar ./yb-sample-apps.jar --workload SqlInserts \
                                    --nodes 127.0.0.1:5433 \
                                    --num_threads_write 1 \
                                    --num_threads_read 4

The sample application prints some stats while running, which is also shown below. You can read more details about the output of the sample applications here.

2018-05-10 09:10:19,538 [INFO|...] Read: 8988.22 ops/sec (0.44 ms/op), 818159 total ops  |  Write: 1095.77 ops/sec (0.91 ms/op), 97120 total ops  | ... 
2018-05-10 09:10:24,539 [INFO|...] Read: 9110.92 ops/sec (0.44 ms/op), 863720 total ops  |  Write: 1034.06 ops/sec (0.97 ms/op), 102291 total ops  | ...

3. Observe even load across all nodes

You can check a lot of the per-node stats by browsing to the tablet-servers page. It should look like this. The total read and write IOPS per node are highlighted in the screenshot below. Note that both the reads and the writes are roughly the same across all the nodes indicating uniform usage across the nodes.

Read and write IOPS with 3 nodes

4. Remove a node and observe continuous write availability

Remove a node from the universe.

$ ./bin/yb-ctl remove_node 3

Refresh the tablet-servers page to see the stats update. The Time since heartbeat value for that node will keep increasing. Once that number reaches 60s (i.e. 1 minute), YugabyteDB will change the status of that node from ALIVE to DEAD. Note that at this time the universe is running in an under-replicated state for some subset of tablets.

Read and write IOPS with 3rd node dead

4. Remove another node and observe write unavailability

Remove another node from the universe.

$ ./bin/yb-ctl remove_node 2

Refresh the tablet-servers page to see the stats update. Writes are now unavailable but reads can continue to be served for whichever tablets available on the remaining node.

Read and write IOPS with 2nd node removed

6. [Optional] Clean up

Optionally, you can shutdown the local cluster created in Step 1.

$ ./bin/yb-ctl destroy

1. Create universe

If you have a previously running local universe, destroy it using the following.

$ ./yb-docker-ctl destroy

Start a new local universe with replication factor 5.

$ ./yb-docker-ctl create --rf 5 

Connect to cqlsh on node 1.

$ docker exec -it yb-tserver-n1 /home/yugabyte/bin/cqlsh
Connected to local cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.9-SNAPSHOT | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
cqlsh>

Create a Cassandra keyspace and a table.

cqlsh> CREATE KEYSPACE users;
cqlsh> CREATE TABLE users.profile (id bigint PRIMARY KEY,
	                               email text,
	                               password text,
	                               profile frozen<map<text, text>>);

2. Insert data through node 1

Now insert some data by typing the following into cqlsh shell we joined above.

cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  (1000, '[email protected]', 'licensed2Kill',
   {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  );
cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  (2000, '[email protected]', 'itsElementary',
   {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  );

Query all the rows.

cqlsh> SELECT email, profile FROM users.profile;
 email                        | profile
------------------------------+---------------------------------------------------------------
      [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
 [email protected] |               {'firstname': 'Sherlock', 'lastname': 'Holmes'}

(2 rows)

3. Read data through another node

Let us now query the data from node 5.

$ docker exec -it yb-tserver-n5 /home/yugabyte/bin/cqlsh
cqlsh> SELECT email, profile FROM users.profile;
 email                        | profile
------------------------------+---------------------------------------------------------------
      [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
 [email protected] |               {'firstname': 'Sherlock', 'lastname': 'Holmes'}

(2 rows)

4. Verify that one node failure has no impact

We have 5 nodes in this universe. You can verify this by running the following.

$ ./yb-docker-ctl status

Let us simulate node 5 failure by doing the following.

$ ./yb-docker-ctl remove_node 5

Now running the status command should show only 4 nodes:

$ ./yb-docker-ctl status

Now connect to node 4.

$ docker exec -it yb-tserver-n4 /home/yugabyte/bin/cqlsh

Let us insert some data.

cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES 
  (3000, '[email protected]', 'imGroovy',
   {'firstname': 'Austin', 'lastname': 'Powers'});

Now query the data.

cqlsh> SELECT email, profile FROM users.profile;
 email                        | profile
------------------------------+---------------------------------------------------------------
      [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
 [email protected] |               {'firstname': 'Sherlock', 'lastname': 'Holmes'}
   [email protected] |                 {'firstname': 'Austin', 'lastname': 'Powers'}

(3 rows)

5. Verify that second node failure has no impact

This cluster was created with replication factor 5 and hence needs only 3 replicas to make consensus. Therefore, it is resilient to 2 failures without any data loss. Let us simulate another node failure.

$ ./yb-docker-ctl remove_node 1

We can check the status to verify:

$ ./yb-docker-ctl status

Now let us connect to node 2.

$ docker exec -it yb-tserver-n2 /home/yugabyte/bin/cqlsh

Insert some data.

cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  (4000, '[email protected]', 'iCanFly',
   {'firstname': 'Clark', 'lastname': 'Kent'});

Run the query.

cqlsh> SELECT email, profile FROM users.profile;
 email                        | profile
------------------------------+---------------------------------------------------------------
        [email protected] |                    {'firstname': 'Clark', 'lastname': 'Kent'}
      [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
 [email protected] |               {'firstname': 'Sherlock', 'lastname': 'Holmes'}
   [email protected] |                 {'firstname': 'Austin', 'lastname': 'Powers'}

(4 rows)

Step 6. Clean up (optional)

Optionally, you can shutdown the local cluster created in Step 1.

$ ./yb-docker-ctl destroy

1. Create universe

If you have a previously running local universe, destroy it using the following.

$ kubectl delete -f yugabyte-statefulset.yaml

Start a new local cluster - by default, this will create a 3 node universe with a replication factor of 3.

$ kubectl apply -f yugabyte-statefulset.yaml

Check the Kubernetes dashboard to see the 3 yb-tserver and 3 yb-master pods representing the 3 nodes of the cluster.

$ minikube dashboard

Kubernetes Dashboard

Connect to cqlsh on node 1.

$ kubectl exec -it yb-tserver-0 /home/yugabyte/bin/cqlsh
Connected to local cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.9-SNAPSHOT | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
cqlsh>

Create a Cassandra keyspace and a table.

cqlsh> CREATE KEYSPACE users;
cqlsh> CREATE TABLE users.profile (id bigint PRIMARY KEY,
	                               email text,
	                               password text,
	                               profile frozen<map<text, text>>);

2. Insert data through node 1

Now insert some data by typing the following into cqlsh shell we joined above.

cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  (1000, '[email protected]', 'licensed2Kill',
   {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  );
cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  (2000, '[email protected]', 'itsElementary',
   {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  );

Query all the rows.

cqlsh> SELECT email, profile FROM users.profile;
 email                        | profile
------------------------------+---------------------------------------------------------------
      [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
 [email protected] |               {'firstname': 'Sherlock', 'lastname': 'Holmes'}

(2 rows)

3. Read data through another node

Let us now query the data from node 3.

$ kubectl exec -it yb-tserver-2 /home/yugabyte/bin/cqlsh
cqlsh> SELECT email, profile FROM users.profile;
 email                        | profile
------------------------------+---------------------------------------------------------------
      [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
 [email protected] |               {'firstname': 'Sherlock', 'lastname': 'Holmes'}

(2 rows)

cqlsh> exit;

4. Verify one node failure has no impact

This cluster was created with replication factor 3 and hence needs only 2 replicas to make consensus. Therefore, it is resilient to 1 failure without any data loss. Let us simulate node 3 failure.

$ kubectl delete pod yb-tserver-2

Now running the status command should would show that the yb-tserver-2 pod is Terminating.

$ kubectl get pods
NAME           READY     STATUS        RESTARTS   AGE
yb-master-0    1/1       Running       0          33m
yb-master-1    1/1       Running       0          33m
yb-master-2    1/1       Running       0          33m
yb-tserver-0   1/1       Running       1          33m
yb-tserver-1   1/1       Running       1          33m
yb-tserver-2   1/1       Terminating   0          33m

Now connect to node 2.

$ kubectl exec -it yb-tserver-1 /home/yugabyte/bin/cqlsh

Let us insert some data to ensure that the loss of a node hasn't impacted the ability of the universe to take writes.

cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES 
  (3000, '[email protected]', 'imGroovy',
   {'firstname': 'Austin', 'lastname': 'Powers'});

Now query the data. We see that all the data inserted so far is returned and the loss of the node has no impact on data integrity.

cqlsh> SELECT email, profile FROM users.profile;
 email                        | profile
------------------------------+---------------------------------------------------------------
      [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
 [email protected] |               {'firstname': 'Sherlock', 'lastname': 'Holmes'}
   [email protected] |                 {'firstname': 'Austin', 'lastname': 'Powers'}

(3 rows)

5. Verify that Kubernetes brought back the failed node

We can now check the cluster status to verify that Kubernetes has indeed brought back the yb-tserver-2 node that had failed before. This is because the replica count currently effective in Kubernetes for the yb-tserver StatefulSet is 3 and there were only 2 nodes remaining after 1 node failure.

$ kubectl get pods
NAME           READY     STATUS    RESTARTS   AGE
yb-master-0    1/1       Running   0          34m
yb-master-1    1/1       Running   0          34m
yb-master-2    1/1       Running   0          34m
yb-tserver-0   1/1       Running   1          34m
yb-tserver-1   1/1       Running   1          34m
yb-tserver-2   1/1       Running   0          7s

YugabyteDB's fault tolerance when combined with Kubernetes's automated operations ensures that planet-scale applications can be run with ease while ensuring extreme data resilience.

6. Clean up (optional)

Optionally, you can shutdown the local cluster created in Step 1.

$ kubectl delete -f yugabyte-statefulset.yaml

Further, to destroy the persistent volume claims (you will lose all the data if you do this), run:

kubectl delete pvc -l app=yb-master
kubectl delete pvc -l app=yb-tserver
Ask our community
  • Slack
  • Github
  • Forum
  • StackOverflow
Yugabyte
Contact Us
Copyright © 2017-2020 Yugabyte, Inc. All rights reserved.