Configurable data sharding

Understand the various data distribution schemes

YugabyteDB splits table data into smaller pieces called tablets. Sharding is the process in which rows of a table are distributed across the various tablets in the cluster. The mapping of a row to a tablet is deterministic and based on the primary key of the row. The deterministic nature of sharding enables fast access to a row for a given primary key. Sharding is one of the foundational techniques that enable the scalability of YugabyteDB.

YugabyteDB offers two schemes for data sharding, namely Hash and Range.

Functionality Range sharding Hash sharding
Distribution of data based on Actual value of the Primary key Hash of the Primary key
Ordering of data in a partition Actual value of the Primary key Hash of the Primary key
Location of contiguous keys (e.g., 1,2,3,4) Mostly on the same tablet Mostly on different tablets
Performance of point lookups k=100 High High
Performance of Range lookups k>150 and k<180 High Low
Tablet splitting Manual and automatic Automatic

Cluster setup

To understand how these sharding schemes work, set up a local RF 3 cluster for testing.

Set up a local cluster

If a local universe is currently running, first destroy it.

Start a local three-node universe with an RF of 3 by first creating a single node, as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.1 \
                --base_dir=${HOME}/var/node1 \
                --cloud_location=aws.us-east-2.us-east-2a

On macOS, the additional nodes need loopback addresses configured, as follows:

sudo ifconfig lo0 alias 127.0.0.2
sudo ifconfig lo0 alias 127.0.0.3

Next, join more nodes with the previous node as needed. yugabyted automatically applies a replication factor of 3 when a third node is added.

Start the second node as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.2 \
                --base_dir=${HOME}/var/node2 \
                --cloud_location=aws.us-east-2.us-east-2b \
                --join=127.0.0.1

Start the third node as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.3 \
                --base_dir=${HOME}/var/node3 \
                --cloud_location=aws.us-east-2.us-east-2c \
                --join=127.0.0.1

After starting the yugabyted processes on all the nodes, configure the data placement constraint of the universe, as follows:

./bin/yugabyted configure data_placement --base_dir=${HOME}/var/node1 --fault_tolerance=zone

This command can be executed on any node where you already started YugabyteDB.

To check the status of a running multi-node universe, run the following command:

./bin/yugabyted status --base_dir=${HOME}/var/node1

Range sharding

In range sharding the data is split into contiguous ranges of the primary key (in the sort order as defined in the table). Typically tables start as a single tablet and as the data grows and reaches a size threshold, the tablet dynamically splits into two. Tables can also be pre-split during creation.

The primary advantage of range sharding is that it is fast both for range scans and point lookups. The downside to this sharding scheme is that it starts with only one tablet even though there could be multiple nodes in the cluster. The other concern is that if the data being inserted is already sorted, all new inserts would go to only one tablet, even though there are multiple tablets in the cluster.

Behavior

Consider a basic table of users for which the user ID is the primary key.

CREATE TABLE user_range (
    id int,
    name VARCHAR,
    PRIMARY KEY(id ASC)
);

The word ASC has been explicitly added to the primary key definition to specify that this is a range-sharded table. This defines the ordering of the rows in the tablet. You can specify DESC if you want the rows to be stored in descending order.

Add 6 rows to the table:

INSERT INTO user_range VALUES (1, 'John Wick'), (2, 'Harry Potter'), (3, 'Bruce Lee'),
                              (4, 'India Jones'), (5, 'Vito Corleone'), (6, 'Jack Sparrow');

Fetch all the rows from the table:

SELECT * FROM user_range ;

If you select all the rows from the table, you will see an output similar to the following:

 id |     name
----+---------------
  1 | John Wick
  2 | Harry Potter
  3 | Bruce Lee
  4 | India Jones
  5 | Vito Corleone
  6 | Jack Sparrow

Note that all the rows are ordered in ascending order of ID. This is because you specified id ASC in the primary key definition.

Even though you have set up a cluster with 3 nodes, only 1 tablet has been created by default. A range-sharded table cannot be split into multiple tablets automatically because the range of numbers is not fixed (it is actually infinite) and it is inefficient for the system to decide which range of numbers should be stored in which tablet. So the inserted rows will be stored only on this tablet.

To verify this, execute the same command with EXPLAIN ANALYZE as follows:

EXPLAIN (ANALYZE, DIST, COSTS OFF) SELECT * FROM user_range ;

You will see an output like the following:

                            QUERY PLAN
------------------------------------------------------------------
 Seq Scan on user_range (actual time=0.865..0.867 rows=6 loops=1)
...
 Storage Read Requests: 1
 Storage Rows Scanned: 6
...

The Storage Rows Scanned: 6 metric clearly says that 6 rows were read and rows=6 on the first line says that 6 rows were returned.

Now pay attention to the Storage Read Requests: 1. This says that there was just 1 read request to the storage subsystem. This is because there is only one tablet and all the rows are stored in it.

Now to see how point lookups perform in a range sharded table, fetch the row for a specific row:

explain (analyze, dist, costs off) select * from user_range where id=5;

You should see an output similar to the following:

                                       QUERY PLAN
------------------------------------------------------------------------------------------
 Index Scan using user_range_pkey on user_range (actual time=1.480..1.488 rows=1 loops=1)
   Index Cond: (id = 5)
...
 Storage Read Requests: 1
 Storage Rows Scanned: 1
...

This shows that only one row was scanned (via Storage Rows Scanned: 1) to retrieve one row (via rows=1), making point lookups very efficient in a range-sharded table.

To understand how range scans are efficient in a range-sharded table, fetch the rows with id > 2 and id < 6 as follows:

explain (analyze, dist, costs off) select * from user_range where id>2 and id<6;

You should see an output similar to the following:

                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Index Scan using user_range_pkey on user_range (actual time=1.843..1.849 rows=3 loops=1)
   Index Cond: ((id > 2) AND (id < 6))
...
 Storage Rows Scanned: 3
...

Exactly 3 rows were scanned (via Storage Rows Scanned: 3) to retrieve 3 rows. This is because the data is ordered on the primary key and hence an index scan was done for the range query (via Index Scan using user_range_pkey).

Remember

Range sharding is the preferred sharding scheme on YugabyteDB as it is good for both range queries and point lookups. In a range-shared table, data is ordered on the ordering of the primary key.

Hash sharding

In hash sharding, the data is randomly and evenly distributed across the tablets based on the hash of the primary key. Each tablet owns a range of hash values in the 2-byte hash space - 0x0000 - 0xFFFF. For instance, in a 3-tablet table, the hash space could be split as 0x0000 - 0x5554, 0x5555 - 0xAAA9, 0xAAAA - 0xFFFF with each tablet being responsible for one of the ranges.

In a hash-sharded table, even though the data is distributed well across all available tablets, in each tablet, the data is stored in the sorted order of the hash of the primary key. This leads to fast lookups but lower performance of range queries like greater than and lesser than (for example, age < 20).

Behavior

Consider a basic table of users for which the user ID is the primary key.

CREATE TABLE user_hash (
    id int,
    name VARCHAR,
    PRIMARY KEY(id HASH)
);

For clarity, HASH has been explicitly added to the primary key definition to specify that this is a hash-sharded table. Hash sharding is the default distribution scheme. You can change the default using the yb_use_hash_splitting_by_default configuration parameter.

Add 6 rows to the table:

INSERT INTO user_hash VALUES (1, 'John Wick'), (2, 'Harry Potter'), (3, 'Bruce Lee'),
                             (4, 'India Jones'), (5, 'Vito Corleone'), (6, 'Jack Sparrow');

Fetch all the rows from the table:

SELECT * FROM user_hash ;

You will see an output like the following:

 id |     name
----+---------------
  5 | Vito Corleone
  1 | John Wick
  6 | Jack Sparrow
  4 | India Jones
  2 | Harry Potter
  3 | Bruce Lee

Note how there is no ordering of the primary key id. This is because the rows are internally ordered on the hash of the id, which can be determined using the function yb_hash_code(id).

As you set up a cluster with 3 nodes, 3 tablets were created by default. The reason why the table can be split into multiple tablets easily in hash sharding is that the hash range is fixed [0x0000 - 0xFFFF], exactly 65536 numbers, and hence it is straightforward to divide this range across different tablets. So the inserted rows would have been distributed across the 3 tablets.

To verify this, execute the same command with EXPLAIN ANALYZE as follows:

EXPLAIN (ANALYZE, DIST, COSTS OFF) SELECT * FROM user_hash ;

You will see an output like the following:

                           QUERY PLAN
-----------------------------------------------------------------
 Seq Scan on user_hash (actual time=0.526..2.054 rows=6 loops=1)
...
 Storage Read Requests: 3
 Storage Rows Scanned: 6
...

The Storage Rows Scanned: 6 metric shows that 6 rows were read and rows=6 on the first line says that 6 rows were returned. Now pay attention to the Storage Read Requests: 3. This says that there were 3 different read requests to the storage subsystem. This is because 3 different requests have been sent to the 3 different tablets to retrieve the rows.

To see how efficient point lookups are in a hash-sharded table, execute the following:

explain (analyze, dist, costs off) select * from user_hash where id=5;

You will see an output similar to the following:

                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Index Scan using user_hash_pkey on user_hash (actual time=1.479..1.487 rows=1 loops=1)
   Index Cond: (id = 5)
...
 Storage Read Requests: 1
 Storage Rows Scanned: 1
...

The primary key index, user_hash_pkey, was used to look up id = 5 and only one row was scanned to retrieve one row. This shows that point lookups are very efficient on a hash-sharded table.

To see the performance of a range scan on a hash-sharded table, fetch the rows with id > 2 and id < 6 as follows:

explain (analyze, dist, costs off) select * from user_hash where id>2 and id<6;

You should see an output similar to the following:

                           QUERY PLAN
-----------------------------------------------------------------
 Seq Scan on user_hash (actual time=2.535..2.547 rows=3 loops=1)
   Remote Filter: ((id > 2) AND (id < 6))
...
 Storage Rows Scanned: 6
...

You will immediately notice that this was a sequential scan (via Seq Scan on user_hash) and all 6 rows in the table were scanned (via Storage Rows Scanned: 6) to retrieve 3 rows. If there were a million rows in the table, all of them would have been scanned. Hence, hash-sharded tables don't perform well for range queries.

Remember

Hash sharding is good for point lookups. In a hash-sharded table, data is evenly distributed and is ordered on the ordering of the hash of the primary key.

Learn more