Configurable data sharding
YugabyteDB splits table data into smaller pieces called tablets. Sharding is the process in which rows of a table are distributed across the various tablets in the cluster. The mapping of a row to a tablet is deterministic and based on the primary key of the row. The deterministic nature of sharding enables fast access to a row for a given primary key. Sharding is one of the foundational techniques that enable the scalability of YugabyteDB.
row_id
to be used as the primary key. This row_id
is not accessible by users.
Let us look into the two sharding schemes offered by YugabyteDB.
Cluster setup
To understand how these sharding schemes work, let's set up a local RF3 cluster for testing.
If a local universe is currently running, first destroy it.
Start a local three-node universe with an RF of 3
by first creating a single node, as follows:
./bin/yugabyted start \
--advertise_address=127.0.0.1 \
--base_dir=${HOME}/var/node1 \
--cloud_location=aws.us-east-2.us-east-2a
On macOS, the additional nodes need loopback addresses configured, as follows:
sudo ifconfig lo0 alias 127.0.0.2
sudo ifconfig lo0 alias 127.0.0.3
Next, join more nodes with the previous node as needed. yugabyted
automatically applies a replication factor of 3
when a third node is added.
Start the second node as follows:
./bin/yugabyted start \
--advertise_address=127.0.0.2 \
--base_dir=${HOME}/var/node2 \
--cloud_location=aws.us-east-2.us-east-2b \
--join=127.0.0.1
Start the third node as follows:
./bin/yugabyted start \
--advertise_address=127.0.0.3 \
--base_dir=${HOME}/var/node3 \
--cloud_location=aws.us-east-2.us-east-2c \
--join=127.0.0.1
After starting the yugabyted processes on all the nodes, configure the data placement constraint of the universe, as follows:
./bin/yugabyted configure data_placement --base_dir=${HOME}/var/node1 --fault_tolerance=zone
This command can be executed on any node where you already started YugabyteDB.
To check the status of a running multi-node universe, run the following command:
./bin/yugabyted status --base_dir=${HOME}/var/node1
Hash sharding
In hash sharding, the data is randomly and evenly distributed across the tablets. Each tablet owns a range of hash values in the 2-byte hash space - 0x0000 - 0xFFFF
. For instance, in a 3-tablet table, the hash space could be split as 0x0000 - 0x5554
, 0x5555 - 0xAAA9
, 0xAAAA - 0xFFFF
with each tablet being responsible for one of the ranges.
In a hash-sharded table, even though the data is distributed well across all available tablets, within each tablet, the data is stored in the sorted order of the hash of the primary key. This leads to fast lookups but lower performance of range queries like greater than and lesser than (e.g., age
< 20` ).
Behavior
Let's understand the behavior of hash sharding via an example. Consider a simple table of users for which the user-id is the primary key.
CREATE TABLE user_hash (
id int,
name VARCHAR,
PRIMARY KEY(id HASH)
);
HASH
has been explicitly added to the primary key definition to specify that this is a hash-sharded table. Hash sharding is the default distribution scheme. You can change the default using the yb_use_hash_splitting_by_default
flag.
Now, let's add 6 rows to the table.
INSERT INTO user_hash VALUES (1, 'John Wick'), (2, 'Harry Potter'), (3, 'Bruce Lee'),
(4, 'India Jones'), (5, 'Vito Corleone'), (6, 'Jack Sparrow');
Now fetch all the rows from the table.
SELECT * FROM user_hash ;
You will see an output like :
id | name
----+---------------
5 | Vito Corleone
1 | John Wick
6 | Jack Sparrow
4 | India Jones
2 | Harry Potter
3 | Bruce Lee
Note that there is no ordering of the primary key id
. This is because the rows are internally ordered on the hash of the id, which can be figured out using the function, yb_hash_code(id)
.
As we have set up a cluster with 3 nodes, 3 tablets would have been created by default. The reason why the table can be split into multiple tablets easily in hash sharding is that the hash range is fixed [0x0000 - 0xFFFF]
, exactly 65536 numbers and hence it is easy to divide this range across different tablets. So the inserted rows would have been distributed across the 3 tablets. Let's verify this.
Now if you run a execute the same command with EXPLAIN ANALYZE as,
EXPLAIN (ANALYZE, DIST, COSTS OFF) SELECT * FROM user_hash ;
You will see an output like:
QUERY PLAN
-----------------------------------------------------------------
Seq Scan on user_hash (actual time=0.526..2.054 rows=6 loops=1)
Storage Table Read Requests: 3
Storage Table Read Execution Time: 1.943 ms
Storage Table Rows Scanned: 6
Planning Time: 0.042 ms
Execution Time: 2.085 ms
Storage Read Requests: 3
Storage Read Execution Time: 1.943 ms
Storage Rows Scanned: 6
Storage Write Requests: 0
Catalog Read Requests: 0
Catalog Write Requests: 0
Storage Flush Requests: 0
Storage Execution Time: 1.943 ms
Peak Memory Usage: 24 kB
The Storage Rows Scanned: 6
metric clearly says that 6 rows were read and rows=6
on the first line says that 6 rows were returned. Now pay attention to the Storage Table Read Requests: 3
. This says that there were 3 different read requests to the storage subsystem. This is because 3 different requests have been sent to the 3 different tablets to retrieve the rows from each of them.
Remember
Hash sharding is good for point lookups. In a hash-sharded table, data is evenly distributed and is ordered on the ordering of the hash of the primary key.Range sharding
In range sharding, there are no hashes involved and the data is split into contiguous ranges of the primary key (in the sort order as defined in the table). Typically tables start as a single tablet and as the data grows and reaches a size threshold, the tablet dynamically splits into two. Tables can also be pre-split during creation.
The primary advantage of range sharding is that it is fast both for range scans and point lookups. The downside to this sharding scheme is that it starts with only one tablet even though there could be multiple nodes in the cluster. The other concern is that if the data being inserted is already coming in sorted, all new inserts would go to only one tablet, even though there are multiple tablets in the cluster.
Behavior
Let's understand the behavior of range sharding via an example. Consider a simple table of users for which the user-id is the primary key.
CREATE TABLE user_range (
id int,
name VARCHAR,
PRIMARY KEY(id ASC)
);
Now, let's add 6 rows to the table.
INSERT INTO user_range VALUES (1, 'John Wick'), (2, 'Harry Potter'), (3, 'Bruce Lee'),
(4, 'India Jones'), (5, 'Vito Corleone'), (6, 'Jack Sparrow');
Now fetch all the rows from the table.
SELECT * FROM user_range ;
If you select all the rows from the table, you will see an output like:
id | name
----+---------------
1 | John Wick
2 | Harry Potter
3 | Bruce Lee
4 | India Jones
5 | Vito Corleone
6 | Jack Sparrow
Note that all the rows are ordered in the ascending order of the id
. This is because we have specified id ASC
in the primary key definition.
Even though we have set up a cluster with 3 nodes, only 1 tablet would have been created by default. The reason why a range-sharded table cannot be split into multiple tablets automatically as the range of numbers is not fixed (actually infinite) and it is inefficient for the system to decide which range of numbers should be stored in which tablet. So the inserted rows will be stored only on this tablet. Let's verify this.
Now if you run a execute the same command with EXPLAIN ANALYZE as,
EXPLAIN (ANALYZE, DIST, COSTS OFF) SELECT * FROM user_range ;
You will see an output like:
QUERY PLAN
------------------------------------------------------------------
Seq Scan on user_range (actual time=0.865..0.867 rows=6 loops=1)
Storage Table Read Requests: 1
Storage Table Read Execution Time: 0.729 ms
Storage Table Rows Scanned: 6
Planning Time: 0.050 ms
Execution Time: 0.915 ms
Storage Read Requests: 1
Storage Read Execution Time: 0.729 ms
Storage Rows Scanned: 6
Storage Write Requests: 0
Catalog Read Requests: 0
Catalog Write Requests: 0
Storage Flush Requests: 0
Storage Execution Time: 0.729 ms
Peak Memory Usage: 24 kB
The Storage Rows Scanned: 6
metric clearly says that 6 rows were read and rows=6
on the first line says that 6 rows were returned.
Now pay attention to the Storage Table Read Requests: 1
. This says that there was just 1 read request to the storage subsystem. This is because there is only one tablet and all the rows are stored in it.
Remember
Range sharding is good for both range queries and point lookups. In a range sharded table, data is ordered on the ordering of the primary key.