Bucket-based indexes TECH PREVIEW YSQL

Use bucket-based distribution of indexes to avoid hotspots

Traditional leading range (ASC/DESC) indexes with monotonic inserts concentrate all writes on the "most recent" tablet, creating hot shards and uneven resource usage. By prepending a bucket column (often a hash code modulo of a key column) to the ASC/DESC key, writes can be evenly distributed across multiple tablets (buckets), achieving write scalability and balanced resource usage.

The goal is to have a globally ordered result (for example, the latest 1000 rows by timestamp) while avoiding write hot spots on a monotonically increasing column. Bucket-based indexes solve the hot spot issue by distributing data across multiple tablets (buckets). The system can return a globally ordered result for range queries and LIMIT clauses without a sort operation, even though the data is physically sharded by the bucket. This makes queries like top-N and keyset pagination very efficient.

TP In addition, YugabyteDB (v2025.2.1.0 and later) includes scan optimizations for bucket-based indexes that can produce a globally ordered result without query changes, even when additional ranges are introduced into the index structure.

Normally, to get a globally ordered result for the most recent 1000 rows by timestamp (for example) from data that is sharded across multiple tablets, the database would have to:

Query each tablet.
Collect all of the results into one place.
Perform a final global sort to ensure the total result set is correctly ordered, even with a LIMIT clause.

This is resource-intensive and slow.

Using a bucket-based index with bucket-based scan optimizations enabled, the database can "push down" the LIMIT request to each of the individual tablets (buckets).

For example, for a timestamp column that is the second column in the index (and ordered ASC), for each bucket, we can quickly find its top 1000 locally ordered rows. The database only has to scan 1000 rows per bucket instead of scanning potentially millions of rows that match the larger range condition.

The index then performs an efficient merge of the small, pre-sorted result sets from each bucket to produce the final globally ordered 1000 rows without a final global sort.

In short, it achieves the necessary write scalability and global ordering simultaneously, making OLTP top-N queries exceptionally fast.

When to use it

Use bucket-based scans for the following workloads:

Timestamp-ordered inserts.
Sequence-based IDs.
"Latest N items" queries (feeds, time-series, audit tables).

Syntax

CREATE INDEX index_name ON table_name(yb_hash_code(<key_columns>) % <buckets>) ASC, column_name ASC) 
SPLIT AT VALUES ((1), (2));

For unique indexes, columns in yb_hash_code must be a subset of the remaining columns in the key. Non-unique indexes can have anything in yb_hash_code.
The bucket column or index expression column should generally be the first column of the key.
What you set the number of buckets to depends on how much you want to spread the previously-hot-shard load. The simplest recommendation is to make it equal to the number of nodes.
You should take care that the load is spread to other nodes. For example, with 3 nodes and 9 tablets, if there are 3 buckets, each bucket having 3 tablets, it could be the case that the hottest tablet of each bucket is on the same node.

To reduce the chance of this happening, increase number of buckets and/or keep tablet-to-bucket ratio low (or increase number of nodes).
You can't change the number of buckets after creation (not an issue for secondary indexes, which can be recreated).
The SPLIT clause is optional but recommended to cleanly distribute each bucket to its own tablet. For example, with 3 buckets, if tablets were dynamically split to ((0, 99999), (1, 99999)), then the scan for buckets 0 and 1 may query two tablets. If tablets were presplit to ((1), (2)), all 3 buckets would query one tablet.

Parameters

You configure bucket-based scan optimizations in the query planner using the following configuration parameters:

yb_enable_derived_equalities: Set to true.
yb_enable_derived_saops: Set to true.
yb_max_saop_merge_streams: Maximum number of buckets to process in parallel. The recommended value is 64.

In addition, the cost-based optimizer (CBO) must be enabled (CBO is enabled by default when you deploy your universe using yugabyted, YugabyteDB Anywhere, or YugabyteDB Aeon).

Setup

Follow the setup instructions to start a local multi-node universe with a replication factor of 3, and connect to universe using ysqlsh.

You must be running v2025.2.1.0 or later.

Configure bucket-based indexing

Enable features that preserve global ordering across the buckets by setting the configuration parameters:

SET yb_max_saop_merge_streams=64;
SET yb_enable_derived_saops=true;
SET yb_enable_derived_equalities=true;
ALTER DATABASE yugabyte SET yb_max_saop_merge_streams=64;
ALTER DATABASE yugabyte SET yb_enable_derived_saops=true;
ALTER DATABASE yugabyte SET yb_enable_derived_equalities=true;

Point lookups

Single-row lookups by exact key work with bucket-based indexes because the planner automatically adds the bucket predicate to the index condition. As a result, the query can target the correct tablet instead of scanning all buckets.

Create a table with a generated column:

CREATE TABLE foo (
  r1 int,
  r2 int,
  v1 int,
  v2 int,
  bucket_id int GENERATED ALWAYS AS (yb_hash_code(r1, r2) % 3) STORED,
  PRIMARY KEY (bucket_id ASC, r1, r2))
SPLIT AT VALUES ((1), (2));

Using the primary key (r1, r2), the bucket_id clause is automatically added:

EXPLAIN
    SELECT *
    FROM foo
    WHERE r1 = 1
    AND r2 = 1;

 Index Scan using foo_pkey on foo  (cost=20.00..21.10 rows=1 width=20)
   Index Cond: ((bucket_id = (yb_hash_code(1, 1) % 3)) AND (r1 = 1) AND (r2 = 1))

Create a secondary index:

CREATE INDEX foo_bucket_idx_v1_v2 ON foo (
  (yb_hash_code(v1, v2) % 3) ASC,
  v1,
  v2)
SPLIT AT VALUES ((1), (2));

Run an EXPLAIN on a point lookup query:

EXPLAIN SELECT * FROM foo WHERE v1 = 1 AND v2 = 1;

 Index Scan using foo_bucket_idx_v1_v2 on foo  (cost=40.02..68.42 rows=1 width=20)
   Index Cond: (((yb_hash_code(v1, v2) % 3) = (yb_hash_code(1, 1) % 3)) AND (v1 = 1) AND (v2 = 1))

The query planner automatically calculates the bucket ID for the search keys and adds the bucket predicate to the index condition. This allows the database to go directly to a single, specific tablet (bucket).

Range scans

The following example walks through using a bucket-based index to avoid write hot spots on a timestamp column. You create a table, define an index that distributes writes across three buckets (tablets), insert sample rows with timestamps over a range, and enable the planner settings for bucket-based merge. Then you run queries with ORDER BY and LIMIT and confirm in the plan that there is no Sort node—ordering comes from merging streams from each bucket.

Create a table with a monotonic column:

CREATE TABLE te (
    id SERIAL PRIMARY KEY,
    timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Create an index that always writes to the 3 nodes by adding a bucket column and SPLIT AT:

CREATE INDEX yb_nothotspot
    ON te ((yb_hash_code(timestamp) % 3) ASC, timestamp) INCLUDE (id)
    SPLIT AT VALUES ((1), (2));

SPLIT AT ensures each bucket (0, 1, 2) maps to its own tablet, so writes are spread across three tablets.

As a reminder, an ordinary "hot-spot" index would use only the timestamp column (no bucket column and no SPLIT AT).

ON te (timestamp ASC) INCLUDE (id);

Insert rows with increasing timestamps:

INSERT INTO te (timestamp)
SELECT
    '2026-01-01 00:00:00+00'::timestamptz
    + (gs * (('2024-01-01 00:00:00+00'::timestamptz - '2026-01-01 00:00:00+00'::timestamptz) / 100000))
FROM generate_series(0, 99999) AS gs;

Observe planner changes that provide global ordering with no SQL or application changes:

ANALYZE;
EXPLAIN (ANALYZE, COSTS off)
    SELECT timestamp
    FROM te
    WHERE timestamp >= '2020-01-01'
        AND timestamp <  '2030-01-01'
    ORDER BY timestamp;

 Index Only Scan using yb_nothotspot on te (actual time=17.731..105.882 rows=100000 loops=1)
   Index Cond: (("timestamp" >= '2020-01-01 00:00:00-05'::timestamp with time zone) AND ("timestamp" < '2030-01-01 00:00:00-05'::timestamp with time zone) AND (((yb_hash_code("timestamp") % 3)) = ANY ('{0,1,2}'::integer[])))
   Merge Sort Key: "timestamp"
   Merge Stream Key: (yb_hash_code("timestamp") % 3)
   Merge Streams: 3
   Heap Fetches: 0
 Planning Time: 5.969 ms
 Execution Time: 123.060 ms
 Peak Memory Usage: 144 kB

When the optimization is enabled, EXPLAIN output includes the following additional properties:

Merge Sort Key: columns that the index uses for the merge.
Merge Stream Key: columns involved in forming buckets.
Merge Streams: number of streams that the index merges, generally the cardinality of the cross product of all merge stream key values.

Notice that no sort is present and that the yb_enable_derived_saops feature passed the "bucket" column into the index condition. The index has then merged the streams and returned sorted data, eliminating the need for a sort.

Add a LIMIT

You really see how powerful this feature is when you use the LIMIT clause.

First turn off the optimization:

SET yb_max_saop_merge_streams = 0;

Run the following query:

EXPLAIN (ANALYZE, COSTS off, TIMING on)
    SELECT timestamp
    FROM te
    WHERE timestamp >= '2020-01-01'
    AND timestamp <  '2030-01-01'
    ORDER BY timestamp
    LIMIT 1000;

 Limit (actual time=71.180..71.485 rows=1000 loops=1)
   ->  Sort (actual time=71.178..71.262 rows=1000 loops=1)
         Sort Key: "timestamp"
         Sort Method: top-N heapsort  Memory: 49kB
         ->  Index Only Scan using yb_nothotspot on te (actual time=2.655..52.299 rows=100000 loops=1)
               Index Cond: (("timestamp" >= '2020-01-01 00:00:00-05'::timestamp with time zone) AND ("timestamp" < '2030-01-01 00:00:00-05'::timestamp with time zone))
               Heap Fetches: 0
 Planning Time: 0.144 ms
 Execution Time: 71.710 ms
 Peak Memory Usage: 129 kB

Without the optimization, the planner performs a sort as expected.

Turn the optimization back on:

SET yb_max_saop_merge_streams = 64;

Run the query again:

EXPLAIN (ANALYZE, COSTS off, TIMING on)
    SELECT timestamp
    FROM te
    WHERE timestamp >= '2020-01-01'
    AND timestamp <  '2030-01-01'
    ORDER BY timestamp
    LIMIT 1000;

 Limit (actual time=2.907..3.923 rows=1000 loops=1)
   ->  Index Only Scan using yb_nothotspot on te (actual time=2.904..3.663 rows=1000 loops=1)
         Index Cond: (("timestamp" >= '2020-01-01 00:00:00-05'::timestamp with time zone) AND ("timestamp" < '2030-01-01 00:00:00-05'::timestamp with time zone) AND (((yb_hash_code("timestamp") % 3)) = ANY ('{0,1,2}'::integer[])))
         Merge Sort Key: "timestamp"
         Merge Stream Key: (yb_hash_code("timestamp") % 3)
         Merge Streams: 3
         Heap Fetches: 0
 Planning Time: 0.164 ms
 Execution Time: 4.184 ms
 Peak Memory Usage: 8 kB

The SQL is asking for 1000 globally ordered rows, and the index returns it in a fraction of the time, without the sort, while scanning 1000 rows per bucket.

Keyset pagination

Bucket-based scans also work well for more complex OLTP top-N queries, such as keyset pagination.

Create a more complicated index and an extra predicate:

ALTER TABLE te ADD COLUMN key_id integer NOT NULL DEFAULT 123;
ANALYZE;
CREATE INDEX scalable_key_timestamp ON te (
    (yb_hash_code(timestamp) % 3) ASC,
    key_id,
    timestamp ASC,
    id
) SPLIT AT VALUES ((1), (2));