Designing secondary indexes YSQL

This page documents the preview version (v2.23). Preview includes features under active development and is for development and testing only. For production, use the stable version (v2024.1). To learn more, see Versioning.

The primary goal of an index is to enhance the performance of data retrieval operations on the data in the tables. Indexes are designed to quickly locate data without having to search every row in a database table and provide fast access for patterns other than that of the primary key of the table. In YugabyteDB, indexes are internally designed just like tables and operate as such. The main difference between a table and an index is that the primary key of the table has to be unique but it need not be unique for an index.

In YugabyteDB, indexes are global and are implemented just like tables. They are split into tablets and distributed across the different nodes in the cluster. The sharding of indexes is based on the primary key of the index and is independent of how the main table is sharded and distributed. Indexes are not colocated with the base table.

To illustrate secondary indexes, first create a sample census schema.

Set up a local cluster

If a local universe is currently running, first destroy it.

Start a local three-node universe with an RF of 3 by first creating a single node, as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.1 \
                --base_dir=${HOME}/var/node1 \
                --cloud_location=aws.us-east-2.us-east-2a

On macOS, the additional nodes need loopback addresses configured, as follows:

sudo ifconfig lo0 alias 127.0.0.2
sudo ifconfig lo0 alias 127.0.0.3

Next, join more nodes with the previous node as needed. yugabyted automatically applies a replication factor of 3 when a third node is added.

Start the second node as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.2 \
                --base_dir=${HOME}/var/node2 \
                --cloud_location=aws.us-east-2.us-east-2b \
                --join=127.0.0.1

Start the third node as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.3 \
                --base_dir=${HOME}/var/node3 \
                --cloud_location=aws.us-east-2.us-east-2c \
                --join=127.0.0.1

After starting the yugabyted processes on all the nodes, configure the data placement constraint of the universe, as follows:

./bin/yugabyted configure data_placement --base_dir=${HOME}/var/node1 --fault_tolerance=zone

This command can be executed on any node where you already started YugabyteDB.

To check the status of a running multi-node universe, run the following command:

./bin/yugabyted status --base_dir=${HOME}/var/node1

Setup

To set up a universe, refer to Set up a YugabyteDB Anywhere universe.

Create a census table as follows:

CREATE TABLE census(
   id int,
   name varchar(255),
   age int,
   zipcode int,
   employed boolean,
   PRIMARY KEY(id ASC)
)
Add some data to the table as follows.
INSERT INTO public.census ( id,name,age,zipcode,employed ) VALUES
  (1,'Zachary',55,94085,True),    (2,'James',56,94085,False),    (3,'Kimberly',50,94084,False),
  (4,'Edward',56,94085,True),     (5,'Barry',56,94084,False),    (6,'Tyler',45,94084,False),
  (7,'Nancy',47,94085,False),     (8,'Sarah',52,94084,True),     (9,'Nancy',59,94084,False),
  (10,'Diane',51,94083,False),    (11,'Ashley',42,94083,False),  (12,'Jacqueline',58,94085,False),
  (13,'Benjamin',49,94084,False), (14,'James',48,94083,False),   (15,'Ann',43,94083,False),
  (16,'Aimee',47,94085,True),     (17,'Michael',49,94085,False), (18,'Rebecca',40,94085,False),
  (19,'Kevin',45,94085,True),     (20,'James',45,94084,False),   (21,'Sandra',60,94085,False),
  (22,'Kathleen',40,94085,True),  (23,'William',42,94084,False), (24,'James',42,94083,False),
  (25,'Tyler',50,94085,False),    (26,'James',49,94085,True),    (27,'Kathleen',55,94083,True),
  (28,'Zachary',55,94083,True),   (29,'Rebecca',41,94085,True),  (30,'Jacqueline',49,94085,False),
  (31,'Diane',48,94083,False),    (32,'Sarah',53,94085,True),    (33,'Rebecca',55,94083,True),
  (34,'William',47,94085,False),  (35,'William',60,94085,True),  (36,'Sarah',53,94085,False),
  (37,'Ashley',47,94084,True),    (38,'Ashley',54,94084,False),  (39,'Benjamin',42,94083,False),
  (40,'Tyler',47,94085,True),     (41,'Michael',42,94084,False), (42,'Diane',50,94084,False),
  (43,'Nancy',51,94085,False),    (44,'Rebecca',56,94085,False), (45,'Tyler',41,94085,True);

Basic index

Suppose you need to look up the data based on the zip codes of the people in the census. You can fetch details with a query similar to the following:

select id from census where zipcode=94085;

This required a sequential scan of all the rows in the table. This is because the primary key of the table is id, and looking up by zip code requires a full scan. To avoid the full scan, create an index on zipcode so that the executor can quickly fetch the matching rows by looking at the index.

create index idx_zip on census(zipcode ASC);

Now, for a query to get all the people in zip code 94085 as follows:

explain (analyze, dist, costs off) select id from census where zipcode=94085;

You will see an output like the following:

 Index Scan using idx_zip on public.census (actual time=3.273..3.295 rows=23 loops=1)
   Output: id
   Index Cond: (census.zipcode = 94085)
   Storage Table Read Requests: 1
   Storage Table Read Execution Time: 1.401 ms
   Storage Table Rows Scanned: 23
   Storage Index Read Requests: 1
   Storage Index Read Execution Time: 1.529 ms
   Storage Index Rows Scanned: 23
...

The same 23 rows were fetched from the table, but much faster. This is because the planner uses the index to execute the query.

Covering index

In the prior example, to retrieve the rows the index was first looked up, and then more columns were fetched for the same rows from the table. This additional round trip to the table is needed because the columns are not present in the index. To avoid this, you can store the column along with the index as follows:

create index idx_zip2 on census(zipcode ASC) include(id);

Now, for a query to get all people in zip code 94085 as follows:

explain (analyze, dist, costs off) select id from census where zipcode=94085;

You will see an output like the following:

                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Index Only Scan using idx_zip2 on census (actual time=1.930..1.942 rows=23 loops=1)
   Index Cond: (zipcode = 94085)
   Storage Index Read Requests: 1
   Storage Index Read Execution Time: 1.042 ms
   Storage Index Rows Scanned: 23
...

This is an index-only scan, which means that all the data required by the query has been fetched from the index. This is also why there was no entry for Table Read Requests.

When an index contains all the columns of the table, it is referred to as a Duplicate index. Duplicate indexes can be used in multi-region deployments to reduce read latencies.

See Duplicate indexes for more details.

Listing indexes

You can list the indexes associated with a table using the following methods.

\d+ meta command

The \d+ <table> meta command lists the indexes associated with a table along with the schema details.

\d+ census

The indexes are listed at the end of the output as follows:

  Column  |          Type          | Collation | Nullable | Default | Storage  | Stats target | Description
----------+------------------------+-----------+----------+---------+----------+--------------+-------------
 id       | integer                |           | not null |         | plain    |              |
 name     | character varying(255) |           |          |         | extended |              |
 age      | integer                |           |          |         | plain    |              |
 zipcode  | integer                |           |          |         | plain    |              |
 employed | boolean                |           |          |         | plain    |              |
Indexes:
    "census_pkey" PRIMARY KEY, lsm (id ASC)
    "idx_zip" lsm (zipcode ASC)

pg_indexes view

You can also fetch more information about indexes using the pg_indexes view.

SELECT * FROM pg_indexes WHERE tablename = 'census' ;

This gives an output similar to the following:

 schemaname | tablename |  indexname  | tablespace |                              indexdef
------------+-----------+-------------+------------+---------------------------------------------------------------------
 public     | census    | census_pkey | null       | CREATE UNIQUE INDEX census_pkey ON public.census USING lsm (id ASC)
 public     | census    | idx_zip     | null       | CREATE INDEX idx_zip ON public.census USING lsm (zipcode ASC)

Index usage

It's a good idea to keep track of how well indexes are used by your applications so that you can evaluate and improve your indexes, and drop indexes that are not used. To get the usage statistics of the indexes of a table, you can execute the following command:

SELECT * FROM pg_stat_user_indexes WHERE relname = 'census';

This should give an output similar to the following:

 relid | indexrelid | schemaname | relname | indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
-------+------------+------------+---------+--------------+----------+--------------+---------------
 17227 |      17230 | public     | census  | census_pkey  |        2 |           12 |             0
 17227 |      17237 | public     | census  | idx_zip      |        2 |           24 |             0

You can get an idea of how many times the index was scanned and how many tuples were read from the index using this statistic.

Conclusion

While primary keys are essential to ensure data uniqueness and facilitate efficient data distribution, secondary indexes provide the flexibility needed to optimize queries based on non-primary key columns. Using secondary indexes, applications can boost performance and provide a robust and scalable solution for managing large-scale, distributed datasets.

Learn more