CREATE INDEX

Synopsis

Use the CREATE INDEX statement to create a new index on a table. It defines the index name, index columns, and additional columns to include.

Syntax

Diagram

create_index

CREATEUNIQUEDEFERREDINDEXIFNOTEXISTSindex_nameONtable_name(partition_key_columns,clustering_key_columns)covering_columnsindex_propertiesWHEREindex_predicate

partition_key_columns

index_column(,index_column)

clustering_key_columns

,index_column

index_properties

WITHANDproperty_name=property_literalCLUSTERINGORDERBY(,index_columnASCDESC)

index_column

column_namejsonb_attribute

jsonb_attribute

column_name->'attribute_name'->>'attribute_name'

covering_columns

COVERINGINCLUDE(,column_name)

index_predicate

where_expression

Grammar

create_index ::= CREATE [ UNIQUE ] [ DEFERRED ] INDEX 
                 [ IF NOT EXISTS ] index_name ON  table_name ( 
                 partition_key_columns , [ clustering_key_columns ] )  
                 [ covering_columns ] [ index_properties ] 
                 [ WHERE index_predicate ]

partition_key_columns ::= index_column | ( index_column [ , ... ] )

clustering_key_columns ::= index_column [ , ... ]

index_properties ::= WITH 
                     { property_name = property_literal
                       | CLUSTERING ORDER BY ( 
                         { index_column [ ASC | DESC ] } [ , ... ] ) } 
                     [ AND ... ]

index_column ::= column_name | jsonb_attribute

jsonb_attribute ::= column_name [ -> 'attribute_name' [ ... ] ] ->> 'attribute_name'

covering_columns ::= { COVERING | INCLUDE } ( column_name [ , ... ] )

index_predicate ::= where_expression

Where

  • index_name, table_name, property_name, and column_name are identifiers.
  • table_name may be qualified with a keyspace name.
  • index_name cannot be qualified with a keyspace name because an index must be created in the table's keyspace.
  • property_literal is a literal of either boolean, text, or map data type.
  • index_column can be any data type except MAP, SET, LIST, JSONB, USER_DEFINED_TYPE.

Semantics

  • An error is raised if transactions have not be enabled using the WITH transactions = { 'enabled' : true } clause on the table to be indexed. This is because secondary indexes internally use distributed transactions to ensure ACID guarantees in the updates to the secondary index and the associated primary key. More details here.
  • An error is raised if index_name already exists in the associated keyspace unless the IF NOT EXISTS option is used.

Note

When an index is created on an existing table, YugabyteDB will automatically backfill existing data into the index in an online manner (that is, while continuing to serve other concurrent writes and traffic). For more details on how this is done, see Online Index Backfill.

User enforced consistency

Indexes require transactions to have been enabled on the table. For cases where the table was created without enabling transactions, consistency_level has to be set to user_enforced like,

CREATE TABLE orders (id int PRIMARY KEY, warehouse int);
CREATE INDEX ON orders (warehouse)
      WITH transactions = { 'enabled' : false, 'consistency_level' : 'user_enforced' };

When using an index without transactions enabled, it is the responsibility of the application to retry any insert/update/delete failures to make sure that the table and index are in sync.

PARTITION KEY

  • Partition key is required and defines a split of the index into partitions.

CLUSTERING KEY

  • Clustering key is optional and defines an ordering for index rows within a partition.
  • Default ordering is ascending (ASC) but can be set for each clustering column as ascending or descending using the CLUSTERING ORDER BY property.
  • Any primary key column of the table not indexed explicitly in index_columns is added as a clustering column to the index implicitly. This is necessary so that the whole primary key of the table is indexed.

index_properties

  • The CLUSTERING ORDER BY property can be used to set the ordering for each clustering column individually (default is ASC).
  • The TABLETS = <num> property specifies the number of tablets to be used for the specified YCQL index. Setting this property overrides the value from the --yb_num_shards_per_tserver option. For an example, see Create an index specifying the number of tablets.
  • Use the AND operator to use multiple index properties.
  • When setting a TTL on the index using default_time_to_live, please ensure that the TTL value is the same as that of the table's TTL. If they are different, it would lead to the index and the table being out of sync and would lead to unexpected behavior.

Caveat : Row level TTL cannot be set on a table with a secondary indexes during INSERTS/UPDATES. #10992

INCLUDED COLUMNS

  • Included columns are optional table columns whose values are copied into the index in addition to storing them in the table. When additional columns are included in the index, they can be used to respond to queries directly from the index without querying the table.

  • The following can't be added to an index's included columns: static columns of a table, expressions, and table columns with the following types: frozen, map, set, list, tuple, jsonb, and user defined.

UNIQUE INDEX

  • A unique index disallows duplicate values from being inserted into the indexed columns. It can be used to ensure uniqueness of index column values.

DEFERRED INDEX

Currently, an "index backfill" job is launched for each index that is created. For the case where you create a table and add multiple indexes, the main table needs to be scanned multiple times to populate each index. This is unnecessary, and can also cause issues with the single touch and multi touch block cache algorithm.

After creating a set of indexes with their backfill deferred, you can then trigger a backfill job for the entire batch of indexes (on the same table) in one of the following ways:

  • Create a new index that is not deferred:

    CREATE DEFERRED INDEX idx_1 on table_name(col_1);        // No backfill launched.
    CREATE DEFERRED INDEX idx_2 on table_name(col_2);        // No backfill launched.
    CREATE DEFERRED INDEX idx_9 on table_name(col_9);        // No backfill launched.
    
    
    // To launch backfill ...
    CREATE INDEX idx_10 on table_name(col_10);   // Will launch backfill for idx_10 and             
                                                        // all deferred indexes idx_1 .. idx_9 
                                                        // on the same table viz: table_name.
    
  • Use yb-admin to launch backfill for deferred indexes on the table.

    CREATE DEFERRED INDEX idx_1 on table_name(col_1);        // No backfill launched.
    CREATE DEFERRED INDEX idx_2 on table_name(col_2);        // No backfill launched.
        ...
    CREATE DEFERRED INDEX idx_9 on table_name(col_9);        // No backfill launched.
    CREATE DEFERRED INDEX idx_10 on table_name(col_10);      // No backfill launched.
    

    Launch a backfill job for backfilling all the deferred indexes using the backfill_indexes_for_table command as follows:

    bin/yb-admin -master_addresses <ip:port> backfill_indexes_for_table ycql.ybdemo table_name
    
  • Use the --defer_index_backfill YB-Master flag to force all indexes to be DEFERRED, and run yb-admin backfill_indexes_for_table to backfill indexes.

PARTIAL INDEX

  • If a WHERE clause is specified, only rows which satisfy the index_predicate are indexed.

  • An index_predicate can have sub-expressions on columns of these data types: TINYINT, SMALLINT, INT/INTEGER, BIGINT, VARINT, BOOLEAN and TEXT along with these operators (when applicable): =, !=, >, <, >=, <=.

  • Partial indexes can be UNIQUE. A UNIQUE partial index enforces the constraint that for each possible tuple of indexed columns, only one row that satisfies the index_predicate is allowed in the table.

  • SELECT queries can use a partial index for scanning if the SELECT statement's where_expression => (logically implies) index_predicate.

    Note

    • A partial index might not be chosen even if the implication holds in case there are better query plans.

    • The logical implication holds if all sub-expressions of the index_predicate are present as is in the where_expression. For example, assume where_expression = A AND B AND C, index_predicate_1 = A AND B, index_predicate_2 = A AND B AND D, index_predicate_3 = A AND B AND C AND D. Then where_expression only implies index_predicate_1

    • Currently, valid mathematical implications are not taken into account when checking for logical implication. For example, even if where_expression = x > 5 and index_predicate = x > 4, the SELECT query will not use the index for scanning. This is because the two sub-expressions x > 5 and x > 4 differ.

  • When using a prepared statement, the logical implication check (to decide if a partial index is usable), will only consider those sub-expressions of where_expression that don't have a bind variable. This is because the query plan is decided before execution (i.e., when a statement is prepared).

ycqlsh:example> CREATE TABLE orders (customer_id INT,
                                    order_date TIMESTAMP,
                                    product JSONB,
                                    warehouse_id INT,
                                    amount DOUBLE,
                                    PRIMARY KEY ((customer_id), order_date))
                WITH transactions = { 'enabled' : true };

ycqlsh:example> CREATE INDEX idx ON orders (warehouse_id)
                WHERE warehouse_id < 100;

ycqlsh:example> EXPLAIN SELECT product FROM orders
                WHERE warehouse_id < 100 AND order_date >= ?; // Idx can be used
 QUERY PLAN
------------------------------------------
 Index Scan using temp.idx on temp.orders
   Filter: (order_date >= :order_date)
ycqlsh:example> EXPLAIN SELECT product FROM orders
                WHERE warehouse_id < ? and order_date >= ?; // Idx cannot be used
 QUERY PLAN
--------------------------------------------------------------------------
 Seq Scan on temp.orders
   Filter: (warehouse_id < :warehouse_id) AND (order_date >= :order_date)
  • Without partial indexes, we do not allow many combinations of operators together on the same column in a SELECT's where expression e.g.: WHERE v1 != NULL and v1 = 5. But if there was a partial index that subsumes some clauses of the SELECT's where expression, two or more operators otherwise not supported together, might be supported.
ycqlsh:example> EXPLAIN SELECT product FROM orders
                WHERE warehouse_id != NULL AND warehouse_id = ?;
SyntaxException: Invalid CQL Statement. Illogical condition for where clause
EXPLAIN SELECT product from orders where warehouse_id != NULL and warehouse_id = ?;
                                                                  ^^^^^^^^^^^^
 (ql error -12)
ycqlsh:example> CREATE INDEX warehouse_idx ON orders (warehouse_id)
                WHERE warehouse_id != NULL;
ycqlsh:example> EXPLAIN SELECT product FROM orders
                WHERE warehouse_id != NULL AND warehouse_id = ?; // warehouse_idx can be used
 QUERY PLAN
----------------------------------------------------
 Index Scan using temp.warehouse_idx on temp.orders
   Key Conditions: (warehouse_id = :warehouse_id)

Examples

Create a table to be indexed

'customer_id' is the partitioning column and 'order_date' is the clustering column.

ycqlsh:example> CREATE TABLE orders (customer_id INT,
                                    order_date TIMESTAMP,
                                    product JSONB,
                                    warehouse_id INT,
                                    amount DOUBLE,
                                    PRIMARY KEY ((customer_id), order_date))
                WITH transactions = { 'enabled' : true };

Create an index for query by the order_date column

ycqlsh:example> CREATE INDEX orders_by_date ON orders (order_date) INCLUDE (amount);

Create an index for query by the JSONB attribute product->>'name'

ycqlsh:example> CREATE INDEX product_name
                ON orders (product->>'name') INCLUDE (amount);

Create an index for query by the warehouse_id column

ycqlsh:example> CREATE INDEX orders_by_warehouse
                ON orders (warehouse_id, order_date) INCLUDE (amount);

Insert some data

ycqlsh:example> INSERT INTO orders (customer_id, order_date, product, warehouse_id, amount)
                VALUES (1001, '2018-01-10', '{ "name":"desk" }', 107, 100.30);
ycqlsh:example> INSERT INTO orders (customer_id, order_date, product, warehouse_id, amount)
                VALUES (1002, '2018-01-11', '{ "name":"chair" }', 102, 50.45);
ycqlsh:example> INSERT INTO orders (customer_id, order_date, product, warehouse_id, amount)
                VALUES (1001, '2018-04-09', '{ "name":"pen" }', 102, 20.25);
ycqlsh:example> INSERT INTO orders (customer_id, order_date, product, warehouse_id, amount)
                VALUES (1003, '2018-04-09', '{ "name":"pencil" }', 108, 200.80);

Query by the partition column customer_id in the table

ycqlsh:example> SELECT SUM(amount) FROM orders
                WHERE customer_id = 1001 AND order_date >= '2018-01-01';
  sum(amount)
-------------
      120.55

Query by the partition column order_date in the index orders_by_date

ycqlsh:example> SELECT SUM(amount) FROM orders
                WHERE order_date = '2018-04-09';
 sum(amount)
-------------
      221.05

Query by the partition column product->>'name' in the index product_name

ycqlsh:example> SELECT SUM(amount) FROM orders
                WHERE product->>'name' = 'desk';
 sum(amount)
-------------
      100.30

Query by the partition column warehouse_id column in the index orders_by_warehouse

ycqlsh:example> SELECT SUM(amount) FROM orders
                WHERE warehouse_id = 102 AND order_date >= '2018-01-01';
 sum(amount)
-------------
        70.7

Create a table with a unique index

You can do this as follows:

ycqlsh:example> CREATE TABLE emp (enum INT primary key,
                                 lastname VARCHAR,
                                 firstname VARCHAR,
                                 userid VARCHAR)
               WITH transactions = { 'enabled' : true };
ycqlsh:example> CREATE UNIQUE INDEX emp_by_userid ON emp (userid);

Insert values into the table and verify no duplicate userid is inserted

ycqlsh:example> INSERT INTO emp (enum, lastname, firstname, userid)
                VALUES (1001, 'Smith', 'John', 'jsmith');
ycqlsh:example> INSERT INTO emp (enum, lastname, firstname, userid)
                VALUES (1002, 'Smith', 'Jason', 'jsmith');
InvalidRequest: Error from server: code=2200 [Invalid query] message="SQL error: Execution Error. Duplicate value disallowed by unique index emp_by_userid
INSERT INTO emp (enum, lastname, firstname, userid)
       ^^^^
VALUES (1002, 'Smith', 'Jason', 'jsmith');
 (error -300)"
ycqlsh:example> INSERT INTO emp (enum, lastname, firstname, userid)
                VALUES (1002, 'Smith', 'Jason', 'jasmith');
ycqlsh:example> SELECT * FROM emp;
 enum | lastname | firstname | userid
------+----------+-----------+---------
 1002 |    Smith |     Jason | jasmith
 1001 |    Smith |      John |  jsmith

Create an index specifying the number of tablets

You can use the CREATE INDEX statement with the WITH tablets = <num> clause to specify the number of tablets for an index. This is useful to scale the index up or down based on requirements. For example, for smaller or partial indexes, it may be wasteful to have a large number of shards (tablets). In that case, you can use this to reduce the number of tablets created for the index. Similarly, for a very large index, you can use this statement to presplit the index into a large number of shards to get improved performance.

Note that YugabyteDB, by default, presplits an index in yb_num_shards_per_tserver * num_of_tserver shards. This clause can be used to override that setting on per-index basis.

ycqlsh:example> CREATE TABLE tracking (id int PRIMARY KEY, a TEXT) WITH transactions = { 'enabled' : true };
ycqlsh:example> CREATE INDEX my_indx ON tracking(a) WITH tablets = 10;

See also