Best practices

Core features

Global secondary indexes

Indexes use multi-shard transactional capability of YugabyteDB and are global and strongly consistent (ACID). To add secondary indexes, you need to create tables with transactions enabled. They can also be used as materialized views by using the INCLUDE clause.

Unique indexes

YCQL supports unique indexes. A unique index disallows duplicate values from being inserted into the indexed columns.

Covered indexes

When querying by a secondary index, the original table is consulted to get the columns that aren't specified in the index. This can result in multiple random reads across the main table.

Sometimes, a better way is to include the other columns that you're querying that are not part of the index using the INCLUDE clause. When additional columns are included in the index, they can be used to respond to queries directly from the index without querying the table.

This turns a (possible) random read from the main table to just a filter on the index.

Atomic read modify write operations with UPDATE IF EXISTS

For operations like UPDATE ... IF EXISTS and INSERT ... IF NOT EXISTS that require an atomic read-modify-write, Apache Cassandra uses LWT which requires 4 round-trips between peers. These operations are supported in YugabyteDB a lot more efficiently, because of YugabyteDB's CP (in the CAP theorem) design based on strong consistency, and require only a single Raft-round trip between peers. Number and counter types work the same and don't need a separate "counters" table.

JSONB document data type

YugabyteDB supports the jsonb data type that makes it easy to model JSON data, which does not have a set schema and might change often. You can use JSONB to group less interesting and less accessed columns of a table. YCQL also supports JSONB expression indexes that can be used to speed up data retrieval that would otherwise require scanning the JSON entries.

Use JSONB columns only when necessary

jsonb columns are slower to read and write compared to normal columns. They also take more space because they need to store keys in strings and make keeping data consistency more difficult. A good schema design is to keep most columns as regular columns or collections, and use jsonb only for truly dynamic values. Don't create a data jsonb column where you store everything; instead, use a dynamic_data jsonb column with the others being primitive columns.

Incrementing numeric types

In YugabyteDB, YCQL extends Apache Cassandra to add increment and decrement operators for integer data types. Integers can be set, inserted, incremented, and decremented while COUNTER can only be incremented or decremented. YugabyteDB implements CAS(compare and swap) operations in one round trip, compared to four for Apache Cassandra.

Expire older records automatically with TTL

YCQL supports automatic expiration of data using the TTL feature. You can set a retention policy for data at table/row/column level and the older data is automatically purged from the database.

If configuring TTL for a time series dataset (or any dataset with a table-level TTL), it is recommended for CPU and space efficiency to expire older files directly by using TTL-specific configuration options. More details can be found in Efficient data expiration for TTL.

Note

TTL is not applicable to transactional tables and hence is not supported in that context.

Performance

Use YugabyteDB drivers

Use YugabyteDB-specific client drivers because they are cluster and partition aware and support jsonb columns.

Leverage connection pooling in the YCQL client

A single client (for example, a multi-threaded application) should ideally use a single cluster object. The single cluster object typically holds underneath the covers a configurable number of connections to yb-tservers. Typically 1 or 2 per TServer suffices to serve even 64-128 application threads. The same connection can be used for multiple outstanding requests, also known as multiplexing.

Use prepared statements

Use prepared statements whenever possible. This will ensure that YB partition aware drivers are able to route queries to the tablet leader, improve throughput and server doesn't have to parse the query on each operation.

Use batching for higher throughput

Use batching for writing a set of operations. This will send all operations in a single RPC call instead of using multiple RPC calls, one per operation. Each batch operation has higher latency compared to single rows operations but has higher throughput overall.

Column and row sizes

For consistent latency/performance, keep columns in the 2 MB range or less.

Big columns add up when selecting multiple columns or full rows. For consistent latency and performance, keep the size of individual rows in the 32 MB range or less.

Don't use big collections

Collections are designed for storing small sets of values that are not expected to grow to arbitrary size (such as phone numbers or addresses for a user rather than posts or messages). While collections of larger sizes are allowed, they may have a significant impact on performance for queries involving them. In particular, some list operations (insert at an index and remove elements) require a read-before-write.

Collections with many elements

Each element inside a collection ends up as a separate key value in DocDB adding per-element overhead.

If your collections are immutable, or you update the whole collection in full, consider using the JSONB data type. An alternative would also be to use ProtoBuf or FlatBuffers and store the serialized data in a BLOB column.

Use partition_hash for large table scans

partition_hash function can be handy for querying a subset of the data to get approximate row counts or to break down full-table operations into smaller sub-tasks that can be run in parallel. See example usage along with a working Python script.

Miscellaneous

Use TRUNCATE to empty tables instead of DELETE

TRUNCATE deletes the database files that store the table and is very fast. While DELETE inserts a delete marker for each row in transactions and they are removed from storage when a compaction runs.