Designing optimal primary keys YSQL

The Primary key is a column or a set of columns that uniquely identifies a row, such as a user ID or order number. You should choose the primary key based on the most common access pattern. Columns of data type string, number, serial, or UUID make good choices for primary keys.

Automatically generating the primary key

The best way to uniquely identify a record is to allow the database to assign a unique identifier to the row. YugabyteDB supports multiple schemes for generating identifiers that you can choose based on the needs of your application.

UUID

A UUID is a 128-bit number represented as a string of 36 characters, including hyphens. For example, 4b6aa2ff-53e6-44f5-8bd0-ef9de90a8095. YugabyteDB natively supports UUID generation as per RFC 4122 via the uuid-ossp extension. UUIDs have several advantages:

The likelihood of generating duplicate UUIDs is extremely low.
UUIDs can be independently generated on different nodes in the cluster without any coordination with other systems.
The randomness of UUIDs makes it hard to predict the next ID, providing an additional layer of security.

You can add a UUID to your schema as follows:

CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    name TEXT
);

The DEFAULT clause ensures that for every row inserted, a UUID is automatically generated and inserted along with the row.

Serial

Serial is a special data type in YugabyteDB that creates an auto-incrementing integer column starting with 1. It is essentially a shorthand for creating a sequence and using it as a default value for a column. You can choose between three types of serial data types depending on the needs of your application:

SMALLSERIAL - An integer column in the range of 1 to 32,767.
SERIAL - An integer column in the range of 1 to 2,147,483,647.
BIGSERIAL - An integer column in the range of 1 to 9,223,372,036,854,775,807.

Serial can be used directly in table definitions to simplify the creation of auto-incrementing columns.

DROP TABLE IF EXISTS users;
CREATE TABLE users (
    id serial,
    name TEXT,
    PRIMARY KEY(id)
);

For each row inserted into the table, an auto-incremented id value is automatically inserted along with the row.

Sequence

A sequence is a database object that generates a sequence of unique numbers. Sequences are independent objects that can be associated with one or more tables or columns. Sequences offer more flexibility and control over auto-incrementing behavior. They can be created, managed, and used separately from table definitions. Sequences can be customized with different increment values, start values, minimum and maximum values, and cycle behavior.

CREATE SEQUENCE user_id_seq START 100 INCREMENT BY 100 CACHE 10000;

DROP TABLE IF EXISTS users;
CREATE TABLE users (
    id INTEGER DEFAULT nextval('user_id_seq'),
    name TEXT,
    PRIMARY KEY(id)
);

For every row inserted, user IDs are automatically generated as 100, 200,300, and so on.

Use serial for basic use cases and opt for sequences when you need more control over the sequence behavior, need to share a sequence between multiple tables or columns, or require custom incrementing logic.

Existing columns as primary keys

To illustrate how to choose existing columns as primary keys, first create a sample census schema.

Local
YugabyteDB Anywhere

Set up a local cluster

If a local universe is currently running, first destroy it.

Start a local three-node universe with an RF of 3 by first creating a single node, as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.1 \
                --base_dir=${HOME}/var/node1 \
                --cloud_location=aws.us-east-2.us-east-2a

On macOS, the additional nodes need loopback addresses configured, as follows:

sudo ifconfig lo0 alias 127.0.0.2
sudo ifconfig lo0 alias 127.0.0.3

Next, join more nodes with the previous node as needed. yugabyted automatically applies a replication factor of 3 when a third node is added.

Start the second node as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.2 \
                --base_dir=${HOME}/var/node2 \
                --cloud_location=aws.us-east-2.us-east-2b \
                --join=127.0.0.1

Start the third node as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.3 \
                --base_dir=${HOME}/var/node3 \
                --cloud_location=aws.us-east-2.us-east-2c \
                --join=127.0.0.1

After starting the yugabyted processes on all the nodes, configure the data placement constraint of the universe, as follows:

./bin/yugabyted configure data_placement --base_dir=${HOME}/var/node1 --fault_tolerance=zone

This command can be executed on any node where you already started YugabyteDB.

To check the status of a running multi-node universe, run the following command:

./bin/yugabyted status --base_dir=${HOME}/var/node1

Setup

To set up a universe, refer to Create a multi-zone universe.

Create a census table as follows:

CREATE TABLE census(
   id int,
   name varchar(255),
   age int,
   zipcode int,
   employed boolean,
   PRIMARY KEY(id ASC)
)

Add some data to the table as follows.

INSERT INTO public.census ( id,name,age,zipcode,employed ) VALUES
  (1,'Zachary',55,94085,True),    (2,'James',56,94085,False),    (3,'Kimberly',50,94084,False),
  (4,'Edward',56,94085,True),     (5,'Barry',56,94084,False),    (6,'Tyler',45,94084,False),
  (7,'Nancy',47,94085,False),     (8,'Sarah',52,94084,True),     (9,'Nancy',59,94084,False),
  (10,'Diane',51,94083,False),    (11,'Ashley',42,94083,False),  (12,'Jacqueline',58,94085,False),
  (13,'Benjamin',49,94084,False), (14,'James',48,94083,False),   (15,'Ann',43,94083,False),
  (16,'Aimee',47,94085,True),     (17,'Michael',49,94085,False), (18,'Rebecca',40,94085,False),
  (19,'Kevin',45,94085,True),     (20,'James',45,94084,False),   (21,'Sandra',60,94085,False),
  (22,'Kathleen',40,94085,True),  (23,'William',42,94084,False), (24,'James',42,94083,False),
  (25,'Tyler',50,94085,False),    (26,'James',49,94085,True),    (27,'Kathleen',55,94083,True),
  (28,'Zachary',55,94083,True),   (29,'Rebecca',41,94085,True),  (30,'Jacqueline',49,94085,False),
  (31,'Diane',48,94083,False),    (32,'Sarah',53,94085,True),    (33,'Rebecca',55,94083,True),
  (34,'William',47,94085,False),  (35,'William',60,94085,True),  (36,'Sarah',53,94085,False),
  (37,'Ashley',47,94084,True),    (38,'Ashley',54,94084,False),  (39,'Benjamin',42,94083,False),
  (40,'Tyler',47,94085,True),     (41,'Michael',42,94084,False), (42,'Diane',50,94084,False),
  (43,'Nancy',51,94085,False),    (44,'Rebecca',56,94085,False), (45,'Tyler',41,94085,True);

ID as the primary key

In the census table, the most likely way to look up a person is by their id, so the primary key has been set to id ASC. This means that the data is stored in ascending order of ID, ensuring contiguous IDs are mostly located in the same tablet. This works well for point lookups on ID and range scans on IDs. For example, to look up ID 9, you can do the following:

select * from census where id=9;

You will see output similar to the following:

 id | name  | age | zipcode | employed
----+-------+-----+---------+----------
  9 | Nancy |  59 |   94084 | f

One row matching ID 9 was quickly fetched with just one request. You can also do a quick range scan.

select * from census where id>=5 and id<=15;

You will see an output similar to the following:

 id |    name    | age | zipcode | employed
----+------------+-----+---------+----------
  5 | Barry      |  56 |   94084 | f
  6 | Tyler      |  45 |   94084 | f
  7 | Nancy      |  47 |   94085 | f
  8 | Sarah      |  52 |   94084 | t
  9 | Nancy      |  59 |   94084 | f
 10 | Diane      |  51 |   94083 | f
 11 | Ashley     |  42 |   94083 | f
 12 | Jacqueline |  58 |   94085 | f
 13 | Benjamin   |  49 |   94084 | f
 14 | James      |  48 |   94083 | f
 15 | Ann        |  43 |   94083 | f
(11 rows)

11 rows were quickly retrieved as the data is stored sorted on the id column. So range scans are also fast.

Name as the primary key

Suppose your most common lookup is based on the name. In this case you would make the name column part of the primary key. Because the name alone may not be unique enough to be the primary key (the primary key has to be unique), you can choose a primary key with both name and ID as follows:

CREATE TABLE census2(
   id int,
   name varchar(255),
   age int,
   zipcode int,
   employed boolean,
   PRIMARY KEY(name ASC, id ASC)
);

-- copy the same data into census2
INSERT INTO census2 SELECT * FROM census;

When specifying the primary key, the name column is specified first, and id second. This ensures that the data is stored sorted based on name first, and for all matching names, the id is stored sorted in ascending order, ensuring all people with the same name will be stored in the same tablet. This allows you to do a fast lookup on name even though (name, id) is the primary key.

Retrieve all the people with the name James as follows:

select * from census2 where name = 'James';

You will see output similar to the following:

 id | name  | age | zipcode | employed
----+-------+-----+---------+----------
  2 | James |  56 |   94085 | f
 14 | James |  48 |   94083 | f
 20 | James |  45 |   94084 | f
 24 | James |  42 |   94083 | f
 26 | James |  49 |   94085 | t
(5 rows)

There are 5 people named James, and all of them can be quickly looked up as the data has been sorted on name.

Ordering

The primary key was specified with ASC order. However, if the queries are going to retrieve data in descending order with ORDER BY name DESC, then it is better to match the same ordering in the primary key definition.