Names and identifiers
The terms of art name and identifier are often used interchangeably. But they have different meanings. A simple example makes the point. See the section Unquoted and quoted identifiers in action. Nevertheless, no matter in which order you read this page's two sections (Artifacts, their names, and identifiers for those names and Unquoted and quoted identifiers in action), you should make sure that you read both of them.
You should also read the section Lexical Structure in the PostgreSQL documentation.
Artifacts, their names, and identifiers for those names
The terms of art 'artifact' and 'object'.
The term of art artifact is used to mean anything that has a name and that might be listed in one of the catalog tables or views. Examples are roles, tablespaces, databases, schemas, tables, views, indexes, columns in tables, constraints, and so on. The term is needed because not all artifacts have an owner. The owning principal can only be a role—but roles don't have owners. Every artifact that isn't a role does have an explicit or implicit owner—and such an owned artifact is denoted by the term of art object.
- Roles, databases, and tablespaces exist at the level of the cluster as a whole. But every object of every other kind each exists within a particular database.
- Schemas exist at top-level within a database.
- Many other kinds of object, like tables and views, must each exist at top-level within a particular schema. Such objects are therefore denoted by the term of art schema object.
- In contrast, databases, tablespaces, and schemas are denoted by the term of art non-schema object.
- Some objects, like columns in tables or triggers, can exist only within the context of a schema object. These are denoted by the term of art secondary object. It isn't useful to consider the ownership of a secondary object—but you might like to say that a secondary object is owned, transitively, by the role that owns the schema object within whose context it exists.
Top-level SQL statements have items that are defined only within the statement text like common_table_expression and alias. Such items also have names; and here, too, the name is denoted by an identifier. Items in the text of a plpgsql_block_stmt, like formal_arg, label, variable, also have names; and here, too, the name is denoted by an identifier.
The catalog tables list the names of artifacts; SQL statements and PL/pgSQL source text are spelled using identifiers to denote the artifacts to which the statement refers.
-
If the name of an artifact satisfies certain rules (for example, the letters that it contains must be lower case ASCII 7 letters), then the identifier that denotes it is simply written as the name is written—except that the case with which the identifier is written is insignificant. The term of art unquoted identifier is used for such an identifier; and the term of art common name is used for the name that the unquoted identifier denotes.
-
If the name does not satisfy the rules that allow it to be a common name, then it is known by the term of art exotic name. The identifier for an exotic name must be written exactly as the name is written and then must be tightly enclosed with double quotes. The double quote characters are, by definition, part of the identifier. The term of art quoted identifier is used for such an identifier.
-
The name that is denoted by the identifier for the to-be-created artifact can be arbitrarily long in the create DDL statement for the artifact. (The same holds for other related DDLs like drop and alter for the artifact.)
-
But if the length of the name exceeds sixty-three characters, then it will be truncated to sixty-three characters with (according to the current setting for client_min_messages) a 42622 notice like this:
identifier "a123456789a123456789a123456789a123456789a123456789a123456789a123456789" will be truncated to "a123456789a123456789a123456789a123456789a123456789a123456789a12"
(Whitespace was added manually to improve the readability.) Try this demonstration:
\c yugabyte yugabyte set client_min_messages = error; do $body$ declare ten_ascii7_chars constant text not null := 'a123456789'; seventy_ascii7_chars constant text not null := ten_ascii7_chars|| ten_ascii7_chars|| ten_ascii7_chars|| ten_ascii7_chars|| ten_ascii7_chars|| ten_ascii7_chars|| ten_ascii7_chars; sixty_three_ascii7_chars constant text not null := substr(seventy_ascii7_chars, 1, 63); drop_role constant text not null := 'drop role if exists %I'; cr_role constant text not null := 'create role %I'; qry constant text not null := $$ select rolname from pg_roles where rolname ~'^%s'; $$; role_name text not null := ''; begin execute format(drop_role, seventy_ascii7_chars); execute format(cr_role, seventy_ascii7_chars); execute format(qry, ten_ascii7_chars) into strict role_name; assert length(sixty_three_ascii7_chars) = 63; assert role_name = sixty_three_ascii7_chars; execute format(drop_role, sixty_three_ascii7_chars); end; $body$;
It finishes silently. To see the truncation notice, change the set statement at the start to this:
set client_min_messages = notice;
A common name must satisfy the following criteria in order to have that status.
- (reprise) It must have at most sixty-three characters. (This holds for an exotic name, too.)
- (reprise) Every letter that it contains must be a lower case ASCII 7 letter (i.e. in a through z).
- It must start with a letter or an underscore.
- Each of its remaining characters must be a letter, a digit (i.e. in 0 though 9) or the punctuation character underscore.
The built-in function quote_ident() will tightly enclose its input with double quotes if the string does not qualify for common name status. Otherwise, it will return the input as is. Try this:
select
quote_ident('_123') as "Test 1",
quote_ident('1dog') as "Test 2",
quote_ident('dog$house') as "Test 3";
This is the result:
Test 1 | Test 2 | Test 3
--------+--------+-------------
_123 | "1dog" | "dog$house"
You can confirm this outcome with this test:
drop schema if exists s cascade;
create schema s;
create table s._123(k int primary key);
Notice that if you try create table s.1dog ... instead, it causes the 42601 syntax error—just as the quoted return from quote_ident() tells you to expect.
Surprisingly, then, this test:
create table s.dog$house(k int primary key);
also completes without error—in spite of the fact that quote_ident() tells you to expect that this would, without quoting the name, cause a syntax error. (Here, too, you can go on to use the table.) The explanation is given in the PostgreSQL documentation in the subsection Identifiers and Key Words within the Lexical Structure section. Look for this:
SQL identifiers and key words must begin with a letter (a-z, but also ...) or an underscore. Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($). Note that dollar signs are not allowed in identifiers according to the letter of the SQL standard, so their use might render applications less portable. The SQL standard will not define a key word that contains digits or starts or ends with an underscore, so identifiers of this form are safe against possible conflict with future extensions of the standard.
You can surmise from this that the quote_ident() function aims to implement the stricter rules of the SQL standard. Here is the text that was elided, above, following "must begin with a letter (a-z, but also"
[but also] letters with diacritical marks and non-Latin letters
Try this:
select
quote_ident('høyde') as "Norwegian",
quote_ident('école') as "French",
quote_ident('правда') as "Russian",
quote_ident('速度') as "Chinese";
This is the result:
Norwegian | French | Russian | Chinese
-----------+---------+----------+---------
"høyde" | "école" | "правда" | "速度"
This implies that the identifier that denotes each of those names must be double quoted. But try this test:
create table s.høyde (k int primary key);
create table s.hØyde (k int primary key);
create table s.école (k int primary key);
create table s.École (k int primary key);
create table s.правда (k int primary key);
create table s.ПРАВДА (k int primary key);
create table s.速度 (k int primary key);
Each create table completes without error. This outcome might surprise you. In an unquoted identifier:
- It's only the Latin letters in a through z that are taken as is—while the letters in A through Z are taken to mean their lower-case equivalents.
- In contrast, the upper and lower case forms of Latin characters with diacritical marks, like ø and Ø, é and É are all taken to be different.
- The upper and lower case forms of all non-Latin characters, in languages like Russian, like
р
andР
,а
andА
,д
andД
are taken to be different—without considering diacritical marks. - Pictograms, in languages like Chinese, like
速
and度
simply have no concept of case. - It gets even harder to understand the rules if identifiers are written using, say, Hebrew or Arabic script where the reading order is from right to left.
You can use the quote_ident() function to implement a simple boolean function to test if a name is a common name thus:
create function s.is_common(nam in text)
returns boolean
set search_path = pg_catalog, s
language sql
as $body$
select (nam = quote_ident(nam));
$body$;
select
s.is_common('_123')::text as "is_common('_123')",
s.is_common('1dog')::text as "is_common('1dog')";
This is the result:
is_common('_123') | is_common('1dog')
-------------------+-------------------
true | false
Yugabyte recommends that you use only common names for user-created artifacts and that you determine a candidate's "common" status with quote_ident(). In other words, you should not exploit the leniency of PostgreSQL's practical definition that empirical testing, by creating tables with names like høyde, école, and so on (like the create table attempts above show). The following do statement shows how to test a candidate name's "common" versus "exotic" status:
do $body$
declare
nam text not null := '';
common_names text[] not null := array[
'employees',
'pay_grades',
'_42'];
exotic_names text[] not null := array[
'Employees',
'pay grades',
'start$dates',
'emp#',
'42',
'høyde',
'hØyde',
'école',
'École',
'правда',
'ПРАВДА',
'速度'];
begin
foreach nam in array common_names loop
assert s.is_common(nam), 'Not common';
end loop;
foreach nam in array exotic_names loop
assert not s.is_common(nam), 'Not exotic: '||nam;
end loop;
end;
$body$;
The block finishes without error, demonstrating that the classification of all the tested names is correct. Developers world-wide, with a huge range of native languages and writing schemes, usually avoid the need for quoted identifiers by choosing only names that pass the is_common() test that the block uses.
If you are convinced that you need to go against the recommendation that this tip gives, then you should explain your reasoning in the design specification document for your application's database backend.
Notice that the paradigm for role names, like d42$api that's used in the ysql-case-studies goes against this tip's recommendations by using $ as the separator between the name of the database for which the role is local and the role's so-called nickname. This design choice is justified by reasoning that the names of the two components must each be able to be freely chosen given just that they respect the ordinary criteria for common name status. The $ separator therefore has this clear, unique purpose.
The section Case study: PL/pgSQL procedures for provisioning roles with privileges on only the current database shows how the name of the database can be isolated to a single point of definition within the code corpus that installs the application's database backend.
A purist treatment needs to distinguish between the platonic notions, name and identifier, and their partner notions, the text of a name and the text of an identifier, as these are found in code and in query output. However, this level of precision isn't needed to make the basic point that this section makes; and it's very rare indeed to see any ordinary documentation for a programming language make this distinction.
Finally, it must be acknowledged that the documentation for most SQL systems, including PostgreSQL and YSQL, typically blurs the distinction between name and identifier, though both terms of art are used: name is very frequently used where identifier is the proper choice; and this is most conspicuous in the Grammar Diagrams omnibus listing. Fortunately, the context always makes the intended meaning clear.
Unquoted and quoted identifiers in action
Authorize a session as a regular role that can create objects in the current database and do this:
drop schema if exists "My Schema" cascade;
create schema "My Schema";
create table "My Schema"."my_table"(K int primary key, "V" text not null);
insert into "My Schema".MY_table("k", "V") values (1, 17), (2, 42);
select K, "V" from "My Schema".MY_TABLE order by 1;
The identifier for the object with the common name my_table is written, in the create table, insert, and select statements above, variously and correctly as "my_table", MY_TABLE, and MY_table. If a common name is entirely lower-case characters and is fairly long, then there's a vast number of different ways in which an identifier that denotes an object with that name might be spelled.
Here's the query result:
k | V
---+----
1 | 17
2 | 42
Notice that the first column heading spells the column's name, correctly, as lower-case k; and that the second column heading spells the column's name, again correctly, as upper-case V; Query the catalog to list some facts about the just-created table:
with c("SCHEMA_NAME", "TABLE_NAME", "COLUMN_NAME") as (
select n.nspname, c.relname, a.attname
from
pg_namespace as n
inner join
pg_class as c
on n.oid = c.relnamespace
inner join
pg_attribute as a
on c.oid = a.attrelid
where n.nspname ~ '^My'::name
and c.relname ~ '^my'::name
and c.relkind = 'r'
and a.attnum > 0)
select * from c order by 1, 2, 3;
Notice that the selected columns in the three catalog tables that the query joins are all spelled, in conformance with the PostgreSQL convention for such tables, by starting with an abbreviation that identifies the catalog table and by ending with the string name. (In other words, they don't end with the string identifier.)
This is the output:
SCHEMA_NAME | TABLE_NAME | COLUMN_NAME
-------------+------------+-------------
My Schema | my_table | V
My Schema | my_table | k
Notice the presence of both lower case and upper case characters. The create table, insert, and select statements might have used many different combinations of upper and lower case in the identifiers for the common names that produce this same result.
Of course, this example is contrived. You'd never see statements spelled like this in ordinary humanly written real code:
- Double quotes are used, here, around some names that qualify as common and that don't need (but may use) a quoted identifier.
- Unquoted strings are variously rendered, here, in all lower case, all upper case, or mixed case. These therefore identify objects with common names and so the spellings of the names are coerced to lower case.
The example makes the point forcefully. When we use the terms of art precisely and accurately, we must say that the create table, insert and select statements don't use names. Rather, they denote the artifacts with the names of interest by using identifiers to denote the artifacts with the intended names.