Example data sets for window-functions

The data sets used by the code examples

This page documents the preview version (v2.23). Preview includes features under active development and is for development and testing only. For production, use the stable version (v2024.1). To learn more, see Versioning.

These four pages:

contain scripts to create and populate tables with data sets that are useful for demonstrating window function semantics.

Each table uses a surrogate uuid primary key whose values are provided by the function gen_random_uuid(), brought by the pgcrypto extension. The procedure to populate table "t4" also uses the function normal_rand(), brought by the tablefunc extension. These extensions are described in the sections pgcrypto and tablefunc. Each is a pre-bundled extension. This means that the installation for each will work without any preparatory steps, as long as you install them as a superuser like this:

create extension pgcrypto;
create extension tablefunc;

If you plan to run the code samples in this main "Window functions" section on your laptop using a YugabyteDB cluster that you've created for your own personal use, then you probably have already adopted the habit of running any and all ad hoc tests as a superuser. If so, then simply install the pgcrypto and tablefunc extensions just as you'd do anything else and then create the tables "t1", "t2", "t3", and "t4".

Note 1: about the installation of extensions

If you've established the practice of creating different databases within your cluster for different purposes, and within a database using a regime of different users that own different schemas to model how a real-world application would be organized, then you'll doubtless want to create and install the extensions in a dedicated central schema and ensure that this is in the search path for all ordinary users.

Note 2: about the use of the gen_random_uuid() function

Yugabyte recommends that, when you want a self-populating surrogate primary key column, you should use the approach shown here for the test tables, like this:

create table my_table(
  k uuid default gen_random_uuid() primary key, ...

This is preferred to using, for example serial or bigserial, like this:

create table t4(
  k serial primary key, ...

(This approach that is common, and that works well, in PostgreSQL—a monolithic SQL database.) This is because serial and bigserial use a SEQUENCE to generate unique values, and this involves expensive coordination between the nodes in a YugabyteDB cluster. In contrast, any invocation of gen_random_uuid() on any node, will reliably produce a new globally unique value entirely algorithmically. This brings a noticeable performance benefit.

The tables "t1", "t2", and "t3" have only a handful of rows and so this performance benefit is well below the noise level. But table "t4" is populated using a purpose-written procedure parameterized with the number of rows to create. You get the most convincing demonstration effect with a large number, like 100,000, rows.

You can expect to see that populating the table "t4" if you use gen_random_uuid() is about 20x faster than if you use a sequence.

Each of the tables "t1", "t2", "t3", and "t4" is populated so that the values of interest for the demonstrations come back in random order (as you are taught to expect) when a query has no ORDER BY clause. This takes just a little programming effort for the tables "t1", "t2", and "t3". Effort is needed because, following a bulk insert into a newly-created table, queries with no ORDER BY clause tend to see the rows come back in the order in which they are inserted. And the INSERT statements for the tables "t1", "t2", and "t3" explicitly list the to-be-inserted values in an intuitive order where they increase monotonically. No effort is needed for table "t4" because the values of interest are generated by normal_rand()—which generates its values in a random order.

Deliberately subverting this tendency (that when rows are inserted naïvely, as they usually are for functionality demonstrations, they come back in a "natural" order, even with no ORDER BY) allows a vivid demonstration of the fact that if the window_definition that's used to invoke any window function has no windowORDER BY clause (even if there is such a clause at overall query level), then the results are unpredictable and therefore meaningless. This is demonstrated in the section Showing the importance of the window ORDER BY clause. (There are cases where the order doesn't matter—for example, when the set of rows is the input to a conventionally invoked aggregate function.)

Save a script to (re)create all four test tables.

It's a good idea to save a script that you can use quickly and effortlessly to create the test tables should you lose, or change, them. This can happen easily with a YugabyteDB cluster on your laptop that you use for all sorts of ad hoc tests. For example, it takes only moments to destroy and re-create the cluster (or even to upgrade to a new version of YugabyteDB)—and this is common practice for certain kinds of test.

Of course, you must first visit each of these pages:

and save each of the scripts that these present onto the same directory where you save the "Master" installation script.

Save this script as, for example, install_all_tables.sql:

-- You can run this script time and again. It will always finish silently.

\i t1.sql
\echo 't1 done'

\i t2.sql
\echo 't2 done'

\i t3.sql
\echo 't3 done'

\i t4_1.sql
\i t4_2.sql
\echo 't4 done'

Then you can simply do this whenever you need to re-establish the state that the code examples rely on:

\i install_all_tables.sql

It takes only a few seconds to finish. Each of the scripts t1.sql, t2.sql, t3.sql, t4_1.sql, and t4_2.sql that it runs is designed to be able to be repeated. Each finishes silently on its first and all subsequent runs.