YSQL aggregate functions signature and purpose

Signature and purpose of each aggregate function

The aggregate functions are categorized into four classes:

General-purpose aggregate functions

The aggregate functions in this class can be invoked in one of two ways:

  • Either "ordinarily" on all the rows in a table or in connection with GROUP BY, when they return a single value for the set of rows.

    In this use, row ordering often doesn't matter. For example, avg() has this property. Sometimes, row ordering does matter. For example, the order of grouped values determines the mapping between array index and array value with array_agg().

  • Or as a window function with OVER.

    In this use, where the aggregate function is evaluated for each row in the window, ordering always matters.

Arguably, avg() might be better classified as a statistical aggregate function. (It is often used together with stddev().) But, because it is so very commonly used, and used as a canonical example of the "aggregate function" notion, it is classified here as a general-purpose aggregate function.

Function Description
array_agg() Returns an array whose elements are the individual values that are aggregated. This is described in full detail the array_agg() subsection in the main Array section.
avg() Computes the arithmetic mean of a set of summable values by adding them all together and dividing by the number of values. If the set contains nulls, then these are simply ignored—both when computing the sum and when counting the number of values.
bit_and() Returns a value that represents the outcome of the applying the two-by-two matrix AND rule to each alligned set of bits for the set of NOT NULL input values.
bit_or() Returns a value that represents the outcome of the applying the two-by-two matrix OR rule to each alligned set of bits for the set of NOT NULL input values.
bool_and() Returns a value that represents the outcome of the applying the two-by-two matrix AND rule to the set of NOT NULL input boolean values.
bool_or() Returns a value that represents the outcome of the applying the two-by-two matrix OR rule to the set of NOT NULL input boolean values.
count() Counts the number of non null values in a set. The data type of the values is of no consequence.
every() every() is a synonym for bool_and()
jsonb_agg() This, and json_agg() are described in detail the jsonb_agg() section in the main JSON section.
jsonb_object_agg() This and json_object_agg() are described in detail the jsonb_object_agg() section in the main JSON section.
max() Computes the greatest value among the values in the set using the rule that is used for the particular data type in the ORDER BY clause. nulls are removed before sorting the values.
min() Computes the least value among the values in the set using the rule that is used for the particular data type in the ORDER BY clause. nulls are removed before sorting the values.
string_agg() Returns a single value produced by concatenating the aggregated values (first argument) separated by a mandatory separator (second argument). The first overload has text inputs and returns text. The second overload has bytea inputs and returns bytea.
sum() Computes the sum of a set of summable values by adding them all together. If the set contains nulls, then these are simply ignored.
xmlagg() This is not supported through Version YB 2.2. See GitHub Issue #1043

Statistical aggregate functions

The aggregate functions in this class can be invoked in one of two ways:

  • Either "ordinarily" on all the rows in a table or in connection with GROUP BY, when they return a single value for the set of rows. In this use, row ordering doesn't matter.

  • Or as a window function with OVER.

    In this use, where the aggregate function is evaluated for each row in the window, ordering always matters.

Function Description
covar_pop() Returns the so-called covariance, taking the available values to be the entire population.
covar_samp() Returns the so-called covariance, taking the available values to be a sample of the population.
corr() Returns the so-called correlation coefficient. This measures the extent to which the "y" values are linearly related to the "x" values. A return value of 1.0 indicates perfect correlation.
regr_avgy() Returns the average of the first argument for those rows where both arguments are NOT NULL.
regr_avgx() Returns the average of the second argument for those rows where both arguments are NOT NULL.
regr_count() Returns the number of rows where both arguments are NOT NULL.
regr_slope() Returns the slope of the straight line that linear regression analysis has determined best fits the "(y, x)" pairs.
regr_intercept() Returns the point at which the straight line that linear regression analysis has determined best fits the "(y, x)" pairs intercepts the "y"-axis.
regr_r2() Returns the square of the correlation coefficient, corr().
regr_syy() Returns regr_count(y, x)*var_pop(y) for NOT NULL pairs.
regr_sxx() Returns regr_count(y, x)*var_pop(x) for NOT NULL pairs.
regr_sxy() Returns regr_count(y, x)*covar_pop(y, x) for NOT NULL pairs.
variance() The semantics of variance() and var_samp() are identical.
var_pop() Returns the variance of a set of values using the "population" variant of the formula that divides by N, the number of values.
var_samp() Returns the variance of a set of values using the "sample" variant of the formula that divides by (N - 1) where N is the number of values.
stddev() The semantics of stddev() and stddev_samp() are identical.
stddev_pop() Returns the standard deviation of a set of values using the naïve formula (i.e. the "population" variant) that divides by the number of values, N.
stddev_samp() Returns the standard deviation of a set of values using the "sample" variant of the formula that divides by (N - 1) where N is the number of values.

Within-group ordered-set aggregate functions

These functions are sometimes referred to as “inverse distribution” functions. They can be invoked only with the dedicated WITHIN GROUP (ORDER BY ...) syntax. They cannot be invoked as a window function with OVER.

Function Description
mode() Return the most frequent value of "sort expression". If there's more than one equally-frequent value, then one of these is silently chosen arbitrarily.
percentile_disc() Discrete variant. The scalar overload of percentile_disc() takes a percentile rank value as input and returns the value, within the specified window, that would produce this. The array overload takes an array of percentile rank values as input and returns the array of values that would produce these.
percentile_cont() Continuous variant. The scalar overload of percentile_cont() takes a percentile rank value as input and returns the value, within the specified window, that would produce this. The array overload takes an array of percentile rank values as input and returns the array of values that would produce these.

Within-group hypothetical-set aggregate functions

These functions as invoked, as within-group hypothetical-set aggregate functions, with the dedicated WITHIN GROUP (ORDER BY ...) syntax. Further, each of the functions listed in this class is associated with a window function of the same name. (But not every window function can be invoked in this way.) For each, the result is the value that the associated window function would have returned for the “hypothetical” row constructed, as the invocation syntax specifies, as if such a row had been added to the window.

Function Description
rank() Returns the integer ordinal rank of each row according to the emergent order that the window ORDER BY clause specifies. The series of values starts with 1 but, when the window contains ties, the series is not dense. See the account of rank() in the Window functions section for more information.
dense_rank() Returns the integer ordinal rank of the distinct value of each row according to what the window ORDER BY clause specifies. The series of values starts with 1 and, even when the window contains ties, the series is dense. See the account of dense_rank() in the Window functions section for more information.
percent_rank() Returns the percentile rank of each row within the window, with respect to the argument of the window_definition's window ORDER BY clause. See the account of percent_rank() in the Window functions section for more information.
cume_dist() Returns a value that represents the number of rows with values less than or equal to the current row’s value divided by the total number of rows—in other words, the relative position of a value in a set of values. See the account of cume_dist() in the Window functions section for more information.