variance(), var_pop(), var_samp(), stddev(), stddev_pop(), stddev_samp()

This section describes the variance(), var_pop(), var_samp(), stddev(), stddev_pop(), and stddev_samp() aggregate functions. They provide a confidence measure for the computed arithmetic mean of a set of values.

Each of these aggregate functions is invoked by using the same syntax:

either the simple syntax, select aggregate_fun(expr) from t
or the GROUP BY syntax
or the OVER syntax

Only the simple invocation is illustrated in this section. See, for example, the sections GROUP BY syntax and OVER syntax in the section avg(), count(), max(), min(), sum() for how to use these syntax patterns.

Background

The notions "variance" and "standard deviation" are trivially related: the latter is the square root of the former. The variance of a set of N values, v, is defined, naïvely, in terms of the arithmetic mean, a of those values:

  variance = ( sum over all "v" of (v - a)^2 ) / N

Statisticians distinguish between the variance and the standard deviation of an entire population and the variance and the standard deviation of a sample of a population. The formulas for computing the "population" variants use the naïve definition of variance. And the formulas for computing the "sample" variants divide by (N - 1) rather than by N.

This example demonstrates that the built-in functions for the "population" and the "sample" variants of variance and standard deviation produce the same values as the text-book formulas that define them. First create a small set of values:

drop table if exists t cascade;
create table t(v numeric primary key);

insert into t(v)
select 100 + s.v*0.01
from generate_series (-5, 5) as s(v);

select to_char(v, '999.99') as v from t order by v;

This is the result:

Now create a function to test the equality between what the built-in functions produce and what the formulas that define them produce:

drop function if exists fmt(x in numeric) cascade;
drop function if exists f() cascade;

create function fmt(x in numeric)
  returns text
  language sql
as $body$
  select to_char(x, '0.99999999');
$body$;

create function f()
  returns table(t text)
  language plpgsql
as $body$
declare
  sum constant numeric not null := (
    select count(v)::numeric from t);

  avg constant numeric not null := (
    select avg(v) from t);

  s constant numeric not null := (
    select sum((avg - v)^2) from t);

  variance    numeric not null := 0;
  var_samp    numeric not null := 0;
  var_pop     numeric not null := 0;
  stddev      numeric not null := 0;
  stddev_samp numeric not null := 0;
  stddev_pop  numeric not null := 0;
begin
  select variance(v), var_samp(v), var_pop(v), stddev(v), stddev_samp(v), stddev_pop(v)
  into   variance,    var_samp,    var_pop,    stddev,    stddev_samp,    stddev_pop
  from t;

  assert variance = var_samp,              'unexpected';
  assert stddev   = stddev_samp,           'unexpected';

  assert var_samp = s/(sum - 1),           'unexpected';
  assert var_pop  = s/sum,                 'unexpected';

  assert stddev_samp = sqrt(s/(sum - 1)),  'unexpected';
  assert stddev_pop = sqrt(s/sum),         'unexpected';

  t = 'var_samp:               '||fmt(var_samp);                return next;
  t = 'var_pop:                '||fmt(var_pop);                 return next;
  t = 'stddev_samp:            '||fmt(stddev_samp);             return next;
  t = 'stddev_pop:             '||fmt(stddev_pop);              return next;
  t = 'stddev_samp/stddev_pop: '||fmt(stddev_samp/stddev_pop);  return next;
end;
$body$;

\t on
select t from f();
\t off

Notice that the semantics of variance() and var_samp() are identical; and that the semantics of stddev() and stddev_samp() are identical. Each of the assert statements succeeds and the function produces this result:

 var_samp:                0.00110000
 var_pop:                 0.00100000
 stddev_samp:             0.03316625
 stddev_pop:              0.03162278
 stddev_samp/stddev_pop:  1.04880885

This section assumes that you understand the distinction between the "population" and the "sample" variants and that you know which variant you need for your present purpose.

Signature:

Each one of the "confidence measure" aggregate functions has the same signature:

input value:       smallint, int, bigint, numeric, double precision, real

return value:      numeric, double precision

Notes: The lists of input and return data types give the distinct kinds. Because, the output of each function is computed by division, the return data type is never one whose values are constrained to be whole numbers. Here are the specific mappings:

INPUT             OUTPUT
----------------  ----------------
smallint          numeric
int               numeric
bigint            numeric
numeric           numeric
double precision  double precision
real              double precision

variance()

Purpose: the semantics of variance() and var_samp() are identical.

var_pop()

Purpose: Returns the variance of a set of values using the naïve formula (i.e. the "population" variant) that divides by the number of values, N, as explained in the Background section. In other words, it treats the set of values as the entire population of interest.

var_samp()

Purpose: Returns the variance of a set of values using the "sample" variant of the formula that divides by (N - 1) where N is the number of values, as explained in the Background section. In other words, it treats the set of values as just a sample of the entire population of interest. The value produced by var_samp() is bigger than that produced by var_pop(), reflecting the fact that using only a sample is less reliable than using the entire population.

stddev()

Purpose: the semantics of stddev() and stddev_samp() are identical.

stddev_pop()

Purpose: Returns the standard deviation of a set of values using the naïve formula (i.e. the "population" variant) that divides by the number of values, N, as explained in the Background section. In other words, it treats the set of values as the entire population of interest.

stddev_samp()

Purpose: Returns the standard deviation of a set of values using the "sample" variant of the formula that divides by (N - 1) where N is the number of values, as explained in the Background section. In other words, it treats the set of values as just a sample of the entire population of interest. The value produced by stddev_samp() is bigger than that produced by stddev_pop()`, reflecting the fact that using only a sample is less reliable than using the entire population.

Example

The example uses the function normal_rand(), brought by the tablefunc extension, to populate the test table:

drop table if exists t cascade;
create table t(v double precision primary key);

do $body$
declare
  no_of_rows constant int              := 100000;
  mean       constant double precision := 0.0;
  stddev     constant double precision := 50.0;
begin
  insert into t(v)
  select normal_rand(no_of_rows, mean, stddev);
end;
$body$;

Of course, the larger is the value that you choose for "no_of_rows", the closer will be the values returned by the "sample" variants of the confidence measures to the values returned by the "population" variants.

Because the demonstration (for convenience) uses a table with a single double precision column, "v", this must be the primary key. It's just possible that normal_rand() will create some duplicate values. However, this is so very rare that it was never seen while the script was repeated, many times, during the development of this code example. If insert into t(v) does fail because of this, just repeat the script by hand.

Now display the values for avg(v), stddev_samp(v), stddev_pop(v), and the value of stddev_samp(v)/stddev_pop(v).

with a as (
  select
    avg(v)         as avg,
    stddev_samp(v) as stddev_samp,
    stddev_pop(v)  as stddev_pop
  from t)
select
  to_char(avg,              '0.999') as avg,
  to_char(stddev_samp, '999.999999') as stddev_samp,
  to_char(stddev_pop,  '999.999999') as stddev_pop,

  to_char(stddev_samp/stddev_pop, '90.999999') as "stddev_samp/stddev_pop"
from a;

Because of the pseudorandom nature of normal_rand(), the values produced will change from run to run. Here are some typical values:

  avg   | stddev_samp | stddev_pop  | stddev_samp/stddev_pop
--------+-------------+-------------+------------------------
  0.138 |   49.880052 |   49.879802 |   1.000005