Case study: using aggregate functions to demonstrate the "68–95–99.7" rule
This case study shows the use of aggregate functions to demonstrate the so-called "68–95–99.7 rule"—described in this Wikipedia article. This case-study focuses on just one part of the rule:
- 68.27% of the values in a normal distribution lie within one standard deviation each side of the mean.
Populate the test table
The demonstration uses the function
normal_rand(), brought by the tablefunc extension, to populate the test table:
drop table if exists t cascade; create table t(v double precision primary key); do $body$ declare no_of_rows constant int := 1000000; mean constant double precision := 0.0; stddev constant double precision := 50.0; begin insert into t(v) select normal_rand(no_of_rows, mean, stddev); end; $body$;
If you choose a value of one million for "no_of_rows", then the two different estimates for the one sigma boundaries reliably produce results that typically differ from each other by less than about 0.1%. If you choose fewer rows, then the variability, and the typical results difference, will be bigger.
Because the demonstration (for convenience) uses a table with a single
double precision column, "v", this must be the primary key. It's just possible that
normal_rand() will create some duplicate values. However, this is so very rare that it was never seen while the script was repeated, many times, during the development of this code example. If
insert into t(v) does fail because of this, just repeat the script by hand.
Create, and execute, the table function "the_6827_rule()"
The table function "the_6827_rule()", below, uses the following approach:
Secondly, it determines these boundaries by invoking the
percentile_cont()aggregate function with putative
percent_rank()input values corresponding, respectively, to the fraction of the table's values that lie below "mean - one standard deviation" and below "mean + one standard deviation".
drop function if exists the_6827_rule() cascade; create function the_6827_rule() returns table (t text) language plpgsql as $body$ declare -- FIRST, determine the one sigma boundaries by using avg() and stddev_pop(). measured_avg constant double precision := ( select avg(v) from t); measured_sigma constant double precision := ( select stddev_pop(v) from t); one_sigma_boundaries_from_avg_and_stddev constant double precision := array[ measured_avg - measured_sigma, measured_avg + measured_sigma]; -- SECOND, determine the one sigma boundaries by using percentile_cont(). fraction_within_sigma_from_avg constant double precision := 0.6827; expected_fraction_below_avg_minus_sigma constant double precision := (1.0 - fraction_within_sigma_from_avg)/2.0; expected_fraction_above_avg_plus_sigma constant double precision := expected_fraction_below_avg_minus_sigma + fraction_within_sigma_from_avg; fractions constant double precision := array[ expected_fraction_below_avg_minus_sigma, expected_fraction_above_avg_plus_sigma]; one_sigma_boundaries_from_percentile_cont constant double precision := ( select percentile_cont(fractions) within group (order by v) from t); begin t := rpad(' ', 13)|| lpad('from avg() and stddev_pop()', 29, ' ')|| lpad('from percentile_cont()', 24, ' ')|| lpad('ratio', 10, ' '); return next; t := lpad(' ', 13)||rpad(' ', 29, '-')||rpad(' ', 24, '-')||rpad(' ', 10, '-'); return next; for j in 1..2 loop declare caption constant text not null := case j when 1 then 'mean - sigma:' when 2 then 'mean + sigma:' end; num constant double precision not null := one_sigma_boundaries_from_percentile_cont[j] - one_sigma_boundaries_from_avg_and_stddev[j]; denom constant double precision not null := (one_sigma_boundaries_from_percentile_cont[j] + one_sigma_boundaries_from_avg_and_stddev[j])/2.0; ratio constant double precision not null := abs((100.0::double precision*num)/denom); begin t := rpad(caption, 13)|| lpad(to_char(one_sigma_boundaries_from_avg_and_stddev[j], '990.99999'), 29)|| lpad(to_char(one_sigma_boundaries_from_percentile_cont[j], '990.99999'), 24)|| lpad(to_char(ratio, '990.999'), 9)||'%'; return next; end; end loop; end; $body$;
A table function is used because if
raise info is used in a procedure or anonymous block, then the output cannot be spooled to a file. Invoke it, for example, like this:
\t on \o report.txt select t from the_6827_rule(); \o \t off
Here is a typical result (for one million rows):
from avg() and stddev_pop() from percentile_cont() ratio --------------------------- ---------------------- -------- mean - sigma: -50.00899 -49.99646 0.025% mean + sigma: 49.97396 50.00483 0.062%