The bucket allocation scheme
It might seem that the built-in function
width_bucket() (here in the PostgreSQL documentation) is tailor-made for this task. But there's a snag.
The values in the column t4.dp_score" lie in this so-called closed-closed interval
[0, 100]—expressed like this in SQL:
dp_score is between 0 and 100
Note: This Wikipedia article explains the terminology used to characterize intervals. For example:
The open-closed interval (0,1] means greater than 0 and less than or equal to 1. And the closed-open interval [0,1) means greater than or equal to 0 and less than 1.
Here is the signature (using sensibly chosen names for the formal parameters) of the
double precision overload of
function width_bucket( val in double precision, -- the value to be allocated lb in double precision, -- the lower bound of the overall closed-open interval ub in double precision, -- the upper bound of the overall closed-open interval n_vals in int) -- the required number of buckets returns int
If you invoke
width_bucket() to allocate a value
lb = 0,
ub = 100, and
n_vals = 10, like this:
width_bucket($1, 0, 100, 10)
then the buckets are defined as these closed-open intervals:
[ 0, 10) -> bucket 1 [10, 20) -> bucket 2 [20, 30) -> bucket 3 ... [80, 90) -> bucket 9 [90, 100) -> bucket 10
Any value that is less than 0 is allocated to the lower overflow bucket 0. And any value that is greater than or equal to the value 100 is allocated to the upper overflow bucket 11. Here is an example:
do $body$ declare lb constant double precision := 0; ub constant double precision := 100; n_vals constant int := 10; begin assert width_bucket( -0.0000000001, lb, ub, n_vals) = 0, 'unexpected'; assert width_bucket( 0, lb, ub, n_vals) = 1, 'unexpected'; assert width_bucket( 99.9999999999, lb, ub, n_vals) = 10, 'unexpected'; assert width_bucket(100, lb, ub, n_vals) = 11, 'unexpected'; end; $body$;
This means that attention has to be paid to the special case when the value to be allocated is 100 in order that it is allocated to bucket 10, and not to the upper overflow bucket 11. However, this is not the main snag. Rather, it's this:
- Pen-and-paper analysis, and some empirical testing, show that
cume_dist()can produce the same results only when these open-closed intervals are used:
( 0, 10] -> bucket 1 (10, 20] -> bucket 2 (20, 30] -> bucket 3 ... (80, 90] -> bucket 9 (90, 100] -> bucket 10
This means that attention has to be paid to the special case at the other end of the closed-closed interval of possible input values: when the value to be allocated is 0, it must be allocated to bucket 1, and not to the lower overflow bucket 0.
The consequence is that
width_bucket() cannot be used as is. This leaves two choices. Each relies on implementing a new function
bucket() with the same signature as
width_bucket() but with a different implementation.
The first choice is to implement
bucketas a wrapper around
width_bucket()to change its behavior at the bucket boundaries, See cr_bucket_using_width_bucket.sql .
The second choice is to implement
bucketfrom scratch. See
See do_assert_bucket_ok.sql for the acceptance test that each must pass.