Functions for linear regression analysis
This parent section and its two child sections describe these aggregate functions for linear regression analysis:
See, for example, this Wikipedia article on Regression analysis. Briefly, linear regression analysis estimates the relationship between a dependent variable and an independent variable, aiming to find the line that most closely fits the data. This is why each of the functions described has two input formal parameters. The dependent variable, the first formal parameter, is conventionally designated by "y"; and the independent variable, the second formal parameter, is conventionally designated by "x".
See, for example, the article "How To Interpret R-squared in Regression Analysis". It says this:
Linear regression identifies the equation that produces the smallest difference between all of the observed values and their fitted values. To be precise, linear regression finds the smallest sum of squared residuals that is possible for the dataset.
In terms of the high school equation for a straight line:
y = m*x + c
regr_slope(y, x) estimates the gradient, "m", of the straight line that best fits the set of coordinate pairs over which the aggregation is done; and the function
regr_intercept(y, x) estimates its intercept with the y-axis, "c". The so-called "R-squared " measure, implemented by
regr_r2(y, x), indicates the goodness-of-fit. It measures the percentage of the variance in the dependent variable that the independent variables explain collectively—in other words, the strength of the relationship between your model and the dependent variable on a 0 – 100% scale. For example, if
regr_r2() returns a value of 0.7, it means that seventy percent of the relationship between the putative dependent variable and the independent variable can be explained by a straight line with the gradient and intercept returned, respectively, by
regr_intercept(). The remaining thirty percent can be attributed to stochastic variation.
The purpose of each of the functions is rather specialized; but the domain is also very familiar to people who need to do linear regression. For this reason, the aim here is simply to explain enough for specialists to be able to understand exactly what is available, and how to invoke what they decide that they need. Each function is illustrated with a simple example.
Each of these aggregate functions is invoked by using the same syntax:
- either the simple syntax,
select aggregate_fn(expr, expr) from t
- or the
- or the
Each one of the aggregate functions for linear regression analysis, except for
regr_count(), has the same signature:
input value: double precision, double precision return value: double precision
Because it returns a count,
regr_count() returns a
input value: double precision, double precision return value: bigint
In all cases, the first input parameter represents the values that you want to be taken as the dependent variable (conventionally denoted by "y") and the second input parameter represents the values that you want to be taken as the independent variable (conventionally denoted by "x").
About nullnessIf, for a particular input row, either the expression for "y", or the expression for "x", evaluates to
null, then that row is implicitly filtered out.
Create the test table
The same test table recipe serves for illustrating all of the functions for linear regression analysis. The design is straightforward. Noise is added to a pure linear function, thus:
y = slope*x + intercept + delta
where "delta" is picked, for each "x" value from a pseudorandom normal distribution with specified mean and standard deviation.
The procedure "populate_t()" lets you try different values for "slope", "intercept", and for the size and variability of "delta". It uses the function
normal_rand(), brought by the tablefunc extension.
drop procedure if exists populate_t( int, double precision, double precision, double precision, double precision) cascade; drop table if exists t cascade; create table t( k int primary key, x double precision, y double precision, delta double precision); create procedure populate_t( no_of_rows in int, slope in double precision, intercept in double precision, mean in double precision, stddev in double precision) language plpgsql as $body$ begin delete from t; with a1 as ( select s.v as k, s.v as x, (s.v * slope) + intercept as y from generate_series(1, no_of_rows) as s(v)), a2 as ( select ( row_number() over()) as k, r.v as delta from normal_rand(no_of_rows, mean, stddev) as r(v)) insert into t(k, x, y, delta) select k, x, a1.y, a2.delta from a1 inner join a2 using(k); insert into t(k, x, y, delta) values (no_of_rows + 1, 0, null, null), (no_of_rows + 2, null, 0, null); end; $body$; \set no_of_rows 100 call populate_t( no_of_rows => :no_of_rows, mean => 0.0, stddev => 20.0, slope => 5.0, intercept => 3.0); \pset null <null> with a as( select k, x, y, delta from t where x between 1 and 5 union all select k, x, y, delta from t where k between 96 and (:no_of_rows + 2)) select to_char(x, '990.9') as x, to_char(y, '990.9') as y, to_char((y + delta), '990.9999') as "y + delta" from a order by k;
Here is an impression of the result of invoking "populate_t()" with the values shown. The whitespace has been manually added.
x | y | y + delta --------+--------+----------- 1.0 | 8.0 | -5.9595 2.0 | 13.0 | -14.8400 3.0 | 18.0 | 40.4009 4.0 | 23.0 | 27.8537 5.0 | 28.0 | 68.7411 96.0 | 483.0 | 483.9196 97.0 | 488.0 | 464.3205 98.0 | 493.0 | 528.2446 99.0 | 498.0 | 514.0421 100.0 | 503.0 | 549.7692 0.0 | <null> | <null> <null> | 0.0 | <null>
The individual functions are described in these two child-sections