 # Functions for linear regression analysis

This parent section and its two child sections describe these aggregate functions for linear regression analysis:

## Overview

See, for example, this Wikipedia article on Regression analysis. Briefly, linear regression analysis estimates the relationship between a dependent variable and an independent variable, aiming to find the line that most closely fits the data. This is why each of the functions described has two input formal parameters. The dependent variable, the first formal parameter, is conventionally designated by "y"; and the independent variable, the second formal parameter, is conventionally designated by "x".

See, for example, the article "How To Interpret R-squared in Regression Analysis". It says this:

Linear regression identifies the equation that produces the smallest difference between all of the observed values and their fitted values. To be precise, linear regression finds the smallest sum of squared residuals that is possible for the dataset.

In terms of the high school equation for a straight line:

``````y = m*x + c
``````

the function `regr_slope(y, x)` estimates the gradient, "m", of the straight line that best fits the set of coordinate pairs over which the aggregation is done; and the function `regr_intercept(y, x)` estimates its intercept with the y-axis, "c". The so-called "R-squared " measure, implemented by `regr_r2(y, x)`, indicates the goodness-of-fit. It measures the percentage of the variance in the dependent variable that the independent variables explain collectively—in other words, the strength of the relationship between your model and the dependent variable on a 0 – 100% scale. For example, if `regr_r2()` returns a value of 0.7, it means that seventy percent of the relationship between the putative dependent variable and the independent variable can be explained by a straight line with the gradient and intercept returned, respectively, by `regr_slope()` and `regr_intercept()`. The remaining thirty percent can be attributed to stochastic variation.

The purpose of each of the functions is rather specialized; but the domain is also very familiar to people who need to do linear regression. For this reason, the aim here is simply to explain enough for specialists to be able to understand exactly what is available, and how to invoke what they decide that they need. Each function is illustrated with a simple example.

Each of these aggregate functions is invoked by using the same syntax:

• either the simple syntax, `select aggregate_fn(expr, expr) from t`
• or the `GROUP BY` syntax
• or the `OVER` syntax

Only the simple invocation is illustrated. See, for example, the sections `GROUP BY` syntax and `OVER` syntax in the section `avg(), count(), max(), min(), sum()` for how to use these syntax patterns.

Signature:

Each one of the aggregate functions for linear regression analysis, except for `regr_count()`, has the same signature:

``````input value:       double precision, double precision

return value:      double precision
``````

Because it returns a count, `regr_count()` returns a `bigint`, thus:

``````input value:       double precision, double precision

return value:      bigint
``````

In all cases, the first input parameter represents the values that you want to be taken as the dependent variable (conventionally denoted by "y") and the second input parameter represents the values that you want to be taken as the independent variable (conventionally denoted by "x").

If, for a particular input row, either the expression for "y", or the expression for "x", evaluates to `null`, then that row is implicitly filtered out.

## Create the test table

The same test table recipe serves for illustrating all of the functions for linear regression analysis. The design is straightforward. Noise is added to a pure linear function, thus:

``````  y = slope*x + intercept + delta
``````

where "delta" is picked, for each "x" value from a pseudorandom normal distribution with specified mean and standard deviation.

The procedure "populate_t()" lets you try different values for "slope", "intercept", and for the size and variability of "delta". It uses the function `normal_rand()`, brought by the tablefunc extension.

``````drop procedure if exists populate_t(
int,  double precision,  double precision,  double precision,  double precision)
drop table if exists t cascade;

create table t(
k      int primary key,
x      double precision,
y      double precision,
delta  double precision);

create procedure populate_t(
no_of_rows  in int,
slope       in double precision,
intercept   in double precision,
mean        in double precision,
stddev      in double precision)
language plpgsql
as \$body\$
begin
delete from t;

with
a1 as (
select
s.v        as k,
s.v        as x,
(s.v * slope) + intercept as y
from generate_series(1, no_of_rows) as s(v)),

a2 as (
select (
row_number() over()) as k,
r.v as delta
from normal_rand(no_of_rows, mean, stddev) as r(v))

insert into t(k, x, y, delta)
select
k, x, a1.y, a2.delta
from a1 inner join a2 using(k);

insert into t(k, x, y, delta) values
(no_of_rows + 1,    0, null, null),
(no_of_rows + 2, null,    0, null);
end;
\$body\$;

\set no_of_rows 100
call populate_t(
no_of_rows  => :no_of_rows,

mean        =>  0.0,
stddev      => 20.0,
slope       =>  5.0,
intercept   =>  3.0);

\pset null <null>
with a as(
select k, x, y, delta from t where x between 1 and 5
union all
select k, x, y, delta from t where k between 96 and (:no_of_rows + 2))
select
to_char(x, '990.9')           as x,
to_char(y, '990.9')           as y,
to_char((y + delta), '990.9999') as "y + delta"
from a
order by k;
``````

Here is an impression of the result of invoking "populate_t()" with the values shown. The whitespace has been manually added.

``````   x    |   y    | y + delta
--------+--------+-----------
1.0 |    8.0 |   -5.9595
2.0 |   13.0 |  -14.8400
3.0 |   18.0 |   40.4009
4.0 |   23.0 |   27.8537
5.0 |   28.0 |   68.7411

96.0 |  483.0 |  483.9196
97.0 |  488.0 |  464.3205
98.0 |  493.0 |  528.2446
99.0 |  498.0 |  514.0421
100.0 |  503.0 |  549.7692
0.0 | <null> | <null>
<null> |    0.0 | <null>
``````

The individual functions are described in these two child-sections