Average COVID-like symptoms vs average mask-wearing by state scatter plot for 21-Oct-2020

Scatter-plot for 21-Oct-2020

The dots on this plot represent "mask_wearing_pct" on the x-axis with "symptoms_pct" on the y-axis from this query:

select
  round(mask_wearing_pct)  as "% wearing mask",
  round(symptoms_pct)      as "% with symptoms",
  state
from covidcast_fb_survey_results_v
where survey_date = to_date('2020-10-21', 'yyyy-mm-dd')
order by 1;

The plot would be too cluttered if each of the 51 points were labeled with its two-letter state abbreviation.

The plot was created simply by pasting a comma-separated list of "mask_wearing_pct"-"symptoms_pct" pairs into a spreadsheet and by using the app's built-in functionality to create a scatter plot from such pairs values. The values were produced with this query:

select
  round(mask_wearing_pct)::text||','||round(symptoms_pct)::text
from covidcast_fb_survey_results_v
where survey_date = to_date('2020-10-21', 'yyyy-mm-dd')
order by 1;

Then the plot was printed and the line was drawn in by hand using the slope and y-axis intercept from this query:

with a as (
  select
    max(survey_date)                               as survey_date,
    regr_slope    (symptoms_pct, mask_wearing_pct) as s,
    regr_intercept(symptoms_pct, mask_wearing_pct) as i
  from covidcast_fb_survey_results_v
  where survey_date = to_date('2020-10-21', 'yyyy-mm-dd'))
select
  to_char(survey_date,      'mm/dd')  as survey_date,
  to_char(s,  '90.9')                 as s,
  to_char(i,  '990.9')                as i
from a;

This is the result:

 survey_date |   s   |   i
-------------+-------+--------
 10/21       |  -1.2 |  131.4

And here is the plot:

Scatter-plot for 2020-10-21

Scatter-plot for synthetic data

For comparison, the same technique was used to create a scatter-plot and to draw in the best-fit straight line using synthetic data by running the procedure "populate_t()" described in the section Create the test table within the section that introduces the built-in aggregate functions for linear regression analysis.

create table t(
  k      int primary key,
  x      double precision,
  y      double precision,
  delta  double precision);

create procedure populate_t(
  no_of_rows  in int,
  slope       in double precision,
  intercept   in double precision,
  mean        in double precision,
  stddev      in double precision)
  language plpgsql
as $body$
  ...

This code, and the remaining code below, needed to make the scatter-plot for synthetic data is included in synthetic-data.sql.

It uses the function normal_rand(), brought by the tablefunc extension to add pseudorandomly generated normally distributed noise the y-axis values produced by the "y = m*x + c" formula for the straight line.

It was then invoked like this:

call populate_t(
  no_of_rows  => 100,
  mean        => 0.0,
  stddev      => 5.0,
  slope       => -1.2,
  intercept   => 131.4);

using the values for slope and intercept from the regression analysis of the COVIDcast data for 21-Oct-2020 and by choosing a value for the "stddev" actual argument arbitrarily.

The comma-separated pairs for the spreadsheet were produced by this query:

select
  round(x)::text||','||round(y + delta)::text
from t
where
  x > 60        and
  x < 95        and
  x is not null and
  y is not null
order by x;

And the values for the slope and y-axis intercept were produced by this query:

with a as (
  select
    regr_r2       ((y + delta), x) as r2,
    regr_slope    ((y + delta), x) as s,
    regr_intercept((y + delta), x) as i
  from t)
select
  to_char(r2,  '0.99')                as r2,
  to_char(s,  '90.9')                 as s,
  to_char(i,  '990.9')                as i
from a;

This is the result:

  r2   |   s   |   i
-------+-------+--------
  0.98 |  -1.2 |  130.8

The emergent values for the slope and intercept are very close to the values (-1.2 and 131.4) that were used for the invocation of "populate_t()".

Here is the resulting plot

Scatter-plot for synthetic data :

Note: The normal_rand() function produces a different set of pseudorandomly distributed values each time that synthetic-data.sql is run. But the overall shape of the scatter-plot will remain the same.