Average COVID-like symptoms vs average mask-wearing by state scatter plot for 21-Oct-2020
This page documents the preview version (v2.23). Preview includes features under active development and is for development and testing only. For production, use the stable version (v2024.1). To learn more, see Versioning.
Scatter-plot for 21-Oct-2020
The dots on this plot represent "mask_wearing_pct" on the x-axis with "symptoms_pct" on the y-axis from this query:
select
round(mask_wearing_pct) as "% wearing mask",
round(symptoms_pct) as "% with symptoms",
state
from covidcast_fb_survey_results_v
where survey_date = to_date('2020-10-21', 'yyyy-mm-dd')
order by 1;
The plot would be too cluttered if each of the 51 points were labeled with its two-letter state abbreviation.
The plot was created simply by pasting a comma-separated list of "mask_wearing_pct"-"symptoms_pct" pairs into a spreadsheet and by using the app's built-in functionality to create a scatter plot from such pairs values. The values were produced with this query:
select
round(mask_wearing_pct)::text||','||round(symptoms_pct)::text
from covidcast_fb_survey_results_v
where survey_date = to_date('2020-10-21', 'yyyy-mm-dd')
order by 1;
Then the plot was printed and the line was drawn in by hand using the slope and y-axis intercept from this query:
with a as (
select
max(survey_date) as survey_date,
regr_slope (symptoms_pct, mask_wearing_pct) as s,
regr_intercept(symptoms_pct, mask_wearing_pct) as i
from covidcast_fb_survey_results_v
where survey_date = to_date('2020-10-21', 'yyyy-mm-dd'))
select
to_char(survey_date, 'mm/dd') as survey_date,
to_char(s, '90.9') as s,
to_char(i, '990.9') as i
from a;
This is the result:
survey_date | s | i
-------------+-------+--------
10/21 | -1.2 | 131.4
And here is the plot:
Scatter-plot for synthetic data
For comparison, the same technique was used to create a scatter-plot and to draw in the best-fit straight line using synthetic data by running the procedure "populate_t()" described in the section Create the test table within the section that introduces the built-in aggregate functions for linear regression analysis.
create table t(
k int primary key,
x double precision,
y double precision,
delta double precision);
create procedure populate_t(
no_of_rows in int,
slope in double precision,
intercept in double precision,
mean in double precision,
stddev in double precision)
language plpgsql
as $body$
...
This code, and the remaining code below, needed to make the scatter-plot for synthetic data is included in synthetic-data.sql
.
It uses the function normal_rand()
, brought by the tablefunc extension to add pseudorandomly generated normally distributed noise the y-axis values produced by the "y = m*x + c" formula for the straight line.
It was then invoked like this:
call populate_t(
no_of_rows => 100,
mean => 0.0,
stddev => 5.0,
slope => -1.2,
intercept => 131.4);
using the values for slope and intercept from the regression analysis of the COVIDcast data for 21-Oct-2020 and by choosing a value for the "stddev" actual argument arbitrarily.
The comma-separated pairs for the spreadsheet were produced by this query:
select
round(x)::text||','||round(y + delta)::text
from t
where
x > 60 and
x < 95 and
x is not null and
y is not null
order by x;
And the values for the slope and y-axis intercept were produced by this query:
with a as (
select
regr_r2 ((y + delta), x) as r2,
regr_slope ((y + delta), x) as s,
regr_intercept((y + delta), x) as i
from t)
select
to_char(r2, '0.99') as r2,
to_char(s, '90.9') as s,
to_char(i, '990.9') as i
from a;
This is the result:
r2 | s | i
-------+-------+--------
0.98 | -1.2 | 130.8
The emergent values for the slope and intercept are very close to the values (-1.2 and 131.4) that were used for the invocation of "populate_t()".
Here is the resulting plot
:
Note: The normal_rand()
function produces a different set of pseudorandomly distributed values each time that synthetic-data.sql
is run. But the overall shape of the scatter-plot will remain the same.