Inspect the COVIDcast .csv files
What the columns in the .csv files mean
Open each of the
.csv files that you downloaded in a plain text editor—or, if you prefer, a spreadsheet app. You'll see that there are twelve columns and that the first row is the title row. The first column has no title but you'll see that the values are all integers. This column will be named "code" in each of the three staging tables into which the values from the
.csv files are loaded.
Here, using that naming choice for the first column, is the list of column names. The list shows a typical value for each column.
code: 0 geo_value: ak signal: smoothed_wearing_mask time_value: 2020-09-26 direction: <null> issue: 2020-10-05 lag: 9 value: 79.4036061 stderr: 1.5071256 sample_size: 720.0 geo_type: state data_source: fb-survey
The meanings of the columns are given, on the COVIDcast site, within the documentation for the R package. The section ILI and CLI Indicators explains how to interpret the values in the "value" column. Notice this on the page from which you downloaded the data:
Each geographic region is identified with a unique identifier, such as FIPS code. See the geographic coding documentation for details.
Some of the explanations given on these referenced pages are more than you need. Here, for your convenience, is a précis—at the level of detail that this case study needs.
"code" and "geo_value"
You can see immediately that _"geo_value"_ contains the 50 familiar two-letter abbreviations for the US States together with DC for the District of Columbia making 51 distinct values in all. And you can see, too, that "code"_ contains integer values in the range 0..50_. The procedure "assert_assumptions_ok()" will check that these two rules hold and that the values for "code" and "geo_value" bear a fixed one-to-one correspondence. In the general case, these two columns have a meaning that reflects the choice that you made for "Geographic Level" when you downloaded the
Here is the code from the procedure "assert_assumptions_ok()" that checks that there are 51 distinct values of "geo_value" in each staging table as a whole and that they correspond to a manually provided list of the actual states (including DC).
This code checks that there are 51 distinct values of "geo_value" for each distinct "time_value.
This code checks the one-to-one correspondence between "code" and "geo_value".
Every row in this column, for a particular
.csv file, has the same value, thus:
covidcast-fb-survey-smoothed_wearing_mask-2020-09-13-to-2020-11-01.csv — smoothed_wearing_mask covidcast-fb-survey-smoothed_cli-2020-09-13-to-2020-11-01.csv — smoothed_cli covidcast-fb-survey-smoothed_hh_cmnty_cli-2020-09-13-to-2020-11-01.csv — smoothed_hh_cmnty_cli
You'll see that this column, in each
.csv file, has the same set of distinct values: the 50 dates in the range 2020-09-13 - 2020-11-01. In other words, the value specifies the date of the Facebook survey.
This code checks that there are 50 distinct values of "time_value" (in a dense series) for each distinct "geo_value".
This column, in each
.csv file, is empty—and this will be reflected by the fact that the columns in the three tables into which the three files are ingested are
NULL for every row.
"issue" and "lag"
Briefly, these values specify the publication date for the data and the number of days that the publication date lagged the date of the observation. This case study doesn't use these values.
This is given as a percentage in each of the three
.csv files. The procedure "assert_assumptions_ok()" will check that the values lie in the range 0..100. The page from which you downloaded the data explains, when each of the signals that you downloaded was highlighted, the significance of that signal's value.
- People Wearing Masks: Percentage of people who wear a mask most or all of the time while in public, based on surveys of Facebook users.
- COVID-Like Symptoms: Percentage of people with COVID-like symptoms, based on surveys of Facebook users.
- COVID-Like Symptoms in Community: Percentage of people who know someone in their local community with COVID-like symptoms, based on surveys of Facebook users.
The percentages reflect the ratio of the number of respondents who answered "yes" to the number of respondents to the particular survey question.
Notice the use of the term of art "smoothed" in the names of the
.csv files and the values of "signal". The ILI and CLI Indicators page explains it thus:
The smoothed versions of the signals... are calculated using seven day pooling. For example, the estimate reported for June 7 in a specific geographical area... is formed by collecting all surveys completed between June 1 and 7 (inclusive)...
In other words, all values are seven-day trailing moving averages.
This gives a confidence measure for the moving averages produced by smoothing. It isn't used by this case study, but it is carried over to the "covidcast_fb_survey_results" table that represents the data that (apart from this one and "sample_size" for each signal) the case study uses. You might find these values interesting for your own ad hoc queries.
This specifies the number of respondents, on a particular day. It isn't used by this case study, but it is carried over to the "covidcast_fb_survey_results" table that represents the data that (apart from this one and "stderr" for each signal) the case study uses. You might find these values interesting for your own ad hoc queries.
"data_source" and "geo_type"
The values in each of these two columns are the same for every row in the three
.csv files. They reflect your choice to download “Facebook Survey Results” at the "States" geographic level. The values are, respectively, 'fb-survey' and 'state'.
The meanings of the columns inform the design of the "covidcast_fb_survey_results" table
The descriptions above, and the successful outcome of the checks that the procedure "assert_assumptions_ok()" implements, inform the design of a single table to hold all of the data for the linear regression analysis.
The information in the following four columns is redundant and needn't be included in the projection:
GEO_TYPE # always 'state' DATA_SOURCE # always 'fb-survey' CODE # one-to-one with GEO_VALUE SIGNAL # will determine the destination column for VALUE and SAMPLE_SIZE
Further, the information in the following three columns isn't relevant for the linear regression analysis that this case study performs and also, therefore, needn't be included in the projection:
DIRECTION # always NULL ISSUE # ISSUE and LAG are just metadata LAG
Only these five columns remain of interest:
TIME_VALUE # TIME_VALUE is the SURVEY_DATE. GEO_VALUE is the two-letter STATE code. GEO_VALUE # The (SURVEY_DATE, STATE) tuple is the primary key. VALUE # Will go to MASK_WEARING_PCT, SYMPTOMS_PCT, or CMNTY_SYMPTOMS_PCT. STDERR # Will go to MASK_WEARING_STDERR, SYMPTOMS_STDERR, or CMNTY_SYMPTOMS_STDERR. SAMPLE_SIZE # Will go to MASK_WEARING_SAMPLE_SIZE, SYMPTOMS_SAMPLE_SIZE, or CMNTY_SYMPTOMS_SAMPLE_SIZE.
This leads to the following design for the "covidcast_fb_survey_results" table:
survey_date date not null } primary state text not null } key mask_wearing_pct numeric not null mask_wearing_stderr numeric not null mask_wearing_sample_size integer not null symptoms_pct numeric not null symptoms_stderr numeric not null symptoms_sample_size integer not null cmnty_symptoms_pct numeric not null cmnty_symptoms_stderr numeric not null cmnty_symptoms_sample_size integer not null
The columns "mask_wearing_pct", "symptoms_pct", and "cmnty_symptoms_pct" will have check constraints to ensure that the values are plausible percentages. And the columns "mask_wearing_sample_size", "symptoms_sample_size", "cmnty_symptoms_sample_size", "mask_wearing_stderr", "symptoms_stderr", and "cmnty_symptoms_stderr" will have check constraints to ensure that the values are positive. (The "sample size" and "stderr" columns are not used in the regression analysis.)