Join us on YugabyteDB Community Slack
Star us on
Get Started
Slack
GitHub
Get Started
v2.15 (preview) v2.12 (stable) v2.8 (earlier version) v2.6 (earlier version) v2.4 (earlier version) Unsupported versions
  • YUGABYTEDB CORE
    • Quick start
      • 1. Install YugabyteDB
      • 2. Create a local cluster
      • 3. Explore distributed SQL
      • 4. Build an application
        • Java
        • Node.js
        • Go
        • Python
        • Ruby
        • C#
        • PHP
        • C++
        • C
        • Scala
        • Rust
    • Explore
      • SQL features
        • SQL Feature Support
        • PostgreSQL Compatibility
        • Foreign Data Wrappers
        • Schemas and Tables
        • Data Types
        • Data Manipulation
        • Queries and Joins
        • Expressions and Operators
        • Stored Procedures
        • Triggers
        • Advanced features
          • Cursors
          • Table Partitioning
          • Views
          • Savepoints
          • Collations
        • Going beyond SQL
          • Follower reads
          • Tablespaces
        • PostgreSQL extensions
      • Fault tolerance
      • Horizontal scalability
        • Scaling Transactions
        • Sharding Data
      • Transactions
        • Distributed Transactions
        • Isolation Levels
        • Explicit Locking
      • Indexes and Constraints
        • Primary keys
        • Foreign keys
        • Secondary indexes
        • Unique indexes
        • Partial indexes
        • Expression indexes
        • Covering indexes
        • GIN indexes
        • Other constraints
      • JSON support
      • Multi-region deployments
        • Sync replication (3+ regions)
        • Async Replication (2+ regions)
        • Row-Level Geo-Partitioning
        • Read replicas
      • Query tuning
        • Introduction
        • Get query statistics using pg_stat_statements
        • Viewing live queries with pg_stat_activity
        • Analyzing queries with EXPLAIN
        • Optimizing YSQL queries using pg_hint_plan
      • Cluster management
        • Point-in-time recovery
      • Change data capture (CDC)
        • CDCSDK Server
        • Debezium connector
        • Java CDC console
      • Security
      • Observability
        • Prometheus Integration
        • Grafana Dashboard
    • Drivers and ORMs
      • Java
        • Connect an app
        • Use an ORM
      • Go
        • Go drivers
        • Go ORMs
        • Supported versions
      • C#
        • Connect an app
        • Use an ORM
      • NodeJS
        • NodeJS drivers
        • NodeJS ORMs
        • Supported Versions
      • Python
        • Python drivers
        • Python ORMs
        • Supported versions
      • Rust
        • Diesel ORM
    • Develop
      • Learn app development
        • 1. SQL vs NoSQL
        • 2. Data modeling
        • 3. Data types
        • 4. ACID transactions
        • 5. Aggregations
        • 6. Batch operations
        • 7. Date and time
        • 8. Strings and text
        • 9. TTL for data expiration
      • Real-world examples
        • E-Commerce app
        • IoT fleet management
      • Explore sample apps
      • Best practices
      • Quality of Service
        • Rate limiting connections
        • Write-heavy workloads
        • Transaction priorities
      • Cloud-native development
        • Codespaces
        • Gitpod
    • Migrate
      • YugabyteDB Voyager
        • Install
        • Migration steps
        • Performance
        • yb-voyager CLI
      • Manual import
        • Convert a PostgreSQL schema
        • Migrate a PostgreSQL application
        • Export PostgreSQL data
        • Prepare a cluster
        • Import PostgreSQL data
        • Verify migration
    • Deploy
      • Deployment checklist
      • Manual deployment
        • 1. System configuration
        • 2. Install software
        • 3. Start YB-Masters
        • 4. Start YB-TServers
        • 5. Verify deployment
      • Kubernetes
        • Single-zone
          • Open Source
          • Amazon EKS
          • Google Kubernetes Engine
          • Azure Kubernetes Service
        • Multi-zone
          • Amazon EKS
          • Google Kubernetes Engine
        • Multi-cluster
          • Google Kubernetes Engine
        • Best practices
        • Connect Clients
      • Docker
      • Public clouds
        • Amazon Web Services
        • Google Cloud Platform
        • Microsoft Azure
      • Multi-DC deployments
        • Three+ data center (3DC)
        • Asynchronous replication
        • Read replica clusters
    • Benchmark
      • TPC-C
      • sysbench
      • YCSB
      • Key-value workload
      • Large datasets
      • Scalability
        • Scaling queries
      • Resilience
        • Jepsen testing
      • Performance Troubleshooting
    • Secure
      • Security checklist
      • Enable authentication
        • Enable users
        • Configure client authentication
      • Authentication methods
        • Password authentication
        • LDAP authentication
        • Host-based authentication
        • Trust authentication
      • Role-based access control
        • Overview
        • Manage users and roles
        • Grant privileges
        • Row-level security
        • Column-level security
      • Encryption in transit
        • Create server certificates
        • Enable server-to-server encryption
        • Enable client-to-server encryption
        • Connect to clusters
        • TLS and authentication
      • Encryption at rest
      • Column-level encryption
      • Audit logging
        • Configure audit logging
        • Session-Level Audit Logging
        • Object-Level Audit Logging
      • Vulnerability disclosure policy
    • Manage
      • Backup and restore
        • Export and import
        • Distributed snapshots
        • Point-in-time recovery
      • Migrate data
        • Bulk import
        • Bulk export
      • Change cluster configuration
      • Diagnostics reporting
      • Upgrade a deployment
      • Grow cluster
    • Troubleshoot
      • Troubleshooting
      • Cluster level issues
        • YCQL connection issues
        • YEDIS connection Issues
        • Recover tserver/master
        • Replace a failed YB-TServer
        • Replace a failed YB-Master
        • Manual remote bootstrap when a majority of peers fail
        • Recover YB-TServer from crash loop
      • Node level issues
        • Check servers
        • Inspect logs
        • System statistics
        • Disk failure
        • Common error messages
    • Contribute
      • Core database
        • Contribution checklist
        • Build the source
        • Configure a CLion project
        • Run the tests
        • Coding style
      • Documentation
        • Docs checklist
        • Docs layout
        • Build the docs
          • Editor setup
        • Edit the docs
          • Docs page structure
          • Widgets and shortcodes
          • Syntax diagrams
        • Style guide
  • YUGABYTEDB ANYWHERE
    • Overview
      • Install
      • Configure
    • Install
      • Prerequisites
      • Prepare the environment
      • Install software
      • Prepare nodes
      • Uninstall software
    • Configure
      • Create admin user
      • Configure cloud providers
      • Configure backup target
      • Configure alerts
    • Create deployments
      • Multi-zone universe
      • Multi-region universe
      • Multi-cloud universe
      • Read replica cluster
      • Asynchronous replication
    • Manage deployments
      • Start and stop processes
      • Eliminate an unresponsive node
      • Recover a node
      • Enable high availability
      • Edit configuration flags
      • Edit a universe
      • Delete a universe
      • Configure instance tags
      • Upgrade YugabyteDB
      • Migrate to Helm 3
    • Back up universes
      • Configure backup storage
      • Back up universe data
      • Restore universe data
      • Schedule data backups
    • Security
      • Security checklist
      • Configure ports
      • LDAP authentication
      • Authorization
      • Create a KMS configuration
      • Enable encryption at rest
      • Enable encryption in transit
      • Network security
    • Alerts and monitoring
      • Alerts
      • Live Queries dashboard
      • Slow Queries dashboard
    • Troubleshoot
      • Install and upgrade issues
      • Cloud provider configuration issues
      • Universe issues
    • Administer
      • Back up YugabyteDB Anywhere
      • Configure authentication
    • Upgrade
      • Upgrade Kubernetes installation
      • Upgrade using Replicated
  • YUGABYTEDB MANAGED
    • Overview
    • Quick start
      • Create a Sandbox cluster
      • Connect to the cluster
      • Explore distributed SQL
      • Build an application
        • Before you begin
        • Java
        • Go
        • Python
        • Node.js
        • C
        • C++
        • C#
        • Ruby
        • Rust
        • PHP
    • Deploy clusters
      • Plan your cluster
      • Choose a topology
      • VPC network
        • Overview
        • VPCs
        • Peering connections
        • Create a VPC Network
      • Create your cluster
        • Sandbox
        • Single region
        • Replicate across regions
    • Secure clusters
      • IP allow lists
      • Database authorization
      • Add database users
      • Encryption in transit
      • Audit account activity
    • Connect to clusters
      • Cloud Shell
      • Client shell
      • Connect applications
    • Alerts and monitoring
      • Alerts
      • Performance metrics
      • Live queries
      • Slow queries
      • Performance Advisor
      • Cluster activity
    • Manage clusters
      • Scale and configure clusters
      • Backup and restore
      • Maintenance windows
      • Create extensions
    • Administration and billing
      • Manage account access
      • Manage API keys
      • Manage billing
      • Cluster costs
    • Example applications
      • Connect a Spring application
      • Connect a YCQL Java application
      • Hasura Cloud
      • Deploy a GraphQL application
    • Security architecture
      • Security architecture
      • Shared responsibility model
    • Troubleshoot
    • YugabyteDB Managed FAQ
    • What's new
  • INTEGRATIONS
    • Apache Kafka
    • Apache Spark
    • Camunda
    • Debezium
    • Django REST framework
    • Entity Framework
    • Flyway
    • GORM
    • Hasura
      • Application Development
      • Benchmarking
    • JanusGraph
    • KairosDB
    • Liquibase
    • Metabase
    • Presto
    • Prisma
    • Sequelize
    • Spring Framework
      • Spring Data YugabyteDB
      • Spring Data JPA
      • Spring Data Cassandra
    • SQLAlchemy
    • WSO2 Identity Server
    • YSQL Loader
    • YugabyteDB JDBC driver
  • REFERENCE
    • Architecture
      • Design goals
      • Key concepts
        • Universe
        • YB-TServer Service
        • YB-Master Service
      • Core functions
        • Universe creation
        • Table creation
        • Write IO path
        • Read IO path
        • High availability
      • Layered architecture
      • Query layer
        • Overview
      • DocDB transactions layer
        • Transactions overview
        • Transaction isolation levels
        • Explicit locking
        • Read Committed
        • Single-row transactions
        • Distributed transactions
        • Transactional IO path
      • DocDB sharding layer
        • Hash & range sharding
        • Tablet splitting
        • Colocated tables
      • DocDB replication layer
        • Replication
        • xCluster replication
        • Read replicas
        • Change data capture (CDC)
      • DocDB storage layer
        • Persistence
        • Performance
    • APIs
      • YSQL
        • The SQL language
          • SQL statements
            • ABORT
            • ALTER DATABASE
            • ALTER DEFAULT PRIVILEGES
            • ALTER DOMAIN
            • ALTER FOREIGN DATA WRAPPER
            • ALTER FOREIGN TABLE
            • ALTER GROUP
            • ALTER POLICY
            • ALTER ROLE
            • ALTER SEQUENCE
            • ALTER SERVER
            • ALTER TABLE
            • ALTER USER
            • ANALYZE
            • BEGIN
            • CALL
            • COMMENT
            • COMMIT
            • COPY
            • CREATE AGGREGATE
            • CREATE CAST
            • CREATE DATABASE
            • CREATE DOMAIN
            • CREATE EXTENSION
            • CREATE FOREIGN DATA WRAPPER
            • CREATE FOREIGN TABLE
            • CREATE FUNCTION
            • CREATE GROUP
            • CREATE INDEX
            • CREATE MATERIALIZED VIEW
            • CREATE OPERATOR
            • CREATE OPERATOR CLASS
            • CREATE POLICY
            • CREATE PROCEDURE
            • CREATE ROLE
            • CREATE RULE
            • CREATE SCHEMA
            • CREATE SEQUENCE
            • CREATE SERVER
            • CREATE TABLE
            • CREATE TABLE AS
            • CREATE TRIGGER
            • CREATE TYPE
            • CREATE USER
            • CREATE USER MAPPING
            • CREATE VIEW
            • DEALLOCATE
            • DELETE
            • DO
            • DROP AGGREGATE
            • DROP CAST
            • DROP DATABASE
            • DROP DOMAIN
            • DROP EXTENSION
            • DROP FOREIGN DATA WRAPPER
            • DROP FOREIGN TABLE
            • DROP FUNCTION
            • DROP GROUP
            • DROP MATERIALIZED VIEW
            • DROP OPERATOR
            • DROP OPERATOR CLASS
            • DROP OWNED
            • DROP POLICY
            • DROP PROCEDURE
            • DROP ROLE
            • DROP RULE
            • DROP SEQUENCE
            • DROP SERVER
            • DROP TABLE
            • DROP TRIGGER
            • DROP TYPE
            • DROP USER
            • END
            • EXECUTE
            • EXPLAIN
            • GRANT
            • IMPORT FOREIGN SCHEMA
            • INSERT
            • LOCK
            • PREPARE
            • REASSIGN OWNED
            • REFRESH MATERIALIZED VIEW
            • RELEASE SAVEPOINT
            • RESET
            • REVOKE
            • ROLLBACK
            • ROLLBACK TO SAVEPOINT
            • SAVEPOINT
            • SELECT
            • SET
            • SET CONSTRAINTS
            • SET ROLE
            • SET SESSION AUTHORIZATION
            • SET TRANSACTION
            • SHOW
            • SHOW TRANSACTION
            • TRUNCATE
            • UPDATE
            • VALUES
          • WITH clause
            • WITH clause—SQL syntax and semantics
            • recursive CTE
            • case study—traversing an employee hierarchy
            • traversing general graphs
              • graph representation
              • common code
              • undirected cyclic graph
              • directed cyclic graph
              • directed acyclic graph
              • rooted tree
              • Unique containing paths
              • Stress testing find_paths()
            • case study—Bacon Numbers from IMDb
              • Bacon numbers for synthetic data
              • Bacon numbers for IMDb data
        • Data types
          • Array
            • array[] constructor
            • Literals
              • Text typecasting and literals
              • Array of primitive values
              • Row
              • Array of rows
            • FOREACH loop (PL/pgSQL)
            • array of DOMAINs
            • Functions and operators
              • ANY and ALL
              • Array comparison
              • Array slice operator
              • Array concatenation
              • Array properties
              • array_agg(), unnest(), generate_subscripts()
              • array_fill()
              • array_position(), array_positions()
              • array_remove()
              • array_replace() / set value
              • array_to_string()
              • string_to_array()
          • Binary
          • Boolean
          • Character
          • Date and time
            • Conceptual background
            • Timezones and UTC offsets
              • Catalog views
              • Extended_timezone_names
                • Unrestricted full projection
                • Real timezones with DST
                • Real timezones no DST
                • Synthetic timezones no DST
              • Offset/timezone-sensitive operations
                • Timestamptz to/from timestamp conversion
                • Pure 'day' interval arithmetic
              • Four ways to specify offset
                • Name-resolution rules
                  • 1 case-insensitive resolution
                  • 2 ~names.abbrev never searched
                  • 3 'set timezone' string not resolved in ~abbrevs.abbrev
                  • 4 ~abbrevs.abbrev before ~names.name
                  • Helper functions
              • Syntax contexts for offset
              • Recommended practice
            • Typecasting between date-time and text-values
            • Semantics of the date-time data types
              • Date data type
              • Time data type
              • Plain timestamp and timestamptz
              • Interval data type
                • Interval representation
                  • Ad hoc examples
                  • Representation model
                • Interval value limits
                • Declaring intervals
                • Justify() and extract(epoch...)
                • Interval arithmetic
                  • Interval-interval comparison
                  • Interval-interval addition and subtraction
                  • Interval-number multiplication
                  • Moment-moment overloads of "-"
                  • Moment-interval overloads of "+" and "-"
                • Custom interval domains
                • Interval utility functions
            • Typecasting between date-time datatypes
            • Operators
              • Test comparison overloads
              • Test addition overloads
              • Test subtraction overloads
              • Test multiplication overloads
              • Test division overloads
            • General-purpose functions
              • Creating date-time values
              • Manipulating date-time values
              • Current date-time moment
              • Delaying execution
              • Miscellaneous
                • Function age()
                • Function extract() | date_part()
                • Implementations that model the overlaps operator
            • Formatting functions
            • Case study—SQL stopwatch
            • Download & install the date-time utilities
            • ToC
          • JSON
            • JSON literals
            • Primitive and compound data types
            • Code example conventions
            • Indexes and check constraints
            • Functions & operators
              • ::jsonb, ::json, ::text (typecast)
              • ->, ->>, #>, #>> (JSON subvalues)
              • - and #- (remove)
              • || (concatenation)
              • = (equality)
              • @> and <@ (containment)
              • ? and ?| and ?& (key or value existence)
              • array_to_json()
              • jsonb_agg()
              • jsonb_array_elements()
              • jsonb_array_elements_text()
              • jsonb_array_length()
              • jsonb_build_object()
              • jsonb_build_array()
              • jsonb_each()
              • jsonb_each_text()
              • jsonb_extract_path()
              • jsonb_extract_path_text() and json_extract_path_text()
              • jsonb_object()
              • jsonb_object_agg()
              • jsonb_object_keys()
              • jsonb_populate_record()
              • jsonb_populate_recordset()
              • jsonb_pretty()
              • jsonb_set() and jsonb_insert()
              • jsonb_strip_nulls()
              • jsonb_to_record()
              • jsonb_to_recordset()
              • jsonb_typeof()
              • row_to_json()
              • to_jsonb()
          • Money
          • Numeric
          • Range
          • Serial
          • UUID
        • Functions and operators
          • Aggregate functions
            • Informal functionality overview
            • Invocation syntax and semantics
            • grouping sets, rollup, cube
            • Per function signature and purpose
              • avg(), count(), max(), min(), sum()
              • array_agg(), string_agg(), jsonb_agg(), jsonb_object_agg()
              • bit_and(), bit_or(), bool_and(), bool_or()
              • variance(), var_pop(), var_samp(), stddev(), stddev_pop(), stddev_samp()
              • linear regression
                • covar_pop(), covar_samp(), corr()
                • regr_%()
              • mode(), percentile_disc(), percentile_cont()
              • rank(), dense_rank(), percent_rank(), cume_dist()
            • case study—percentile_cont() and the "68–95–99.7" rule
            • case study—linear regression on COVID data
              • Download the COVIDcast data
              • Ingest the COVIDcast data
                • Inspect the COVIDcast data
                • Copy the .csv files to staging tables
                • Check staged data conforms to the rules
                • Join the staged data into a single table
                • SQL scripts
                  • Create cr_staging_tables()
                  • Create cr_copy_from_scripts()
                  • Create assert_assumptions_ok()
                  • Create xform_to_covidcast_fb_survey_results()
                  • ingest-the-data.sql
              • Analyze the COVIDcast data
                • symptoms vs mask-wearing by day
                • Data for scatter-plot for 21-Oct-2020
                • Scatter-plot for 21-Oct-2020
                • SQL scripts
                  • analysis-queries.sql
                  • synthetic-data.sql
          • currval()
          • lastval()
          • nextval()
          • setval()
          • Window functions
            • Informal functionality overview
            • Invocation syntax and semantics
            • Per function signature and purpose
              • row_number(), rank() and dense_rank()
              • percent_rank(), cume_dist() and ntile()
              • first_value(), nth_value(), last_value()
              • lag(), lead()
              • Tables for the code examples
                • table t1
                • table t2
                • table t3
                • table t4
            • case study—analyzing a normal distribution
              • Bucket allocation scheme
              • do_clean_start.sql
              • cr_show_t4.sql
              • cr_dp_views.sql
              • cr_int_views.sql
              • cr_pr_cd_equality_report.sql
              • cr_bucket_using_width_bucket.sql
              • cr_bucket_dedicated_code.sql
              • do_assert_bucket_ok
              • cr_histogram.sql
              • cr_do_ntile.sql
              • cr_do_percent_rank.sql
              • cr_do_cume_dist.sql
              • do_populate_results.sql
              • do_report_results.sql
              • do_compare_dp_results.sql
              • do_demo.sql
              • Reports
                • Histogram report
                • dp-results
                • compare-dp-results
                • int-results
          • yb_hash_code()
        • Keywords
        • Reserved names
      • YCQL
        • ALTER KEYSPACE
        • ALTER ROLE
        • ALTER TABLE
        • CREATE INDEX
        • CREATE KEYSPACE
        • CREATE ROLE
        • CREATE TABLE
        • CREATE TYPE
        • DROP INDEX
        • DROP KEYSPACE
        • DROP ROLE
        • DROP TABLE
        • DROP TYPE
        • GRANT PERMISSION
        • GRANT ROLE
        • REVOKE PERMISSION
        • REVOKE ROLE
        • USE
        • INSERT
        • SELECT
        • EXPLAIN
        • UPDATE
        • DELETE
        • TRANSACTION
        • TRUNCATE
        • Simple expressions
        • Subscripted expressions
        • Function call
        • Operators
        • BLOB
        • BOOLEAN
        • Collection
        • FROZEN
        • INET
        • Integer and counter
        • Non-integer
        • TEXT
        • DATE, TIME, and TIMESTAMP
        • UUID and TIMEUUID
        • JSONB
        • Date and time
        • BATCH
    • CLIs
      • yb-ctl
      • yb-docker-ctl
      • ysqlsh
      • ycqlsh
      • yb-admin
      • yb-ts-cli
      • ysql_dump
      • ysql_dumpall
    • Configuration
      • yb-tserver
      • yb-master
      • yugabyted
      • Default ports
    • Drivers and ORMs
      • JDBC Drivers
      • C# Drivers
      • Go Drivers
      • Python Drivers
      • Client drivers for YSQL
      • Client drivers for YCQL
    • Connectors
      • Kafka Connect YugabyteDB
    • Third party tools
      • pgAdmin
      • Apache Superset
      • Arctype
      • DBeaver
      • TablePlus
      • DbSchema
      • SQL Workbench/J
      • Cassandra Workbench
    • Sample datasets
      • Chinook
      • Northwind
      • PgExercises
      • SportsDB
      • Retail Analytics
  • RELEASES
    • Releases overview
      • v2.15 series (preview)
      • v2.13 series
      • v2.12 series (stable)
      • v2.11 series
      • v2.9 series
      • v2.8 series
      • v2.7 series
      • v2.6 series
      • v2.5 series
      • v2.4 series
      • v2.3 series
      • v2.2 series
      • v2.1 series
      • v2.0 series
      • v1.3 series
      • v1.2 series
    • Release versioning
  • FAQ
    • Comparisons
      • Amazon Aurora
      • Google Cloud Spanner
      • CockroachDB
      • TiDB
      • Vitess
      • MongoDB
      • FoundationDB
      • Amazon DynamoDB
      • Azure Cosmos DB
      • Apache Cassandra
      • PostgreSQL
      • Redis in-memory store
      • Apache HBase
    • General FAQ
    • Operations FAQ
    • API compatibility FAQ
    • YugabyteDB Anywhere FAQ
  • MISC
    • YEDIS
      • Quick start
      • Develop
        • Build an application
        • C#
        • C++
        • Go
        • Java
        • NodeJS
        • Python
      • API reference
        • APPEND
        • AUTH
        • CONFIG
        • CREATEDB
        • DELETEDB
        • LISTDB
        • SELECT
        • DEL
        • ECHO
        • EXISTS
        • EXPIRE
        • EXPIREAT
        • FLUSHALL
        • FLUSHDB
        • GET
        • GETRANGE
        • GETSET
        • HDEL
        • HEXISTS
        • HGET
        • HGETALL
        • HINCRBY
        • HKEYS
        • HLEN
        • HMGET
        • HMSET
        • HSET
        • HSTRLEN
        • HVALS
        • INCR
        • INCRBY
        • KEYS
        • MONITOR
        • PEXPIRE
        • PEXPIREAT
        • PTTL
        • ROLE
        • SADD
        • SCARD
        • RENAME
        • SET
        • SETEX
        • PSETEX
        • SETRANGE
        • SISMEMBER
        • SMEMBERS
        • SREM
        • STRLEN
        • ZRANGE
        • TSADD
        • TSCARD
        • TSGET
        • TSLASTN
        • TSRANGEBYTIME
        • TSREM
        • TSREVRANGEBYTIME
        • TTL
        • ZADD
        • ZCARD
        • ZRANGEBYSCORE
        • ZREM
        • ZREVRANGE
        • ZSCORE
        • PUBSUB
        • PUBLISH
        • SUBSCRIBE
        • UNSUBSCRIBE
        • PSUBSCRIBE
        • PUNSUBSCRIBE
    • Legal
      • Third party software
> APIs > YSQL > Functions and operators > Aggregate functions > case study—linear regression on COVID data > Ingest the COVIDcast data >

Copy each of the COVIDcast data .csv files to a dedicated staging table

Report a doc issue Suggest new content Contributor guide
  • First, create a dedicated ordinary user for the project
  • Create and populate the "covidcast_names" table
  • Create the three staging tables
  • Use the \copy metacommand to copy the data from each .csv file into its staging table

First, create a dedicated ordinary user for the project

It isn't essential to do this. It's very unlikely that the names of the objects that are created for your COVIDcast case study project will collide with those of any objects that you might already have (and might care about). But the scripts that create them follow usual practice and start by dropping them. So there is a risk—at least in principle. Anyway, it makes sense to create a dedicated ordinary user for this project for all the usual reasons. For example, when all the project objects are owned by a user that owns no other objects (and are all in the same schema), it's easy to list all of the objects (and only these).

Create and populate the "covidcast_names" table

Each of the .csv files that you downloaded has its unique name; each will be imported, using the \copy metacommand "as is" into a dedicated staging table whose name reflects the source .csv file; and each has its own value in the "signal" column. Because these names will be needed in various places within the overall set of scripts, they are defined in the "covidcast_names" table, thus:

drop table if exists covidcast_names cascade;
create table covidcast_names(csv_file text primary key, staging_table text not null, signal text not null);

-- Each of these files contains 50 days of observations.
insert into covidcast_names(csv_file, staging_table, signal) values
  ('csv-files/covidcast-fb-survey-smoothed_'|| 'wearing_mask' ||'-2020-09-13-to-2020-11-01.csv', 'mask_wearers',   'smoothed_wearing_mask'),
  ('csv-files/covidcast-fb-survey-smoothed_'|| 'cli'          ||'-2020-09-13-to-2020-11-01.csv', 'symptoms',       'smoothed_cli'),
  ('csv-files/covidcast-fb-survey-smoothed_'|| 'hh_cmnty_cli' ||'-2020-09-13-to-2020-11-01.csv', 'cmnty_symptoms', 'smoothed_hh_cmnty_cli');

create unique index covidcast_names_staging_table_unq on covidcast_names(staging_table);
create unique index covidcast_names_signal_unq on covidcast_names(signal);

This code is included directly in ingest-the-data.sql so you don't need to save it to a file.

Create the three staging tables

Because the three .csv files have the same format as each other (see the section Inspect the .csv files), each of the three staging tables has the same column names and data types. Dynamic SQL is used to create these tables to avoid repetition of code (i.e. to keep the code size as small as possible), and to bring optimal "single point of definition" maintainability.

This code is implemented by the procedure cr_staging_tables() created by the cr-cr-staging-tables.sql script.

After creating the procedure, run it like this:

call cr_staging_tables();

This invocation is included in ingest-the-data.sql.

Now take an inventory of your tables with the \d metacommand. You should see something like this, with the names of the user and schema that you created in place of "u1":

 Schema |      Name       | Type  | Owner
--------+-----------------+-------+-------
 u1     | cmnty_symptoms  | table | u1
 u1     | covidcast_names | table | u1
 u1     | mask_wearers    | table | u1
 u1     | symptoms        | table | u1

Use the \copy metacommand to copy the data from each .csv file into its staging table

The COPY SQL statement is designed to ingest data from a file "as is". However, its simple use requires that the to-be-read (or to-be-written) file resides server-side on the local filesystem of the YB-TServer that you connect to. If you specify stdin as the argument of COPY FROM, then these input and output channels are defined client-side in the environment of the client where you run ysqlsh. This sounds promising. But the snag is that you must include the text of the COPY FROM statement at the start of the file that contains the data that you intend to ingest. This is described in the stdin and stdout section of the documentation for the COPY statement.

The preferred option for the present case study, because ysqlsh is chosen for running all the SQL statements, is to use the \copy metacommand.

Because the three .csv files all have the same format, as do their three dedicated staging tables, the copy command will have the same form for each of its invocations, thus:

\copy <staging table> from <csv file> with (format 'csv', header true)

This is another case where a procedure that reads the names from the "covidcast_names" table avoids repetition of code and brings optimal "single point of definition" maintainability. However, there's particular design dilemma to confront. The \copy metacommand cannot be run from a stored procedure written in PL/pgSQL. Therefore a function is used that will return the text of the \copy metacommand. You execute the appropriately parameterized function, spool its output to a .sql script and start that script.

  • Create the function with the cr-cr-copy-from-csv-scripts.sql script.

  • Test it manually like this:

    \t on
    select cr_copy_from_scripts(1);
    \t off
    
  • This is the result:

    \copy symptoms from 'csv-files/covidcast-fb-survey-smoothed_cli-2020-09-13-to-2020-11-01.csv' with (format 'csv', header true);
    
  • Execute it for each to-be-ingested .csv file.

    \t on
    
    \o copy_from_csv.sql
    select cr_copy_from_scripts(1);
    \o
    \i copy_from_csv.sql
    
    \o copy_from_csv.sql
    select cr_copy_from_scripts(2);
    \o
    \i copy_from_csv.sql
    
    \o copy_from_csv.sql
    select cr_copy_from_scripts(3);
    \o
    \i copy_from_csv.sql
    
    \t off
    

    This code is included directly in ingest-the-data.sql so you don't need to save it to a file.

This pattern—using a PL/pgSQL function to create a sql script whose details reflect the present content of database tables and then executing it from ysqlsh is a useful generic technique.

Finally, check visually that each staging table has the same number of rows:

select
  (select count(*) from mask_wearers)  as "mask_wearers count",
  (select count(*) from symptoms)       as "symptoms count",
  (select count(*) from cmnty_symptoms) as "cmnty_symptoms count";

This is the result:

 mask_wearers count | symptoms count | cmnty_symptoms count
--------------------+----------------+----------------------
               2550 |           2550 |                 2550

The value 2,550 is the product of the 51 states (including DC) and the 50 days on which survey data were collected.

  • First, create a dedicated ordinary user for the project
  • Create and populate the "covidcast_names" table
  • Create the three staging tables
  • Use the \copy metacommand to copy the data from each .csv file into its staging table
Ask our community
  • Slack
  • Github
  • Forum
  • StackOverflow
Yugabyte
Contact Us
Copyright © 2017-2022 Yugabyte, Inc. All rights reserved.