pg_parquet extension TECH PREVIEW

Extension support

Support for pg_parquet extension is in TP only. It may get replaced by another extension in future. So, it is not recommended for use in production.

The pg_parquet extension allows you to read from and write to Parquet files by extending the PostgreSQL COPY command. Files can be located in S3, Azure Blob Storage, Google Cloud Storage, http(s) endpoints, or on a file system.

Enable pg_parquet

To enable the pg_parquet extension:

Add pg_parquet to shared_preload_libraries in the PostgreSQL server configuration parameters using the YB-TServer --ysql_pg_conf_csv flag:
```
--ysql_pg_conf_csv="shared_preload_libraries='pg_parquet'"
```
Note that modifying shared_preload_libraries requires restarting the YB-TServer.
Enable the extension:
```
CREATE EXTENSION pg_parquet;
```

Use pg_parquet

You can use pg_parquet to do the following:

Export tables or queries to Parquet files.
Ingest data from Parquet files to tables.
Inspect the schema and metadata of Parquet files.

Copy to and from Parquet files

To read and write Parquet files, you use the COPY command. The following example shows how to write a YSQL table, with complex types, into a Parquet file, and then read the Parquet file content back to the same table.

Create composite types that will be used in the table:

CREATE TYPE product_item AS (id INT, name TEXT, price float4);
CREATE TYPE product AS (id INT, name TEXT, items product_item[]);

Create a table that uses the composite types, including arrays and timestamp columns:

CREATE TABLE product_example (
    id int,
    product product,
    products product[],
    created_at TIMESTAMP,
    updated_at TIMESTAMPTZ
);

Insert some rows into the table:

INSERT INTO product_example VALUES (
    1,
    ROW(1, 'product 1', ARRAY[ROW(1, 'item 1', 1.0), ROW(2, 'item 2', 2.0), NULL]::product_item[])::product,
    ARRAY[ROW(1, NULL, NULL)::product, NULL],
    now(),
    '2022-05-01 12:00:00-04'
);

Export the table data to a Parquet file using COPY TO with gzip compression:

COPY product_example TO '/tmp/product_example.parquet' (format 'parquet', compression 'gzip');
-- show table
SELECT * FROM product_example;

Import the Parquet file data back to the table using COPY FROM to verify the round-trip export and import:

COPY product_example FROM '/tmp/product_example.parquet';
-- show table
SELECT * FROM product_example;

Inspect Parquet schema

The following SELECT query outputs the schema of the Parquet file at a given URI:

SELECT * FROM parquet.schema('/tmp/product_example.parquet') LIMIT 10;

             uri              |     name     | type_name  | type_length | repetition_type | num_children | converted_type | scale | precision | field_id | logical_type
------------------------------+--------------+------------+-------------+-----------------+--------------+----------------+-------+-----------+----------+--------------
 /tmp/product_example.parquet | arrow_schema |            |             |                 |            5 |                |       |           |          |
 /tmp/product_example.parquet | id           | INT32      |             | OPTIONAL        |              |                |       |           |        0 |
 /tmp/product_example.parquet | product      |            |             | OPTIONAL        |            3 |                |       |           |        1 |
 /tmp/product_example.parquet | id           | INT32      |             | OPTIONAL        |              |                |       |           |        2 |
 /tmp/product_example.parquet | name         | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |        3 | STRING
 /tmp/product_example.parquet | items        |            |             | OPTIONAL        |            1 | LIST           |       |           |        4 |
 /tmp/product_example.parquet | list         |            |             | REPEATED        |            1 |                |       |           |          |
 /tmp/product_example.parquet | element      |            |             | OPTIONAL        |            3 |                |       |           |        5 |
 /tmp/product_example.parquet | id           | INT32      |             | OPTIONAL        |              |                |       |           |        6 |
 /tmp/product_example.parquet | name         | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |        7 | STRING
(10 rows)

Inspect Parquet metadata

Use the following SELECT query to discover the detailed metadata of the Parquet file, such as column statistics, at a given URI:

SELECT uri, row_group_id, row_group_num_rows, row_group_num_columns, row_group_bytes, column_id, file_offset, num_values, path_in_schema, type_name FROM parquet.metadata('/tmp/product_example.parquet') LIMIT 1;

             uri              | row_group_id | row_group_num_rows | row_group_num_columns | row_group_bytes | column_id | file_offset | num_values | path_in_schema | type_name
------------------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+-----------
 /tmp/product_example.parquet |            0 |                  1 |                    13 |             842 |         0 |           0 |          1 | id             | INT32
(1 row)

SELECT stats_null_count, stats_distinct_count, stats_min, stats_max, compression, encodings, index_page_offset, dictionary_page_offset, data_page_offset, total_compressed_size, total_uncompressed_size FROM parquet.metadata('/tmp/product_example.parquet') LIMIT 1;

 stats_null_count | stats_distinct_count | stats_min | stats_max |    compression     |        encodings         | index_page_offset | dictionary_page_offset | data_page_offset | total_compressed_size | total_uncompressed_size
------------------+----------------------+-----------+-----------+--------------------+--------------------------+-------------------+------------------------+------------------+-----------------------+-------------------------
                0 |                      | 1         | 1         | GZIP(GzipLevel(6)) | PLAIN,RLE,RLE_DICTIONARY |                   |                      4 |               42 |                   101 |                      61
(1 row)

Use the following SELECT query to discover file level metadata of the Parquet file, such as format version, at a given URI:

SELECT * FROM parquet.file_metadata('/tmp/product_example.parquet')

             uri              | created_by | num_rows | num_row_groups | format_version
------------------------------+------------+----------+----------------+----------------
 /tmp/product_example.parquet | pg_parquet |        1 |              1 | 1
(1 row)

Use the following SELECT query to get custom key-value metadata of the Parquet file at a given URI:

SELECT uri, encode(key, 'escape') as key, encode(value, 'escape') as value FROM parquet.kv_metadata('/tmp/product_example.parquet');

             uri              |     key      |    value
------------------------------+--------------+---------------------
 /tmp/product_example.parquet | ARROW:schema | /////5gIAAAQAAAA ...
(1 row)

Inspect Parquet column statistics

Use the following SELECT query to discover the column statistics of the Parquet file, such as min and max value for the column, at a given URI:

SELECT * FROM parquet.column_stats('/tmp/product_example.parquet');

 column_id | field_id |         stats_min          |         stats_max          | stats_null_count | stats_distinct_count
-----------+----------+----------------------------+----------------------------+------------------+----------------------
         4 |        7 | item 1                     | item 2                     |                1 |
         6 |       11 | 1                          | 1                          |                1 |
         7 |       12 |                            |                            |                2 |
        10 |       17 |                            |                            |                2 |
         0 |        0 | 1                          | 1                          |                0 |
        11 |       18 | 2025-03-11 14:01:22.045739 | 2025-03-11 14:01:22.045739 |                0 |
         3 |        6 | 1                          | 2                          |                1 |
        12 |       19 | 2022-05-01 19:00:00+03     | 2022-05-01 19:00:00+03     |                0 |
         8 |       15 |                            |                            |                2 |
         5 |        8 | 1                          | 2                          |                1 |
         9 |       16 |                            |                            |                2 |
         1 |        2 | 1                          | 1                          |                0 |
         2 |        3 | product 1                  | product 1                  |                0 |
(13 rows)

Object Store support

pg_parquet supports reading and writing Parquet files in S3, Azure Blob Storage, http(s), and Google Cloud Storage object stores.

To write to an object store location, you need to grant the parquet_object_store_write role to your current user. Similarly, to read from an object store location, you need to grant the parquet_object_store_read role to your current user. For information on how to grant roles, refer to GRANT.

S3 Storage

The basic way to configure object storage is by creating the standard ~/.aws/credentials and ~/.aws/config files:

$ cat ~/.aws/credentials
[default]
aws_access_key_id = AKIAIOSFODNN7EXAMPLE
aws_secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

$ cat ~/.aws/config
[default]
region = eu-central-1

Alternatively, you can use environment variables when starting postgres to configure the S3 client as described in the following table:

Variable	Description
`AWS_ACCESS_KEY_ID`	Access key ID of the AWS account.
`AWS_SECRET_ACCESS_KEY`	Secret access key of the AWS account.
`AWS_SESSION_TOKEN`	Session token for the AWS account.
`AWS_REGION`	Default region of the AWS account.
`AWS_ENDPOINT_URL`	The endpoint.
`AWS_SHARED_CREDENTIALS_FILE`	An alternative location for the credentials file (only via environment variables).
`AWS_CONFIG_FILE`	An alternative location for the configuration file (only via environment variables).
`AWS_PROFILE`	Name of the profile from the credentials and configuration file (default profile name is default) (only via environment variables).
`AWS_ALLOW_HTTP`	Allows HTTP endpoints (only via environment variables).

The following table summarizes key S3 client configuration and authorization methods' priorities.

Configuration source priority order	Supported authorization methods' priority order
Environment variables Configuration file	Temporary session tokens by assuming roles Long term credentials

The supported S3 URI formats are as follows:

s3://<bucket>/<path>
https://<bucket>.s3.amazonaws.com/<path>
https://s3.amazonaws.com/<bucket>/<path>

Azure Blob Storage

The basic way to configure object storage is by creating the standard ~/.azure/config file:

$ cat ~/.azure/config
[storage]
account = devstoreaccount1
key = Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==

Alternatively, you can use environment variables when starting postgres to configure the Azure Blob Storage client as described in the following table:

Variable	Description
`AZURE_STORAGE_ACCOUNT`	Storage account name of the Azure Blob.
`AZURE_STORAGE_KEY`	Storage key of the Azure Blob.
`AZURE_STORAGE_CONNECTION_STRING`	Connection string for the Azure Blob (overrides any other configuration).
`AZURE_STORAGE_SAS_TOKEN`	Storage SAS token for the Azure Blob.
`AZURE_TENANT_ID`	Tenant ID for client secret authentication (only via environment variables).
`AZURE_CLIENT_ID`	Client ID for client secret authentication (only via environment variables).
`AZURE_CLIENT_SECRET`	Client secret for client secret authentication (only via environment variables).
`AZURE_STORAGE_ENDPOINT`	The endpoint (only via environment variables).
`AZURE_CONFIG_FILE`	An alternative location for the configuration file (only via environment variables).
`AZURE_ALLOW_HTTP`	Allows HTTP endpoints (only via environment variables).

The following table summarizes key Azure Blob client configuration and authorization methods' priorities.

Configuration source priority order	Supported authorization methods' priority order
Connection string (read from environment variable or configuration file) Environment variables Configuration file	Bearer token via client secret Sas token Storage key

The supported Azure Blob Storage URI formats are as follows:

az://<container>/<path>
azure://<container>/<path>
https://<account>.blob.core.windows.net/<container>

HTTP(s) Storage

HTTPS URIs are supported by default. You can set ALLOW_HTTP environment variables to allow HTTP URIs.

Google Cloud Storage

The basic way to configure object storage is by creating a JSON configuration file like ~/.config/gcloud/application_default_credentials.json (can be generated by gcloud auth application-default login):

$ cat ~/.config/gcloud/application_default_credentials.json
{
  "gcs_base_url": "http://localhost:4443",
  "disable_oauth": true,
  "client_email": "",
  "private_key_id": "",
  "private_key": ""
}

Alternatively, you can use following environment variables when starting postgres to configure the Google Cloud Storage as described in the following table:

Variable	Description
`GOOGLE_SERVICE_ACCOUNT_KEY`	JSON serialized service account key (only via environment variables).
`GOOGLE_SERVICE_ACCOUNT_PATH`	An alternative location for the configuration file (only via environment variables).

Supported Google Cloud Storage URI format is gs://<bucket>/<path>.

Copy options

The COPY TO command options pg_parquet supports is described in the following table:

Option	Description
`format parquet`	Specify this option to read or write Parquet files that do not end with the `.parquet[.<compression>]` extension.
`file_size_bytes <string>`	Total file size per Parquet file. When set, Parquet files with the target size are created under a parent directory (named the same as the file name). By default, when not specified, a single file is generated without a parent folder. You can specify total bytes without a unit (for example, `file_size_bytes 2000000`), or with a unit (KB, MB, or GB, for example, `file_size_bytes '1MB'`).
`field_ids <string>`	Field IDs that are assigned to the fields in the Parquet file schema. By default, no field IDs are assigned. Pass `auto` to let pg_parquet generate field IDs. You can pass a JSON string to explicitly provide the field IDs.
`row_group_size <int64>`	Number of rows in each row group while writing Parquet files. Default is `122880`.
`row_group_size_bytes <int64>`	Total byte size of rows in each row group while writing Parquet files. Default is `row_group_size * 1024`.
`compression <string>`	The compression format to use while writing Parquet files. Supported formats are `uncompressed`, `snappy` (default), `gzip`, `brotli`, `lz4`, `lz4raw`, and `zstd`. If not specified, the format is determined by the file extension.
`compression_level <int>`	The compression level to use while writing Parquet files. This is only supported for `gzip`, `zstd`, and `brotli`. The default is `6` for gzip (0-10), `1` for zstd (1-22), and `1` for brotli (0-11).
`parquet_version <string>`	The writer version of the Parquet file. By default, it is set to `v1` for better interoperability. You can set it to `v2` to unlock new encodings.

The COPY FROM command options pg_parquet supports is described in the following table:

Option	Description
`format parquet`	Specify this option to read or write Parquet files which does not end with `.parquet[.<compression>]` extension
`match_by <string>`	Method to match Parquet file fields to PostgreSQL table columns. Available methods are `position` (default) and `name`. You can set it to `name` to match the columns by their name rather than by their position in the schema (default). Match by name is useful when field order differs between the Parquet file and the table, but their names match.