G.3. pgpro_duckdb — DuckDB columnar-vectorized analytics engine and features

G.3. pgpro_duckdb — DuckDB columnar-vectorized analytics engine and features
Prev	Up	Appendix G. Postgres Pro Modules and Extensions Shipped as Individual Packages	Home	Next

G.3.1. Installation #

The pgpro_duckdb extension is provided with Postgres Pro Enterprise as a separate pre-built package pgpro-duckdb-ent-17 (for the detailed installation instructions, see Chapter 17). To enable the pgpro_duckdb extension, create it using the following query:

CREATE EXTENSION pgpro_duckdb;

By default known extensions are allowed to be automatically installed and loaded when a DuckDB query depends on them. This behavior can be configured using the duckdb.autoinstall_known_extensions and duckdb.autoload_known_extensions settings.

It is also possible to manually install an extension. This can be useful when this autoinstall/autoload behavior is disabled, or when DuckDB fails to realize an extension is necessary to execute the query. Installing an extension requires superuser rights.

SELECT duckdb.install_extension('extname');

Installing an extension causes it to be loaded and installed globally for any connection that uses DuckDB. The current list of installed extensions is maintained in the duckdb.extensions table. Superusers can use this table to view, disable, or uninstall extensions, as follows:

-- install an extension --
SELECT duckdb.install_extension('iceberg');
-- view currently installed extensions --
SELECT * FROM duckdb.extensions;
-- change an extension to stop it from being automatically loaded in new connections --
SELECT duckdb.auotoload_extension('iceberg', false);
-- you can still load such extensions manually in a session --
SELECT duckdb.load_extension('iceberg');
-- you can also install community extensions --
SELECT duckdb.install_extension('prql', 'community');

G.3.2. Supported Extensions #

The following extensions are supported by pgpro_duckdb.

azure: azure support allows reading files from Azure Blob Storage by using az://... filepaths.
iceberg: iceberg support adds functions to read iceberg tables and metadata. For a list of iceberg functions, see Section G.3.9.
delta: delta support adds the ability to read delta Lake files via delta_scan.

G.3.3. Security Considerations #

By default, only superusers can execute duckdb.install_extension and duckdb.autoload_extension. This prevents other users from installing extensions that can have security implications or interfere with database operation.

This means that users can only use extensions that DuckDB marked as auto-installable. If you want to restrict the use of these extensions to a specific list of allowed extensions, you can do it by setting duckdb.autoinstall_known_extensions to false. This will prevent users from automatically installing any known extensions. Note that this requires that any of the extensions you want to allow are already installed by a superuser.

G.3.4. Secrets #

DuckDB secrets can be configured either using utility functions or with a foreign data wrapper for more advanced cases.

G.3.4.1. Simple Secrets #

For example, with utility functions:

SELECT duckdb.create_simple_secret(
    type          := 'S3',          -- Type: one of (S3, GCS, R2) --
    key_id        := 'access_key_id',
    secret        := 'xxx',
    session_token := 'yyy',         -- (optional) --
    region        := 'us-east-1',   -- (optional) --
    url_style     := 'xxx',         -- (optional) --
    provider      := 'xxx',         -- (optional) --
    endpoint      := 'xxx'          -- (optional) --
)

For Azure secrets, you can use:

SELECT duckdb.create_azure_secret('<connection_string>');

G.3.4.2. Secrets with the `credential_chain` Provider #

For more advanced use-cases, you can define secrets with SERVER and USER MAPPING on the DuckDB foreign data wrapper:

CREATE SERVER my_s3_secret
TYPE 's3'
FOREIGN DATA WRAPPER duckdb
OPTIONS (PROVIDER 'credential_chain');

G.3.4.3. Secrets with `secret_access_key` #

When your secret contains sensitive information, you need to create an additional USER MAPPING like this:

CREATE SERVER my_s3_secret TYPE 's3' FOREIGN DATA WRAPPER duckdb;

CREATE USER MAPPING FOR CURRENT_USER SERVER my_s3_secret
OPTIONS (KEY_ID 'my_secret_key', SECRET 'my_secret_value');

You can use any of the supported DuckDB secret types as long as the related extension is installed. For more information, refer to the official DuckDB documentation.

G.3.4.4. How It Works #

Secrets are stored in a combination of SERVER and USER MAPPING on the DuckDB foreign data wrapper. USER MAPPING hosts sensitive elements like token, session_token, and secret. Each time a DuckDB instance is created by pgpro_duckdb, and when a secret is modified, the secrets are loaded into the DuckDB secrets manager as non-persistent secrets.

G.3.5. Transactions in pgpro_duckdb #

Multi-statement transactions are supported in pgpro_duckdb. There is one important restriction on this though, which is is currently necessary to ensure the expected ACID guarantees: you cannot write to both a Postgres Pro table and a DuckDB table in the same transaction.

Similarly, you are allowed to do DDL (like CREATE/DROP TABLE) on DuckDB tables inside a transaction, but it is not allowed to combine such statements with DDL involving Postgres Pro objects.

Finally, it is possible to disable this restriction completely and allow writes to both DuckDB and Postgres Pro in the same transaction by setting duckdb.unsafe_allow_mixed_transactions to true but it isn't recommended to do so as it can result in the transaction being committed only in DuckDB, but not in Postgres Pro. This can lead to inconsistencies and even data loss. For example, the following code might result in deleting the duckdb_table, while not copying its contents to pg_table:

BEGIN;
SET LOCAL duckdb.unsafe_allow_mixed_transactions TO true;
CREATE TABLE pg_table AS SELECT * FROM duckdb_table;
DROP TABLE duckdb_table;
COMMIT;

G.3.6. Types #

pgpro_duckdb can read many data types that exist in both Postgres Pro and DuckDB. The following data types are currently supported for use in queries:

Integer types (integer, bigint, etc.)
Floating point types (real, double precision)
numeric (might get converted to double precision internally, see Section G.3.6.1 limitations below for details)
text/varchar/bpchar
bit-related types, including both fixed and varied-sized bit array
bytea/blob
timestamp/timstampz/date/interval/timestamp_ns/timestamp_ms/timestamp_s
boolean
uuid
json/jsonb
domain
arrays for all of the above types, but see limitations below about multi-dimensional arrays

G.3.6.1. Known Limitations #

The type support in pgpro_duckdb is not yet complete. The following are known issues that you might run into:

enum types are not supported.
The DuckDB decimal type does not support the wide range of values that the Postgres Pro numeric type does. To avoid errors when converting between the two, numeric is converted to double precision internally if DuckDB does not support the required precision. This might cause precision loss of the values.
The DuckDB STRUCT type is not supported.
The DuckDB timestamp_ns type gets truncated to microseconds when it is converted to the Postgres Pro timestamp type, which loses precision in the output. Operations on a timestamp_ns value, such as sorting/grouping/comparing, will use the full precision.
jsonb columns are converted to json columns when reading from DuckDB. This is because DuckDB does not have a jsonb type.
Many Postgres Pro json and jsonb functions and operators are not implemented in DuckDB. Instead you can use DuckDB json functions and operators. See the DuckDB documentation for more information on these functions.
The DuckDB tinyint type is converted to a char type in Postgres Pro. This is because Postgres Pro does not have a tinyint type. This causes it to be displayed as a hex code instead of a regular number.
Conversion between Postgres Pro multi-dimensional arrays and DuckDB nested LISTs in DuckDB can run into various problems, because neither database supports the thing that the other supports exactly. Specifically, in Postgres Pro it is allowed for different arrays in a column to have a different number of dimensions, e.g. [1] and [[1], [2]] can both occur in the same column. In DuckDB, that is not allowed, i.e. the amount of nesting should always be the same. On the other hand, in DuckDB it is valid for different lists at the same nest-level to contain a different number of elements, e.g. [[1], [1, 2]]. This is not allowed in Postgres Pro. So conversion between these types is only possible when the arrays follow the subset. Another possible problem that you can run into is that pgpro_duckdb uses the Postgres Pro column metadata to determine the number of dimensions that an array has. Since Postgres Pro does not complain when you add arrays of different dimensions, it is possible that the number of dimensions in the column metadata does not match the actual number of dimensions. To solve this you need to alter the column type:
```
-- this configures the column to be a 3-dimensional array of text --
 ALTER TABLE s ALTER COLUMN a SET DATA TYPE text[][][];
```
For the domain actually, during the execution of the INSERT operation, the check regarding domain is conducted by Postgres Pro rather than DuckDB. When we execute the SELECT operation and the type of the queried field is a domain, we will convert it to the corresponding base type and let DuckDB handle it.

G.3.7. Special Types #

pgpro_duckdb introduces a few special Postgres Pro types. You should not create these types explicitly and normally you do not need to know about their existence, but they might show up in error messages from Postgres Pro. These are explained below:

duckdb.row #

The duckdb.row type is returned by functions like read_parquet, read_csv, scan_iceberg, etc. Depending on the arguments of these functions, they can return rows with different columns and types. Postgres Pro does not support such functions well at this point in time, so for now a custom type is returned. Then to be able to get the actual columns out of these rows, you have to use the “square bracket indexing” syntax, similarly to how you would get field:

SELECT r['id'], r['name'] FROM read_parquet('file.parquet') r WHERE r['age'] > 21;

Using SELECT * will result in the columns of this row being expanded, so your query result will never have a column that has duckdb.row as its type:

SELECT * FROM read_parquet('file.parquet');

Due to limitations in Postgres Pro, there are some limitations when using a function that returns a duckdb.row in a CTE or subquery. The main problem is that pgpro_duckdb cannot automatically assign useful aliases to the selected columns from the row. So while this query without a CTE/subquery returns the r[company] column as company:

SELECT r['company']
FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r;
--    company
-- ─────────────
--  DuckDB Labs

The same query in a subquery or CTE will return the column simply as r:

WITH mycte AS (
SELECT r['company']
FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r
)
SELECT * FROM mycte;
--       r
-- ─────────────
--  DuckDB Labs

This is easy to work around by adding an explicit alias to the column in the CTE/subquery:

WITH mycte AS (
SELECT r['company'] AS company
FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r
)
SELECT * FROM mycte;
--    company
-- ─────────────
--  DuckDB Labs

Another limitation that can be similarly confusing is that when using SELECT * inside the CTE/subquery, you want to reference a specific column outside the CTE/subquery, then you still need to use the r['colname'] syntax instead of simply colname. So while this works as expected:

WITH mycte AS (
SELECT *
FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r
)
SELECT * FROM mycte;
--    company
-- ─────────────
--  DuckDB Labs

The following query will throw an error:

WITH mycte AS (
SELECT *
FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r
)
SELECT * FROM mycte WHERE company = 'DuckDB Labs';
-- ERROR:  42703: column "company" does not exist
-- LINE 5: SELECT * FROM mycte WHERE company = 'DuckDB Labs';
--                                   ^
-- HINT:  If you use DuckDB functions like read_parquet, you need to use the r['colname'] syntax to use columns. If you are already doing that, maybe you forgot to to give the function the r alias.

This is easy to work around by using the r['colname'] syntax like so:

> WITH mycte AS (
SELECT *
FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r
)
SELECT * FROM mycte WHERE r['company'] = 'DuckDB Labs';
--    company
-- ─────────────
--  DuckDB Labs

duckdb.unresolved_type #

The duckdb.unresolved_type type is a type that is used to make Postgres Pro understand an expression for which the type is not known at query parse time. This is the type of any of the columns extracted from a duckdb.row using the r['mycol'] syntax. Many operators and aggregates will return a duckdb.unresolved_type when one of the sides of the operator is of the type duckdb.unresolved_type, for instance r['age'] + 10.

Once the query gets executed by DuckDB, the actual type will be filled in by DuckDB. So, a query result will never contain a column that has duckdb.unresolved_type as its type. And generally you should not even realize that this type exists.

You might get errors that say that functions or operators do not exist for the duckdb.unresolved_type, such as:

ERROR:  function some_func(duckdb.unresolved_type) does not exist
LINE 6:  some_func(r['somecol']) as somecol

In such cases, a simple workaround is often to add an explicit cast to the type that the function accepts, such as some_func(r['somecol']::text) as somecol.

duckdb.json #

The duckdb.json type is used as arguments to DuckDB JSON functions. This type exists so that these functions can take values of json, jsonb and duckdb.unresolved_type.

G.3.8. Configuration Parameters #

Many of these settings are used to configure specific DuckDB settings.

duckdb.force_execution #

Force queries to use DuckDB execution. This is only necessary when accessing only Postgres Pro tables in a query. As soon as you use a DuckDB-only features, then DuckDB execution will be used automatically. DuckDB-only features are reading from DuckDB tables, using DuckDB functions (like read_parquet), or COPY to remote storage (s3://, etc).

Default: false

G.3.8.1. Security Settings #

duckdb.postgres_role #

Which Postgres Pro role should be allowed to use DuckDB execution and use the secrets. Defaults to superusers only. If this is configured, but the role does not exist when running CREATE EXTENSION pgpro_duckdb, it will be created automatically. This role will have access to DuckDB secrets.

Default: ""

Access: Needs to be in the postgresql.conf file and requires a restart

duckdb.disabled_filesystems #

Disable specific file systems preventing access. This setting applies to all users including superusers. For non-superusers that are not members of both the pg_read_server_files and pg_write_server_files roles, LocalFileSystem is always disabled. If you add LocalFileSystem to this setting, superusers also cannot access LocalFileSystem through DuckDB.

Default: ""

Access: Superuser-only

duckdb.autoinstall_known_extensions #

Whether known extensions are allowed to be automatically installed when a DuckDB query depends on them.

Default: true

Access: Superuser-only

duckdb.autoload_known_extensions #

Whether known extensions are allowed to be automatically loaded when a DuckDB query depends on them.

Default: true

duckdb.allow_community_extensions #

Disable installing community extensions.

Default: false

Access: Superuser-only

duckdb.enable_external_access #

Allow the DuckDB to access external storages (e.g., HTTP, S3, etc.). This setting is not tested very well yet and disabling it may break unintended pgpro_duckdb functionality.

Default: true

Access: Superuser-only

G.3.8.2. Resource Management Settings #

Since any connection that uses DuckDB will have its own DuckDB instance, these settings are per-connection. When using pgpro_duckdb in many concurrent connections it can be a good idea to set some of these more conservatively than their defaults.

duckdb.max_memory / duckdb.memory_limit #

The maximum memory DuckDB can use within a single Postgres Pro connection. This is somewhat comparable to Postgres Pro work_mem setting. When set to an empty string, this will use DuckDB normal default, which is 80% of RAM.

Default: "4GB"

Access: Superuser-only

duckdb.threads / duckdb.worker_threads #

Maximum number of DuckDB threads per Postgres Pro connection. -1 means to use DuckDB its default, which is the number of CPU cores on the machine.

Default: -1

Access: Superuser-only

duckdb.max_workers_per_postgres_scan #

Maximum number of PostgreSQL workers used for a single Postgres Pro scan. This is similar to Postgres Pro max_parallel_workers_per_gather setting.

Default: 2

Access: General

G.3.8.3. Developer Settings #

duckdb.allow_unsigned_extensions #

Allow DuckDB to load extensions with invalid or missing signatures. Mostly useful for development of DuckDB extensions.

Default: false

Access: Superuser-only

G.3.9. Functions #

By default, functions without a schema listed below are installed into public. You can choose to install these functions to an alternate location by running CREATE EXTENSION pgpro_duckdb WITH SCHEMA schema.

ALTER EXTENSION pgpro_duckdb WITH SCHEMA schema is not currently supported.

G.3.9.1. Data Lake Functions #

read_parquet(path TEXT or TEXT[], optional_parameters) returns SETOF duckdb.row #

Reads a Parquet file, either from a remote storage (via httpfs) or a local storage.

This returns DuckDB rows, you can expand them using * or you can select specific columns using the r['mycol'] syntax. If you want to select specific columns, you should give the function call an easy alias, like r. For example:

SELECT * FROM read_parquet('file.parquet');
SELECT r['id'], r['name'] FROM read_parquet('file.parquet') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM read_parquet('file.parquet');

Further information:

The following arguments are required:

Name	Type	Description
path	text or text[]	The path, either to the remote httpfs storage or the local storage (if enabled) with the Parquet files to read. The path can be a glob or array of files to read.

Optional parameters mirror DuckDB read_parquet function. To specify optional parameters, use parameter := 'value'.

read_csv(path TEXT or TEXT[], optional_parameters) returns SETOF duckdb.row #

Reads a CSV file, either from a remote storage (via httpfs) or a local storage.

This returns DuckDB rows, you can expand them using * or you can select specific columns using the r['mycol'] syntax. If you want to select specific columns, you should give the function call an easy alias, like r. For example:

SELECT * FROM read_csv('file.csv');
SELECT r['id'], r['name'] FROM read_csv('file.csv') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM read_csv('file.csv');

Further information:

The following arguments are required:

Name	Type	Description
path	text or text[]	The path, either to the remote httpfs storage or the local storage (if enabled) with the CSV files to read. The path can be a glob or array of files to read.

Optional parameters mirror DuckDB read_csv function. To specify optional parameters, use parameter := 'value'.

Compatibility notes:

columns is not currently supported.
nullstr must be an array (TEXT[]).

read_json(path TEXT or TEXT[], optional_parameters) returns SETOF duckdb.row #

Reads a JSON file, either from a remote storage (via httpfs) or a local storage.

This returns DuckDB rows, you can expand them using * or you can select specific columns using the r['mycol'] syntax. If you want to select specific columns, you should give the function call an easy alias, like r. For example:

SELECT * FROM read_parquet('file.parquet');
SELECT r['id'], r['name'] FROM read_parquet('file.parquet') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM read_parquet('file.parquet');

Further information:

DuckDB JSON documentation

The following arguments are required:

Name	Type	Description
path	text or text[]	The path, either to the remote httpfs storage or the local storage (if enabled) with the JSON files to read. The path can be a glob or array of files to read.

Optional parameters mirror DuckDB read_json function. To specify optional parameters, use parameter := 'value'.

Compatibility notes:

columns is not currently supported.

iceberg_scan(path TEXT, optional_parameters) returns SETOF duckdb.row #

Reads an Iceberg table, either from a remote storage (via httpfs) or a local storage.

To use iceberg_scan, you must enable the iceberg extension:

SELECT duckdb.install_extension('iceberg');

This returns DuckDB rows, you can expand them using * or you can select specific columns using the r['mycol'] syntax. If you want to select specific columns, you should give the function call an easy alias, like r. For example:

SELECT * FROM iceberg_scan('data/iceberg/table');
SELECT r['id'], r['name'] FROM iceberg_scan('data/iceberg/table') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM iceberg_scan('data/iceberg/table');

Further information:

DuckDB Iceberg extension documentation

The following arguments are required:

Name	Type	Description
path	text	The path, either to a remote httpfs storage or a local storage (if enabled) with the Iceberg table to read.

Optional parameters mirror DuckDB iceberg_scan function. To specify optional parameters, use parameter := 'value'.

Name	Type	Default	Description
allowed_moved_paths	boolean	false	Ensures that some path resolution is performed, which allows scanning Iceberg tables that are moved.
mode	text	`''`
metadata_compression_codec	text	`'none'`
skip_schema_inference	boolean	false
version	text	`'version-hint.text'`
version_name_format	text	`'v%s%s.metadata.json,%s%s.metadata.json'`

iceberg_metadata(path TEXT, optional_parameters) returns SETOF iceberg_metadata_record #

To use iceberg_metadata, you must enable the iceberg extension:

SELECT duckdb.install_extension('iceberg');

Returns metadata about an Iceberg table. Data is returned as a set of icerberg_metadata_record, which is defined as follows:

CREATE TYPE duckdb.iceberg_metadata_record AS (
  manifest_path TEXT,
  manifest_sequence_number NUMERIC,
  manifest_content TEXT,
  status TEXT,
  content TEXT,
  file_path TEXT
);

Further information:

DuckDB Iceberg extension documentation

The following arguments are required:

Name	Type	Description
path	text	The path, either to a remote httpfs storage or a local storage (if enabled) with the Iceberg table to read.

Optional parameters mirror DuckDB iceberg_metadata function. To specify optional parameters, use parameter := 'value'.

Name	Type	Default	Description
allowed_moved_paths	boolean	false	Ensures that some path resolution is performed, which allows scanning Iceberg tables that are moved.
metadata_compression_codec	text	`'none'`
skip_schema_inference	boolean	false
version	text	`'version-hint.text'`
version_name_format	text	`'v%s%s.metadata.json,%s%s.metadata.json'`

delta_scan(path TEXT) returns SETOF duckdb.row #

Reads a Delta dataset, either from a remote (via httpfs) or a local storage.

To use delta_scan, you must enable the delta extension:

SELECT duckdb.install_extension('delta');

This returns DuckDB rows, you can expand them using * or you can select specific columns using the r['mycol'] syntax. If you want to select specific columns, you should give the function call an easy alias, like r. For example:

SELECT * FROM delta_scan('/path/to/delta/dataset');
SELECT r['id'], r['name'] FROM delta_scan('/path/to/delta/dataset') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM delta_scan('/path/to/delta/dataset');

Further information:

DuckDB Delta extension documentation

The following arguments are required:

Name	Type	Description
path	text	The path, either to a remote httpfs storage or a local storage (if enabled) with the Delta dataset to read.

G.3.9.2. JSON Functions #

All of the DuckDB json functions and aggregates. Postgres Pro JSON/JSONB functions are not supported.

G.3.9.3. Aggregates #

The approx_count_distinct function gives the approximate count of distinct elements using HyperLogLog. For more information, refer to the official DuckDB documentation.

G.3.9.4. DuckDB Administration Functions #

duckdb.install_extension(extension_name TEXT, repository TEXT DEFAULT 'core') returns bool #

Installs a DuckDB extension and configures it to be loaded automatically in every session that uses pgpro_duckdb.

SELECT duckdb.install_extension('iceberg');
SELECT duckdb.install_extension('avro', 'community');

Since this function can be used to install and download any extensions, it can only be executed by a superuser by default. To allow execution by some other admin user, such as my_admin, you can grant such a user the following permissions:

GRANT ALL ON FUNCTION duckdb.install_extension(TEXT, TEXT) TO my_admin;

The following arguments are required:

Name	Type	Description
extension_name	text	The name of the extension to install

duckdb.query(query TEXT) returns SETOF duckdb.row #

Executes the given SELECT query directly against DuckDB. This can be useful if DuckDB syntax makes the query easier to write or if you want to use a function that is not exposed by pgpro_duckdb yet. For example, the below query shows a query that puts FROM before SELECT and uses a list comprehension. Both of those features are not supported in Postgres Pro.

SELECT * FROM duckdb.query('FROM range(10) as a(a) SELECT [a for i in generate_series(0, a)] as arr');

duckdb.raw_query(extension_name TEXT) returns void #

Runs an arbitrary query directly against DuckDB. Compared to duckdb.query, this function can execute any query, not just SELECT queries. The main downside is that it does not return its result as rows, but instead sends the query result to the logs. So the recommendation is to use duckdb.query when possible, but if you need to run e.g. some DDL you can use this function.

duckdb.recycle_ddb() returns void #

pgpro_duckdb keeps the DuckDB instance open inbetween transactions. This is done to save session level state, such as manually done SET commands. If you want to clear this session level state for some reason you can close the currently open DuckDB instance using the following query:

CALL duckdb.recycle_ddb();

Prev	Up	Next
G.2. pgpro_datactl — manage Postgres Pro Enterprise data files	Home	G.4. pgpro_multiplan — save a specific plan of a parameterized query for future usage