pgpro_duckdb is a Postgres Pro extension that embeds DuckDB columnar-vectorized analytics engine and features into Postgres Pro. pgpro_duckdb is the key component of the Built-in Analytical Platform, a Postgres Pro solution designed to operate with OLAP workloads within a Postgres Pro instance.
The pgpro_duckdb extension is currently in an experimental phase.
The pgpro_duckdb extension is provided with Postgres Pro Enterprise
as a separate pre-built package pgpro-duckdb-ent-17
(for the detailed installation instructions, see Chapter 17).
To enable the pgpro_duckdb extension, create it
using the following query:
CREATE EXTENSION pgpro_duckdb;
By default known extensions are allowed to be automatically installed and loaded when a DuckDB query depends on them. This behavior can be configured using the duckdb.autoinstall_known_extensions and duckdb.autoload_known_extensions settings.
It is also possible to manually install an extension. This can be useful when this autoinstall/autoload behavior is disabled, or when DuckDB fails to realize an extension is necessary to execute the query. Installing an extension requires superuser rights.
SELECT duckdb.install_extension('extname');
Installing an extension causes it to be loaded and installed
globally for any connection that uses DuckDB. The current list
of installed extensions is maintained in the
duckdb.extensions table. Superusers can use
this table to view, disable, or uninstall extensions, as
follows:
-- install an extension --
SELECT duckdb.install_extension('iceberg');
-- view currently installed extensions --
SELECT * FROM duckdb.extensions;
-- change an extension to stop it from being automatically loaded in new connections --
SELECT duckdb.auotoload_extension('iceberg', false);
-- you can still load such extensions manually in a session --
SELECT duckdb.load_extension('iceberg');
-- you can also install community extensions --
SELECT duckdb.install_extension('prql', 'community');
The following extensions are supported by pgpro_duckdb.
azure
azure support allows reading files from
Azure Blob Storage by using
az://... filepaths.
icebergiceberg support adds functions to read iceberg tables and metadata. For a list of iceberg functions, see Section G.3.9.
deltadelta support adds the ability to read delta Lake files via delta_scan.
By default, only superusers can execute duckdb.install_extension
and duckdb.autoload_extension. This prevents other users from
installing extensions that can have security implications or interfere with
database operation.
This means that users can only use extensions that DuckDB
marked as auto-installable. If you want to restrict the use of these
extensions to a specific list of allowed extensions, you can do it by
setting duckdb.autoinstall_known_extensions
to false. This will prevent users from automatically installing
any known extensions. Note that this requires that any of the extensions you
want to allow are already installed by a superuser.
DuckDB secrets can be configured either using utility functions or with a foreign data wrapper for more advanced cases.
For example, with utility functions:
SELECT duckdb.create_simple_secret(
type := 'S3', -- Type: one of (S3, GCS, R2) --
key_id := 'access_key_id',
secret := 'xxx',
session_token := 'yyy', -- (optional) --
region := 'us-east-1', -- (optional) --
url_style := 'xxx', -- (optional) --
provider := 'xxx', -- (optional) --
endpoint := 'xxx' -- (optional) --
)
For Azure secrets, you can use:
SELECT duckdb.create_azure_secret('<connection_string>');
credential_chain Provider #
For more advanced use-cases, you can define secrets with SERVER
and USER MAPPING on the DuckDB
foreign data wrapper:
CREATE SERVER my_s3_secret TYPE 's3' FOREIGN DATA WRAPPER duckdb OPTIONS (PROVIDER 'credential_chain');
secret_access_key #
When your secret contains sensitive information, you need to create an
additional USER MAPPING like this:
CREATE SERVER my_s3_secret TYPE 's3' FOREIGN DATA WRAPPER duckdb; CREATE USER MAPPING FOR CURRENT_USER SERVER my_s3_secret OPTIONS (KEY_ID 'my_secret_key', SECRET 'my_secret_value');
You can use any of the supported DuckDB secret types as long as the related extension is installed. For more information, refer to the official DuckDB documentation.
Secrets are stored in a combination of SERVER and
USER MAPPING on the DuckDB
foreign data wrapper. USER MAPPING
hosts sensitive elements like token, session_token,
and secret. Each time a DuckDB
instance is created by pgpro_duckdb, and when a
secret is modified, the secrets are loaded into the DuckDB
secrets manager as non-persistent secrets.
Multi-statement transactions are supported in pgpro_duckdb. There is one important restriction on this though, which is is currently necessary to ensure the expected ACID guarantees: you cannot write to both a Postgres Pro table and a DuckDB table in the same transaction.
Similarly, you are allowed to do DDL (like
CREATE/DROP TABLE) on
DuckDB tables inside a transaction, but it is
not allowed to combine such statements with DDL involving
Postgres Pro objects.
Finally, it is possible to disable this restriction completely and
allow writes to both DuckDB and
Postgres Pro in the same transaction by
setting duckdb.unsafe_allow_mixed_transactions to
true but it isn't recommended to do so as it can result in the
transaction being committed only in DuckDB,
but not in Postgres Pro. This can lead to
inconsistencies and even data loss. For example, the following code
might result in deleting the duckdb_table, while
not copying its contents to pg_table:
BEGIN; SET LOCAL duckdb.unsafe_allow_mixed_transactions TO true; CREATE TABLE pg_table AS SELECT * FROM duckdb_table; DROP TABLE duckdb_table; COMMIT;
pgpro_duckdb can read many data types that exist in both Postgres Pro and DuckDB. The following data types are currently supported for use in queries:
Integer types (integer, bigint, etc.)
Floating point types (real, double precision)
numeric (might get converted to
double precision internally, see
Section G.3.6.1
limitations below for details)
text/varchar/bpchar
bit-related types, including both fixed and varied-sized
bit array
bytea/blob
timestamp/timstampz/date/interval/timestamp_ns/timestamp_ms/timestamp_s
boolean
uuid
json/jsonb
domain
arrays for all of the above types, but see limitations
below about multi-dimensional arrays
The type support in pgpro_duckdb is not yet complete. The following are known issues that you might run into:
enum types are not supported.
The DuckDB decimal type
does not support the wide range of values that the
Postgres Pro numeric type
does. To avoid errors when converting between the two,
numeric is converted to double precision
internally if DuckDB does not support
the required precision. This might cause precision loss of the
values.
The DuckDB STRUCT type is not supported.
The DuckDB timestamp_ns type
gets truncated to microseconds when it is converted to the
Postgres Pro timestamp type,
which loses precision in the output. Operations on a
timestamp_ns value, such as
sorting/grouping/comparing, will use the full precision.
jsonb columns are converted to json columns
when reading from DuckDB.
This is because DuckDB does not have a
jsonb type.
Many Postgres Pro json and
jsonb functions and operators are not
implemented in DuckDB. Instead you can
use DuckDB json
functions and operators. See the
DuckDB
documentation for more information on these functions.
The DuckDB tinyint type
is converted to a char type in
Postgres Pro. This is because
Postgres Pro does not have a
tinyint type. This causes it to be displayed as a hex
code instead of a regular number.
Conversion between Postgres Pro
multi-dimensional arrays and DuckDB
nested LISTs in DuckDB can
run into various problems, because neither database supports the
thing that the other supports exactly. Specifically, in
Postgres Pro it is allowed for different
arrays in a column to have a different number of dimensions,
e.g. [1] and [[1], [2]] can
both occur in the same column. In DuckDB,
that is not allowed, i.e. the amount of nesting should always be
the same. On the other hand, in DuckDB
it is valid for different lists at the same nest-level
to contain a different number of elements,
e.g. [[1], [1, 2]]. This is not allowed
in Postgres Pro. So conversion between
these types is only possible when the arrays follow the subset.
Another possible problem that you can run into is that
pgpro_duckdb uses the
Postgres Pro column metadata to determine
the number of dimensions that an array has. Since
Postgres Pro does not
complain when you add arrays of different dimensions, it is
possible that the number of dimensions in the column
metadata does not match the actual number of dimensions. To
solve this you need to alter the column type:
-- this configures the column to be a 3-dimensional array of text -- ALTER TABLE s ALTER COLUMN a SET DATA TYPE text[][][];
For the domain actually, during the
execution of the INSERT operation, the check
regarding domain is conducted by
Postgres Pro rather
than DuckDB. When we execute the
SELECT operation and the
type of the queried field is a domain, we
will convert it to the corresponding base type and let
DuckDB handle it.
pgpro_duckdb introduces a few special Postgres Pro types. You should not create these types explicitly and normally you do not need to know about their existence, but they might show up in error messages from Postgres Pro. These are explained below:
duckdb.row #
The duckdb.row type is returned by
functions like read_parquet,
read_csv, scan_iceberg,
etc. Depending on the arguments of these functions, they can
return rows with different columns and types.
Postgres Pro does not
support such functions well at this point in time, so for now
a custom type is returned. Then to be able to get the
actual columns out of these rows, you have to use the
“square bracket indexing” syntax, similarly to
how you would get field:
SELECT r['id'], r['name'] FROM read_parquet('file.parquet') r WHERE r['age'] > 21;
Using SELECT * will result in the columns
of this row being expanded, so your query result will never
have a column that has duckdb.row as its
type:
SELECT * FROM read_parquet('file.parquet');
Due to limitations in Postgres Pro, there
are some limitations when using a function that returns a
duckdb.row in a CTE or subquery. The main
problem is that pgpro_duckdb cannot
automatically assign useful aliases to the selected columns from
the row. So while this query without a CTE/subquery returns the
r[company] column as company:
SELECT r['company'] FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r; -- company -- ───────────── -- DuckDB Labs
The same query in a subquery or CTE will return the column
simply as r:
WITH mycte AS ( SELECT r['company'] FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r ) SELECT * FROM mycte; -- r -- ───────────── -- DuckDB Labs
This is easy to work around by adding an explicit alias to the column in the CTE/subquery:
WITH mycte AS ( SELECT r['company'] AS company FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r ) SELECT * FROM mycte; -- company -- ───────────── -- DuckDB Labs
Another limitation that can be similarly confusing is that
when using SELECT * inside the
CTE/subquery, you want to reference a specific column
outside the CTE/subquery, then you still need to use the
r['colname'] syntax instead of simply
colname. So while this works as expected:
WITH mycte AS ( SELECT * FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r ) SELECT * FROM mycte; -- company -- ───────────── -- DuckDB Labs
The following query will throw an error:
WITH mycte AS ( SELECT * FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r ) SELECT * FROM mycte WHERE company = 'DuckDB Labs'; -- ERROR: 42703: column "company" does not exist -- LINE 5: SELECT * FROM mycte WHERE company = 'DuckDB Labs'; -- ^ -- HINT: If you use DuckDB functions like read_parquet, you need to use the r['colname'] syntax to use columns. If you are already doing that, maybe you forgot to to give the function the r alias.
This is easy to work around by using the
r['colname'] syntax like so:
> WITH mycte AS ( SELECT * FROM duckdb.query($$ SELECT 'DuckDB Labs' company $$) r ) SELECT * FROM mycte WHERE r['company'] = 'DuckDB Labs'; -- company -- ───────────── -- DuckDB Labs
duckdb.unresolved_type #
The duckdb.unresolved_type type is a type
that is used to make Postgres Pro understand
an expression for which the type is not known at query parse time.
This is the type of any of the columns extracted from a
duckdb.row using the
r['mycol'] syntax. Many operators and
aggregates will return a
duckdb.unresolved_type when one of the
sides of the operator is of the type
duckdb.unresolved_type, for instance
r['age'] + 10.
Once the query gets executed by DuckDB,
the actual type will be filled in by DuckDB.
So, a query result will never contain a column that has
duckdb.unresolved_type as its type. And generally
you should not even realize that this type exists.
You might get errors that say that functions or operators
do not exist for the duckdb.unresolved_type, such as:
ERROR: function some_func(duckdb.unresolved_type) does not exist LINE 6: some_func(r['somecol']) as somecol
In such cases, a simple workaround is often to add an explicit
cast to the type that the function accepts, such as
some_func(r['somecol']::text) as somecol.
duckdb.json #
The duckdb.json type is used as arguments
to DuckDB JSON functions. This type exists
so that these functions can take values of json,
jsonb and
duckdb.unresolved_type.
Many of these settings are used to configure specific DuckDB settings.
duckdb.force_execution #
Force queries to use DuckDB execution.
This is only necessary when accessing only
Postgres Pro tables in a query. As soon as you
use a DuckDB-only features, then
DuckDB execution will be used
automatically. DuckDB-only features
are reading from DuckDB
tables, using DuckDB functions (like
read_parquet), or COPY to
remote storage (s3://, etc).
Default: false
duckdb.postgres_role #
Which Postgres Pro role should be allowed
to use DuckDB execution and
use the secrets. Defaults to superusers only. If this is configured, but the
role does not exist when running
CREATE EXTENSION pgpro_duckdb, it will be
created automatically. This role will have access to
DuckDB secrets.
Default: ""
Access: Needs to be in the postgresql.conf
file and requires a restart
duckdb.disabled_filesystems #
Disable specific file systems preventing access. This setting applies
to all users including superusers. For non-superusers that are not
members of both the pg_read_server_files
and pg_write_server_files roles,
LocalFileSystem is always disabled. If you add
LocalFileSystem to this setting, superusers also cannot
access LocalFileSystem through DuckDB.
Default: ""
Access: Superuser-only
duckdb.autoinstall_known_extensions #Whether known extensions are allowed to be automatically installed when a DuckDB query depends on them.
Default: true
Access: Superuser-only
duckdb.autoload_known_extensions #Whether known extensions are allowed to be automatically loaded when a DuckDB query depends on them.
Default: true
duckdb.allow_community_extensions #Disable installing community extensions.
Default: false
Access: Superuser-only
duckdb.enable_external_access #Allow the DuckDB to access external storages (e.g., HTTP, S3, etc.). This setting is not tested very well yet and disabling it may break unintended pgpro_duckdb functionality.
Default: true
Access: Superuser-only
Since any connection that uses DuckDB will have its own DuckDB instance, these settings are per-connection. When using pgpro_duckdb in many concurrent connections it can be a good idea to set some of these more conservatively than their defaults.
duckdb.max_memory / duckdb.memory_limit
#
The maximum memory DuckDB can use within
a single Postgres Pro
connection. This is somewhat comparable to
Postgres Pro
work_mem
setting. When set to an empty string, this will use
DuckDB normal default, which is 80% of RAM.
Default: "4GB"
Access: Superuser-only
duckdb.threads / duckdb.worker_threads #
Maximum number of DuckDB threads per
Postgres Pro connection. -1
means to use DuckDB its default, which
is the number of CPU cores on the machine.
Default: -1
Access: Superuser-only
duckdb.max_workers_per_postgres_scan #Maximum number of PostgreSQL workers used for a single Postgres Pro scan. This is similar to Postgres Pro max_parallel_workers_per_gather setting.
Default: 2
Access: General
By default, functions without a schema listed below are installed
into public. You can choose to install these
functions to an alternate location by running
CREATE EXTENSION pgpro_duckdb WITH SCHEMA schema.
ALTER EXTENSION pgpro_duckdb WITH SCHEMA schema is
not currently supported.
read_parquet(path TEXT or TEXT[], optional_parameters) returns SETOF duckdb.row
#Reads a Parquet file, either from a remote storage (via httpfs) or a local storage.
This returns DuckDB rows, you can expand
them using * or you can select specific columns
using the r['mycol'] syntax. If you want to
select specific columns, you should give the function call an
easy alias, like r. For example:
SELECT * FROM read_parquet('file.parquet');
SELECT r['id'], r['name'] FROM read_parquet('file.parquet') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM read_parquet('file.parquet');
Further information:
The following arguments are required:
| Name | Type | Description |
|---|---|---|
| path | text or text[] | The path, either to the remote httpfs storage or the local storage (if enabled) with the Parquet files to read. The path can be a glob or array of files to read. |
Optional parameters mirror
DuckDB
read_parquet function. To specify
optional parameters, use parameter := 'value'.
read_csv(path TEXT or TEXT[], optional_parameters) returns SETOF duckdb.row
#Reads a CSV file, either from a remote storage (via httpfs) or a local storage.
This returns DuckDB rows, you can expand
them using * or you can select specific columns
using the r['mycol'] syntax. If you want to
select specific columns, you should give the function call an
easy alias, like r. For example:
SELECT * FROM read_csv('file.csv');
SELECT r['id'], r['name'] FROM read_csv('file.csv') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM read_csv('file.csv');
Further information:
The following arguments are required:
| Name | Type | Description |
|---|---|---|
| path | text or text[] | The path, either to the remote httpfs storage or the local storage (if enabled) with the CSV files to read. The path can be a glob or array of files to read. |
Optional parameters mirror
DuckDB
read_csv function. To specify optional
parameters, use parameter := 'value'.
Compatibility notes:
columns is not currently supported.
nullstr must be an array
(TEXT[]).
read_json(path TEXT or TEXT[], optional_parameters) returns SETOF duckdb.row
#Reads a JSON file, either from a remote storage (via httpfs) or a local storage.
This returns DuckDB rows, you can expand
them using * or you can select specific columns
using the r['mycol'] syntax. If you want to
select specific columns, you should give the function call an
easy alias, like r. For example:
SELECT * FROM read_parquet('file.parquet');
SELECT r['id'], r['name'] FROM read_parquet('file.parquet') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM read_parquet('file.parquet');
Further information:
The following arguments are required:
| Name | Type | Description |
|---|---|---|
| path | text or text[] | The path, either to the remote httpfs storage or the local storage (if enabled) with the JSON files to read. The path can be a glob or array of files to read. |
Optional parameters mirror
DuckDB
read_json function. To specify optional
parameters, use parameter := 'value'.
Compatibility notes:
columns is not currently supported.
iceberg_scan(path TEXT, optional_parameters) returns SETOF duckdb.row
#Reads an Iceberg table, either from a remote storage (via httpfs) or a local storage.
To use iceberg_scan, you must enable the
iceberg extension:
SELECT duckdb.install_extension('iceberg');
This returns DuckDB rows, you can expand
them using * or you can select specific columns
using the r['mycol'] syntax. If you want to
select specific columns, you should give the function call an
easy alias, like r. For example:
SELECT * FROM iceberg_scan('data/iceberg/table');
SELECT r['id'], r['name'] FROM iceberg_scan('data/iceberg/table') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM iceberg_scan('data/iceberg/table');
Further information:
The following arguments are required:
| Name | Type | Description |
|---|---|---|
| path | text | The path, either to a remote httpfs storage or a local storage (if enabled) with the Iceberg table to read. |
Optional parameters mirror DuckDB
iceberg_scan function. To specify optional
parameters, use parameter := 'value'.
| Name | Type | Default | Description |
|---|---|---|---|
| allowed_moved_paths | boolean | false | Ensures that some path resolution is performed, which allows scanning Iceberg tables that are moved. |
| mode | text |
''
| |
| metadata_compression_codec | text |
'none'
| |
| skip_schema_inference | boolean | false | |
| version | text |
'version-hint.text'
| |
| version_name_format | text |
'v%s%s.metadata.json,%s%s.metadata.json'
|
iceberg_metadata(path TEXT, optional_parameters) returns SETOF iceberg_metadata_record
#
To use iceberg_metadata, you must enable
the iceberg extension:
SELECT duckdb.install_extension('iceberg');
Returns metadata about an Iceberg table. Data is returned as a
set of icerberg_metadata_record, which is
defined as follows:
CREATE TYPE duckdb.iceberg_metadata_record AS ( manifest_path TEXT, manifest_sequence_number NUMERIC, manifest_content TEXT, status TEXT, content TEXT, file_path TEXT );
Further information:
The following arguments are required:
| Name | Type | Description |
|---|---|---|
| path | text | The path, either to a remote httpfs storage or a local storage (if enabled) with the Iceberg table to read. |
Optional parameters mirror DuckDB
iceberg_metadata function. To specify optional
parameters, use parameter := 'value'.
| Name | Type | Default | Description |
|---|---|---|---|
| allowed_moved_paths | boolean | false | Ensures that some path resolution is performed, which allows scanning Iceberg tables that are moved. |
| metadata_compression_codec | text |
'none'
| |
| skip_schema_inference | boolean | false | |
| version | text |
'version-hint.text'
| |
| version_name_format | text |
'v%s%s.metadata.json,%s%s.metadata.json'
|
delta_scan(path TEXT) returns SETOF duckdb.row
#Reads a Delta dataset, either from a remote (via httpfs) or a local storage.
To use delta_scan, you must enable the
delta extension:
SELECT duckdb.install_extension('delta');
This returns DuckDB rows, you can expand
them using * or you can select specific columns
using the r['mycol'] syntax. If you want to
select specific columns, you should give the function call an
easy alias, like r. For example:
SELECT * FROM delta_scan('/path/to/delta/dataset');
SELECT r['id'], r['name'] FROM delta_scan('/path/to/delta/dataset') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM delta_scan('/path/to/delta/dataset');
Further information:
The following arguments are required:
| Name | Type | Description |
|---|---|---|
| path | text | The path, either to a remote httpfs storage or a local storage (if enabled) with the Delta dataset to read. |
All of the DuckDB json functions and aggregates. Postgres Pro JSON/JSONB functions are not supported.
The approx_count_distinct function gives the approximate
count of distinct elements using HyperLogLog. For more
information, refer to the
official DuckDB documentation.
duckdb.install_extension(extension_name TEXT, repository TEXT DEFAULT 'core') returns bool
#Installs a DuckDB extension and configures it to be loaded automatically in every session that uses pgpro_duckdb.
SELECT duckdb.install_extension('iceberg');
SELECT duckdb.install_extension('avro', 'community');
Since this function can be used to install and download any
extensions, it can only be executed by a
superuser by default. To allow execution by some other admin
user, such as my_admin, you can grant
such a user the following permissions:
GRANT ALL ON FUNCTION duckdb.install_extension(TEXT, TEXT) TO my_admin;
The following arguments are required:
| Name | Type | Description |
|---|---|---|
| extension_name | text | The name of the extension to install |
duckdb.query(query TEXT) returns SETOF duckdb.row
#
Executes the given SELECT query directly against
DuckDB. This can be useful if
DuckDB syntax makes the query easier to
write or if you want to use a function that is not exposed by
pgpro_duckdb yet. For example, the below query
shows a query that puts FROM before
SELECT and uses a list comprehension. Both
of those features are not supported in Postgres Pro.
SELECT * FROM duckdb.query('FROM range(10) as a(a) SELECT [a for i in generate_series(0, a)] as arr');
duckdb.raw_query(extension_name TEXT) returns void
#
Runs an arbitrary query directly against
DuckDB. Compared to
duckdb.query, this function can execute any
query, not just SELECT queries. The main downside
is that it does not return its result as rows, but instead sends the
query result to the logs. So the recommendation is to use
duckdb.query when possible, but if you need
to run e.g. some DDL you can use this function.
duckdb.recycle_ddb() returns void
#
pgpro_duckdb keeps the
DuckDB instance open inbetween
transactions. This is done to save session level state, such
as manually done SET commands. If you want
to clear this session level state for some reason you can
close the currently open DuckDB instance
using the following query:
CALL duckdb.recycle_ddb();