To start working with the built-in analytical platform, perform the following steps:
Read information about considerations and limitations and ensure that the pgpro_duckdb extension suits internal requirements of your organization and analytical query workload profile.
Define your optimal deployment diagram. For more information, see Deployment Recommendations.
Install the pgpro_duckdb extension.
Allocate the required resources. For more information, see Capacity Planning.
Define the security model for accessing OLAP resources. For more information, see Configuring Security for Accessing OLAP Resources.
Limit resources for analytical queries considering OLTP workload. For more information, see Configuring Computational Resource Limits.
Provide pgpro_duckdb users with the required access rights.
Export the required OLAP data to a columnar view and move it to the corresponding storage. For more information, see Exporting OLAP Data to a Storage.
Create views for the exported Parquet files. For more information, see Creating a View for Accessing Parquet Files.
Ensure that analytical queries can be executed based on the created views.
To configure security for accessing OLAP resources, configure role access and storages.
You can use the duckdb.postgres_role configuration parameter to specify the name of the role that can call pgpro_duckdb functions:
postgres=# show duckdb.postgres_role;
duckdb.postgres_role
----------------------
(1 row)
By default, the duckdb.postgres_role configuration
parameter is set to NULL, meaning only a superuser can call
pgpro_duckdb functions:
postgres=> SELECT usename, usesuper FROM pg_user WHERE usename = current_user;
usename | usesuper
---------+----------
user1 | f
(1 row)
postgres=> SELECT * FROM duckdb.query('SELECT * FROM a;');
ERROR: DuckDB execution is not allowed because you have not been granted the duckdb.postgres_role
To allow members of a user role to call pgpro_duckdb functions, specify the name of the required role, then reload the server:
postgres=# ALTER SYSTEM SET duckdb.postgres_role TO 'user1';
ALTER SYSTEM
-- after server reboot --
postgres=# show duckdb.postgres_role ;
duckdb.postgres_role
----------------------
user1
(1 row)
You can use the duckdb.disabled_filesystems configuration parameter to specify the storages that will be disabled for the pgpro_duckdb extension.
Possible values:
LocalFileSystem: Local storage
S3FileSystem: S3 storage
HTTPFileSystem: Network storage
By default, the local storage is disabled:
postgres=> show duckdb.disabled_filesystems;
duckdb.disabled_filesystems
-----------------------------
LocalFileSystem
(1 row)
Before enabling a local storage or an S3 storage, see Considerations for Working with a Local Storage and Considerations for Working with an S3 Storage.
To specify multiple storages, separate their names with commas without spaces:
postgres=# ALTER SYSTEM SET duckdb.disabled_filesystems TO 'LocalFileSystem,HTTPFileSystem';
ALTER SYSTEM
-- after server reboot --
postgres=# show duckdb.disabled_filesystems;
duckdb.disabled_filesystems
--------------------------------
LocalFileSystem,HTTPFileSystem
(1 row)
The local storage is disabled by default preventing users from exporting or reading files from it:
postgres=> show duckdb.disabled_filesystems;
duckdb.disabled_filesystems
-----------------------------
LocalFileSystem
(1 row)
postgres=> COPY test TO 'out.parquet';
ERROR: (PGDuckDB/DuckdbUtilityHook_Cpp) Permission Error: File system LocalFileSystem has been disabled by configuration
postgres=> SELECT * FROM read_parquet('test.parquet');
ERROR: (PGDuckDB/CreatePlan) Prepared query returned an error: 'Permission Error: File system LocalFileSystem has been disabled by configurationinput>
If the local storage is enabled, members of the role specified in the
duckdb.postgres_role configuration
parameter have the same privileges within this storage as the postgres
user. It is required for reading and writing to Parquet files:
postgres=> show duckdb.disabled_filesystems;
duckdb.disabled_filesystems
-----------------------------
(1 row)
postgres=> COPY test TO 'out.parquet';
COPY 7
postgres=> SELECT * FROM read_parquet('out.parquet');
a | b
---+---
1 | 2
...
(7 rows)
However, in this case, users can also access files in the
PGDATA directory:
postgres=> show duckdb.disabled_filesystems ;
duckdb.disabled_filesystems
-----------------------------
(1 row)
postgres=> SELECT pg_read_file('postgresql.auto.conf');
ERROR: permission denied for function pg_read_file
postgres=> SELECT * FROM duckdb.query('SELECT content FROM read_text(''postgresql.auto.conf'')');
content
-------------------------------------------------------
# Do not edit this file manually! +
# It will be overwritten by the ALTER SYSTEM command.+
Whereas if you disable the local storage:
postgres=> show duckdb.disabled_filesystems ;
duckdb.disabled_filesystems
-----------------------------
LocalFileSystem
(1 row)
postgres=> SELECT * FROM duckdb.query('SELECT content FROM read_text(''postgresql.auto.conf'')');
ERROR: (PGDuckDB/CreatePlan) Prepared query returned an error: 'Permission Error: File system LocalFileSystem has been disabled by configuration
Recommendations for using the pgpro_duckdb extension with a local storage:
Use a local storage only if you trust users to call pgpro_duckdb functions correctly.
Isolate Postgres Pro processes from files
outside the PGDATA directory and files that will be
read or written to.
To work with an S3 storage, specify the required connection information.
For example, if you have an instance of the MinIO
S3 storage deployed, perform the following steps under the
postgres user:
Create a server for the S3 storage:
postgres=# CREATE SERVER my_secret1
TYPE 's3'
FOREIGN DATA WRAPPER duckdb
OPTIONS (
endpoint '127.0.0.1:9000',
url_style 'path',
use_ssl 'FALSE'
);
CREATE SERVER
Create a user mapping for the user that will work with the S3 storage:
postgres=# CREATE USER MAPPING FOR user1 SERVER my_secret1
OPTIONS (KEY_ID 'minioadmin', SECRET 'minioadmin');
CREATE USER MAPPING
The specified user now has required privileges for working with the S3 storage:
postgres=> SELECT * FROM read_parquet('s3://bucket1/results.parquet');
a | b
---+---
1 | 2
2 |
(2 rows)
postgres=> COPY a TO 's3://bucket1/a.parquet';
COPY 8
If you create a user mapping for the user role specified in the duckdb.postgres_role configuration parameter, all members of this role will be able to work with the S3 storage.
To revoke the user access to the S3 storage, delete the corresponding user mapping:
postgres=# DROP USER MAPPING FOR user1 SERVER my_secret1;
DROP USER MAPPING
To revoke all user access to the S3 storage, disable it:
postgres=> show duckdb.disabled_filesystems ;
duckdb.disabled_filesystems
------------------------------
LocalFileSystem,S3FileSystem
(1 row)
postgres=> SELECT * FROM read_parquet('s3://bucket1/results.parquet');
ERROR: (PGDuckDB/CreatePlan) Prepared query returned an error: 'Permission Error: File system S3FileSystem has been disabled by configuration
postgres=> COPY a TO 's3://bucket1/a.parquet';
ERROR: (PGDuckDB/DuckdbUtilityHook_Cpp) Permission Error: File system S3FileSystem has been disabled by configuration
Considerations for exporting OLAP data to a storage:
The zstd compression method with default compression level provides optimal file size / operation speed ratio.
You can specify the file size using the FILE_SIZE_BYTES
parameter. The value must be in the range from 100 MB to 10 GB.
You can leave the ROW_GROUP_SIZE parameter unchanged.
OLAP data is exported using the COPY command. You can
export an entire Postgres Pro table:
COPY tbl TO 'output.parquet' (FORMAT parquet);
Or only the required OLAP data:
COPY (SELECT * FROM tbl) TO 'output.parquet' (FORMAT parquet);
You can follow guidelines for directory structure and export to Parquet files:
COPY (SELECT * FROM tbl) TO '/rootpath/db_name/schema_name/ table_name'
(FORMAT parquet, COMPRESSION zstd, FILE_SIZE_BYTES 1g);
To export OLAP data to an S3 storage, specify the required connection
information. For example, if you have an instance of the MinIO
S3 storage deployed, perform the following steps under the
postgres user:
Create a server for the S3 storage:
postgres=# CREATE SERVER my_secret1
TYPE 's3'
FOREIGN DATA WRAPPER duckdb
OPTIONS (
endpoint '127.0.0.1:9000',
url_style 'path',
use_ssl 'FALSE'
);
CREATE SERVER
Create a user mapping for the user that will work with the S3 storage:
postgres=# CREATE USER MAPPING FOR user1 SERVER my_secret1
OPTIONS (KEY_ID 'minioadmin', SECRET 'minioadmin');
CREATE USER MAPPING
Execute the COPY command under the required user:
COPY (SELECT * FROM tbl) TO 's3://bucket/db_name/schema_name/table_name' (FORMAT parquet, COMPRESSION zstd, FILE_SIZE_BYTES 1g);
To organize OLAP data by partition keys
within the Postgres Pro table directory using Hive
partitioning, specify the PARTITION_BY parameter when
exporting data:
COPY (SELECT * FROM tbl) TO 's3://bucket/db_name/schema_name/table_name' (FORMAT parquet, PARTITION_BY (year, month), COMPRESSION zstd);
Since writing a large number of Parquet files is a resource-intensive OLAP
operation, select partition keys in a way that the size of each resulting
partition is at least 100 MB. You cannot configure the size of Parquet files
along with the PARTITION_BY parameter. The number of
Parquet files created in each directory (partition) corresponds to the number of
pgpro_duckdb threads.
For more information about the COPY command parameters
and Parquet format, see the official DuckDB documentation
on statements
and Parquet files.
The COPY command is executed in a transaction so
the snapshot of the exported Postgres Pro table is
consistent. You can use an explicit transaction to consistently export
multiple tables by continually executing the COPY command.
Open transactions delay vacuum, which can negatively impact performance and increase the size of tables. To avoid this, perform the following steps:
If you have available computational resources, execute the
COPY command for each table in a separate session.
In this case, connect to a single transaction snapshot in each
session using the SET TRANSACTION SNAPSHOT command.
Append-only tables usually have a column with monotonically increasing values. You can split the values into ranges and export them in different transactions.
Note that you can specify the number of postgres processes
for reading tables and the number of pgpro_duckdb
threads for exporting OLAP data to Parquet files. However, the conversion
from columnar to row view is currently performed in a single thread.
If an error occurs when exporting OLAP data to Parquet files, for example due to insufficient disk space, partially exported files are not deleted. For such cases, configure automatic deletion and work continuation.
The pgpro_duckdb
extension uses the read_parquet
function for reading Parquet files. This function returns a special data type —
duckdb.row that provides compatibility between
DuckDB and Postgres Pro.
The read_parquet function uses a unique syntax for
accessing columns:
SELECT r['id'], r['name'] FROM read_parquet('file.parquet') r WHERE r['age'] > 21;
After exporting OLAP data to Parquet files, you can work with them as you would with standard Postgres Pro tables. To do this, create a view for Parquet files:
CREATE VIEW parquet_table AS
SELECT
r['id']::BIGINT AS id,
r['age']::SMALLINT AS age,
r['name']::VARCHAR AS name
FROM read_parquet('file.parquet') AS r;
In this example, data types used in Parquet files are specified explicitly.
This is optional, but it allows you to cast columns to
required DuckDB types, otherwise metadata from
Parquet files is used. If data types are not specified explicitly, the view
description will include a special duckdb.unresolved_type data
type.
Once the view is created, standard analytical queries can be executed:
SELECT id, name FROM parquet_table r WHERE age > 21;
The read_parquet function can take a list of Parquet
file names. It also allows using a glob pattern syntax and specifying lists of
glob patterns. The glob pattern syntax can be used for reading a directory with
Parquet files or multiple directories if Hive partitioning is used.
For more information, see the official DuckDB documentation on Parquet files.
The pgpro_duckdb extension includes the following performance optimizations for Parquet files:
Projection pushdown: Excludes columns that are not specified in analytical queries.
Filter pushdown: Uses Parquet file metadata to read only the required rows.
Hive partitioning: Reduces the amount of OLAP data being read. You can organize OLAP data by partition keys in a directory hierarchy and use partition keys in analytical queries. In this case, only the directories that correspond to the specified partition keys will be read.
For more information, see the official DuckDB documentation on partial reading and filter pushdown.
To avoid negative impact on OLTP workloads, it is important to limit the consumption of computational resources by analytical sessions.
The pgpro_duckdb extension allows limiting the number of threads and memory consumption for analytical sessions using the following configuration parameters:
The real memory consumption by OS can be higher, for example, allocating space for query results does not take configured limits into account. For more information, see the official DuckDB documentation on resource management.
OLAP data is processed by the pgpro_duckdb extension in streams, i.e., part by part, avoiding full materialization when possible. If the pgpro_duckdb extension lacks RAM for executing an analytical query, the associated data is placed in temporary files, which significantly increases the execution time.
Temporary files are located in the PGDATA/.tmp
directory and can take up to 90% of disk space. Monitor the amount of
temporary files created by the pgpro_duckdb
extension every 24 hours and avoid peak values.
The CH-benCHmark methodology was used to evaluate performance. It includes measuring the average number of transactions per second (TPS) during a specific time period, as well as the number of analytical queries executed in parallel. In this case, the nature of analytical queries is not important, as they are executed in a random order, and the execution time of a single analytical query is much less than the duration of the test run.
OLTP workload was provided by the pgbench program, and analytical ClickBench queries were generated in the background by a script. In the CH-benCHmark file, analytical TCP-H queries were replaced with ClickBench queries. As the initial implementation, citus-benchmark was used.
Test details:
Server with 64 vCPUs and 128 GB RAM
Data size: 14 GB each for pgbench and ClickBench datasets, processed in RAM
Test duration: 100 seconds
Metrics: TPS, number of analytical queries, and load average
For this test, a single ClickBench process with 4 threads was initialized. The number of pgbench -S (select-only) connections varied for each test run.
Table 34.2. Test 1 Results
|
Number of Connections |
TPS / Analytical Queries (Load Average) |
|---|---|
|
25 |
287/39 (23.8) |
|
50 |
487/39 (42) |
|
100 |
532/38 (68) |
|
200 |
610/34 (158) |
For this test, 200 pgbench -S (select-only) connections were initialized. The number of ClickBench processes with 2 threads each varied for each test run.
Table 34.3. Test 2 Results
|
Number of ClickBench processes |
TPS / Analytical Queries (Load Average) |
|---|---|
|
2 |
574/49 (147) |
|
4 |
533/66 (156) |
|
8 |
522/127 (170) |
For this test, 200 pgbench TCP-B connections were initialized. The number of threads varied for each test run.
Table 34.4. Test 3 Results
|
Number of Threads |
TPS / Analytical Queries (Load Average) |
|---|---|
|
0 |
53 |
|
1 |
42/17 (66) |
|
4 |
42/39 (69) |
|
Not limited |
42/108 (86) |
Recommendations for simultaneous OLAP and OLTP workload:
If OLTP sessions use a large number of CPU cores, you can initialize a limited number of OLAP threads without negatively impacting OLTP workload. This is provided by the OS scheduler that evenly distributes CPUs across threads.
Limit the number of threads and memory consumption by analytical sessions using pgpro_duckdb configuration parameters.
Store Parquet files and temporary file directory in a separate physical storage to avoid I/O competition.
This section describes a basic work scenario for an analyst, assuming the following prerequisites are met:
Postgres Pro is deployed with the pgpro_duckdb extension installed.
An instance of the MinIO S3 storage is deployed and contains Parquet files.
The DBeaver application is installed and connected to the database.
The basic scenario includes the following steps:
Configuring the pgpro_duckdb extension with
the MinIO S3 storage. The administrator creates a
server for the S3 storage and a user mapping for the user that will work
with the S3 storage. The commands are executed under the
postgres user:
postgres=# CREATE SERVER my_secret1
TYPE 's3'
FOREIGN DATA WRAPPER duckdb
OPTIONS (
endpoint '127.0.0.1:9000',
url_style 'path',
use_ssl 'FALSE'
);
CREATE SERVER
postgres=# CREATE USER MAPPING FOR user1 SERVER my_secret1
OPTIONS (KEY_ID 'minioadmin', SECRET 'minioadmin');
CREATE USER MAPPING
Creating a view for accessing Parquet files:
postgres=# CREATE VIEW orders_view AS SELECT r['o_orderkey']::bigint o_orderkey, r['o_custkey']::bigint o_custkey, r['o_orderstatus']::char o_orderstatus, r['o_totalprice']::double precision o_totalprice, r['o_orderdate']::date o_orderdate, r['o_orderpriority']::text o_orderpriority, r['o_clerk']::text o_clerk, r['o_shippriority']::bigint o_shippriority, r['o_comment']::text o_comment FROM read_parquet('s3://bucket1/orders/orders.parquet') r;
CREATE VIEW
In this query, the required columns are selected from the Parquet file
using the r['
parameter, and data types are specified by the
column_name']r['
column_name']::type syntax. Furthermore, column aliases
and the r alias for the read_parquet
parameter are created for the convenience of working with data.
Now, the analyst can work with the created view:
test=# SELECT o_orderkey, o_totalprice FROM orders_view LIMIT 3;
o_orderkey | o_totalprice
------------+--------------
1 | 224560.83
2 | 75388.65
3 | 255287.36
(3 rows)
test=# SELECT AVG(o_totalprice) FROM orders_view ;
AVG
-------------------
151121.1229494574
(1 row)
Working with the created view using the DBeaver application. The analyst can connect to the database, access the created view, and execute analytical queries. The DBeaver application displays column types specified at view creation.