While evaluating performance of DML statements, it is important to understand how they are processed in Shardman.
First of all, the execution of INSERT significantly
differs from the execution of UPDATE and DELETE
statements. The behavior of INSERT is controlled by
batch_size foreign server option, which
can be set in FDWOptions section of
Shardman configuration
file.
If batch_size is greater than 0,
an INSERT in the same statement of several values
that fall into the same foreign partition leads to the values being grouped together
in batches of the specified size. Remote INSERT statements are prepared
with the necessary number of parameters and then are executed with the
given values. If the number of values does not match the number of prepared
arguments, the modified statement with the necessary number of parameters is prepared
again. A batch insert optimization can fail if a transaction inserts records one by one
or records routed to different foreign tables are intermixed
in one INSERT statement. A batch is formed for a single
foreign modify operation. It is sent to the remote server when the batch is filled
or when the modify operation is over. The modify operation is over when
we start routing tuples to another sharded table partition.
So, for bulk load, inserting multiple values in a single INSERT
command or using COPY is recommended (as COPY
is optimized in a similar way). Large batch_size
values allow issuing less INSERT
statements on remote side and so significantly reduce communication
costs. However, during construction of parameters for
prepared INSERT statements, all
inserted values should be copied to libpq-allocated
memory. This can lead to unrestricted memory usage on the query coordinator
side when several large text
or bytea objects are loaded.
UPDATE and DELETE statements
can be executed in a direct or indirect mode. A direct mode is used
when a statement can be directly sent to a foreign server.
In this mode, to modify a table on a remote server, a new statement
is created based on the original ModifyTable
plan node. Using a direct update is not always possible.
In particular, it is impossible when some conditions should be evaluated
locally. In this case, a much less efficient indirect modification
is used. An indirect modification includes several statements.
The first one is SELECT FOR UPDATE to lock remote
rows. The second one is an actual UPDATE or
DELETE, which is prepared once and then
executed with different parameters for each row of the
SELECT FOR UPDATE statement result
after local filters are applied to the result.
Evidently, direct modifications are much more efficient.
You can easily identify whether a DML statement is going to be executed in a direct or indirect mode looking at the query plan. A typical example of an indirect modification is:
EXPLAIN VERBOSE DELETE FROM pgbench_history
WHERE bid = 20 AND mtime > '2023-03-14 10:00:00'::timestamptz;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Delete on public.pgbench_history (cost=100.00..142.66 rows=0 width=0)
Foreign Delete on public.pgbench_history_17_fdw pgbench_history_1
Remote SQL: DELETE FROM public.pgbench_history_17 WHERE ctid = $1
-> Foreign Scan on public.pgbench_history_17_fdw pgbench_history_1 (cost=100.00..142.66 rows=4 width=10)
Output: pgbench_history_1.tableoid, pgbench_history_1.ctid
Filter: (pgbench_history_1.mtime > '2023-03-14 10:00:00+03'::timestamp with time zone)
Remote SQL: SELECT mtime, ctid FROM public.pgbench_history_17 WHERE ((bid = 20)) FOR UPDATE
If we had chosen another type for the string constant, this would become a direct update.
EXPLAIN VERBOSE DELETE FROM pgbench_history
WHERE bid = 20 AND mtime > '2023-03-14 10:00:00'::timestamp;
explain verbose delete from pgbench_history where bid = 20 and mtime > '2023-03-14 10:00:00'::timestamp;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Delete on public.pgbench_history (cost=100.00..146.97 rows=0 width=0)
Foreign Delete on public.pgbench_history_17_fdw pgbench_history_1
-> Foreign Delete on public.pgbench_history_17_fdw pgbench_history_1 (cost=100.00..146.97 rows=4 width=10)
Remote SQL: DELETE FROM public.pgbench_history_17 WHERE ((mtime > '2023-03-14 10:00:00'::timestamp without time zone)) AND ((bid = 20))
We see that in a direct update mode, only one statement is executed on the remote server.