While evaluating performance of DML statements, it's important to understand how they are processed in Shardman.
First of all, the execution of INSERT significantly
differs from the execution of UPDATE and DELETE
statements. The behavior of INSERT is controlled by
batch_size foreign server option, which
can be set in FDWOptions section of
Shardman configuration file
.
If batch_size is more than 0,
INSERTing in the same statement several values
which falls into the same foreign partition leads to the values being grouped together
in batches of specified size. Remote INSERT statements are prepared
with the necessary number of parameters and then are executed with the
given values. If number of values doesn't match the number of prepared
arguments, the modified statement with the necessary number of parameters is prepared
again. The batch insert optimization can fail if transaction inserts records one-by-one
or records routed to different foreign tables are intermixed
in one INSERT statement. A batch is formed for a single
foreign modify operation. It is sent to the remote server when the batch is filled
or when the modify operation is over. The modify operation is over when
we start routing tuples to another sharded table partition.
So, for bulk load inserting multiple values in a single INSERT
command or using COPY is recommended (as COPY
is optimized in a similar way). Large batch_size
values allow to issue less INSERT
statements on remote side and so significantly reduce communication
costs. However, during construction of parameters for
prepared INSERT statements all
inserted values should be copied to libpq-allocated
memory. This can lead to unrestricted memory usage on the side
of a query coordinator when several large text
or bytea objects are loaded.
UPDATE and DELETE statements
can be executed in direct or indirect mode. A direct mode is used
when statement can be directly sent to a foreign server.
In this mode on the base of the original ModifyTable
plan node, a new statement is created to modify a table on the remote
server. Using direct update is not always possible.
In particular, it's impossible when some conditions should be evaluated
locally. In this case much less efficient indirect modification
is used. An indirect modification includes several statements -
the first one is SELECT FOR UPDATE to lock remote
rows. The second is a real UPDATE or
DELETE which is prepared once and then
executed with different parameters for each row of the
SELECT FOR UPDATE statement result
after local filters are applied to the result.
Evidently, direct modifications are much more efficient.
You can easily identify if DML statement is going to be executed in direct or indirect mode looking at the query plan. A typical example of indirect modification is:
EXPLAIN VERBOSE DELETE FROM pgbench_history
WHERE bid = 20 AND mtime > '2023-03-14 10:00:00'::timestamptz;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Delete on public.pgbench_history (cost=100.00..142.66 rows=0 width=0)
Foreign Delete on public.pgbench_history_17_fdw pgbench_history_1
Remote SQL: DELETE FROM public.pgbench_history_17 WHERE ctid = $1
-> Foreign Scan on public.pgbench_history_17_fdw pgbench_history_1 (cost=100.00..142.66 rows=4 width=10)
Output: pgbench_history_1.tableoid, pgbench_history_1.ctid
Filter: (pgbench_history_1.mtime > '2023-03-14 10:00:00+03'::timestamp with time zone)
Remote SQL: SELECT mtime, ctid FROM public.pgbench_history_17 WHERE ((bid = 20)) FOR UPDATE
If we had chosen another type for string constant, this would become direct update.
EXPLAIN VERBOSE DELETE FROM pgbench_history
WHERE bid = 20 AND mtime > '2023-03-14 10:00:00'::timestamp;
explain verbose delete from pgbench_history where bid = 20 and mtime > '2023-03-14 10:00:00'::timestamp;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Delete on public.pgbench_history (cost=100.00..146.97 rows=0 width=0)
Foreign Delete on public.pgbench_history_17_fdw pgbench_history_1
-> Foreign Delete on public.pgbench_history_17_fdw pgbench_history_1 (cost=100.00..146.97 rows=4 width=10)
Remote SQL: DELETE FROM public.pgbench_history_17 WHERE ((mtime > '2023-03-14 10:00:00'::timestamp without time zone)) AND ((bid = 20))
In direct update mode we see that only one statement is executed on the remote server.