5.2. DML Optimizations

5.2. DML Optimizations
Prev	Up	Chapter 5. Performance Tuning	Home	Next

5.2. DML Optimizations #

5.2.1. DML Optimizations of Global Tables

While evaluating performance of DML statements, it is important to understand how they are processed in Shardman.

First of all, the execution of INSERT significantly differs from the execution of UPDATE and DELETE statements. The behavior of INSERT for sharded tables is controlled by the batch_size foreign server option, which can be set in FDWOptions section of Shardman configuration file. If batch_size is greater than 0, an INSERT in the same statement of several values that fall into the same foreign partition leads to the values being grouped together in batches of the specified size. Remote INSERT statements are prepared with the necessary number of parameters and then are executed with the given values. If the number of values does not match the number of prepared arguments, the modified statement with the necessary number of parameters is prepared again. A batch insert optimization can fail if a transaction inserts records one by one or records routed to different foreign tables are intermixed in one INSERT statement. A batch is formed for a single foreign modify operation. It is sent to the remote server when the batch is filled or when the modify operation is over. The modify operation is over when we start routing tuples to another sharded table partition. So, for bulk load, inserting multiple values in a single INSERT command or using COPY is recommended (as COPY is optimized in a similar way). Large batch_size values allow issuing less INSERT statements on remote side and so significantly reduce communication costs. However, during construction of parameters for prepared INSERT statements, all inserted values should be copied to libpq-allocated memory. This can lead to unrestricted memory usage on the query coordinator side when several large text or bytea objects are loaded.

UPDATE and DELETE statements can be executed in a direct or indirect mode. A direct mode is used when a statement can be directly sent to a foreign server. In this mode, to modify a table on a remote server, a new statement is created based on the original ModifyTable plan node. Using a direct update is not always possible. In particular, it is impossible when some conditions should be evaluated locally. In this case, a much less efficient indirect modification is used. An indirect modification includes several statements. The first one is SELECT FOR UPDATE to lock remote rows. The second one is an actual UPDATE or DELETE, which is prepared once and then executed with different parameters for each row of the SELECT FOR UPDATE statement result after local filters are applied to the result. Evidently, direct modifications are much more efficient.

You can easily identify whether a DML statement is going to be executed in a direct or indirect mode looking at the query plan. A typical example of an indirect modification is:

EXPLAIN VERBOSE DELETE FROM pgbench_history
WHERE bid = 20 AND mtime > '2023-03-14 10:00:00'::timestamptz;
                                                  QUERY PLAN                                                  
--------------------------------------------------------------------------------------------------------------
 Delete on public.pgbench_history  (cost=100.00..142.66 rows=0 width=0)
   Foreign Delete on public.pgbench_history_17_fdw pgbench_history_1
     Remote SQL: DELETE FROM public.pgbench_history_17 WHERE ctid = $1
   ->  Foreign Scan on public.pgbench_history_17_fdw pgbench_history_1  (cost=100.00..142.66 rows=4 width=10)
         Output: pgbench_history_1.tableoid, pgbench_history_1.ctid
         Filter: (pgbench_history_1.mtime > '2023-03-14 10:00:00+03'::timestamp with time zone)
         Remote SQL: SELECT mtime, ctid FROM public.pgbench_history_17 WHERE ((bid = 20)) FOR UPDATE

If we had chosen another type for the string constant, this would become a direct update.

EXPLAIN VERBOSE DELETE FROM pgbench_history
WHERE bid = 20 AND mtime > '2023-03-14 10:00:00'::timestamp;
explain verbose delete from pgbench_history where bid = 20 and mtime > '2023-03-14 10:00:00'::timestamp;
                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 Delete on public.pgbench_history  (cost=100.00..146.97 rows=0 width=0)
   Foreign Delete on public.pgbench_history_17_fdw pgbench_history_1
   ->  Foreign Delete on public.pgbench_history_17_fdw pgbench_history_1  (cost=100.00..146.97 rows=4 width=10)
         Remote SQL: DELETE FROM public.pgbench_history_17 WHERE ((mtime > '2023-03-14 10:00:00'::timestamp without time zone)) AND ((bid = 20))

We see that in a direct update mode, only one statement is executed on the remote server.

5.2.1. DML Optimizations of Global Tables #

The shardman.gt_batch_size configuration parameter, which you can tune, defines the size of an intermediate buffer used before sending data to a remote server.

INSERT uses the binary protocol and creates batches of the shardman.gt_batch_size size. Large values of the buffer size enable sending fewer network requests on the remote side and thus substantially reduce the connection costs. On the other hand, large values of this parameter can increase memory consumption on the query coordinator side. Therefore, when specifying the buffer size, it is important to achieve a compromise between the connection costs and the allocated memory size.

For UPDATE, a query for each column and each row is created on the coordinator and sent to remote nodes.

For DELETE, a query for a batch of data of the shardman.gt_batch_size size is created on the coordinator and sent to remote nodes.

Prev	Up	Next
5.1. Examining Plans	Home	Chapter 6. Shardman Reference