Shardman enhances the
EXPLAIN command so that it can provide
additional information about a query if it is distributed.
The work with the distributed tables is based on the plan
nodes with the ForeignScan type.
A query to each remote partition is determined by a single plan node of this type, with
Shardman submitting additional information to the
EXPLAIN blocks with the node description.
When executing a distributed query, the part of the plan (a subtree) that
relates to a specific remote partition is serialized into an SQL statement.
This process is known as deparsing. Then, this statement is sent to a remote server.
The result of this query is the output of a ForeignScan node.
It is used to gather the final results of the distributed query execution.
When the VERBOSE option of the EXPLAIN
command is set to on, the Remote SQL
field of the ForeignScan node block shows
the statement sent to the remote server. Also, the Server
field indicates the name of the server as it was specified during
the cluster configuration and as it is displayed in pg_foreign_server,
along with the transport method used to send this statement.
The transport field can take two values: silk
for the the enhanced interconnect Shardman mechanism,
or libpq for sending via the standard
PostgreSQL protocol.
To see the execution plan that will be used on the remote server under
the EXPLAIN block of the ForeignScan node,
use the postgres_fdw.foreign_explain configuration parameter.
The possible values are: none to exclude
the EXPLAIN output from the remote servers, full
to include the EXPLAIN output from the remote servers,
collapsed to include the EXPLAIN output
only for the first ForeignScan node under its
Append/MergeAppend.
In production, it is recommended to disable this parameter (set it
to none) or set it to collapsed,
because obtaining any EXPLAIN information
results in an additional implicit request to the server. Moreover, this
request is executed in a synchronous mode, meaning the overall
EXPLAIN output is built only once all the servers
are sequentially queried.
It can be a costly operationin case of a table with a large number of partitions.
Note that in case of the internal request for obtaining the EXPLAIN
blocks for a remote plan, certain parameters are forcibly disabled, regardless
of the parameters specified by a user when requesting EXPLAIN
from the coordinator: ANALYZE OFF, TIMING OFF,
SUMMARY OFF, SETTINGS OFF, NETWORK OFF.
In this case, the EXPLAIN block of a remote plan will lack
the corresponding metrics. Other EXPLAIN parameters
(FORMAT, VERBOSE, COSTS, BUFFERS,
WAL) are inherited from the coordinator.
If the subplan deparsing forms a statement that includes parameters
(in the statement using symbols $1,
$2, etc.), such a statement
generally cannot be sent to the remote server to obtain EXPLAIN
results. Therefore, the ForeignExplain blocks are not
formed for the SQL statements with parameters.
Setting the NETWORK option of the EXPLAIN
command to on shows the network operation metrics
for the plan nodes, including individual ForeignScan nodes
and general nodes Append or MergeAppend.
For each plan node, the FDW bytes, sent,
and received parameters are displayed for the outgoing
and incoming traffic when the node is executed (regardless of the transport type).
Note that these metrics are only output when the ANALYZE
option of the EXPLAIN
command is set to on.
When the track_fdw_wait_timing configuration parameter
is enabled, the wait_time metric is also output.
This metric summarizes all stages of the plan node execution,
starting from the time the request is sent to the remote server,
including the time spent on the execution itself
and all the time until the complete set of results for that plan node is received.
Note that the ForeignScan node can operate in both
synchronous and asynchronous modes. For the asynchronous execution,
the node's execution function sends a request to the remote server
and completes its execution without waiting for the result.
The result is considered and processed later, upon receipt.
In this scenario, the wait_time metric may not
accurately reflect the actual execution time.
For the Silk transport, there is an option
to output the extended debug information about tracing of a query passing
from the coordinator to the remote server and back, including
the results from the remote server. This information is only
available if the ANALYZE option of the EXPLAIN
command is set to on, and
the shardman.silk_tracepoints configuration parameter
is enabled.
When these parameters are enabled, each message transferred through
the Silk transport (sending the SQL query,
delivering it to the recipient, executing the query, and
returning the execution result) is accompanied by an array of
the timestamps measured at certain points in the pipeline. Once the query
is executed, this information is displayed in the
EXPLAIN block as rows starting with the word Trace.
Each metric represents the difference between the timestamps at
different points, in milliseconds:
Table 5.1. Query Tracing for Silk Transport Metrics
| Interval | Description |
|---|---|
| bk shm->mp1 (qry) | The time taken to transfer an SQL query from the coordinator to its multiplexer via the shared memory. |
| mp1 shm->net (qry) | The time between receiving a query within the multiplexer from the shared memory and transferring it over the network. |
| net (qry) | The time spent by an SQL query to transfer over the network between the multiplexers. |
| mp2 recv->shm (qry) | The time between receiving an SQL query from the network and placing it in the queue in the shared memory on a remote multiplexer. |
| wk exec (1st tup) | The time spent to execute a query in Silkworm
until the first row of the result is received. |
| wk exec (all tups) | The time spent to execute a query on Silkworm
until the complete result is received. |
| wk->shm (1st tup) | The time taken to place the first row of the result into the
Silkworm queue. |
| wk->shm (last tup) | The time taken to place the last row of the result
into the Silkworm queue. |
| mp2 shm->net (1st tup) | The time between reading the first row of the result from the queue by the remote multiplexer and transferring it over the network. |
| net (1st tup) | The time spent to transfer the first row of the result over the network between the multiplexers. |
| mp1 recv->shm (1st tup) | The time between receiving the first row of the result from the network and placing it in the queue by the local multiplexer. |
| mp1 shm->bk (1st tup) | The time spent to retrieve the first row of the result from the queue by the coordinator. |
| mp2 shm->net (last tup) | The time between reading of the last row of the result from the queue by the remote multiplexer and transferring it over the network. |
| net (last tup) | The time spent to transfer the last row of the result over the network between the multiplexers. |
| mp1 recv->shm (last tup) | The time between receiving the last row of the result from the network and placing it in the queue by the local multiplexer. |
| mp1 shm->bk (last tup) | The time taken by the coordinator to retrieve the last row o the result from the queue. |
| END-TO-END | The total time from sending the query to receiving the last
row of the result. This approximately corresponds to the wait_time. |
For the metrics net (qry), net (1st tup),
and net (last tup), the interval value is calculated
as the difference between timestamps on different servers.
Therefore, negative values may appear in these lines.
This difference includes both time spent on a message transfer and the clock drift
(positive or negative) between these servers.
Thus, even with a slight drift, the values will be negative if its absolute value exceeds the duration
of network transfer. Although it is not a bug, you should pay close
attention to whether the cluster clocks are synchronized. For more information,
see Section 5.3.