shardman-monitor — Shardman monitor
shardman-monitor [common_options] [--no-xact-resolver] [--no-deadlock-detector] [
--deadlock-timeout
]timeout
Here common_options are:
[--cluster-name cluster_name] [--log-level error | warn | info | debug ] [--retries ] [retries_number--session-timeout ] [seconds--store-endpoints ] [store_endpoints--store-ca-file ] [store_ca_file--store-cert-file ] [store_cert_file--store-key ] [client_private_key--store-timeout ] [duration--version] [ -h | --help ]
shardman-monitor is a
Shardman monitoring daemon. There are
usually several monitor instances in the cluster. Several instances
are necessary for fault tolerance and load distribution.
The usual name of the shardman-monitor systemd service
is shardman-monitor@CLUSTER_NAME.service.
shardman-monitor performs the following tasks:
It makes sure that each replication group is aware of the location of
the current master of any other replication group and that
postgres_fdw settings for foreign servers match
FDWOptions in
shardman/cluster0/data/cluster etcd key. (See
FDWOptions
for details).
Since shardman-monitor modifies foreign server settings, manual changes to the settings of foreign servers corresponding to replication groups get lost.
It resolves the prepared distributed (2PC) transactions according to the transaction status on its coordinator.
It resolves distributed deadlocks by aborting one of the transactions involved in the deadlock.
shardman-monitor accepts the following command-line options. Some of them are specific to the monitoring daemon, while the rest are common for Shardman utilities.
--deadlock-timeout timeout
Specifies the interval between deadlock checks. Accepted formats are the same as in
PostgreSQL GUC settings. The default unit is ms,
like for PostgreSQL
deadlock_timeout.
The default is 2s.
--no-deadlock-detector
Sets shardman-monitor not to run the deadlock detector
--no-xact-resolver
Sets shardman-monitor not to run the resolver of prepared distributed transactions
shardman-monitor common options are optional parameters
that are not specific to the utility. They specify
etcd connection settings, cluster name and a few more settings.
By default shardmanctl tries to connect to
the etcd store 127.0.0.1:2379 and use the cluster0
cluster name. The default log level is info.
-h, --help
Show brief usage information
--cluster-name cluster_name
Specifies the name for a cluster to operate on.
The default is cluster0.
--log-level level
Specifies the log verbosity. Possible values of
level are (from minimum to maximum):
error,
warn, info and
debug. The default is info.
--retries number
Specifies how many times shardmanctl retries a failing etcd request. If an etcd request fails, most likely, due to a connectivity issue, shardmanctl retries it the specified number of times before reporting an error. The default is 5.
--session-timeout seconds
Specifies the session timeout for shardmanctl locks. If there is no connectivity between shardmanctl and the etcd store for the specified number of seconds, the lock is released. The default is 30.
--store-endpoints string
Specifies the etcd address in the format:
http[s]://address[:port](,http[s]://address[:port])*.
The default is http://127.0.0.1:2379.
--store-ca-file string
Verify the certificate of the HTTPS-enabled etcd store server using this CA bundle
--store-cert-file string
Specifies the certificate file for client identification by the etcd store
--store-key string
Specifies the private key file for client identification by the etcd store
--store-timeout duration
Specifies the timeout for a etcd request. The default is 5 seconds.
--version
Show shardman-utils version information
The following environment variables affect the behavior of shardman-monitor.
SDM_CLUSTER_NAME
An alternative to setting the --cluster-name
option
SDM_DEADLOCK_TIMEOUT
An alternative to setting the --deadlock-timeout
option
SDM_LOG_LEVEL
An alternative to setting the --log-level
option
SDM_NO_DEADLOCK_DETECTOR
An alternative to setting the
--no-deadlock-detector option
SDM_NO_XACT_RESOLVER
An alternative to setting the --no-xact-resolver
option
SDM_RETRIES
An alternative to setting the --retries
option
SDM_STORE_ENDPOINTS
An alternative to setting the --store-endpoints
option
SDM_STORE_CA_FILE
An alternative to setting the --store-ca-file
option
SDM_STORE_CERT_FILE
An alternative to setting the --store-cert-file
option
SDM_STORE_KEY
An alternative to setting the --store-key
option
SDM_STORE_TIMEOUT
An alternative to setting the --store-timeout
option
SDM_SESSION_TIMEOUT
An alternative to setting the --session-timeout
option
To look at shardman-monitor logs, you can use a
journalctl command:
$journalctl -u shardman-monitor@cluster0.service
Assume that a sharded table was created like this:
postgres=#create table public.players(id int, username text, pass text) with (distributed_by='id');postgres=#insert into players select id, 'user_'|| id, 'pass_' || id from generate_series(1,1000) id;
And let the records with IDs 3 and 18 belong to different shards. Now, consider this sequence
of queries to n1 and n3:
n1=#begin;n1=#update players set pass='somevalue' where id=18;n3=#begin;n3=#update players set pass='othervalue' where id=3;n3=#update players set pass='othervalue' where id=18;
The transaction on n3 waits for a lock held by the transaction
on n1, and the transaction on n1 continues.
n1=#update players set pass='somevalue' where id=3;ERROR: canceling statement due to user request CONTEXT: while updating tuple (0,1) in relation "players_2" remote SQL command: UPDATE public.players_2 SET pass = 'somevalue'::text WHERE ((id = 3))
Looking at shardman-monitor@cluster0 logs, on one of the monitors you can see something like this:
Feb 10 16:31:23 vg1 shardman-monitor[134814]: 2021-02-10T16:31:23.024+0300 INFO cmd/monitor.go:914 Deadlock discovered:
Feb 10 16:31:23 vg1 shardman-monitor[134814]: 3:185860->4:186922->4:185362->1:185862->1:187368->
Feb 10 16:31:23 vg1 shardman-monitor[134814]: Canceling pid 185362 at repgroup 4... {"goroutine": "deadlock detector"}
Feb 10 16:31:23 vg1 shardman-monitor[134814]: 2021-02-10T16:31:23.027+0300 INFO cmd/monitor.go:925 successfully canceled backend 185362 at repgroup 4 {"goroutine": "deadlock detector"}
shardman-monitor detected a deadlock and cancelled random transactions involved in the deadlock.