shardman-monitor

shardman-monitor
Prev	Up	3.3. Shardman Command Reference	Home	Next

shardman-monitor — Shardman monitor

Synopsis

shardman-monitor [common_options] [--no-xact-resolver] [--no-deadlock-detector] [ --deadlock-timeout timeout ]

Here common_options are:

[--cluster-name cluster_name] [--log-level error | warn | info | debug ] [--retries retries_number] [--session-timeout seconds] [--store-endpoints store_endpoints] [--store-ca-file store_ca_file] [--store-cert-file store_cert_file] [--store-key client_private_key] [--store-timeout duration] [--version] [ -h | --help ]

Description

shardman-monitor is a Shardman monitoring daemon. There are usually several monitor instances in the cluster. Several instances are necessary for fault tolerance and load distribution. The usual name of the shardman-monitor systemd service is shardman-monitor@CLUSTER_NAME.service.

shardman-monitor performs the following tasks:

It makes sure that each replication group is aware of the location of the current master of any other replication group and that postgres_fdw settings for foreign servers match FDWOptions in clusterdata. (See FDWOptions for details).
Since shardman-monitor modifies foreign server settings, manual changes to the settings of foreign servers corresponding to replication groups get lost.
It resolves the prepared distributed (2PC) transactions according to the transaction status on its coordinator.
It resolves distributed deadlocks by aborting one of the transactions involved in the deadlock.

Options

shardman-monitor accepts the following command-line options. Some of them are specific to the monitoring daemon, while the rest are common for Shardman utilities.

Options Specific to shardman-monitor

--deadlock-timeout timeout

Specifies the interval between deadlock checks. Accepted formats are the same as in PostgreSQL GUC settings. The default unit is ms, like for PostgreSQL deadlock_timeout.

The default is 2s.

--no-deadlock-detector

Sets shardman-monitor not to run the deadlock detector

--no-xact-resolver

Sets shardman-monitor not to run the resolver of prepared distributed transactions

Common Options

shardman-monitor common options are optional parameters that are not specific to the utility. They specify etcd connection settings, cluster name and a few more settings. By default shardmanctl tries to connect to the etcd store 127.0.0.1:2379 and use the cluster0 cluster name. The default log level is info.

-h, --help: Show brief usage information
--cluster-name cluster_name: Specifies the name for a cluster to operate on. The default is cluster0.
--log-level level: Specifies the log verbosity. Possible values of level are (from minimum to maximum): error, warn, info and debug. The default is info.
--retries number: Specifies how many times shardman-ladle retries a failing etcd request. If an etcd request fails, most likely, due to a connectivity issue, shardman-ladle retries it the specified number of times before reporting an error. The default is 5.
--session-timeout seconds: Specifies the session timeout for shardman-ladle locks. If there is no connectivity between shardman-ladle and the etcd store for the specified number of seconds, the lock is released. The default is 30.
--store-endpoints string: Specifies the etcd address in the format: http[s]://address[:port](,http[s]://address[:port])*. The default is http://127.0.0.1:2379.
--store-ca-file string: Verify the certificate of the HTTPS-enabled etcd store server using this CA bundle
--store-cert-file string: Specifies the certificate file for client identification by the etcd store
--store-key string: Specifies the private key file for client identification by the etcd store
--store-timeout duration: Specifies the timeout for a etcd request. The default is 5 seconds.
--version: Show shardman-utils version information

Environment

The following environment variables affect the behavior of shardman-monitor.

SDM_CLUSTER_NAME: An alternative to setting the --cluster-name option
SDM_DEADLOCK_TIMEOUT: An alternative to setting the --deadlock-timeout option
SDM_LOG_LEVEL: An alternative to setting the --log-level option
SDM_NO_DEADLOCK_DETECTOR: An alternative to setting the --no-deadlock-detector option
SDM_NO_XACT_RESOLVER: An alternative to setting the --no-xact-resolver option
SDM_RETRIES: An alternative to setting the --retries option
SDM_STORE_ENDPOINTS: An alternative to setting the --store-endpoints option
SDM_STORE_CA_FILE: An alternative to setting the --store-ca-file option
SDM_STORE_CERT_FILE: An alternative to setting the --store-cert-file option
SDM_STORE_KEY: An alternative to setting the --store-key option
SDM_STORE_TIMEOUT: An alternative to setting the --store-timeout option
SDM_SESSION_TIMEOUT: An alternative to setting the --session-timeout option

Examples

Showing shardman-monitor Logs

To look at shardman-monitor logs, you can use a journalctl command:

$ journalctl -u shardman-monitor@cluster0.service

Detecting Deadlocks

Assume that a sharded table was created like this:

 postgres=#   create table public.players(id int, username text, pass text) with (distributed_by='id'); 
 postgres=#   insert into players select id, 'user_'|| id, 'pass_' || id from generate_series(1,1000) id;

And let the records with IDs 3 and 18 belong to different shards. Now, consider this sequence of queries to n1 and n3:

 n1=#   begin; 
 n1=#   update players set pass='somevalue' where id=18;
 n3=#   begin;
 n3=#   update players set pass='othervalue' where id=3;
 n3=#   update players set pass='othervalue' where id=18;

The transaction on n3 waits for a lock held by the transaction on n1, and the transaction on n1 continues.

 n1=#   update players set pass='somevalue' where id=3;

ERROR:  canceling statement due to user request
CONTEXT:  while updating tuple (0,1) in relation "players_2"
remote SQL command: UPDATE public.players_2 SET pass = 'somevalue'::text WHERE ((id = 3))

Looking at shardman-monitor@cluster0 logs, on one of the monitors you can see something like this:


Feb 10 16:31:23 vg1 shardman-monitor[134814]: 2021-02-10T16:31:23.024+0300        INFO        cmd/monitor.go:914        Deadlock discovered:
Feb 10 16:31:23 vg1 shardman-monitor[134814]:   3:185860->4:186922->4:185362->1:185862->1:187368->
Feb 10 16:31:23 vg1 shardman-monitor[134814]:   Canceling pid 185362 at repgroup 4...        {"goroutine": "deadlock detector"}
Feb 10 16:31:23 vg1 shardman-monitor[134814]: 2021-02-10T16:31:23.027+0300        INFO        cmd/monitor.go:925        successfully canceled backend 185362 at repgroup 4        {"goroutine": "deadlock detector"}

shardman-monitor detected a deadlock and cancelled random transactions involved in the deadlock.