shardman-monitor

shardman-monitor — Shardman monitor

Synopsis

shardman-monitor [common_options] [--no-xact-resolver] [--no-deadlock-detector] [ --deadlock-timeout timeout ]

Here common_options are:

[--cluster-name cluster_name] [--log-level error | warn | info | debug ] [--retries retries_number] [--session-timeout seconds] [--store-endpoints store_endpoints] [--store-ca-file store_ca_file] [--store-cert-file store_cert_file] [--store-key client_private_key] [--store-timeout duration] [--version] [ -h | --help ]

Description

shardman-monitor is a Shardman monitoring daemon. There are usually several monitor instances in the cluster. Several instances are necessary for fault tolerance and load distribution. The usual name of the shardman-monitor systemd service is shardman-monitor@CLUSTER_NAME.service.

shardman-monitor performs the following tasks:

  • It makes sure that each replication group is aware of the location of the current master of any other replication group and that postgres_fdw settings for foreign servers match FDWOptions in clusterdata. (See FDWOptions for details).

    Since shardman-monitor modifies foreign server settings, manual changes to the settings of foreign servers corresponding to replication groups get lost.

  • It resolves the prepared distributed (2PC) transactions according to the transaction status on its coordinator.

  • It resolves distributed deadlocks by aborting one of the transactions involved in the deadlock.

Options

shardman-monitor accepts the following command-line options. Some of them are specific to the monitoring daemon, while the rest are common for Shardman utilities.

Options Specific to shardman-monitor

--deadlock-timeout timeout

Specifies the interval between deadlock checks. Accepted formats are the same as in PostgreSQL GUC settings. The default unit is ms, like for PostgreSQL deadlock_timeout.

The default is 2s.

--no-deadlock-detector

Sets shardman-monitor not to run the deadlock detector

--no-xact-resolver

Sets shardman-monitor not to run the resolver of prepared distributed transactions

Common Options

shardman-monitor common options are optional parameters that are not specific to the utility. They specify etcd connection settings, cluster name and a few more settings. By default shardmanctl tries to connect to the etcd store 127.0.0.1:2379 and use the cluster0 cluster name. The default log level is info.

-h, --help

Show brief usage information

--cluster-name cluster_name

Specifies the name for a cluster to operate on. The default is cluster0.

--log-level level

Specifies the log verbosity. Possible values of level are (from minimum to maximum): error, warn, info and debug. The default is info.

--retries number

Specifies how many times shardman-ladle retries a failing etcd request. If an etcd request fails, most likely, due to a connectivity issue, shardman-ladle retries it the specified number of times before reporting an error. The default is 5.

--session-timeout seconds

Specifies the session timeout for shardman-ladle locks. If there is no connectivity between shardman-ladle and the etcd store for the specified number of seconds, the lock is released. The default is 30.

--store-endpoints string

Specifies the etcd address in the format: http[s]://address[:port](,http[s]://address[:port])*. The default is http://127.0.0.1:2379.

--store-ca-file string

Verify the certificate of the HTTPS-enabled etcd store server using this CA bundle

--store-cert-file string

Specifies the certificate file for client identification by the etcd store

--store-key string

Specifies the private key file for client identification by the etcd store

--store-timeout duration

Specifies the timeout for a etcd request. The default is 5 seconds.

--version

Show shardman-utils version information

Environment

The following environment variables affect the behavior of shardman-monitor.

SDM_CLUSTER_NAME

An alternative to setting the --cluster-name option

SDM_DEADLOCK_TIMEOUT

An alternative to setting the --deadlock-timeout option

SDM_LOG_LEVEL

An alternative to setting the --log-level option

SDM_NO_DEADLOCK_DETECTOR

An alternative to setting the --no-deadlock-detector option

SDM_NO_XACT_RESOLVER

An alternative to setting the --no-xact-resolver option

SDM_RETRIES

An alternative to setting the --retries option

SDM_STORE_ENDPOINTS

An alternative to setting the --store-endpoints option

SDM_STORE_CA_FILE

An alternative to setting the --store-ca-file option

SDM_STORE_CERT_FILE

An alternative to setting the --store-cert-file option

SDM_STORE_KEY

An alternative to setting the --store-key option

SDM_STORE_TIMEOUT

An alternative to setting the --store-timeout option

SDM_SESSION_TIMEOUT

An alternative to setting the --session-timeout option

Examples

Showing shardman-monitor Logs

To look at shardman-monitor logs, you can use a journalctl command:

$ journalctl -u shardman-monitor@cluster0.service

Detecting Deadlocks

Assume that a sharded table was created like this:

 postgres=#   create table public.players(id int, username text, pass text) with (distributed_by='id'); 
 postgres=#   insert into players select id, 'user_'|| id, 'pass_' || id from generate_series(1,1000) id;

And let the records with IDs 3 and 18 belong to different shards. Now, consider this sequence of queries to n1 and n3:

 n1=#   begin; 
 n1=#   update players set pass='somevalue' where id=18;
 n3=#   begin;
 n3=#   update players set pass='othervalue' where id=3;
 n3=#   update players set pass='othervalue' where id=18;

The transaction on n3 waits for a lock held by the transaction on n1, and the transaction on n1 continues.

 n1=#   update players set pass='somevalue' where id=3;

ERROR:  canceling statement due to user request
CONTEXT:  while updating tuple (0,1) in relation "players_2"
remote SQL command: UPDATE public.players_2 SET pass = 'somevalue'::text WHERE ((id = 3))

Looking at shardman-monitor@cluster0 logs, on one of the monitors you can see something like this:


Feb 10 16:31:23 vg1 shardman-monitor[134814]: 2021-02-10T16:31:23.024+0300        INFO        cmd/monitor.go:914        Deadlock discovered:
Feb 10 16:31:23 vg1 shardman-monitor[134814]:   3:185860->4:186922->4:185362->1:185862->1:187368->
Feb 10 16:31:23 vg1 shardman-monitor[134814]:   Canceling pid 185362 at repgroup 4...        {"goroutine": "deadlock detector"}
Feb 10 16:31:23 vg1 shardman-monitor[134814]: 2021-02-10T16:31:23.027+0300        INFO        cmd/monitor.go:925        successfully canceled backend 185362 at repgroup 4        {"goroutine": "deadlock detector"}

shardman-monitor detected a deadlock and cancelled random transactions involved in the deadlock.

See Also

shardman-ladle, sdmspec.json