F.30. multimaster

F.30. multimaster
Prev	Up	Appendix F. Additional Supplied Modules	Home	Next

Table of Contents

Limitations

Architecture

Installation and Setup

Multi-Master Cluster Administration

Reference

Compatibility

Authors

multimaster is a Postgres Pro Enterprise extension with a set of patches that turns Postgres Pro Enterprise into a synchronous shared-nothing cluster to provide Online Transaction Processing (OLTP) scalability for read transactions and high availability with automatic disaster recovery.

As compared to a standard PostgreSQL master-slave cluster, a cluster configured with the multimaster extension offers the following benefits:

Cluster-wide transaction isolation
Synchronous logical replication and DDL replication
Working with temporary tables on each cluster node
Fault tolerance and automatic node recovery
PostgreSQL online upgrades

The multimaster extension replicates your database to all nodes of the cluster and allows write transactions on each node. To ensure data consistency in the case of concurrent updates, multimaster enforces transaction isolation cluster-wide, using multiversion concurrency control (MVCC) at the read committed or repeatable read isolation levels. Any write transaction is synchronously replicated to all nodes, which increases commit latency for the time required for synchronization. Read-only transactions and queries are executed locally, without any measurable overhead.

To ensure high availability and fault tolerance of the cluster, multimaster uses three-phase commit protocol and heartbeats for failure discovery. A multi-master cluster of N nodes can continue working while the majority of the nodes are alive and reachable by other nodes. To be configured with multimaster, the cluster must include at least two nodes. In most cases, three cluster nodes are enough to ensure high availability. Since the data on all cluster nodes is the same, you do not typically need more than five cluster nodes.

When a failed node is reconnected to the cluster, multimaster can automatically fast-forward the node to the actual state based on the Write-Ahead Log (WAL) data in the corresponding replication slot. If WAL data is no longer available for the time when the node was excluded from the cluster, you can restore the node using pg_basebackup.

Important

When using multimaster, make sure to take its replication restrictions into account. For details, see Section F.30.1.

To learn more about the multimaster internals, see Section F.30.2.

F.30.1. Limitations

The multimaster extension takes care of the database replication in a fully automated way. You can perform write transactions on any node and work with temporary tables on each cluster node simultaneously. However, make sure to take the following replication restrictions into account:

multimaster can only replicate one database per cluster, which is specified in the multimaster.conn_strings variable. If you try to connect to a different database, multimaster will return a corresponding error message.
The replicated tables must have primary keys or replica identity. Otherwise, multimaster will not allow replication because of the logical replication restrictions. Unlogged tables are not replicated, as in the standard PostgreSQL.
Note
You can enable replication of tables without primary keys by setting the multimaster.ignore_tables_without_pk variable to false. However, take into account that multimaster does not allow update operations on such tables.
Isolation level. The multimaster extension supports read committed and repeatable read isolation levels. Serializable isolation level is currently not supported.
Important
Using repeatable read isolation level increases the probability of serialization failure at commit time. Unlike in the standard PostgreSQL, read committed level can also cause serialization failures on a multi-master cluster.
When performing a write transaction, multimaster blocks the affected objects only on the node on which the transaction is performed. However, since write transactions are allowed on all nodes, other transactions can try to change the same objects on the neighbor nodes at the same time. In this case, the replication of the first transaction can fail because the affected objects on the neighbor nodes are already blocked by another transaction. Similarly, the latter transaction cannot be replicated to the first node. In this case, a distributed deadlock occurs. As a result, one of the transactions is automatically rolled back and needs to be repeated. The application must be ready to retry transactions.
If your typical workload has too many rollbacks, it is recommended to use read committed isolation level. However, the read committed still does not guarantee the absence of deadlocks on a multi-master cluster. If using the read committed level does not help, you can try directing all the write transactions to a single node.
Sequence generation. To avoid conflicts between unique identifiers on different nodes, multimaster modifies the default behavior of sequence generators. For each node, ID generation is started with the node number and is incremented by the number of nodes. For example, in a three-node cluster, 1, 4, and 7 IDs are allocated to the objects written onto the first node, while 2, 5, and 8 IDs are reserved for the second node. If you change the number of nodes in the cluster, the incrementation interval for new IDs is adjusted accordingly. Thus, the generated sequence values are not monotonic.
DDL replication. While multimaster replicates data on the logical level, DDL is replicated on the statement level, which results in distributed commits of the same statement on different nodes. As a result, complex DDL scenarios, such as stored procedures and temporary tables, may work differently as compared to the standard PostgreSQL.
Commit latency. In the current implementation of logical replication, multimaster sends data to subscriber nodes only after the local commit, so you have to wait for transaction processing twice: first on the local node, and then on all the other nodes simultaneously. In the case of a heavy-write transaction, this may result in a noticeable delay.

If you have any data that must be present on one of the nodes only, you can exclude a particular table from replication, as follows:

SELECT * FROM mtm.make_table_local('table_name')

F.30.2. Architecture

F.30.2.1. Replication

Since each server in a multi-master cluster can accept writes, any server can abort a transaction because of a concurrent update — in the same way as it happens on a single server between different backends. To ensure high availability and data consistency on all cluster nodes, multimaster uses logical replication and the three-phase E3PC commit protocol.

When Postgres Pro Enterprise loads the multimaster shared library, multimaster sets up a logical replication producer and consumer for each node, and hooks into the transaction commit pipeline. The typical data replication workflow consists of the following phases:

PREPARE phase. multimaster captures and implicitly transforms each COMMIT statement to a PREPARE statement. All the nodes that get the transaction via the replication protocol (the cohort nodes) send their vote for approving or declining the transaction to the arbiter process on the initiating node. This ensures that all the cohort can accept the transaction, and no write conflicts occur. For details on PREPARE transactions support in PostgreSQL, see the PREPARE TRANSACTION topic.
PRECOMMIT phase. If all the cohort nodes approve the transaction, the arbiter process sends a PRECOMMIT message to all the cohort nodes to express an intention to commit the transaction. The cohort nodes respond to the arbiter with the PRECOMMITTED message. In case of a failure, all the nodes can use this information to complete the transaction using a quorum-based voting procedure.
COMMIT phase. If PRECOMMIT is successful, the arbiter commits the transaction to all nodes.

Important

multimaster currently supports the read committed and repeatable read isolation levels only, which can cause unexpected serialization failures in your workload. For details, see Section F.30.1.

If a node crashes or gets disconnected from the cluster between the PREPARE and COMMIT phases, the PRECOMMIT phase ensures that the survived nodes have enough information to complete the prepared transaction. The PRECOMMITTED messages help avoid the situation when the crashed node has already committed or aborted the transaction, but has not notified other nodes about the transaction status. In a two-phase commit (2PC), such a transaction would block resources (hold locks) until the recovery of the crashed node. Otherwise, you could get data inconsistencies in the database when the failed node is recovered. For example, if the failed node committed the transaction but the survived node aborted it.

To complete the transaction, the arbiter must receive a response from the majority of the nodes. For example, for a cluster of 2N+1 nodes, at least N+1 responses are required. Thus, multimaster ensures that your cluster is available for reads and writes while the majority of the nodes are connected, and no data inconsistencies occur in case of a node or connection failure. For details on the failure detection mechanism, see Section F.30.2.2.

F.30.2.2. Failure Detection and Recovery

Since multimaster allows writes to each node, it has to wait for responses about transaction acknowledgement from all the other nodes. Without special actions in case of a node failure, each commit would have to wait until the failed node recovery. To deal with such situations, multimaster periodically sends heartbeats to check the node state and the connectivity between nodes. When several heartbeats to the node are lost in a row, this node is kicked out of the cluster to allow writes to the remaining alive nodes. You can configure the heartbeat frequency and the response timeout in the multimaster.heartbeat_send_timeout and multimaster.heartbeat_recv_timeout parameters, respectively.

For alive nodes, there is no way to distinguish between a failed node that stopped serving requests and a network-partitioned node that can be accessed by database users, but is unreachable for other nodes. To avoid conflicting writes to nodes in different network partitions, multimaster only allows writes to the nodes that see the majority of other nodes.

For example, suppose a five-node multi-master cluster experienced a network failure that split the network into two isolated subnets, with two and three cluster nodes. Based on heartbeats propagation information, multimaster will continue accepting writes at each node in the bigger partition, and deny all writes in the smaller one. Thus, a cluster consisting of 2N+1 nodes can tolerate N node failures and stay alive if any N+1 nodes are alive and connected to each other.

In case of a partial network split when different nodes have different connectivity, multimaster finds a fully connected subset of nodes and switches off other nodes. For example, in a three-node cluster, if node A can access both B and C, but node B cannot access node C, multimaster isolates node C to ensure data consistency on nodes A and B.

Each node maintains a data structure that keeps the information about the state of all nodes in relation to this node. You can get this data in the mtm.get_nodes_state() view.

When a failed node connects back to the cluster, multimaster starts automatic recovery:

The reconnected node selects a random cluster node and starts catching up with the current state of the cluster based on the Write-Ahead Log (WAL).
When the node gets synchronized up to the minimum recovery lag, all the cluster nodes get locked for write transactions to allow the recovery process to finish. By default, the minimum recovery lag is 100KB. You can change this value in the multimaster.min_recovery_lag variable.
When the recovery is complete, multimaster promotes the reconnected node to the online state and includes it into the replication scheme.

Note

Automatic recovery is only possible if the failed node WAL lag behind the working ones does not exceed the multimaster.max_recovery_lag value. When the WAL lag is bigger than multimaster.max_recovery_lag, you can manually restore the node from one of the working nodes using pg_basebackup.

See Also

Restoring a Cluster Node

F.30.3. Installation and Setup

To use multimaster, you need to install Postgres Pro Enterprise on all nodes of your cluster. Postgres Pro Enterprise includes all the required dependencies and extensions.

F.30.3.1. Setting up a Multi-Master Cluster

After installing Postgres Pro Enterprise on all nodes, you need to configure the cluster with multimaster.

Suppose you are setting up a cluster of three nodes, with node1, node2, and node3 domain names. First, set up the database to be replicated, and make sure you have a user with superuser rights to perform replication:

If you are starting from scratch, initialize a cluster, create an empty database mydb and a new user myuser, on each node of the cluster. For details, see Section 17.2.
If you already have a database mydb running on node1, initialize new nodes from the working node using pg_basebackup. On behalf of myuser, run the following command on each node you are going to add:
```
pg_basebackup -D datadir -h node1 mydb
```
where datadir is the directory containing the database cluster. This directory is specified at the cluster initialization stage, or set in the PGDATA environment variable.
For details on using pg_basebackup, see pg_basebackup.

Once the database is set up, complete the following steps on each cluster node:

Modify the postgresql.conf configuration file, as follows:
- Add multimaster to the shared_preload_libraries variable:
```
shared_preload_libraries = 'multimaster'
```
  Tip
  If the shared_preload_libaries variable is already defined in postgresql.auto.conf, you will need to modify its value using the ALTER SYSTEM command. For details, see Section 18.1.2.
- Specify the transaction isolation level for your cluster. multimaster currently supports read committed and repeatable read isolation levels.
```
default_transaction_isolation = 'read committed'
```
  Important
  Using repeatable read isolation level increases the probability of serialization failure at commit time. If such cases are not handled by your application, you are recommended to use read committed isolation level.
- Set up PostgreSQL parameters related to replication.
```
wal_level = logical
max_connections = 100
max_prepared_transactions = 300
max_wal_senders = 10       # at least the number of nodes
max_replication_slots = 10 # at least the number of nodes
```
  You must change the replication level to logical as multimaster relies on logical replication. For a cluster of N nodes, enable at least N WAL sender processes and replication slots. Since multimaster implicitly adds a PREPARE phase to each COMMIT transaction, make sure to set the number of prepared transactions to N*max_connections. Otherwise, prepared transactions may be queued.
- Make sure you have enough background workers allocated for each node:
```
max_worker_processes = 250
```
  For example, for a three-node cluster with max_connections = 100, multimaster may need up to 206 background workers at peak times: 200 workers for connections from the neighbor nodes, two workers for WAL sender processes, two workers for WAL receiver processes, and two workers for the arbiter sender and receiver processes. When setting this parameter, remember that other modules may also use background workers at the same time.
- Add multimaster-specific options:
```
multimaster.max_nodes = 3  # cluster size
multimaster.node_id = 1    # the 1-based index of this node 
                           # in the cluster
multimaster.conn_strings = 'dbname=mydb user=myuser host=node1, dbname=mydb user=myuser host=node2, dbname=mydb user=myuser host=node3'
                           # comma-separated list 
                           # of connection strings 
                           # to neighbor nodes
```
  The multimaster.max_nodes variable defines the maximum cluster size. If you plan to add new nodes to your cluster, the multimaster.max_nodes value should exceed the initial number of nodes. In this case, you can add new nodes without restarting Postgres Pro Enterprise until the specified number of nodes is reached.
  In most cases, three cluster nodes are enough to ensure high availability. Since the data on all cluster nodes is the same, you do not typically need more than five cluster nodes.
  If you would like to change the default connection settings for a cluster node, you can add other connection parameters to the corresponding connection string in the multimaster.conn_strings variable. If you change the default port on which the arbiter process listens for connections, you must specify this port in the arbiter_port parameter. For details, see multimaster.arbiter_port and multimaster.conn_strings.
  Important
  The multimaster.node_id variable takes natural numbers starting from 1, without any gaps in numbering. For example, for a cluster of five nodes, set node IDs to 1, 2, 3, 4, and 5. In the multimaster.conn_strings variable, make sure to list the nodes in the order of their IDs. The multimaster.conn_strings variable must be the same on all nodes.
Depending on your network environment and usage patterns, you may want to tune other multimaster parameters. For details, see Section F.30.3.2.
Modify the pg_hba.conf file to allow replication to each cluster node on behalf of myuser.
Restart PostgreSQL:
```
pg_ctl -D datadir -l pg.log start
```

When Postgres Pro Enterprise is started on all nodes, connect to any node and create the multimaster extension:

psql -h node1
CREATE EXTENSION multimaster;

The CREATE EXTENSION query is replicated to all the cluster nodes.

To ensure that multimaster is enabled, check the mtm.get_cluster_state() view:

mtm.get_cluster_state();

If liveNodes is equal to allNodes, your cluster is successfully configured and ready to use.

See Also

Tuning Configuration Parameters

F.30.3.2. Tuning Configuration Parameters

While you can use multimaster in the default configuration, you may want to tune several parameters for faster failure detection or more reliable automatic recovery.

F.30.3.2.1. Setting Timeout for Failure Detection

To check availability of the neighbor nodes, multimaster periodically sends heartbeat packets to all nodes. You can define the timeout for failure detection with the following variables:

The multimaster.heartbeat_send_timeout variable defines the time interval between the heartbeats. By default, this variable is set to 1000ms.
The multimaster.heartbeat_recv_timeout variable sets the timeout for the response. If no heartbeats are received during this time, the node is assumed to be disconnected and is excluded from the cluster. By default, this variable is set to 10000ms.

It's a good idea to set multimaster.heartbeat_send_timeout based on typical ping latencies between the nodes. Small recv/send ratio decreases the time of failure detection, but increases the probability of false-positive failure detection. When setting this parameter, take into account the typical packet loss ratio between your cluster nodes.

F.30.3.2.2. Configuring Automatic Recovery Parameters

If a cluster node fails, multimaster can automatically restore it based on the WAL collected on other cluster nodes. To control the recovery settings, use the following variables:

multimaster.min_recovery_lag — sets the minimal WAL lag between the node to be restored and the current cluster state. By default, multimaster.min_recovery_lag is set to 100KB. When the disconnected node is fast-forwarded up to the multimaster.min_recovery_lag threshold, multimaster stops all new commits to the alive nodes until the node fully catches up with the current state of the cluster. When the data is fully synchronized, the disconnected node is promoted to the online state, and the cluster resumes its work.
multimaster.max_recovery_lag — sets the maximum size of WAL. Upon reaching the multimaster.max_recovery_lag threshold, WAL for the disconnected node is overwritten. At this point, automatic recovery is no longer possible. In this case, you can restore the node manually by cloning the data from one of the alive nodes using pg_basebackup.

By default, multimaster.max_recovery_lag is set to 100MB. Setting multimaster.max_recovery_lag to a larger value increases the timeframe for automatic recovery, but requires more disk space for WAL collection.

See Also

GUC Variables

F.30.4. Multi-Master Cluster Administration

F.30.4.1. Monitoring Cluster Status

multimaster provides several views to check the current cluster state.

To check node-specific information, use mtm.get_nodes_state():

mtm.get_nodes_state();

To check the status of the whole cluster, use the mtm.get_cluster_state() view:

mtm.get_cluster_state();

For details on all the returned information, see Section F.30.5.2.

F.30.4.2. Adding New Nodes to the Cluster

With the multimaster extension, you can add or drop cluster nodes without stopping the database service.

To add a new node, you need to change the cluster configuration on alive nodes, load all the data to the new node using pg_basebackup, and start the node.

Suppose we have a working cluster of three nodes, with node1, node2, and node3 domain names. To add node4, follow these steps:

Check whether the current number of cluster nodes has reached the value specified in the multimaster.max_nodes variable. If this value is reached, increase the multimaster.max_nodes value on each node and restart all nodes. You can restart the nodes one by one, without stopping the database. If the maximum number of nodes is not reached, proceed to the next step.
Figure out the required connection string to access the new node. For example, for the database mydb, user myuser, and the new node node4, the connection string can be "dbname=mydb user=myuser host=node4". For details, see multimaster.conn_strings.
In psql connected to any alive node, run:
```
mtm.add_node('dbname=mydb user=myuser host=node4');
```
This command changes the cluster configuration on all nodes and starts replication slots for the new node.
Connect to the new node and clone all the data from one of the alive nodes to the new node:
```
pg_basebackup -D datadir -h node1 -x
```
pg_basebackup copies the entire data directory from node1, together with configuration settings.

Update postgresql.conf settings on node4:

multimaster.node_id = 4
multimaster.conn_strings = 'dbname=mydb user=myuser host=node1, dbname=mydb user=myuser host=node2,
                            dbname=mydb user=myuser host=node3, dbname=mydb user=myuser host=node4'

Start PostgreSQL on the new node:
```
pg_ctl -D datadir -l pg.log start
```
All the cluster nodes get locked for write transactions until the new node retrieves all the updates that happened after you started making a base backup. When data recovery is complete, multimaster promotes the new node to the online state and includes it into the replication scheme.

To ensure that the new configuration is loaded in the case of PostgreSQL restart, update configuration settings on all the cluster nodes:

Change multimaster.conn_strings to include the new node.
Make sure the pg_hba.conf file allows replication to the new node.

See Also

Setting up a Multi-Master Cluster

Monitoring Cluster Status

F.30.4.3. Excluding Nodes from the Cluster

To exclude a node from the cluster, use the mtm.stop_node() function. For example, to exclude node 3, run the following command on any other node:

SELECT mtm.stop_node(3);

This disables replication slots for node 3 on all cluster nodes and stops replication to this node.

If you simply shutdown a node, it will be excluded from the cluster as well. However, all transactions in the cluster will be frozen until other nodes detect the offline state of the node. This time interval is defined by the multimaster.heartbeat_recv_timeout parameter.

F.30.4.4. Restoring a Cluster Node

The multimaster extension can automatically restore a failed node if the WAL is available for the time when the node was disconnected from the cluster. However, if the data updates on the alive nodes exceed the allowed WAL size specified in the multimaster.max_recovery_lag variable. In this case, you can manually restore the failed node.

Suppose node2 got disconnected from your three-node cluster and needs to be manually restored. The typical workflow is as follows:

In psql connected to any alive node, create a new replication slot for the disconnected node with the following command:
```
mtm.recover_node(2);
```
where 2 is the ID of the disconnected node specified in the multimaster.node_id variable.
Connect to node2 and clone all the data from one of the alive nodes:
```
pg_basebackup -D datadir -h node1 -x
```
pg_basebackup copies the entire data directory from node1, together with configuration settings.
Start PostgreSQL on the restored node:
```
pg_ctl -D datadir -l pg.log start
```
All the cluster nodes get locked for write transactions until the restored node retrieves all the updates that happened after you started making a base backup. When data recovery is complete, multimaster promotes the new node to the online state and includes it into the replication scheme.

See Also

Failure Detection and Recovery

F.30.5. Reference

F.30.5.1. GUC Variables

multimaster.node_id

Node ID — a unique natural number identifying the node of a multi-master cluster. You must start node numbering from 1. There must be no gaps in numbering. For example, for a cluster of five nodes, set node IDs to 1, 2, 3, 4, and 5.

multimaster.conn_strings

Connection strings for each node of a multi-master cluster, separated by commas. The multimaster.conn_strings parameter must be identical on all nodes. Each connection string must include the name of the database to replicate and the cluster node domain name. For example, 'dbname=mydb host=node1, dbname=mydb host=node2, dbname=mydb host=node3'. Optionally, you can add other connection parameters to change the default connection settings. Connection strings must appear in the order of the node IDs specified in the multimaster.node_id variable. Connection string for the i-th node must be on the i-th position. If you specify a custom port in the multimaster.arbiter_port, you must provide this value in the arbiter_port parameter in the connection string for the corresponding node.

multimaster.max_nodes

The maximum number of nodes allowed in the cluster. If you plan to add new nodes to your cluster, the multimaster.max_nodes value should exceed the initial number of nodes. In this case, you can add new nodes without restarting Postgres Pro Enterprise until the specified number of nodes is reached. In most cases, three cluster nodes are enough to ensure high availability. Since the data on all cluster nodes is the same, you do not typically need more than five cluster nodes. The maximum possible number of nodes is limited to 64.

Default: the number of nodes specified in the multimaster.conn_strings variable

multimaster.arbiter_port

Port for the arbiter process to listen on. If you change the default value, you must specify this value in the arbiter_port parameter in the connection string for the corresponding node.

Default: 5433

multimaster.heartbeat_send_timeout

Time interval between heartbeat messages, in milliseconds. An arbiter process broadcasts heartbeat messages to all nodes to detect connection problems.

Default: 1000

multimaster.heartbeat_recv_timeout

Timeout, in milliseconds. If no heartbeat message is received from the node within this timeframe, the node is excluded from the cluster.

Default: 10000

multimaster.min_recovery_lag

Minimal WAL lag between the node to be restored and the current cluster state, in bytes. When this threshold is reached during node recovery, the cluster is locked for write transactions until the recovery is complete.

Default: 100000

multimaster.max_recovery_lag

Maximal WAL lag size, in bytes. When a node is disconnected from the cluster, other nodes copy WALs for all new transactions into the replication slot of this node. Upon reaching the multimaster.max_recovery_lag value, the replication slot for the disconnected node is dropped to avoid overflow. At this point, automatic recovery of the node is no longer possible. In this case, you can restore the node manually by cloning the data from one of the alive nodes using pg_basebackup or a similar tool. If you set this variable to zero, replication slot will not be dropped.

Default: 100000000

multimaster.ignore_tables_without_pk

Boolean. This variable enables/disables replication of tables without primary keys. By default, such replication is disabled because of the logical replication restrictions. To enable replication of tables without primary keys, you can set this variable to false. However, take into account that multimaster does not allow update operations on such tables.

Default: true

multimaster.cluster_name

Name of the cluster. If you define this variable when setting up the cluster, multimaster checks that the cluster name is the same for all the cluster nodes.

multimaster.break_connection

Break connection with clients connected to the node if this node disconnects from the cluster. If this variable is set to false, the client stays connected to the node but receives an error that the node is in minority.

Default: false

multimaster.major_node

Node with this flag continues working even if it cannot access the majority of other nodes. This is needed to break the symmetry if there is an even number of alive nodes in the cluster. For example, a cluster with three nodes continues working if one of the nodes has crashed. If connection between the remaining nodes is lost, the node with multimaster.major_node = true will continue working.

Important

This parameter should be used with caution. Only one node in the cluster can have this parameter set to true. When set to true on several nodes, this parameter can cause the split-brain problem.

multimaster.max_workers

The maximum number of walreceiver workers on this server.

Important

This parameter should be used with caution. If the number of simultaneous transactions in the whole cluster is bigger than the provided value, it can lead to undetected deadlocks.

multimaster.trans_spill_threshold

The maximal size of transaction, in MB. When this threshold is reached, the transaction is written to the disk.

Default: 100

F.30.5.2. Functions

mtm.get_nodes_state()

Shows the status of all nodes in the cluster. Returns a tuple of the following values:

id, integer
Node ID.
enabled, boolean
Shows whether the node is excluded from the cluster. The node can only be disabled if responses to heartbeats are not received within the heartbeat_recv_timeout time interval. When the node starts responding to heartbeats, multimaster can automatically restore the node and switch it back to the enabled state. Automatic recovery is only possible if the replication slot is still active. Otherwise, you can restore the node manually.
connected, boolean
Shows whether the node is connected to the WAL sender.
slot_active, boolean
Shows whether the node has an active replication slot. For a disabled node, the slot remains active until the max_recovery_lag value is reached.
stopped, boolean
Shows whether replication to this node was stopped by the mtm.stop_node() function. A stopped node acts as a disabled one, but cannot be automatically recovered. Call mtm.recover_node() to re-enable such a node.
catchUp, boolean
During the node recovery, shows whether the data is recovered up to the min_recovery_lag value.
slotLag, bigint
The size of WAL data that the replication slot holds for a disabled/stopped node. The slot is dropped when slotLag reaches the max_recovery_lag value.
avgTransDelay, bigint
An average commit delay caused by this node, in microseconds.
lastStatusChange, timestamp
Last time when the node changed its status (enabled/disabled).
oldestSnapshot, bigint
The oldest global snapshot existing on this node.
SenderPid, integer
Process ID of the WAL sender.
SenderStartTime, timestamp
WAL sender start time.
ReceiverPid, integer
Process ID of the WAL receiver.
ReceiverStartTime, timestamp
WAL receiver start time.
connStr, text
Connection string to this node.
connectivityMask, bigint
Bitmask representing connectivity to neighbor nodes. Each bit represents a connection to node.
nHeartbeats, integer
The number of heartbeat responses received from this node.

mtm.collect_cluster_state()

Collects the data returned by the mtm.get_cluster_state() function from all available nodes. For this function to work, in addition to replication connections, pg_hba.conf must allow ordinary connections to the node with the specified connection string.

mtm.get_cluster_state()

Shows the status of the multimaster extension. Returns a tuple of the following values:

status, text
Node status. Possible values are: Initialization, Offline, Connected, Online, Recovery, Recovered, InMinor, OutOfService.
disabledNodeMask, bigint
Bitmask of disabled nodes.
disconnectedNodeMask, bigint
Bitmask of disconnected nodes.
catchUpNodeMask, bigint
Bitmask of nodes that completed the recovery.
liveNodes, integer
Number of enabled nodes.
allNodes, integer
Number of nodes in the cluster. The majority of alive nodes is calculated based on this parameter.
nActiveQueries, integer
Number of queries being currently processed on this node.
nPendingQueries, integer
Number of queries waiting for execution on this node.
queueSize, bigint
Size of the pending query queue, in bytes.
transCount, bigint
The total number of replicated transactions processed by this node.
timeShift, bigint
Global snapshot shift caused by unsynchronized clocks on nodes, in microseconds.
recoverySlot, integer
The node from which a failed node gets data updates during automatic recovery.
xidHashSize, bigint
Size of xid2state hash.
gidHashSize, bigint
Size of gid2state hash.
oldestXid, bigint
The oldest transaction ID on this node.
configChanges, integer
Number of state changes (enabled/disabled) since the last reboot.
stalledNodeMask, bigint
Bitmask of nodes for which replication slots were dropped.
stoppedNodeMask, bigint
Bitmask of nodes that were stopped by mtm.stop_node().
lastStatusChange, timestamp
Timestamp of the last state change.

mtm.add_node(conn_str text)

Adds a new node to the cluster.

Arguments:

conn_str — connection string for the new node. For example, for the database mydb, user myuser, and the new node node4, the connection string is "dbname=mydb user=myuser host=node4".
Type: text

mtm.drop_node(node integer, drop_slot bool default false)

Excludes a node from the cluster.

Arguments:

node — ID of the node to be dropped that you specified in the multimaster.node_id variable.
Type: integer
drop_slot — Optional. Defines whether the replication slot should be dropped together with the node. Set this option to true if you do not plan to restore the node in the future.
Type: boolean
Default: false

mtm.recover_node(node integer)

Creates a replication slot for the node that was previously dropped together with its slot.

Arguments:

node — ID of the node to be restored.

mtm.make_table_local(relation regclass)

Stops replication for the specified table.

Arguments:

relation — The table you would like to exclude from the replication scheme.
Type: regclass

F.30.6. Compatibility

The multimaster extension currently passes 162 of 166 PostgreSQL regression tests. We are working right now on providing full compatibility with the standard PostgreSQL.

F.30.7. Authors

Postgres Professional, Moscow, Russia.

F.30.7.1. Credits

The replication mechanism is based on logical decoding and an earlier version of the pglogical extension provided for community by the 2ndQuadrant team.

The three-phase E3PC commit protocol is based on the following works:

Idit Keidar, Danny Dolev. Increasing the Resilience of Distributed and Replicated Database Systems.
Tim Kempster, Colin Stirling, Peter Thanisch. A More Committed Quorum-Based Three Phase Commit Protocol.

Prev	Home	Next
F.29. mchar	Up	F.31. online-analyze