F.30. multimaster

F.30. multimaster
Prev	Up	Appendix F. Additional Supplied Modules	Home	Next

F.30.1. Limitations
F.30.2. Architecture
F.30.3. Installation and Setup
F.30.4. Multi-Master Cluster Administration
F.30.5. Reference
F.30.6. Compatibility
F.30.7. Authors

multimaster is a Postgres Pro Enterprise extension with a set of patches that turns Postgres Pro Enterprise into a synchronous shared-nothing cluster to provide Online Transaction Processing (OLTP) scalability for read transactions and high availability with automatic disaster recovery.

As compared to a standard PostgreSQL master-standby cluster, a cluster configured with the multimaster extension offers the following benefits:

Fault tolerance and automatic node recovery
Synchronous logical replication and DDL replication
Read scalability
Working with temporary tables on each cluster node
Postgres Pro Enterprise online upgrades

The multimaster extension replicates your database to all nodes of the cluster and allows write transactions on each node. Write transactions are synchronously replicated to all nodes, which increases commit latency. Read-only transactions and queries are executed locally, without any measurable overhead.

To ensure high availability and fault tolerance of the cluster, multimaster uses three-phase commit protocol and heartbeats for failure discovery. A multi-master cluster of N nodes can continue working while the majority of the nodes are alive and reachable by other nodes. To be configured with multimaster, the cluster must include at least two nodes. In most cases, three cluster nodes are enough to ensure high availability. Since the data on all cluster nodes is the same, you do not typically need more than five cluster nodes.

When a failed node is reconnected to the cluster, multimaster can automatically fast-forward the node to the actual state based on the Write-Ahead Log (WAL) data in the corresponding replication slot. If a node was excluded from the cluster, you can add it back using pg_basebackup.

Important

When using multimaster, make sure to take its replication restrictions into account. For details, see Section F.30.1.

To learn more about the multimaster internals, see Section F.30.2.

F.30.1. Limitations

The multimaster extension takes care of the database replication in a fully automated way. You can perform write transactions on any node and work with temporary tables on each cluster node simultaneously. However, make sure to take the following replication restrictions into account:

Unlike in vanilla PostgreSQL, read committed isolation level can cause serialization failures on a multi-master cluster if there are conflicting transactions from different nodes, so the application must be ready to retry transactions. Serializable isolation level works only with respect to local transactions on the current node.
Sequence generation. To avoid conflicts between unique identifiers on different nodes, multimaster modifies the default behavior of sequence generators. By default, ID generation on each node is started with this node number and is incremented by the number of nodes. For example, in a three-node cluster, 1, 4, and 7 IDs are allocated to the objects written onto the first node, while 2, 5, and 8 IDs are reserved for the second node. If you change the number of nodes in the cluster, the incrementation interval for new IDs is adjusted accordingly. Thus, the generated sequence values are not monotonic. If it is critical to get a monotonically increasing sequence cluster-wide, you can set the multimaster.monotonic_sequences to true.
Accessing temporary and persistent tables in the same transaction is prohibited because Postgres Pro cannot prepare such transactions.
Commit latency. In the current implementation of logical replication, multimaster sends data to subscriber nodes only after the local commit, so you have to wait for transaction processing twice: first on the local node, and then on all the other nodes simultaneously. In the case of a heavy-write transaction, this may result in a noticeable delay.
Replicated non-conflicting transactions are applied on the receiving nodes in parallel, so such transactions may become visible on different nodes in different order.

F.30.2. Architecture

F.30.2.1. Replication

Since each server in a multi-master cluster can accept writes, any server can abort a transaction because of a concurrent update — in the same way as it happens on a single server between different backends. To ensure high availability and data consistency on all cluster nodes, multimaster uses logical replication and the three-phase E3PC commit protocol.

When Postgres Pro Enterprise loads the multimaster shared library, multimaster sets up a logical replication producer and consumer for each node, and hooks into the transaction commit pipeline. The typical data replication workflow consists of the following phases:

PREPARE phase. multimaster captures and implicitly transforms each COMMIT statement to a PREPARE statement. All the nodes that get the transaction via the replication protocol (the cohort nodes) send their vote for approving or declining the transaction to the backend process on the initiating node. This ensures that all the cohort can accept the transaction, and no write conflicts occur. For details on PREPARE transactions support in PostgreSQL, see the PREPARE TRANSACTION topic.
PRECOMMIT phase. If all the cohort nodes approve the transaction, the backend process sends a PRECOMMIT message to all the cohort nodes to express an intention to commit the transaction. The cohort nodes respond to the backend with the PRECOMMITTED message. In case of a failure, all the nodes can use this information to complete the transaction using a quorum-based voting procedure.
COMMIT phase. If PRECOMMIT is successful, the transaction is committed to all nodes.

If a node crashes or gets disconnected from the cluster between the PREPARE and COMMIT phases, the PRECOMMIT phase ensures that the survived nodes have enough information to complete the prepared transaction. The PRECOMMITTED messages help avoid the situation when the crashed node has already committed or aborted the transaction, but has not notified other nodes about the transaction status. In a two-phase commit (2PC), such a transaction would block resources (hold locks) until the recovery of the crashed node. Otherwise, data inconsistencies can appear in the database when the failed node is recovered, for example, if the failed node committed the transaction, but the survived node aborted it.

To complete the transaction, the backend must receive a response from the majority of the nodes. For example, for a cluster of 2N+1 nodes, at least N+1 responses are required. Thus, multimaster ensures that your cluster is available for reads and writes while the majority of the nodes are connected, and no data inconsistencies occur in case of a node or connection failure.

F.30.2.2. Failure Detection and Recovery

Since multimaster allows writes to each node, it has to wait for responses about transaction acknowledgement from all the other nodes. Without special actions in case of a node failure, each commit would have to wait until the failed node recovery. To deal with such situations, multimaster periodically sends heartbeats to check the node state and the connectivity between nodes. When several heartbeats to the node are lost in a row, this node is kicked out of the cluster to allow writes to the remaining alive nodes. You can configure the heartbeat frequency and the response timeout in the multimaster.heartbeat_send_timeout and multimaster.heartbeat_recv_timeout parameters, respectively.

For alive nodes, there is no way to distinguish between a failed node that stopped serving requests and a network-partitioned node that can be accessed by database users, but is unreachable for other nodes. To avoid conflicting writes to nodes in different network partitions, multimaster only allows writes to the nodes that see the majority of other nodes.

For example, suppose a five-node multi-master cluster experienced a network failure that split the network into two isolated subnets, with two and three cluster nodes. Based on heartbeats propagation information, multimaster will continue accepting writes at each node in the bigger partition, and deny all writes in the smaller one. Thus, a cluster consisting of 2N+1 nodes can tolerate N node failures and stay alive if any N+1 nodes are alive and connected to each other.

For clusters with an even number of nodes, you can set up a lightweight referee node that does not hold the data, but acts as a tie-breaker during symmetric node partitioning. For details, see Section F.30.3.3.

In case of a partial network split when different nodes have different connectivity, multimaster finds a fully connected subset of nodes and disconnects nodes outside of this subset. For example, in a three-node cluster, if node A can access both B and C, but node B cannot access node C, multimaster isolates node C to ensure that both A and B can work.

If you try to access a disconnected node, multimaster returns an error message indicating the current status of the node. To prevent stale reads, read-only queries are also forbidden.

Each node maintains a data structure that keeps the information about the state of all nodes in relation to this node. You can get this data by calling the mtm.status() and the mtm.nodes() functions.

When a failed node connects back to the cluster, multimaster starts automatic recovery:

The reconnected node selects a random cluster node and starts catching up with the current state of the cluster based on the Write-Ahead Log (WAL). All the cluster nodes get locked for write transactions to allow the recovery process to finish.
When the recovery is complete, multimaster promotes the reconnected node to the online state and includes it into the replication scheme.

Automatic recovery requires presence of all WAL files generated after node failure. If a node is down for a long time and WAL files are no longer available, you may have to exclude this node from the cluster and manually restore it from one of the working nodes using pg_basebackup. For details, see Section F.30.4.3.

F.30.2.3. Multimaster Background Workers

mtm-monitor: Starts all other workers for a database managed with multimaster. This is the first worker loaded during multimaster boot. Each multimaster node has a single mtm-monitor worker. When a new node is added, mtm-monitor starts mtm-logrep-receiver and mtm-dmq-receiver workers to enable replication to this node. If a node is dropped, mtm-monitor stops mtm-logrep-receiver and mtm-dmq-receiver workers that have been serving the dropped node. Each mtm-monitor controls workers on its own node only.
mtm-logrep-receiver: Receives logical replication stream from a given peer node. During recovery, all received transactions are applied by mtm-logrep-receiver. During normal operation, mtm-logrep-receiver passes transactions to the pool of dynamic workers (see mtm-logrep-receiver-dynworker). The number of mtm-logrep-receiver workers on each node corresponds to the number of peer nodes available.
mtm-dmq-receiver: Receives acknowledgements for transactions sent to peers and checks for heartbeat timeouts. The number of mtm-logrep-receiver workers on each node corresponds to the number of peer nodes available.
mtm-dmq-sender: Collects acknowledgements for transactions applied on the current node and sends them to the corresponding mtm-dmq-receiver on the peer node. There is a single mtm-dmq-sender worker per Postgres Pro instance.
mtm-logrep-receiver-dynworker: Dynamic pool worker for a given mtm-logrep-receiver. Applies the replicated transaction received during normal operation. There are up to multimaster.max_workers workers per each peer node.
mtm-resolver: Asks all peers about the status of unfinished transactions to resolve them according to 3PC rules. This worker is only active during recovery.

F.30.3. Installation and Setup

To use multimaster, you need to install Postgres Pro Enterprise on all nodes of your cluster. Postgres Pro Enterprise includes all the required dependencies and extensions.

F.30.3.1. Setting up a Multi-Master Cluster

Suppose you are setting up a cluster of three nodes, with node1, node2, and node3 host names. After installing Postgres Pro Enterprise on all nodes, you need to initialize data directory on each node, as explained in Section 18.2. If you would like to set up a multi-master cluster for an already existing mydb database, you can load data from mydb to one of the nodes once the cluster is initialized, or you can load data to all new nodes before cluster initialization using any convenient mechanism, such as pg_basebackup or pg_dump.

Once the data directory is set up, complete the following steps on each cluster node:

Modify the postgresql.conf configuration file, as follows:
- Add multimaster to the shared_preload_libraries variable:
```
shared_preload_libraries = 'multimaster'
```
  Tip
  If the shared_preload_libraries variable is already defined in postgresql.auto.conf, you will need to modify its value using the ALTER SYSTEM command. For details, see Section 19.1.2. Note that in a multi-master cluster, the ALTER SYSTEM command only affects the configuration of the node from which it was run.
- Set up Postgres Pro parameters related to replication:
```
wal_level = logical
max_connections = 100
max_prepared_transactions = 300 # max_connections * N
max_wal_senders = 10            # at least N
max_replication_slots = 10      # at least 2N
wal_sender_timeout = 0
```
  where N is the number of nodes in your cluster.
  You must change the replication level to logical as multimaster relies on logical replication. For a cluster of N nodes, enable at least N WAL sender processes and replication slots. Since multimaster implicitly adds a PREPARE phase to each COMMIT transaction, make sure to set the number of prepared transactions to N * max_connections. wal_sender_timeout should be disabled as multimaster uses its custom logic for failure detection.
- Make sure you have enough background workers allocated for each node:
```
max_worker_processes = 250 # (N - 1) * (max_connections + 3) + 3
```
  For example, for a three-node cluster with max_connections = 100, multimaster may need up to 209 background workers at peak times: three always-on workers (monitor, resolver, dmq-sender), three workers per each peer node (walsender, mtm-receiver, dmq-receiver) and up to 200 replication dynamic workers (that is, max_connections workers per each peer node). When setting this parameter, remember that other modules may also use background workers at the same time.
- Depending on your network environment and usage patterns, you may want to tune other multimaster parameters. For details, see Section F.30.3.2.
Start Postgres Pro Enterprise on all nodes.
Create database mydb and user mtmuser on each node:
```
CREATE USER mtmuser WITH SUPERUSER PASSWORD 'mtmuserpassword';
CREATE DATABASE mydb OWNER mtmuser;
```
If you are using password-based authentication, you may want to create a password file.
You can omit this step if you already have a database you are going to replicate, but you are recommended to create a separate superuser for multi-master replication. The examples below assume that you are going to replicate the mydb database on behalf of mtmuser.
Allow replication of the mydb database to each cluster node on behalf of mtmuser, as explained in Section 20.1. Make sure to use the authentication method that satisfies your security requirements. For example, pg_hba.conf might have the following lines on node1:
```
host replication mtmuser node2 md5
host mydb mtmuser node2 md5
host replication mtmuser node3 md5
host mydb mtmuser node3 md5
```
Connect to any node on behalf of the mtmuser database user, create the multimaster extension in the mydb database and run mtm.init_cluster():
```
CREATE EXTENSION multimaster;
SELECT mtm.init_cluster('dbname=mydb user=mtmuser host=node1',
'{"dbname=mydb user=mtmuser host=node2", "dbname=mydb user=mtmuser host=node3"}');
```
where the first argument of mtm.init_cluster() is a connection string to the current node and the second argument is an array of connection strings to all the other nodes.
To ensure that multimaster is enabled, you can run the mtm.status() and mtm.nodes() functions:
```
SELECT * FROM mtm.status();
SELECT * FROM mtm.nodes();
```
If status is equal to online and all nodes are present in the mtm.nodes output, your cluster is successfully configured and ready to use.

Tip

If you have any data that must be present on one of the nodes only, you can exclude a particular table from replication, as follows:

SELECT mtm.make_table_local('table_name')

F.30.3.2. Tuning Configuration Parameters

While you can use multimaster in the default configuration, you may want to tune several parameters for faster failure detection or more reliable automatic recovery.

F.30.3.2.1. Setting Timeout for Failure Detection

To check availability of the peer nodes, multimaster periodically sends heartbeat packets to all nodes. You can define the timeout for failure detection with the following variables:

The multimaster.heartbeat_send_timeout variable defines the time interval between the heartbeats. By default, this variable is set to 200ms.
The multimaster.heartbeat_recv_timeout variable sets the timeout for the response. If no heartbeats are received during this time, the node is assumed to be disconnected and is excluded from the cluster. By default, this variable is set to 1000ms.

It's a good idea to set multimaster.heartbeat_send_timeout based on typical ping latencies between the nodes. Small recv/send ratio decreases the time of failure detection, but increases the probability of false-positive failure detection. When setting this parameter, take into account the typical packet loss ratio between your cluster nodes.

F.30.3.3. Setting up a Standalone Referee Node

By default, multimaster uses a majority-based algorithm to determine whether the cluster nodes have a quorum: a cluster can only continue working if the majority of its nodes are alive and can access each other. For clusters with an even number of nodes, this approach is not optimal. For example, if a network failure splits the cluster into equal parts, or one of the nodes fails in a two-node cluster, all the nodes stop accepting queries, even though at least half of the cluster nodes are running normally.

To enable a smooth failover for such cases, you can set up a standalone referee node to assign the quorum status to a subset of nodes that constitutes half of the cluster.

A referee is a voting node used to determine which subset of nodes has a quorum if the cluster is split into equal parts. The referee node does not store any cluster data, so it is not resource-intensive and can be configured on virtually any system with Postgres Pro Enterprise installed.

To set up a referee for your cluster:

Install Postgres Pro Enterprise on the node you are going to make a referee and create the referee extension:
```
CREATE EXTENSION referee;
```
Make sure the pg_hba.conf file allows access to the referee node.
On all your cluster nodes, specify the referee connection string in the postgresql.conf file:
```
multimaster.referee_connstring = connstring
```
where connstring holds libpq options required to access the referee.

The first subset of nodes that gets connected to the referee wins the voting and continues working. The referee keeps the voting result until all the other cluster nodes get online again. Then the result is discarded, and a new winner can be chosen in case of another network failure.

To avoid split-brain problems, you can only have a single referee in your cluster.

F.30.4. Multi-Master Cluster Administration

F.30.4.1. Monitoring Cluster Status

multimaster provides several functions to check the current cluster state.

To check node-specific information, use mtm.status():

SELECT * FROM mtm.status();

To get the list of all nodes in the cluster together with their status, use mtm.nodes():

SELECT * FROM mtm.nodes();

For details on all the returned information, see Section F.30.5.2.

F.30.4.2. Accessing Disabled Nodes

If a cluster node is disabled, any attempt to read or write data on this node raises an error by default. If you need to access the data on a disabled node, you can override this behavior at connection time by setting the application_name parameter to mtm_admin. In this case, you can run read and write queries on this node without multimaster supervision.

F.30.4.3. Adding New Nodes to the Cluster

With the multimaster extension, you can add or drop cluster nodes without stopping the database service. When adding a new node, you need to load all the data to this node using pg_basebackup from any cluster node, and then start this node.

Suppose we have a working cluster of three nodes, with node1, node2, and node3 host names. To add node4, follow these steps:

Figure out the required connection string to access the new node. For example, for the database mydb, user mtmuser, and the new node node4, the connection string can be "dbname=mydb user=mtmuser host=node4".
In psql connected to any alive node, run:
```
SELECT mtm.add_node('dbname=mydb user=mtmuser host=node4');
```
This command changes the cluster configuration on all nodes and creates replication slots for the new node. It also returns node_id of the new node, which will be required to complete the setup.
Go to the new node and clone all the data from one of the alive nodes to this node:
```
pg_basebackup -D datadir -h node1 -U mtmuser -c fast -v
```
pg_basebackup copies the entire data directory from node1, together with configuration settings, and prints the last LSN replayed from WAL, such as '0/12D357F0'. This value will be required to complete the setup.
Start Postgres Pro on the new node.
In psql connected to the node used to take the base backup, run:
```
SELECT mtm.join_node(4, '0/12D357F0');
```
where 4 is the node_id returned by the mtm.add_node() function call and '0/12D357F0' is the LSN value returned by pg_basebackup.

F.30.4.4. Removing Nodes from the Cluster

To remove the node from the cluster:

Run the mtm.nodes() function to learn the ID of the node to be removed:
```
SELECT * FROM mtm.nodes();
```
Run the mtm.drop_node() function with this node ID as a parameter:
```
SELECT mtm.drop_node(3);
```
This will delete replication slots for node 3 on all cluster nodes and stop replication to this node.

If you would like to return the node to the cluster later, you will have to add it as a new node, as explained in Section F.30.4.3.

F.30.5. Reference

F.30.5.1. Configuration Parameters

multimaster.heartbeat_recv_timeout

Timeout, in milliseconds. If no heartbeat message is received from the node within this timeframe, the node is excluded from the cluster.

Default: 1000 ms

multimaster.heartbeat_send_timeout

Time interval between heartbeat messages, in milliseconds. An arbiter process broadcasts heartbeat messages to all nodes to detect connection problems.

Default: 200 ms

multimaster.max_workers

The maximum number of walreceiver workers per peer node.

Important

This parameter should be used with caution. If the number of simultaneous transactions in the whole cluster is bigger than the provided value, it can lead to undetected deadlocks.

Default: 100

multimaster.monotonic_sequences

Defines the sequence generation mode for unique identifiers. This variable can take the following values:

false (default) — ID generation on each node is started with this node number and is incremented by the number of nodes. For example, in a three-node cluster, 1, 4, and 7 IDs are allocated to the objects written onto the first node, while 2, 5, and 8 IDs are reserved for the second node. If you change the number of nodes in the cluster, the incrementation interval for new IDs is adjusted accordingly.
true — the generated sequence increases monotonically cluster-wide. ID generation on each node is started with this node number and is incremented by the number of nodes, but the values are omitted if they are smaller than the already generated IDs on another node. For example, in a three-node cluster, if 1, 4 and 7 IDs are already allocated to the objects on the first node, 2 and 5 IDs will be omitted on the second node. In this case, the first ID on the second node is 8. Thus, the next generated ID is always higher than the previous one, regardless of the cluster node.

Default: false

multimaster.referee_connstring

Connection string to access the referee node. You must set this parameter on all cluster nodes if the referee is set up.

multimaster.remote_functions

Provides a comma-separated list of function names that should be executed remotely on all multimaster nodes instead of replicating the result of their work.

multimaster.trans_spill_threshold

The maximal size of transaction, in kB. When this threshold is reached, the transaction is written to the disk.

Default: 100MB

F.30.5.2. Functions

mtm.init_cluster(my_conninfo text, peers_conninfo text[])

Initializes cluster configuration on all nodes. It connects the current node to all nodes listed in peers_conninfo and creates the multimaster extension, replications slots, and replication origins on each node. Run this function once all the nodes are running and can accept connections.

Arguments:

my_conninfo — connection string to the node on which you are running this function. Peer nodes use this string to connect back to this node.
peers_conninfo — an array of connection strings to all the other nodes to be added to the cluster.

mtm.add_node(conn_str text)

Adds a new node to the cluster. This function should be called before loading data to this node using pg_basebackup. mtm.add_node creates the required replication slots for a new node, so you can add a node while the cluster is under load.

Arguments:

conn_str — connection string for the new node. For example, for the database mydb, user mtmuser, and the new node node4, the connection string is "dbname=mydb user=mtmuser host=node4".

mtm.join_node(node_id int, backup_end_lsn pg_lsn)

Completes the cluster setup after adding a new node. This function should be called after the added node has been started.

Arguments:

node_id — ID of the node to add to the cluster. It corresponds to the value in the id column returned by mtm.nodes().
backup_end_lsn — the last LSN of the base backup copied to the new node. This LSN will be used as the starting point for data replication once the node joins the cluster.

mtm.drop_node(node_id integer)

Excludes a node from the cluster.

Arguments:

node_id — ID of the node being dropped. It corresponds to the value in the id column returned by mtm.nodes().

mtm.alter_sequences()

Fixes unique identifiers on all cluster nodes. This may be required after restoring all nodes from a single base backup.

mtm.status()

Shows the status of the multimaster extension on the current node. Returns a tuple of the following values:

node_id, integer — ID of this node.
status, text — status of the node. Possible values are: online, recovery, recovered, disabled.
n_nodes, integer — number of nodes in the cluster. The majority of alive nodes is calculated based on this parameter.
n_connected, integer — number of connected nodes.
n_enabled, integer — number of enabled nodes.

mtm.nodes()

Shows the information on all nodes in the cluster. Returns a tuple of the following values:

id, integer — node ID.
enabled, boolean — shows whether the node is fully recovered and is operating normally. The node can only be disabled if responses to heartbeats are not received within the heartbeat_recv_timeout time interval. When the node starts responding to heartbeats, multimaster can automatically restore the node and switch it back to the enabled state.
connected, boolean — shows whether the node is connected to our node.
sender_pid, integer — WAL sender process ID.
receiver_pid, integer — WAL receiver process ID.
receiver_status, text — status of the node. Possible values are: recovery, recovered.
conninfo, text — connection string to this node.

mtm.make_table_local(relation regclass)

Stops replication for the specified table.

Arguments:

relation — the table you would like to exclude from the replication scheme.

multimaster.break_connection

Break connection with clients connected to the node if this node disconnects from the cluster. If this variable is set to false, the client stays connected to the node but receives an error that the node is disabled.

Default: false

F.30.6. Compatibility

The multimaster extension currently passes nearly all PostgreSQL regression tests, except for a few edge cases related to working with temporary tables and updating enum, which are not always transactional in PostgreSQL.

F.30.6.1. Local and Global DDL Statements

By default, any DDL statement is executed on all cluster nodes, except the following statements that can only act locally on a given node:

ALTER SYSTEM
CREATE DATABASE
DROP DATABASE
REINDEX
CHECKPOINT
CLUSTER
LOAD
LISTEN
CHECKPOINT
NOTIFY

F.30.7. Authors

Postgres Professional, Moscow, Russia.

F.30.7.1. Credits

The replication mechanism is based on logical decoding and an earlier version of the pglogical extension provided for community by the 2ndQuadrant team.

The three-phase E3PC commit protocol is described in:

Idit Keidar, Danny Dolev. Increasing the Resilience of Distributed and Replicated Database Systems.

Parallel replication and recovery mechanism is similar to the one described in:

Odorico M. Mendizabal, et al. Checkpointing in Parallel State-Machine Replication.

Prev	Up	Next
F.29. mchar	Home	F.31. online_analyze