multimaster is a Postgres Pro Enterprise extension with a set
of patches that turns Postgres Pro Enterprise into a synchronous shared-nothing
cluster to provide Online Transaction Processing (OLTP) scalability for read transactions and high availability with automatic disaster recovery.
As compared to a standard PostgreSQL master-standby cluster, a cluster configured with the multimaster extension offers the following benefits:
Fault tolerance and automatic node recovery
Synchronous logical replication and DDL replication
Read scalability
Working with temporary tables on each cluster node
Postgres Pro Enterprise online upgrades
The multimaster extension replicates your
database to all nodes of the cluster and allows write transactions
on each node. Write transactions are synchronously replicated to all nodes,
which increases commit latency. Read-only transactions and queries
are executed locally, without any measurable overhead.
To ensure high availability and fault tolerance of the cluster,
multimaster uses three-phase commit protocol
and heartbeats for failure discovery. A multi-master cluster of N
nodes can continue working while the majority of the nodes are
alive and reachable by other nodes. To be configured with multimaster, the cluster must include at least two nodes. In most cases, three
cluster nodes are enough to ensure high availability. Since the data on all cluster nodes is the same, you do not typically need more than five cluster nodes.
When a failed node
is reconnected to the cluster, multimaster can automatically
fast-forward the node to the actual state based on the
Write-Ahead Log (WAL) data in the corresponding replication slot.
If a node was excluded from the cluster, you can add it back using pg_basebackup.
When using multimaster, make sure to take its replication restrictions into account. For details, see Section F.30.1.
To learn more about the multimaster internals, see
Section F.30.2.
The multimaster extension takes care of the database replication in a fully automated way. You can perform write transactions on any node and work with temporary tables on each cluster node simultaneously. However, make sure to take the following replication restrictions into account:
Unlike in vanilla PostgreSQL, read committed
isolation level can cause serialization failures on a multi-master cluster if there are
conflicting transactions from different nodes, so the application must be
ready to retry transactions.
Serializable isolation level works
only with respect to local transactions on the current node.
Sequence generation. To avoid conflicts between unique identifiers on different nodes,
multimaster modifies the default behavior of sequence generators.
By default, ID generation on each node is started with this node number and is
incremented by the number of nodes. For example, in a three-node cluster, 1, 4, and 7
IDs are allocated to the objects written onto the first node, while 2, 5, and 8 IDs are reserved
for the second node. If you change the number of nodes in the cluster, the incrementation
interval for new IDs is adjusted accordingly. Thus, the generated sequence values are not
monotonic. If it is critical to get a monotonically increasing sequence cluster-wide, you can
set the multimaster.monotonic_sequences
to true.
Accessing temporary and persistent tables in the same transaction is prohibited because Postgres Pro cannot prepare such transactions.
Commit latency. In the current implementation of logical
replication, multimaster sends data to subscriber nodes only after the
local commit, so you have to wait for transaction processing twice: first on the local node,
and then on all the other nodes simultaneously. In the case of a heavy-write transaction,
this may result in a noticeable delay.
Replicated non-conflicting transactions are applied on the receiving nodes in parallel, so such transactions may become visible on different nodes in different order.
Since each server in a multi-master cluster can accept writes, any server can abort a
transaction because of a concurrent update — in the same way as it
happens on a single server between different backends. To ensure
high availability and data consistency on all cluster nodes,
multimaster uses logical replication and the three-phase E3PC commit protocol.
When Postgres Pro Enterprise loads the multimaster shared
library, multimaster sets up a logical
replication producer and consumer for each node, and hooks into
the transaction commit pipeline. The typical data replication
workflow consists of the following phases:
PREPARE phase.
multimaster captures and implicitly
transforms each COMMIT statement to a
PREPARE statement. All the nodes that get
the transaction via the replication protocol (the
cohort nodes) send their vote for approving or
declining the transaction to the backend process on the
initiating node. This ensures that all the cohort can accept
the transaction, and no write conflicts occur. For details on
PREPARE transactions support in PostgreSQL,
see the
PREPARE
TRANSACTION topic.
PRECOMMIT phase. If all the cohort nodes approve
the transaction, the backend process sends a
PRECOMMIT message to all the cohort nodes
to express an intention to commit the transaction. The cohort
nodes respond to the backend with the
PRECOMMITTED message. In case of a failure,
all the nodes can use this information to complete the
transaction using a quorum-based voting procedure.
COMMIT phase. If
PRECOMMIT is successful, the transaction
is committed to all nodes.
If a node crashes or gets disconnected from the cluster between
the PREPARE and COMMIT
phases, the PRECOMMIT phase ensures that the
survived nodes have enough information to complete the prepared
transaction. The PRECOMMITTED messages help
avoid the situation when the crashed node has already committed
or aborted the transaction, but has not notified other nodes
about the transaction status. In a two-phase commit (2PC), such a
transaction would block resources (hold locks) until the recovery
of the crashed node. Otherwise, data inconsistencies can appear
in the database when the failed node is recovered, for example, if
the failed node committed the transaction, but the survived node
aborted it.
To complete the transaction, the backend must receive a response
from the majority of the nodes. For example, for a cluster of 2N+1 nodes,
at least N+1 responses are required. Thus, multimaster ensures that
your cluster is available for reads and writes while the majority
of the nodes are connected, and no data inconsistencies occur in
case of a node or connection failure.
Since multimaster allows writes to each node,
it has to wait for responses about transaction acknowledgement
from all the other nodes. Without special actions in case of a
node failure, each commit would have to wait until the failed node
recovery. To deal with such situations,
multimaster periodically sends heartbeats to
check the node state and the connectivity between nodes. When several
heartbeats to the node are lost in a row, this node is kicked out
of the cluster to allow writes to the remaining alive nodes. You
can configure the heartbeat frequency and the response timeout in
the multimaster.heartbeat_send_timeout and
multimaster.heartbeat_recv_timeout parameters,
respectively.
For alive nodes, there is no way to distinguish between a failed
node that stopped serving requests and a network-partitioned node
that can be accessed by database users, but is unreachable for
other nodes. To avoid conflicting writes to nodes in different
network partitions, multimaster only allows
writes to the nodes that see the majority of other nodes.
For example, suppose a five-node multi-master cluster experienced
a network failure that split the network into two isolated
subnets, with two and three cluster nodes. Based on heartbeats
propagation information, multimaster will
continue accepting writes at each node in the bigger partition,
and deny all writes in the smaller one. Thus, a cluster consisting
of 2N+1 nodes can tolerate N node failures and stay alive if any
N+1 nodes are alive and connected to each other.
For clusters with an even number of nodes, you can set up a lightweight referee node that does not hold the data, but acts as a tie-breaker during symmetric node partitioning. For details, see Section F.30.3.3.
In case of a partial network split when different nodes have
different connectivity, multimaster finds a
fully connected subset of nodes and disconnects nodes outside of this subset. For
example, in a three-node cluster, if node A can access both B and
C, but node B cannot access node C, multimaster
isolates node C to ensure that both A and B can work.
If you try to access a disconnected node, multimaster returns an error
message indicating the current status of the node. To prevent stale reads,
read-only queries are also forbidden.
Each node maintains a data structure that keeps the information about the state of all
nodes in relation to this node. You can get this data by calling the
mtm.status() and the mtm.nodes() functions.
When a failed node connects back to the cluster,
multimaster starts automatic recovery:
The reconnected node selects a random cluster node and starts catching up with the current state of the cluster based on the Write-Ahead Log (WAL). All the cluster nodes get locked for write transactions to allow the recovery process to finish.
When the recovery is complete, multimaster
promotes the reconnected node to the online state and
includes it into the replication scheme.
Automatic recovery requires presence of all WAL files generated after node failure. If a node is down for a long time and WAL files are no longer available, you may have to exclude this node from the cluster and manually restore it from one of the working nodes using pg_basebackup. For details, see Section F.30.4.3.
Starts all other workers for a database managed with multimaster.
This is the first worker loaded during multimaster boot.
Each multimaster node has a single mtm-monitor worker.
When a new node is added, mtm-monitor starts mtm-logrep-receiver and
mtm-dmq-receiver workers to enable replication to this node.
If a node is dropped, mtm-monitor stops mtm-logrep-receiver
and mtm-dmq-receiver workers that have been serving the dropped node.
Each mtm-monitor controls workers on its own node only.
Receives logical replication stream from a given peer node. During recovery,
all received transactions are applied by mtm-logrep-receiver.
During normal operation, mtm-logrep-receiver passes transactions to
the pool of dynamic workers (see mtm-logrep-receiver-dynworker).
The number of mtm-logrep-receiver workers on each node corresponds
to the number of peer nodes available.
Receives acknowledgements for transactions sent to peers and
checks for heartbeat timeouts.
The number of mtm-logrep-receiver workers on each node corresponds
to the number of peer nodes available.
Collects acknowledgements for transactions applied on the current node and
sends them to the corresponding mtm-dmq-receiver on the peer node.
There is a single mtm-dmq-sender worker per Postgres Pro instance.
Dynamic pool worker for a given mtm-logrep-receiver. Applies
the replicated transaction received during normal operation. There are up to
multimaster.max_workers workers per each peer node.
Asks all peers about the status of unfinished transactions to resolve them according to 3PC rules. This worker is only active during recovery.
To use multimaster, you need to install
Postgres Pro Enterprise on all nodes of your cluster.
Postgres Pro Enterprise includes all the required dependencies and
extensions.
Suppose you are setting up a cluster of three nodes, with
node1, node2, and
node3 host names. After installing Postgres Pro Enterprise on all nodes, you need to
initialize data directory on each node, as explained in Section 18.2.
If you would like to set up a multi-master cluster for an already existing mydb database,
you can load data from mydb to one of the nodes once the cluster is initialized,
or you can load data to all new nodes before cluster initialization using any convenient mechanism,
such as pg_basebackup or pg_dump.
Once the data directory is set up, complete the following steps on each cluster node:
Modify the postgresql.conf configuration
file, as follows:
Add multimaster to the shared_preload_libraries variable:
shared_preload_libraries = 'multimaster'
If the shared_preload_libraries variable is already defined in postgresql.auto.conf, you will need to modify its value using the ALTER SYSTEM command. For details, see Section 19.1.2.
Note that in a multi-master cluster, the ALTER SYSTEM command only affects the configuration of the node from which it was run.
Set up Postgres Pro parameters related to replication:
wal_level = logical max_connections = 100 max_prepared_transactions = 300 # max_connections * N max_wal_senders = 10 # at least N max_replication_slots = 10 # at least 2N wal_sender_timeout = 0
where N is the number of nodes in your cluster.
You must change the replication level to
logical as
multimaster relies on logical
replication. For a cluster of N nodes, enable at least N
WAL sender processes and replication slots. Since
multimaster implicitly adds a
PREPARE phase to each
COMMIT transaction, make sure to set
the number of prepared transactions to N * max_connections.
wal_sender_timeout should be disabled as multimaster uses
its custom logic for failure detection.
Make sure you have enough background workers allocated for each node:
max_worker_processes = 250 # (N - 1) * (max_connections + 3) + 3
For example, for a three-node cluster with
max_connections = 100,
multimaster may need up to 209
background workers at peak times: three always-on workers
(monitor, resolver, dmq-sender), three workers per each peer node
(walsender, mtm-receiver, dmq-receiver) and up to 200 replication
dynamic workers (that is, max_connections workers
per each peer node). When setting this parameter, remember
that other modules may also use background workers at the
same time.
Depending on your network environment and usage patterns, you
may want to tune other multimaster
parameters. For details, see
Section F.30.3.2.
Start Postgres Pro Enterprise on all nodes.
Create database mydb and user mtmuser
on each node:
CREATE USER mtmuser WITH SUPERUSER PASSWORD 'mtmuserpassword'; CREATE DATABASE mydb OWNER mtmuser;
If you are using password-based authentication, you may want to create a password file.
You can omit this step if you already have a database you are going
to replicate, but you are recommended to create a separate superuser
for multi-master replication. The examples below assume that you are going to
replicate the mydb database on behalf of
mtmuser.
Allow replication of the mydb database
to each cluster node on behalf of mtmuser,
as explained in Section 20.1.
Make sure to use the
authentication method that
satisfies your security requirements. For example,
pg_hba.conf might have the following lines on node1:
host replication mtmuser node2 md5 host mydb mtmuser node2 md5 host replication mtmuser node3 md5 host mydb mtmuser node3 md5
Connect to any node on behalf of the mtmuser database user,
create the multimaster extension
in the mydb database and run mtm.init_cluster():
CREATE EXTENSION multimaster;
SELECT mtm.init_cluster('dbname=mydb user=mtmuser host=node1',
'{"dbname=mydb user=mtmuser host=node2", "dbname=mydb user=mtmuser host=node3"}');
where the first argument of mtm.init_cluster() is a
connection string
to the current node and the second argument is an array of
connection strings to all the other nodes.
To ensure that multimaster is enabled, you can run
the mtm.status() and mtm.nodes() functions:
SELECT * FROM mtm.status(); SELECT * FROM mtm.nodes();
If status is equal to online
and all nodes are present in the mtm.nodes output,
your cluster is successfully configured and ready to use.
If you have any data that must be present on one of the nodes only, you can exclude a particular table from replication, as follows:
SELECT mtm.make_table_local('table_name')
While you can use multimaster in the default
configuration, you may want to tune several parameters for faster
failure detection or more reliable automatic recovery.
To check availability of the peer nodes,
multimaster periodically sends heartbeat
packets to all nodes. You can define the timeout for failure detection with the following variables:
The multimaster.heartbeat_send_timeout
variable defines the time interval between the
heartbeats. By default, this variable is set to 200ms.
The multimaster.heartbeat_recv_timeout
variable sets the timeout for the response. If no heartbeats are
received during this time, the node is assumed to be
disconnected and is excluded from the cluster. By default,
this variable is set to 1000ms.
It's a good idea to set
multimaster.heartbeat_send_timeout based on
typical ping latencies between the nodes. Small recv/send ratio
decreases the time of failure detection, but increases the
probability of false-positive failure detection. When setting
this parameter, take into account the typical packet loss ratio
between your cluster nodes.
By default, multimaster uses a majority-based
algorithm to determine whether the cluster nodes have a quorum: a cluster
can only continue working if the majority of its nodes are alive and can
access each other. For clusters with an even number of nodes, this
approach is not optimal. For example, if a network failure splits the
cluster into equal parts, or one of the nodes fails in a two-node
cluster, all the nodes stop accepting queries, even though at least
half of the cluster nodes are running normally.
To enable a smooth failover for such cases, you can set up a standalone referee node to assign the quorum status to a subset of nodes that constitutes half of the cluster.
A referee is a voting node used to determine which subset of nodes has a quorum if the cluster is split into equal parts. The referee node does not store any cluster data, so it is not resource-intensive and can be configured on virtually any system with Postgres Pro Enterprise installed.
To set up a referee for your cluster:
Install Postgres Pro Enterprise on the node you are
going to make a referee and create the referee
extension:
CREATE EXTENSION referee;
Make sure the pg_hba.conf file allows
access to the referee node.
On all your cluster nodes, specify the referee connection string
in the postgresql.conf file:
multimaster.referee_connstring = connstring
where connstring holds libpq options
required to access the referee.
The first subset of nodes that gets connected to the referee wins the voting and continues working. The referee keeps the voting result until all the other cluster nodes get online again. Then the result is discarded, and a new winner can be chosen in case of another network failure.
To avoid split-brain problems, you can only have a single referee in your cluster.
multimaster provides several functions to check the
current cluster state.
To check node-specific information, use mtm.status():
SELECT * FROM mtm.status();
To get the list of all nodes in the cluster together with their status,
use mtm.nodes():
SELECT * FROM mtm.nodes();
For details on all the returned information, see Section F.30.5.2.
If a cluster node is disabled, any attempt to read or write data on
this node raises an error by default. If you need to access the data
on a disabled node, you can override this behavior at connection time by setting the
application_name parameter to
mtm_admin. In this case, you can run read and
write queries on this node without multimaster
supervision.
With the multimaster extension, you can add or drop cluster nodes without
stopping the database service. When adding a new node, you need to load all the data to this node using
pg_basebackup from any cluster node, and then start this node.
Suppose we have a working cluster of three nodes, with
node1, node2, and
node3 host names. To add
node4, follow these steps:
Figure out the required connection string to
access the new node. For example, for the database
mydb, user mtmuser, and
the new node node4, the connection string
can be "dbname=mydb user=mtmuser host=node4".
In psql connected to any alive node, run:
SELECT mtm.add_node('dbname=mydb user=mtmuser host=node4');
This command changes the cluster configuration on all nodes
and creates replication slots for the new node. It also returns
node_id of the new node, which will be required
to complete the setup.
Go to the new node and clone all the data from one of the alive nodes to this node:
pg_basebackup -D datadir -h node1 -U mtmuser -c fast -v
pg_basebackup copies the entire data
directory from node1, together with
configuration settings, and prints the last LSN replayed from WAL,
such as '0/12D357F0'.
This value will be required to complete the setup.
Start Postgres Pro on the new node.
In psql connected to the node used to take the base backup, run:
SELECT mtm.join_node(4, '0/12D357F0');
where 4 is the node_id returned
by the mtm.add_node() function call and '0/12D357F0'
is the LSN value returned by pg_basebackup.
To remove the node from the cluster:
Run the mtm.nodes() function to learn the ID of the node to be removed:
SELECT * FROM mtm.nodes();
Run the mtm.drop_node() function with
this node ID as a parameter:
SELECT mtm.drop_node(3);
This will delete replication slots for node 3 on all cluster nodes and stop replication to this node.
If you would like to return the node to the cluster later, you will have to add it as a new node, as explained in Section F.30.4.3.
multimaster.heartbeat_recv_timeoutTimeout, in milliseconds. If no heartbeat message is received from the node within this timeframe, the node is excluded from the cluster.
Default: 1000 ms
multimaster.heartbeat_send_timeoutTime interval between heartbeat messages, in milliseconds. An arbiter process broadcasts heartbeat messages to all nodes to detect connection problems.
Default: 200 ms
multimaster.max_workers
The maximum number of walreceiver workers per peer node.
This parameter should be used with caution. If the number of simultaneous transactions in the whole cluster is bigger than the provided value, it can lead to undetected deadlocks.
Default: 100
multimaster.monotonic_sequences
Defines the sequence generation mode for unique identifiers. This variable can take the following values:
false (default) —
ID generation on each node is started with this node number and is incremented by
the number of nodes. For example, in a three-node cluster, 1, 4, and 7 IDs are allocated to the objects written onto
the first node, while 2, 5, and 8 IDs are reserved for the second node. If you
change the number of nodes in the cluster, the incrementation interval for new IDs is adjusted accordingly.
true —
the generated sequence increases monotonically cluster-wide.
ID generation on each node is started with this node number and is incremented by
the number of nodes, but the values are omitted if they are smaller than the already generated IDs on another node.
For example, in a three-node cluster, if 1, 4 and 7 IDs are already allocated to the objects on
the first node, 2 and 5 IDs will be omitted on the second node. In this case, the first ID on the second node is 8.
Thus, the next generated ID is always higher than the previous one, regardless of the cluster node.
Default: false
multimaster.referee_connstring
Connection string to access the referee node. You must set this parameter on all cluster nodes if the referee is set up.
multimaster.remote_functions
Provides a comma-separated list of function names that should be executed remotely on all multimaster nodes instead of replicating the result of their work.
multimaster.trans_spill_threshold
The maximal size of transaction, in kB. When this threshold is reached, the transaction is written to the disk.
Default: 100MB
mtm.init_cluster(my_conninfo text,
peers_conninfo text[])
Initializes cluster configuration on all nodes. It connects the
current node to all nodes listed in peers_conninfo
and creates the multimaster extension,
replications slots, and replication origins on each node. Run this function
once all the nodes are running and can accept connections.
Arguments:
my_conninfo — connection string to the
node on which you are running this function. Peer nodes use this
string to connect back to this node.
peers_conninfo — an array of connection
strings to all the other nodes to be added to the cluster.
mtm.add_node(conn_str text)
Adds a new node to the cluster. This function should be called
before loading data to this node using pg_basebackup.
mtm.add_node creates the required replication slots for a new node,
so you can add a node while the cluster is under load.
Arguments:
conn_str — connection string for the
new node. For example, for the database
mydb, user mtmuser,
and the new node node4, the connection
string is
"dbname=mydb user=mtmuser host=node4".
mtm.join_node(node_id int, backup_end_lsn pg_lsn)
Completes the cluster setup after adding a new node. This function should be called after the added node has been started.
Arguments:
node_id — ID of the node to add to the cluster.
It corresponds to the value in the id column returned by mtm.nodes().
backup_end_lsn — the last LSN of the base backup
copied to the new node. This LSN will be used as the starting point for data
replication once the node joins the cluster.
mtm.drop_node(node_id integer)
Excludes a node from the cluster.
Arguments:
node_id — ID of the node being dropped.
It corresponds to the value in the id column returned by mtm.nodes().
mtm.alter_sequences()
Fixes unique identifiers on all cluster nodes. This may be required after restoring all nodes from a single base backup.
mtm.status()
Shows the status of the multimaster extension on the current node. Returns a tuple of the following values:
node_id, integer — ID of this node.
status, text — status of the node. Possible values are: online, recovery, recovered, disabled.
n_nodes, integer — number of nodes in the cluster. The majority of alive nodes is calculated based on this parameter.
n_connected, integer — number of connected nodes.
n_enabled, integer — number of enabled nodes.
mtm.nodes()
Shows the information on all nodes in the cluster. Returns a tuple of the following values:
id, integer — node ID.
enabled, boolean — shows whether the node is fully recovered and is operating normally.
The node can only be disabled if responses to heartbeats are not received within the heartbeat_recv_timeout time interval.
When the node starts responding to heartbeats, multimaster can automatically restore the node and switch it back to the enabled state.
connected, boolean — shows whether the node is connected to our node.
sender_pid, integer — WAL sender process ID.
receiver_pid, integer — WAL receiver process ID.
receiver_status, text — status of the node.
Possible values are: recovery, recovered.
conninfo, text — connection string to this node.
mtm.make_table_local(relation regclass)
Stops replication for the specified table.
Arguments:
relation — the table you would like to
exclude from the replication scheme.
multimaster.break_connection
Break connection with clients connected to the node if this node disconnects
from the cluster. If this variable is set to false, the client stays
connected to the node but receives an error that the node is disabled.
Default: false
The multimaster extension currently passes nearly all
PostgreSQL regression tests, except for a few
edge cases related to working with temporary tables and updating enum, which are not always
transactional in PostgreSQL.
By default, any DDL statement is executed on all cluster nodes, except the following statements that can only act locally on a given node:
ALTER SYSTEM
CREATE DATABASE
DROP DATABASE
REINDEX
CHECKPOINT
CLUSTER
LOAD
LISTEN
CHECKPOINT
NOTIFY
Postgres Professional, Moscow, Russia.
The replication mechanism is based on logical decoding and an
earlier version of the pglogical extension
provided for community by the 2ndQuadrant team.
The three-phase E3PC commit protocol is described in:
Idit Keidar, Danny Dolev. Increasing the Resilience of Distributed and Replicated Database Systems.
Parallel replication and recovery mechanism is similar to the one described in:
Odorico M. Mendizabal, et al. Checkpointing in Parallel State-Machine Replication.