multimaster is a Postgres Pro Enterprise extension with a set
of patches that turns Postgres Pro Enterprise into a synchronous shared-nothing
cluster to provide Online Transaction Processing (OLTP) scalability for read transactions and high availability with automatic disaster recovery.
As compared to a standard PostgreSQL master-standby cluster, a cluster configured with the multimaster extension offers the following benefits:
Fault tolerance and automatic node recovery
Synchronous logical replication and DDL replication
Read scalability
Working with temporary tables on each cluster node
Upgrading minor releases of Postgres Pro Enterprise seamlessly for multimaster cluster clients.
The multimaster extension is no longer supported
in Postgres Pro Enterprise 12. Please consider migrating to
the latest Postgres Pro Enterprise major version to ensure stable
performance of multimaster.
Before deploying multimaster on production
systems, make sure to take its replication restrictions into
account. For details, see Section F.30.1.
The multimaster extension replicates your
database to all nodes of the cluster and allows write transactions
on each node. Write transactions are synchronously replicated to all nodes,
which increases commit latency. Read-only transactions and queries
are executed locally, without any measurable overhead.
To ensure high availability and fault tolerance of the cluster,
multimaster uses three-phase commit protocol
and heartbeats for failure discovery. A multi-master cluster of N
nodes can continue working while the majority of the nodes are
alive and reachable by other nodes. To be configured with
multimaster, the cluster must include at least
two nodes. Since the data on all cluster nodes is the same, you do not
typically need more than five cluster nodes. Three cluster nodes are
enough to ensure high availability in most cases.
For best experience, it is recommended to set up two data nodes and
a referee, as explained in Section F.30.3.3.
When a failed node
is reconnected to the cluster, multimaster can automatically
fast-forward the node to the actual state based on the
Write-Ahead Log (WAL) data in the corresponding replication slot.
If a node was excluded from the cluster, you can add it back using pg_basebackup.
To learn more about the multimaster internals, see
Section F.30.2.
The multimaster extension takes care of the database replication in a fully automated way. You can perform write transactions on any node and work with temporary tables on each cluster node simultaneously. However, make sure to take the following replication restrictions into account:
Microsoft Windows operating system is not supported.
1C solutions are not supported.
multimaster can replicate only one database
in a cluster. If it is required to replicate the contents of several
databases, you can either transfer all data into different schemas
within a single database or create a separate cluster for each
database and set up multimaster for each cluster.
Large objects are not supported. Although creating large objects is allowed, multimaster cannot replicate such objects, and their OIDs may conflict on different nodes, so their use is not recommended.
Since multimaster is based on
logical replication
and three-phase E3PC commit protocol, its operation is
highly affected by network latency. It is not recommended to
set up a multimaster cluster with geographically
distributed nodes.
Using tables without primary keys can have negative impact
on performance. In some cases, it can even lead to inability
to restore a cluster node, so you should avoid replicating such
tables with multimaster.
Unlike in vanilla PostgreSQL, read committed
isolation level can cause serialization failures on a multi-master cluster (with an SQLSTATE code '40001') if there are
conflicting transactions from different nodes, so the application must be
ready to retry transactions.
Serializable isolation level works
only with respect to local transactions on the current node.
Sequence generation. To avoid conflicts between unique identifiers on different nodes,
multimaster modifies the default behavior of sequence generators.
By default, ID generation on each node is started with this node number and is
incremented by the number of nodes. For example, in a three-node cluster, 1, 4, and 7
IDs are allocated to the objects written onto the first node, while 2, 5, and 8 IDs are reserved
for the second node. If you change the number of nodes in the cluster, the incrementation
interval for new IDs is adjusted accordingly. Thus, the generated sequence values are not
monotonic. If it is critical to get a monotonically increasing sequence cluster-wide, you can
set the multimaster.monotonic_sequences
to true.
Commit latency. In the current implementation of logical
replication, multimaster sends data to subscriber nodes only after the
local commit, so you have to wait for transaction processing twice: first on the local node,
and then on all the other nodes simultaneously. In the case of a heavy-write transaction,
this may result in a noticeable delay.
Logical replication does not guarantee that a system object
OID is the same on all cluster nodes, so OIDs for the same object may
differ between multimaster cluster nodes.
If your driver or application relies on OIDs, make sure that their
use is restricted to connections to one and the same node to avoid
errors. For example, the Npgsql driver may not
work correctly with multimaster if the
NpgsqlConnection.GlobalTypeMapper method tries
using OIDs in connections to different cluster nodes.
Replicated non-conflicting transactions are applied on the receiving nodes in parallel, so such transactions may become visible on different nodes in different order.
CREATE INDEX CONCURRENTLY and REINDEX
CONCURRENTLY are not supported.
COMMIT AND CHAIN feature is not supported.
Since each server in a multi-master cluster can accept writes, any server can abort a
transaction because of a concurrent update — in the same way as it
happens on a single server between different backends. To ensure
high availability and data consistency on all cluster nodes,
multimaster uses logical replication and the three-phase E3PC commit protocol.
When Postgres Pro Enterprise loads the multimaster shared
library, multimaster sets up a logical
replication producer and consumer for each node, and hooks into
the transaction commit pipeline. The typical data replication
workflow consists of the following phases:
PREPARE phase.
multimaster captures and implicitly
transforms each COMMIT statement to a
PREPARE statement. All the nodes that get
the transaction via the replication protocol (the
cohort nodes) send their vote for approving or
declining the transaction to the backend process on the
initiating node. This ensures that all the cohort can accept
the transaction, and no write conflicts occur. For details on
PREPARE transactions support in PostgreSQL,
see the
PREPARE
TRANSACTION topic.
PRECOMMIT phase. If all the cohort nodes approve
the transaction, the backend process sends a
PRECOMMIT message to all the cohort nodes
to express an intention to commit the transaction. The cohort
nodes respond to the backend with the
PRECOMMITTED message. In case of a failure,
all the nodes can use this information to complete the
transaction using a quorum-based voting procedure.
COMMIT phase. If
PRECOMMIT is successful, the transaction
is committed to all nodes.
If a node crashes or gets disconnected from the cluster between
the PREPARE and COMMIT
phases, the PRECOMMIT phase ensures that the
survived nodes have enough information to complete the prepared
transaction. The PRECOMMITTED messages help
avoid the situation when the crashed node has already committed
or aborted the transaction, but has not notified other nodes
about the transaction status. In a two-phase commit (2PC), such a
transaction would block resources (hold locks) until the recovery
of the crashed node. Otherwise, data inconsistencies can appear
in the database when the failed node is recovered, for example, if
the failed node committed the transaction, but the survived node
aborted it.
To complete the transaction, the backend must receive a response
from the majority of the nodes. For example, for a cluster of 2N+1 nodes,
at least N+1 responses are required. Thus, multimaster ensures that
your cluster is available for reads and writes while the majority
of the nodes are connected, and no data inconsistencies occur in
case of a node or connection failure.
Since multimaster allows writes to each node,
it has to wait for responses about transaction acknowledgment
from all the other nodes. Without special actions in case of a
node failure, each commit would have to wait until the failed node
recovery. To deal with such situations,
multimaster periodically sends heartbeats to
check the node state and the connectivity between nodes. When several
heartbeats to the node are lost in a row, this node is kicked out
of the cluster to allow writes to the remaining alive nodes. You
can configure the heartbeat frequency and the response timeout in
the multimaster.heartbeat_send_timeout and
multimaster.heartbeat_recv_timeout parameters,
respectively.
For example, suppose a five-node multi-master cluster experienced
a network failure that split the network into two isolated
subnets, with two and three cluster nodes. Based on heartbeats
propagation information, multimaster will
continue accepting writes at each node in the bigger partition,
and deny all writes in the smaller one. Thus, a cluster consisting
of 2N+1 nodes can tolerate N node failures and stay alive if any
N+1 nodes are alive and connected to each other.
In case of a partial network split when different nodes have
different connectivity, multimaster finds a
fully connected subset of nodes and disconnects nodes outside of this subset. For
example, in a three-node cluster, if node A can access both B and
C, but node B cannot access node C, multimaster
isolates node C to ensure that both A and B can work.
For clusters with an even number of nodes, you can set up a lightweight referee node that does not hold the data, but acts as a tie-breaker during symmetric node partitioning. For details, see Section F.30.3.3.
For alive nodes, there is no way to distinguish between a failed
node that stopped serving requests and a network-partitioned node
that can be accessed by database users, but is unreachable for
other nodes. To avoid conflicting writes to nodes in different
network partitions, multimaster only allows
writes to the nodes that see the majority of other nodes. If
you try to access a disconnected node, multimaster
returns an error message indicating the current status of the node.
To prevent stale reads, read-only queries are also forbidden. Thus,
if you would like to continue using a disconnected node outside of
the cluster in the standalone mode, you have to uninstall the
multimaster extension on this node, as
explained in Section F.30.4.5.
Each node maintains a data structure that keeps the information about the state of all
nodes in relation to this node. You can get this data by calling the
mtm.status() and the mtm.nodes() functions.
When a failed node connects back to the cluster,
multimaster starts automatic recovery:
The reconnected node selects a random cluster node and starts catching up with the current state of the cluster based on the Write-Ahead Log (WAL). All the cluster nodes get locked for write transactions to allow the recovery process to finish.
When the recovery is complete, multimaster
promotes the reconnected node to the online state and
includes it into the replication scheme.
Automatic recovery requires presence of all WAL files generated after node failure. If a node is down for a long time and WAL files are no longer available, you may have to exclude this node from the cluster and manually restore it from one of the working nodes using pg_basebackup. For details, see Section F.30.4.3.
Starts all other workers for a database managed with multimaster.
This is the first worker loaded during multimaster boot.
Each multimaster node has a single mtm-monitor worker.
When a new node is added, mtm-monitor starts mtm-logrep-receiver and
mtm-dmq-receiver workers to enable replication to this node.
If a node is dropped, mtm-monitor stops mtm-logrep-receiver
and mtm-dmq-receiver workers that have been serving the dropped node.
Each mtm-monitor controls workers on its own node only.
Receives logical replication stream from a given peer node. During recovery,
all received transactions are applied by mtm-logrep-receiver.
During normal operation, mtm-logrep-receiver passes transactions to
the pool of dynamic workers (see mtm-logrep-receiver-dynworker).
The number of mtm-logrep-receiver workers on each node corresponds
to the number of peer nodes available.
Receives acknowledgment for transactions sent to peers and
checks for heartbeat timeouts.
The number of mtm-logrep-receiver workers on each node corresponds
to the number of peer nodes available.
Collects acknowledgment for transactions applied on the current node and
sends them to the corresponding mtm-dmq-receiver on the peer node.
There is a single mtm-dmq-sender worker per Postgres Pro instance.
Dynamic pool worker for a given mtm-logrep-receiver. Applies
the replicated transaction received during normal operation. There are up to
multimaster.max_workers workers per each peer node.
Asks all peers about the status of unfinished transactions to resolve them according to 3PC rules. This worker is only active during recovery.
To use multimaster, you need to install
Postgres Pro Enterprise on all nodes of your cluster.
Postgres Pro Enterprise includes all the required dependencies and
extensions. For best experience, it is recommended to set up two data
nodes and a referee, as explained in Section F.30.3.3.
Suppose you are setting up a cluster of three nodes, with
node1, node2, and
node3 host names. After installing Postgres Pro Enterprise on all nodes, you need to
initialize data directory on each node, as explained in Section 18.2.
If you would like to set up a multi-master cluster for an already existing mydb database,
you can load data from mydb to one of the nodes once the cluster is initialized,
or you can load data to all new nodes before cluster initialization using any convenient mechanism,
such as pg_basebackup or pg_dump.
Once the data directory is set up, complete the following steps on each cluster node:
Modify the postgresql.conf configuration
file, as follows:
Add multimaster to the shared_preload_libraries variable:
shared_preload_libraries = 'multimaster'
If the shared_preload_libraries variable is already defined in postgresql.auto.conf, you will need to modify its value using the ALTER SYSTEM command. For details, see Section 19.1.2.
Note that in a multi-master cluster, the ALTER SYSTEM command only affects the configuration of the node from which it was run.
Set up Postgres Pro parameters related to replication:
wal_level = logical max_connections = 100 max_prepared_transactions = 300 # max_connections * N max_wal_senders = 10 # at least N max_replication_slots = 10 # at least 2N wal_sender_timeout = 0
where N is the number of nodes in your cluster.
You must change the replication level to
logical as
multimaster relies on logical
replication. For a cluster of N nodes, enable at least N
WAL sender processes and replication slots. Since
multimaster implicitly adds a
PREPARE phase to each
COMMIT transaction, make sure to set
the number of prepared transactions to N * max_connections.
wal_sender_timeout should be disabled as multimaster uses
its custom logic for failure detection.
Make sure you have enough background workers allocated for each node:
max_worker_processes = 250 # (N - 1) * (max_connections + 3) + 3
For example, for a three-node cluster with
max_connections = 100,
multimaster may need up to 209
background workers at peak times: three always-on workers
(monitor, resolver, dmq-sender), three workers per each peer node
(walsender, mtm-receiver, dmq-receiver) and up to 200 replication
dynamic workers (that is, max_connections workers
per each peer node). When setting this parameter, remember
that other modules may also use background workers at the
same time.
Depending on your network environment and usage patterns, you
may want to tune other multimaster
parameters. For details, see
Section F.30.3.2.
Start Postgres Pro Enterprise on all nodes.
Create database mydb and user mtmuser
on each node:
CREATE USER mtmuser WITH SUPERUSER PASSWORD 'mtmuserpassword'; CREATE DATABASE mydb OWNER mtmuser;
If you are using password-based authentication, you may want to create a password file.
You can omit this step if you already have a database you are going
to replicate, but you are recommended to create a separate superuser
for multi-master replication. The examples below assume that you are going to
replicate the mydb database on behalf of
mtmuser.
Allow replication of the mydb database
to each cluster node on behalf of mtmuser,
as explained in Section 20.1.
Make sure to use the
authentication method that
satisfies your security requirements. For example,
pg_hba.conf might have the following lines on node1:
host replication mtmuser node2 md5 host mydb mtmuser node2 md5 host replication mtmuser node3 md5 host mydb mtmuser node3 md5
Connect to any node on behalf of the mtmuser database user,
create the multimaster extension
in the mydb database and run
mtm.init_cluster(), specifying the connection
string to the current node as the first argument and an array
of connection strings to the other nodes as the second argument.
For example, if you would like to connect to node1, run:
CREATE EXTENSION multimaster;
SELECT mtm.init_cluster('dbname=mydb user=mtmuser host=node1',
'{"dbname=mydb user=mtmuser host=node2", "dbname=mydb user=mtmuser host=node3"}');
To ensure that multimaster is enabled, you can run
the mtm.status() and mtm.nodes() functions:
SELECT * FROM mtm.status(); SELECT * FROM mtm.nodes();
If status is equal to online
and all nodes are present in the mtm.nodes output,
your cluster is successfully configured and ready to use.
If you have any data that must be present on one of the nodes only, you can exclude a particular table from replication, as follows:
SELECT mtm.make_table_local('table_name')
While you can use multimaster in the default
configuration, you may want to tune several parameters for faster
failure detection or more reliable automatic recovery.
To check availability of the peer nodes,
multimaster periodically sends heartbeat
packets to all nodes. You can define the timeout for failure detection with the following variables:
The multimaster.heartbeat_send_timeout
variable defines the time interval between the
heartbeats. By default, this variable is set to 200ms.
The multimaster.heartbeat_recv_timeout
variable sets the timeout for the response. If no heartbeats are
received during this time, the node is assumed to be
disconnected and is excluded from the cluster. By default,
this variable is set to 2000ms.
It's a good idea to set
multimaster.heartbeat_send_timeout based on
typical ping latencies between the nodes. Small recv/send ratio
decreases the time of failure detection, but increases the
probability of false-positive failure detection. When setting
this parameter, take into account the typical packet loss ratio
between your cluster nodes.
By default, multimaster uses a majority-based
algorithm to determine whether the cluster nodes have a quorum: a cluster
can only continue working if the majority of its nodes are alive and can
access each other. For clusters with an even number of nodes, this
approach is not optimal. For example, if a network failure splits the
cluster into equal parts, or one of the nodes fails in a two-node
cluster, all the nodes stop accepting queries, even though at least
half of the cluster nodes are running normally.
To enable a smooth failover for such cases, you can set up a standalone referee node to assign the quorum status to a subset of nodes that constitutes half of the cluster.
A referee is a voting node used to determine which subset of nodes has a quorum if the cluster is split into equal parts. The referee node does not store any cluster data, so it is not resource-intensive and can be configured on virtually any system with Postgres Pro Enterprise installed.
To avoid split-brain problems, you can only have a single referee in your cluster.
To set up a referee for your cluster:
Install Postgres Pro Enterprise on the node you are
going to make a referee and create the referee
extension:
CREATE EXTENSION referee;
Make sure the pg_hba.conf file allows
access to the referee node.
Set up the nodes that will hold cluster data following the instructions in Section F.30.3.1. For best experience, it is recommended to set up two data nodes.
On all data nodes, specify the referee connection string
in the postgresql.conf file:
multimaster.referee_connstring = connstring
where connstring holds libpq options
required to access the referee.
The first subset of nodes that gets connected to the referee wins the voting and starts working. The other nodes have to go through the recovery process to catch up with them and join the cluster. Under heavy load, the recovery can take unpredictably long, so it is recommended to wait for all data nodes going online before switching on the load when setting up a new cluster. Once all the nodes get online, the referee discards the voting result, and all data nodes start operating together.
In case of any failure, the voting mechanism is triggered again.
At this time, all nodes appear to be offline for a short period
of time to allow the referee to choose a new winner, so you can
see the following error message when trying to access the cluster:
[multimaster] node is not online:
current status is "disabled".
multimaster provides several functions to check the
current cluster state.
To check node-specific information, use mtm.status():
SELECT * FROM mtm.status();
To get the list of all nodes in the cluster together with their status,
use mtm.nodes():
SELECT * FROM mtm.nodes();
For details on all the returned information, see Section F.30.5.2.
If a cluster node is disabled, any attempt to read or write data on
this node raises an error by default. If you need to access the data
on a disabled node, you can override this behavior at connection time by setting the
application_name parameter to
mtm_admin. In this case, you can run read and
write queries on this node without multimaster
supervision.
With the multimaster extension, you can add or drop cluster nodes without
stopping the database service. When adding a new node, you need to load all the data to this node using
pg_basebackup from any cluster node, and then start this node.
Suppose we have a working cluster of three nodes, with
node1, node2, and
node3 host names. To add
node4, follow these steps:
Figure out the required connection string to
access the new node. For example, for the database
mydb, user mtmuser, and
the new node node4, the connection string
can be "dbname=mydb user=mtmuser host=node4".
In psql connected to any alive node, run:
SELECT mtm.add_node('dbname=mydb user=mtmuser host=node4');
This command changes the cluster configuration on all nodes
and creates replication slots for the new node. It also returns
node_id of the new node, which will be required
to complete the setup.
Go to the new node and clone all the data from one of the alive nodes to this node:
pg_basebackup -D datadir -h node1 -U mtmuser -c fast -v
pg_basebackup copies the entire data
directory from node1, together with
configuration settings, and prints the last LSN replayed from WAL,
such as '0/12D357F0'.
This value will be required to complete the setup.
Start Postgres Pro on the new node.
In psql connected to the node used to take the base backup, run:
SELECT mtm.join_node(4, '0/12D357F0');
where 4 is the node_id returned
by the mtm.add_node() function call and '0/12D357F0'
is the LSN value returned by pg_basebackup.
To remove the node from the cluster:
Run the mtm.nodes() function to learn the ID of the node to be removed:
SELECT * FROM mtm.nodes();
Run the mtm.drop_node() function with
this node ID as a parameter:
SELECT mtm.drop_node(3);
This will delete replication slots for node 3 on all cluster nodes and stop replication to this node.
If you would like to return the node to the cluster later, you will have to add it as a new node, as explained in Section F.30.4.3.
If you would like to continue using the node that has been removed
from the cluster in the standalone mode, you have to drop the
multimaster extension on this node and
clean up all multimaster-related subscriptions
and uncommitted transactions to ensure that the node is no longer
associated with the cluster.
Remove multimaster from
shared_preload_libraries and restart
Postgres Pro Enterprise.
Delete the multimaster extension and publication:
DROP EXTENSION multimaster; DROP PUBLICATION multimaster;
Review the list of existing subscriptions using the
\dRs command and delete each subscription
that starts with the mtm_sub_ prefix:
\dRs
DROP SUBSCRIPTION mtm_sub_subscription_name;
Review the list of existing replication slots and delete
each slot that starts with the mtm_ prefix:
SELECT * FROM pg_replication_slots;
SELECT pg_drop_replication_slot('mtm_slot_name');
Review the list of existing replication origins and delete
each origin that starts with the mtm_ prefix:
SELECT * FROM pg_replication_origin;
SELECT pg_replication_origin_drop('mtm_origin_name');
Review the list of prepared transaction left, if any:
SELECT * FROM pg_prepared_xacts;
You have to commit or abort these transactions by running
ABORT PREPARED or
transaction_idCOMMIT PREPARED , respectively.
transaction_id
Once all these steps are complete, you can start using the node in the standalone mode, if required.
You can check that the data is the same on all cluster nodes using the mtm.check_query(query_text) function.
As a parameter, this function takes the text of a query you would like to
run for data comparison. When you call this function, it takes a consistent
snapshot of data on each cluster node and runs this query against the
captured snapshots. The query results are compared between pairs of nodes.
If there are no differences, this function returns true.
Otherwise, it reports the first detected difference in a warning
and returns false.
To avoid false-positive results, always use the ORDER BY
clause in your test query. For example, suppose you would like to check
that the data in a my_table is the same on all
cluster nodes. Compare the results of the following queries:
postgres=# SELECT mtm.check_query('SELECT * FROM my_table ORDER BY id');
check_query
-------------
t
(1 row)
postgres=# SELECT mtm.check_query('SELECT * FROM my_table');
WARNING: mismatch in column 'b' of row 0: 256 on node0, 255 on node1
check_query
-------------
f
(1 row)
Even though the data is the same, the second query reports an issue because the order of the returned data differs between cluster nodes.
multimaster.heartbeat_recv_timeoutTimeout, in milliseconds. If no heartbeat message is received from the node within this timeframe, the node is excluded from the cluster.
Default: 2000 ms
multimaster.heartbeat_send_timeoutTime interval between heartbeat messages, in milliseconds. A multimaster node broadcasts heartbeat messages to all peers to detect connection problems.
Default: 200 ms
multimaster.max_workers
The maximum number of walreceiver workers per peer node.
This parameter should be used with caution. If the number of simultaneous transactions in the whole cluster is bigger than the provided value, it can lead to undetected deadlocks.
Default: 100
multimaster.monotonic_sequences
Defines the sequence generation mode for unique identifiers. This variable can take the following values:
false (default) —
ID generation on each node is started with this node number and is incremented by
the number of nodes. For example, in a three-node cluster, 1, 4, and 7 IDs are allocated to the objects written onto
the first node, while 2, 5, and 8 IDs are reserved for the second node. If you
change the number of nodes in the cluster, the incrementation interval for new IDs is adjusted accordingly.
true —
the generated sequence increases monotonically cluster-wide.
ID generation on each node is started with this node number and is incremented by
the number of nodes, but the values are omitted if they are smaller than the already generated IDs on another node.
For example, in a three-node cluster, if 1, 4 and 7 IDs are already allocated to the objects on
the first node, 2 and 5 IDs will be omitted on the second node. In this case, the first ID on the second node is 8.
Thus, the next generated ID is always higher than the previous one, regardless of the cluster node.
Default: false
multimaster.referee_connstring
Connection string to access the referee node. You must set this parameter on all cluster nodes if the referee is set up.
multimaster.remote_functions
Provides a comma-separated list of function names that should be executed remotely on all multimaster nodes instead of replicating the result of their work.
multimaster.trans_spill_threshold
The maximal size of transaction, in kB. When this threshold is reached, the transaction is written to the disk.
Default: 100MB
multimaster.break_connection
Break connection with clients connected to the node if this node disconnects
from the cluster. If this variable is set to false, the client stays
connected to the node but receives an error that the node is disabled.
Default: false
mtm.init_cluster(my_conninfo text,
peers_conninfo text[])
Initializes cluster configuration on all nodes. It connects the
current node to all nodes listed in peers_conninfo
and creates the multimaster extension,
replications slots, and replication origins on each node. Run this function
once all the nodes are running and can accept connections.
Arguments:
my_conninfo — connection string to the
node on which you are running this function. Peer nodes use this
string to connect back to this node.
peers_conninfo — an array of connection
strings to all the other nodes to be added to the cluster.
mtm.add_node(connstr text)
Adds a new node to the cluster. This function should be called
before loading data to this node using pg_basebackup.
mtm.add_node creates the required replication slots for a new node,
so you can add a node while the cluster is under load.
Arguments:
connstr — connection string for the
new node. For example, for the database
mydb, user mtmuser,
and the new node node4, the connection
string is
"dbname=mydb user=mtmuser host=node4".
mtm.join_node(node_id int, backup_end_lsn pg_lsn)
Completes the cluster setup after adding a new node. This function should be called after the added node has been started.
Arguments:
node_id — ID of the node to add to the cluster.
It corresponds to the value in the id column returned by mtm.nodes().
backup_end_lsn — the last LSN of the base backup
copied to the new node. This LSN will be used as the starting point for data
replication once the node joins the cluster.
mtm.drop_node(node_id integer)
Excludes a node from the cluster.
If you would like to continue using this node outside of
the cluster in the standalone mode, you have to uninstall the
multimaster extension from this node,
as explained in Section F.30.4.5.
Arguments:
node_id — ID of the node being dropped.
It corresponds to the value in the id column returned by mtm.nodes().
mtm.alter_sequences()
Fixes unique identifiers on all cluster nodes. This may be required after restoring all nodes from a single base backup.
mtm.status()
Shows the status of the multimaster extension on the current node. Returns a tuple of the following values:
node_id, integer — ID of this node.
status, text — status of the node. Possible values are: online, recovery, recovered, disabled.
n_nodes, integer — number of nodes in the cluster. The majority of alive nodes is calculated based on this parameter.
n_connected, integer — number of connected nodes.
n_enabled, integer — number of enabled nodes.
mtm.nodes()
Shows the information on all nodes in the cluster. Returns a tuple of the following values:
id, integer — node ID.
enabled, boolean — shows whether the node is fully recovered and is operating normally.
The node can only be disabled if responses to heartbeats are not received within the heartbeat_recv_timeout time interval.
When the node starts responding to heartbeats, multimaster can automatically restore the node and switch it back to the enabled state.
connected, boolean — shows whether the node is connected to our node.
sender_pid, integer — WAL sender process ID.
receiver_pid, integer — WAL receiver process ID.
receiver_status, text — status of the node.
Possible values are: recovery, recovered.
conninfo, text — connection string to this node.
mtm.make_table_local(relation regclass)
Stops replication for the specified table.
Arguments:
relation — the table you would like to
exclude from the replication scheme.
mtm.check_query(query_text text)
Checks data consistency across cluster nodes. This function takes a
snapshot of the current state of each node, runs the specified query
against these snapshots, and compares the results. If the results are
different between any two nodes, displays a warning with the first
found issue and returns false.
Otherwise, returns true.
Arguments:
query_text — the query you would like to
run on all nodes for data comparison. To avoid false-positive results,
always use the ORDER BY clause in the test query.
mtm.get_snapshots()
Takes a snapshot of data on each cluster node and returns the
snapshot ID. The snapshots remain available until the
mtm.free_snapshots() is called, or the current
session is terminated. This function is used by the
mtm.check_query(query_text), there is no need to call
it manually.
mtm.free_snapshots()
Removes data snapshots taken by the
mtm.get_snapshots() function.
This function is used by the mtm.check_query(query_text),
there is no need to call it manually.
By default, any DDL statement is executed on all cluster nodes, except the following statements that can only act locally on a given node:
ALTER SYSTEM
CREATE DATABASE
DROP DATABASE
REINDEX
CHECKPOINT
CLUSTER
LOAD
LISTEN
CHECKPOINT
NOTIFY
Postgres Professional, Moscow, Russia.
The replication mechanism is based on logical decoding and an
earlier version of the pglogical extension
provided for community by the 2ndQuadrant team.
The three-phase E3PC commit protocol is described in:
Idit Keidar, Danny Dolev. Increasing the Resilience of Distributed and Replicated Database Systems.
Parallel replication and recovery mechanism is similar to the one described in:
Odorico M. Mendizabal, et al. Checkpointing in Parallel State-Machine Replication.