With built-in high-availability capabilities, Postgres Pro allows creating a cluster with one leader node and several follower nodes. The leader is the primary server of the BiHA cluster, while followers are its replicas.
The bihactl utility is used to initialize the cluster and create the leader, add followers, convert existing cluster nodes into the leader or the follower in the BiHA cluster as well as check the cluster node status. The leader is available for read and write transactions, while followers are read-only and replicate data from the leader in the synchronous or asynchronous mode.
Physical streaming replication implemented in BiHA ensures high availability by providing protection against server failures and data storage system failures. During physical replication, WAL files of the leader node are sent, synchronously of asynchronously, to the follower node and applied there. In case of synchronous replication, with each commit a user waits for the confirmation from the follower that the transaction is committed. The follower in the BiHA cluster can be used to:
Perform read transactions in the database.
Prepare reports.
Create in-memory tables open for write transactions.
Prepare a follower node backup.
Restore bad blocks of data on the leader node by receiving them from the follower node.
Check corrupt records in WAL files.
Physical streaming replication implemented in BiHA provides protection against several types of failures:
Leader node failure. In this case, a follower node is promoted and becomes the new leader of the cluster. The promotion can be done both manually using the biha.set_leader function or automatically by means of elections.
Follower node failure. If a follower node uses asynchronous replication, the failure by no means affects the leader node. If a follower node uses synchronous replication, this failure causes the transaction on the leader node to stop. This happens because the leader stops receiving transaction confirmations from the follower and the transaction fails to end. For details on how to set up synchronous replication in the BiHA cluster, see Section 27.3.7.
Network failure between the leader node and follower nodes. In this case, the leader node cannot send and follower nodes cannot receive any data. Note that you cannot allow write transactions on follower nodes if users are connected to the leader node. Any changes made on follower nodes will not be restored on the leader node. To avoid this, configure your network with redundant channels. It is best to provide each follower with its own communication channel to avoid single point of failure issues.
In case of an emergency, such as operating system or hardware failure,
you can reinstall Postgres Pro and remove
the biha extension from
shared_preload_libraries to go back to work as soon as
possible.
There are several variants of the cluster configuration.
Three and more nodes where one node is the leader and the rest are the followers.
Below are possible scenarios for the cases of the leader failure or network connection interruption:
When the current leader is down, the new leader is elected automatically. To become the leader, a follower must have the highest number of votes. The number of votes must be higher or equal to the value configured in biha.nquorum.
In case of network connection interruptions inside the BiHA cluster,
the cluster may split into several groups of nodes. In this case, the new leader node is
elected in all groups, where the number of nodes is higher or equal
to the biha.nquorum value. After the
connection is restored, the new leader will be chosen between the
old one and the newly elected one depending on the
term value. The node with the highest term
becomes the new leader. It is recommended to set the biha.minnodes
value equal to the biha.nquorum value.
Two-node cluster consisting of the leader and the follower.
Using two-node clusters is not recommended as such configurations can cause split-brain issues. To avoid such issues, you can add a referee node.
Below are possible scenarios for the leader or network failures:
When the leader is down, the follower node becomes the new leader
automatically if the biha.nquorum
configuration parameter is set to 1.
When network interruptions occur between the leader and the follower,
and both the biha.nquorum and
biha.minnodes
configuration parameters are set to 1,
the cluster may split into two leaders available for reads and writes.
The referee node helps avoiding
such issues.
Single-node configuration consisting of the leader only. A possible variant that can be used to wait until follower nodes are configured. Logically, the node cannot be replaced once down, since there are no follower nodes that can become the leader node.
Three-node cluster consisting of the leader, the follower, and the referee. The referee is a node used for voting in elections of the new leader, but it cannot become the leader. In case of faults, the cluster with the referee behaves the same way as the three-node cluster (the leader and two followers). To learn more about the referee, see Section 27.1.3.
You can set the leader manually with the biha.set_leader function.
The recommended value of biha.nquorum is higher or equal to the half of the cluster nodes.
When you add or remove nodes from your cluster, always revise the
biha.nquorum value considering
the highest number of nodes, but not less than set in nquorum.
Elections are a process conducted by the follower nodes to determine a
new leader node when the current leader is down.
As a result of the elections, the follower node with the most records
in the WAL becomes the cluster leader. To be elected, a node must have
the biha.can_be_leader and
biha.can_vote parameters set to true.
Elections are held based on the cluster quorum,
which is the minimum number of nodes that participate in the leader
election.
The quorum value is set in the biha.nquorum
parameter when initializing the cluster with the
bihactl init command.
Nodes with the biha.can_vote parameter set to
false are excluded from voting and are ignored by
nquorum.
For the elections to begin, the followers must miss the maximum
number of heartbeats from the leader set by the
biha.set_heartbeat_max_lost function. At this point
one of the followers proposes itself as a leader CANDIDATE,
and elections begin. If the leader does not receive the set number of
heartbeats from the follower in this case, the follower state changes
to UNKNOWN for the leader.
In a synchronous cluster, you can use the
biha.node_priority
parameter to prioritize the nodes.
If your cluster has only two nodes and you want to avoid potential
split-brain issues in case of elections, you can set up a
referee node that participates in the elections in
the same way as followers.
To learn more, see Section 27.1.3.
For example, if you have a cluster with three nodes where
and one follower node is
down, the cluster leader will continue to operate. If the leader is
down in such a cluster, two remaining followers start elections. After
the new leader node is elected, the node generation specified in the
term is incremented for all cluster nodes.
More specifically, the new leader and the remaining followers have
nquorum=2, while for the old leader
the value is left as term=2.
Therefore, when the old leader is back in the cluster, it goes through
demotion, i.e. turns into a follower.
term=1
After the new leader is set, followers of the cluster start receiving WAL files from this new cluster leader. Note that once the new leader is elected, the old leader is demoted and is not available for write transactions to avoid split-brain issues. You can promote the old leader manually using the biha.set_leader function. Both the cluster quorum and the term concepts are implemented in BiHA based on the Raft consensus algorithm.
The biha extension allows you to set up the
referee node that participates in the leader elections and helps to avoid
potential split-brain issues if your cluster has only two nodes, i.e. the
leader and one follower. In this case, use the referee node and set
both biha.nquorum and
biha.minnodes configuration
parameters to 2.
When you create the referee node,
only the biha_db database and system tables are copied to
the referee node from the leader.
The postgres database and user data are not copied.
The biha extension provides two referee operation modes:
The referee mode. In this mode, the node only takes
part in elections of the leader and does not participate in data
replication, and no replication slots are created on the leader and
follower nodes for the referee.
The referee_with_wal mode. In this case, the node
participates both in the leader elections, in the same way as in the
referee mode, and data replication and receives the
entire WAL from the leader node. If the referee node has the most WAL
records in the cluster when the elections begin, i.e. has the greatest
LSN, the follower node tries to get missing WAL files from the referee.
This process is also important for the referee node to avoid entering the
NODE_ERROR state, which may be the case if WALs
diverge. For the referee_with_wal,
apply lag is NULL and
apply ptr cannot be monitored, as the referee does not apply
user data.
Regardless of the mode set for the referee, it sends and receives heartbeats
over the control channel, including using SSL, participates in the elections
in the same way as follower nodes, supports
cluster monitoring functions,
and must be taken into account when setting the
biha.minnodes configuration
parameter. Note that the referee is the final state of the node and it cannot
be switched to the leader node using the biha.set_leader
function, nor can it become the follower node. If for some reason the follower
does not “see” the leader but the referee does, the referee does
not allow the follower to become the leader. If the leader node with greater
term connects to the referee node, the referee demotes
the leader with lower term and makes it the follower.