With built-in high-availability capabilities, Postgres Pro allows creating a cluster with one leader node and several follower nodes. The leader is the primary server of the BiHA cluster, while followers are its replicas.
The bihactl utility is used to initialize the cluster and create the leader, add followers, convert existing cluster nodes into the leader or the follower in the BiHA cluster as well as check the cluster node status. The leader is available for read and write transactions, while followers are read-only and replicate data from the leader in the synchronous or asynchronous mode.
Physical streaming replication implemented in BiHA ensures high availability by providing protection against server failures and data storage system failures. During physical replication, WAL files of the leader node are sent, synchronously of asynchronously, to the follower node and applied there. In case of synchronous replication, with each commit a user waits for the confirmation from the follower that the transaction is committed. The follower in the BiHA cluster can be used to:
Perform read transactions in the database.
Prepare reports.
Create in-memory tables open for write transactions.
Prepare a follower node backup.
Restore bad blocks of data on the leader node by receiving them from the follower node.
Check corrupt records in WAL files.
Physical streaming replication implemented in BiHA provides protection against several types of failures:
Leader node failure. In this case, a follower node is promoted and becomes the new leader of the cluster. The promotion can be done both manually using the biha.set_leader function or automatically by means of elections.
Follower node failure. If a follower node uses asynchronous replication, the failure by no means affects the leader node. If a follower node uses synchronous replication, this failure causes the transaction on the leader node to stop. This happens because the leader stops receiving transaction confirmations from the follower and the transaction fails to end. For details on how to set up synchronous replication in the BiHA cluster, see Replication Configuration.
Network failure between the leader node and follower nodes. In this case, the leader node cannot send and follower nodes cannot receive any data. Note that you cannot allow write transactions on follower nodes if users are connected to the leader node. Any changes made on follower nodes will not be restored on the leader node. To avoid this, configure your network with redundant channels. It is best to provide each follower with its own communication channel to avoid single point of failure issues.
In case of an emergency, such as operating system or hardware failure,
you can reinstall Postgres Pro and remove
the biha extension from
shared_preload_libraries to go back to work as soon as
possible.
For proper operation, BiHA sets some Postgres Pro configuration parameters and creates a number of auxiliary objects:
The bihactl utility adds biha to
the shared_preload_libraries
variable of the postgresql.conf file
and, if applicable, of the postgresql.auto.conf file:
shared_preload_libraries = 'biha'
This parameter is required for operation of the BiHA cluster.
If shared_preload_libraries already contains other
preloaded libraries, biha is added to the end of
the list.
The bihactl utility creates the following files:
pg_hba.biha.conf is added to the
pg_hba.conf file by
means of the include directive.
The pg_hba.biha.conf file contains authentication
rules for the biha_replication_user
role on the BiHA cluster nodes:
host postgres biha_replication_user all scram-sha-256 host biha_db biha_replication_user all scram-sha-256 host replication biha_replication_user all scram-sha-256
The default authentication method is scram-sha-256.
However, if the password_encryption parameter has been
already set in postgresql.conf,
BiHA uses the existing value. If you use
SSL for user authentication,
the method changes to cert.
postgresql.biha.conf is added to the
postgresql.conf file
by means of the include directive.
The biha_db database, the biha
extension, and a number of BiHA-specific roles are created.
For more information, see Roles.
Replication slots
with names set in the biha_node_
format are created. These slots are managed automatically without the need to
modify or delete them manually.
id
In the postgresql.biha.conf file, bihactl
sets the following Postgres Pro configuration parameters:
hot_standby is set to on (the default).
It is not recommended to modify this parameter.
wal_level is set to replica (the default).
If the value has already been set to logical,
BiHA uses the existing value.
It is not recommended to modify this parameter.
max_wal_senders is set based on the number of
WAL senders required for proper operation of BiHA
that depends on the quorum set in
biha.nquorum. If the
biha.nquorum value is 3 or less, the
max_wal_senders value is 10.
Otherwise, the value is calculated based on the following formula:
. Consider
this when modifying the BiHA_quorum * 2 + 3max_wal_senders value.
For more information about decreasing max_wal_senders
and some other Postgres Pro configuration parameters,
see Section 26.1.1.1.
max_replication_slots is max_wal_senders + 1.
The minimum value is 11. Consider this when modifying
the max_replication_slots value.
max_slot_wal_keep_size is set to 5GB.
If the value has already been set, BiHA
uses the existing value. You can modify the value if required.
Unlike standard primary-standby configuration, the BiHA cluster stores WAL files on all nodes to ensure that a lagging node can catch up. To achieve this, each node uses replication slots, identifies the node that is lagging the most, and retains as many WAL files as the lagging node might require.
When setting up the BiHA cluster, ensure that you select the optimal value for this parameter to avoid the following issues:
If the number of required WAL files is higher than the max_slot_wal_keep_size value, the old WAL files are deleted. As a result, the lagging node cannot receive the required data, changes its state to NODE_ERROR, and stops data replication.
If the max_slot_wal_keep_size value is set
to -1 (which means that WAL files
are never deleted) or if it exceeds the available disk size, this may
lead to disk storage overflow.
If the max_slot_wal_keep_size value is too small, there may not be enough space to keep WAL files required for the lagging node to catch up and continue operation.
wal_keep_size is set to 1GB.
If the value has already been set, BiHA
uses the existing value. The wal_keep_size
configuration parameter helps to keep WAL files for a potential run of
pg_rewind. You can modify the value depending on
the workload. The more WAL files are generated, the higher the
wal_keep_size value must be set.
application_name is set in the
biha_node_ format.
It is not recommended to modify this parameter.
id
listen_addresses is set to *.
It is not recommended to modify this parameter.
port is set to the default Postgres Pro value. If the default port has been changed, BiHA uses the existing value. It is not recommended to modify this parameter.
primary_conninfo, primary_slot_name, synchronous_standby_names are modified and managed by BiHA only.
When biha is loaded and configured, you cannot modify these parameters using ALTER SYSTEM.
These parameters are stored in the pg_biha/biha.conf
file, as well as in the shared memory of the
biha process.
When these parameters are modified, biha sends
the SIGHUP signal for other processes to be informed
about the changes. If you modify any other parameters during this change
and do not send a signal to reread the configuration, the parameters
that you have changed may be unexpectedly reread.
Postgres Pro behaves as described above only
when biha is loaded and configured, i.e.
when the extension is present in the shared_preload_libraries
variable and the required biha.*
parameters are configured. Otherwise, Postgres Pro
operates normally.
During operation, BiHA creates the following service files in the database directory:
standby.signal is used to start
nodes in standby mode. It is required to make biha
read-only at the start of Postgres Pro.
This file is deleted from the leader node when its state
changes to LEADER_RW.
biha.state and biha.conf
are files in the pg_biha directory required to save the
internal state and configuration of biha.
In a BiHA cluster, some Postgres Pro configuration parameter values can only be successfully decreased using the procedure described in this section.
Use this procedure if you need to decrease any of the following parameters:
Be careful when modifying these configuration parameters and ensure their values are the same on all BiHA cluster nodes. Otherwise, the leader change may lead to unexpected cluster behavior.
You can decrease the above mentioned configuration parameters as follows:
On the leader, decrease the value of the required configuration parameter.
(Optional) To avoid elections, increase the biha.nquorum value to the total number of cluster nodes using the biha.set_nquorum function.
Stop and start the leader using pg_ctl.
During startup, the leader verifies other nodes are operational and continue recognizing its leader role. If all nodes operate correctly, the leader applies the modified value and starts successfully. Otherwise, the leader shuts down and provides the corresponding log message.
Ensure that all nodes receive information that the configuration parameter has been modified on the leader.
You can use the pg_controldata utility or, alternatively, make a pause after the leader restarts so that other nodes could have enough time to receive the updates.
On other nodes, decrease the configuration parameter to the value you have just set on the leader.
Stop and start the nodes using pg_ctl.
There are several variants of cluster configuration.
Three and more nodes where one node is the leader and the rest are the followers.
Below are possible scenarios for the cases of the leader failure or network connection interruption:
When the current leader is down, the new leader is elected automatically. To become the leader, a follower must have the highest number of votes. The number of votes must be higher or equal to the value configured in biha.nquorum.
In case of network connection interruptions inside the BiHA cluster,
the cluster may split into several groups of nodes. In this case, the new leader node is
elected in all groups, where the number of nodes is higher or equal
to the biha.nquorum value. After the
connection is restored, the new leader will be chosen between the
old one and the newly elected one depending on the
term value. The node with the highest term
becomes the new leader. It is recommended to set the biha.minnodes
value equal to the biha.nquorum value.
Two-node cluster consisting of the leader and the follower.
Using two-node clusters is not recommended as such configurations can cause split-brain issues. To avoid such issues, you can add a referee node.
Below are possible scenarios for the leader or network failures:
When the leader is down, the follower node becomes the new leader
automatically if the biha.nquorum
configuration parameter is set to 1.
When network interruptions occur between the leader and the follower,
and both the biha.nquorum and
biha.minnodes
configuration parameters are set to 1,
the cluster may split into two leaders available for reads and writes.
The referee node helps avoiding
such issues.
Single-node configuration consisting of the leader only. A possible variant that can be used to wait until follower nodes are configured. Logically, the node cannot be replaced once down, since there are no follower nodes that can become the leader node.
Three-node cluster consisting of the leader, the follower, and the referee. The referee is a node used for voting in elections of the new leader, but it cannot become the leader. In case of faults, the cluster with the referee behaves the same way as the three-node cluster (the leader and two followers). To learn more about the referee, see The Referee Node in the BiHA Cluster.
Cascading BiHA cluster consisting of the leader and two followers where Follower 1 replicates data from the leader and Follower 2 replicates data from Follower 1. Using cascading replication, you can deploy your BiHA cluster in different data centers.
Multi-level geo-distributed and disaster-resilient BiHA (GDBiHA) cluster. The GDBiHA cluster consists of two or more segments — logical nodes that unite one or more physical BiHA cluster nodes located in one data center. For more information, see Section 26.1.5.3.
You can set the leader or front follower manually with the biha.set_leader function.
The recommended value of biha.nquorum is higher or equal to the half of the cluster nodes.
When you add or remove nodes from your cluster, always revise the
biha.nquorum value considering
the highest number of nodes, but not less than set in biha.nquorum.
Elections are a process conducted by the follower nodes to determine a
new leader node when the current leader is down.
As a result of the elections, the follower node with the most records
in the WAL becomes the cluster leader. To be elected, a node must have
the biha.can_be_leader and
biha.can_vote parameters set to true.
Elections are held based on the cluster quorum,
which is the minimum number of nodes that participate in the leader
election.
The quorum value is set in the biha.nquorum
parameter when initializing the cluster with the
bihactl cluster init command.
Nodes with the biha.can_vote parameter set to
false are excluded from voting and are ignored by
biha.nquorum. Nodes in the NODE_ERROR state are
unable to vote.
For the elections to begin, the followers must miss the maximum
number of heartbeats from the leader set in
biha.heartbeat_max_lost. At this point
one of the followers proposes itself as a leader CANDIDATE,
and elections begin. If the leader does not receive the set number of
heartbeats from the follower in this case, the follower state changes
to UNKNOWN for the leader.
In a synchronous cluster, you can use the
biha.priority
parameter to prioritize the nodes.
If your cluster has only two nodes and you want to avoid potential
split-brain issues in case of elections, you can set up a
referee node that participates in the elections in
the same way as followers.
To learn more, see The Referee Node in the BiHA Cluster.
For example, if you have a cluster with three nodes where
and one follower node is
down, the cluster leader will continue to operate. If the leader is
down in such a cluster, two remaining followers start elections. After
the new leader node is elected, the node generation specified in the
term is incremented for all cluster nodes.
More specifically, the new leader and the remaining followers have
biha.nquorum=2, while for the old leader
the value is left as term=2.
Therefore, when the old leader is back in the cluster, it goes through
demotion, i.e. turns into a follower.
term=1
After the new leader is set, followers of the cluster start receiving WAL files from this new cluster leader. Note that once the new leader is elected, the old leader is demoted and is not available for write transactions to avoid split-brain issues. You can promote the old leader manually using the biha.set_leader function. Both the cluster quorum and the term concepts are implemented in BiHA based on the Raft consensus algorithm.
The biha extension allows you to set up the
referee node that participates in elections
and helps to avoid potential split-brain issues if your cluster has only two nodes, i.e. the
leader and one follower. In this case, use the referee node and set
both biha.nquorum and
biha.minnodes configuration
parameters to 2.
By default, the postgres database and user data are not
present on the referee node. For more information, see
The postgres Database on the Referee.
The referee node requires much less disk space, CPU, and RAM than regular nodes. However, the referee is created via partial copy of the leader and, as a result, inherits its configuration parameters. To avoid unnecessary resource consumption, before you start the referee for the first time, ensure its configuration parameters are aligned with the actual hardware resources of the server the referee is running on. For example, set shared_buffers to the default 128 MB.
The biha extension provides the following referee operation modes:
The referee mode. In this mode, the node only takes
part in elections of the leader and does not participate in data
replication, and no replication slots are created on the leader and
follower nodes for the referee.
The referee_with_wal mode. In this case, the node
participates both in the leader elections, in the same way as in the
referee mode, and data replication and receives the
entire WAL from the leader node. If the referee node has the most WAL
records in the cluster when the elections begin, i.e. has the greatest
LSN, the follower node tries to get missing WAL files from the referee.
This process is also important for the referee node to avoid entering the
NODE_ERROR state, which may be the case if WALs
diverge. For the referee_with_wal,
apply lag is NULL and
apply ptr cannot be monitored, as the referee does not apply
user data.
Regardless of the mode set for the referee, it sends and receives heartbeats
over the control channel, including using SSL, participates in the elections
in the same way as follower nodes, supports
cluster monitoring functions,
and must be taken into account when setting the
biha.minnodes configuration
parameter. Note that the referee is the final state of the node and it cannot
be switched to the leader node using the biha.set_leader
function, nor can it become the follower node. If for some reason the follower
does not “see” the leader but the referee does, the referee does
not allow the follower to become the leader. If the leader node with greater
term connects to the referee node, the referee demotes
the leader with lower term and makes it the follower.
postgres Database on the Referee #
When adding the referee node,
the pg_basebackup utility makes a partial
backup of the leader. It means that, by default,
only the biha_db database and system
tables are copied to the referee node from the leader, while
the postgres database and user data are not copied. It was
designed intentionally to decrease resource consumption.
However, some utilities and monitoring systems connect to nodes
via the postgres database. If you need
the postgres database to be present on the referee node,
you can specify the --referee-with-postgres-db option
when adding a node in referee or referee_with_wal
modes. This option copies the postgres database with all
the objects to the referee node. For referee_with_wal,
WAL records related to the postgres database are also applied,
meaning that all new objects created in the postgres database
are also created on the referee in the referee_with_wal mode.
Note that this refers to the postgres database created
during instance initialization. If you delete the postgres
database from the leader, it is also deleted on the referee, and you cannot
recreate it.
BiHA provides the following geographical distribution and disaster resilience features that allow deploying BiHA clusters in different geographic locations to ensure availability during regional failures:
The basic BiHA functionality allows locating cluster nodes in different redundant data centers. By default, in such clusters, the leader is the replication source for all followers. However, you can configure cascading replication to reduce workloads on the leader.
For example, you can distribute nodes of your three-node BiHA cluster across three different data centers where each follower can quickly take over if the leader fails:
Figure 26.1. Basic Geographical Distribution across Three Data Centers
You can also locate the leader and two followers of your BiHA cluster in
Data Center 1 while keeping another additional Follower 3
in a geo-redundant Data Center 2. Follower 3
only operates as a standby and cannot participate in elections or become the
leader:
Figure 26.2. Basic Geographical Distribution with a Geo-Redundant Follower
Cascading replication allows decreasing network workloads in clusters distributed across different data centers, as well as decreasing workloads on the leader.
The BiHA solution provides a set of parameters, such as biha.max_replicas and biha.preferred_roles, designed to automatically configure cascading replication. Once you set these parameters, each node of your cluster independently selects its replication source. In case of cluster topology updates, for example, change of the leader or number of nodes, cascading replication is established automatically.
For more information on how to configure cascading replication in your BiHA cluster, refer to Configuring Cascading Replication.
The following diagram shows an example of a cascading BiHA cluster consisting of five nodes distributed across two data centers:
Figure 26.3. Replication in a Cascading BiHA Cluster
The BiHA cluster on the diagram above operates as follows:
Leader and two followers locate in Data Center 1,
which is the primary data center.
Follower 1 and Follower 2 replicate
data directly from Leader.
All three nodes have the same values of biha.preferred_roles
and biha.max_replicas configuration parameters.
biha.preferred_roles = LF means that these nodes
would always prefer the leader as their replication source.
biha.max_replicas = 2 means
that these nodes can replicate data to no more than two followers at once.
Follower 3 and Follower 4 locate in
Data Center 2, which is the standby data center. Follower 3
replicate data from Follower 2 and is a replication
source for Follower 4.
Both nodes have the same values of biha.preferred_roles
and biha.max_replicas configuration parameters.
biha.preferred_roles = FL means that these nodes
would always prefer a follower as their replication source.
biha.max_replicas = 1 means
that these nodes can replicate data to no more than one follower at once.
Additionally, their biha.can_vote
and biha.can_be_leader configuration
parameters are set to false. This is to ensure
that nodes located in the standby data center cannot participate in
elections as either voters or candidates.
Assume that Leader fails:
Figure 26.4. Replication in a Cascading BiHA Cluster in Case of Leader Failure
Cascading replication automatically reestablishes as follows:
Follower 1 is elected as the new Leader.
Follower 2 selects the new Leader
as its replication source.
Leader now replicates data only to Follower 2.
Follower 3 and Follower 4 continue
replicating as before, because their replication sources remain the same.
The BiHA solution allows creating a multi-level geographically distributed and disaster-resilient (GDBiHA) cluster to provide efficient 24/7 operation at heavy workload. The GDBiHA cluster is divided into segments — logical nodes that unite one or more physical BiHA cluster nodes. Segments are located in different geographically redundant data centers and have their own election systems.
The following diagram shows the typical structure of the GDBiHA cluster distributed across two data centers:
Figure 26.5. Multi-Level Geo-Distributed and Disaster-Resilient BiHA Cluster Diagram
The GDBiHA cluster on the diagram above consists of the following components:
GDBiHA Cluster is a logical
root node with ID
1111 that unites two segments. The root node is the parent node for
segments. This node has no parent and cannot be deleted.
Leader Segment is a logical node
that unite physical nodes located in Data Center 1. This segment is
created by default when you initialize the cluster and always has ID 111.
It owns the leader role because it contains the leader of the GDBiHA
cluster (Leader Node) and supports write operations. If
Leader Segment fails, you must perform manual
switchover to Follower Segment using
biha.set_leader.
Follower Segment is a logical node
that unite physical nodes located in Data Center 2.
This segment is created manually and has the user-set ID, in this case,
it is 222. This segment does not support write operations.
Leader Node is the physical leader
node of the GDBiHA Cluster. If Leader Node
fails, the new leader is elected from
the Leader Segment nodes by means of the standard
BiHA elections procedure.
Front Follower Node 3 is a physical follower
node that fulfills leader functions in Follower Segment.
If Front Follower Node 3 fails, the new
front follower is elected from the Follower Segment
nodes by means of the standard BiHA elections procedure.
Follower Node 1, Follower Node 2,
Follower Node 4, and Follower Node 5
are physical follower nodes of the GDBiHA cluster.
Nodes in the GDBiHA cluster operate on different levels.
Physical nodes (IDs 1-6) operate on the level 1 (physical level). Segments
(IDs 111 and 222) operate on the level 2 (segment level). Cluster (ID 1111)
operates on the level 3 (cluster level).
To view the level where node operate, as well as other configuration details of all nodes, call the biha.config function:
id | term | nquorum | minnodes | heartbeat_send_period | heartbeat_max_lost | no_wal_on_follower | sync_standbys_min | priority | can_be_leader | can_vote | mode | proxima_status | name | repl_pref_roles | max_replicas | config_version | level | parent_id
------+------+---------+----------+-----------------------+--------------------+--------------------+-------------------+----------+---------------+----------+---------+----------------+----------------+-----------------+--------------+----------------+-------+-----------
1 | 1 | 2 | 2 | 1000 | 10 | 20000 | -2 | -1 | t | t | regular | 0 | biha_node_1 | L | 2147483647 | 17 | 1 | 111
2 | 1 | 2 | 2 | 1000 | 10 | 20000 | -2 | -1 | t | t | regular | 0 | biha_node_2 | L | 2147483647 | 17 | 1 | 111
3 | 1 | 2 | 2 | 1000 | 10 | 20000 | -2 | -1 | t | t | regular | 0 | biha_node_3 | L | 2147483647 | 17 | 1 | 111
4 | 1 | 2 | 2 | 1000 | 10 | 20000 | -2 | -1 | t | t | regular | 0 | biha_node_4 | L | 2147483647 | 17 | 1 | 222
5 | 1 | 2 | 2 | 1000 | 10 | 20000 | -2 | -1 | t | t | regular | 0 | biha_node_5 | L | 2147483647 | 17 | 1 | 222
6 | 1 | 2 | 2 | 1000 | 10 | 20000 | -2 | -1 | t | t | regular | 0 | biha_node_6 | L | 2147483647 | 17 | 1 | 222
111 | 1 | 2 | 2 | 1000 | 10 | 20000 | -2 | -1 | t | t | regular | 0 | biha_node_111 | L | 2147483647 | 17 | 2 | 1111
222 | 1 | 2 | 2 | 1000 | 10 | 20000 | -2 | -1 | t | t | regular | 0 | biha_node_222 | L | 2147483647 | 17 | 2 | 1111
1111 | 1 | 1 | 1 | 1000 | 10 | 20000 | -2 | -1 | t | t | regular | 0 | biha_node_1111 | L | 2147483647 | 17 | 3 |
(9 rows)
You can configure nodes, segments, and the GDBiHA cluster by means of biha extension functions.
For more information about deploying the GDBiHA cluster, see Setting Up a Multi-Level Geo-Distributed and Disaster-Resilient BiHA Cluster.
By default, replication in the GDBiHA cluster is cascading and operates as follows:
Follower Node 1 and Follower Node 2
replicate data from Leader Node.
Front Follower 3 replicates data from Leader Node.
Follower Node 4 and Follower Node 5
replicate data from Front Follower 3.
For more information on how to configure cascading replication, see Configuring Cascading Replication.
The multi-level geographical distribution and disaster resilience functionality is currently experimental. It is recommended to test this functionality in a test environment before using it for your production cluster. To deploy your BiHA cluster in different data centers, you can also use cascading replication.
Automatic segment-level elections are not supported. You can change the leader segment manually using the biha.set_leader function.
Configuring multi-level geo-redundancy and disaster-resilience on the existing BiHA cluster is not supported.
The nodes of GDBiHA cluster have no
biha.minnodes and
biha.nquorum configuration
parameters as these parameter values are inherited from the segment where
the nodes are located. For example, to modify the biha.nquorum
value for the nodes located in segment with ID 111, execute the following
function:
biha.set_nquorum(new_quorum_value, 111)
You can add or remove segments, add nodes to segments, or move nodes from one segment to another. For more information, see Section 26.3.1.