biha is a Postgres Pro
extension managed by the bihactl utility. Together with
a set of core patches, SQL interface, and the biha-background-worker
process, which coordinates the cluster nodes, biha
turns Postgres Pro into a cluster with physical
replication and built-in failover, high availability, and automatic node
failure recovery.
As compared to existing cluster solutions, i.e. a standard PostgreSQL master-standby cluster and a cluster configured with multimaster, the biha cluster offers the following benefits:
Physical replication
Built-in failover
Dedicated leader node available for read and write transactions and read-only follower nodes
Synchronous and asynchronous node replication
Built-in capabilities of automatic node failure detection, response, and subsequent cluster reconfiguration, i.e. selection of the new cluster leader and making the old leader the read-only follower
No additional external cluster software required
With built-in high-availability capabilities, Postgres Pro allows creating a cluster with one dedicated leader node and several follower nodes. The bihactl utility is used to initialize the cluster and create the leader, add followers, convert existing cluster nodes into the leader or the follower in the biha cluster as well as check the cluster node status. The leader is available for read and write transactions, while followers are read-only and replicate data from the leader in the synchronous or asynchronous mode.
Physical streaming replication implemented in biha ensures high availability by providing protection against server failures and data storage system failures. During physical replication, WAL files of the leader node are sent, synchronously of asynchronously, to the follower node and applied there. In case of synchronous replication, with each commit a user waits for the confirmation from the follower that the transaction is committed. The follower in the biha cluster can be used to:
Perform read transactions in the database
Prepare reports
Create in-memory tables open for write transactions
Prepare a follower node backup
Restore bad blocks of data on the leader node by receiving them from the follower node
Check corrupt records in WAL files
Physical streaming replication implemented in biha provides protection against several types of failures:
Leader node failure. In this case, a follower node is promoted and
becomes the new leader of the cluster. The promotion can be done both
automatically by means of elections or manually using the
biha.set_leader function. In case of
elections, the follower node with the most records in the WAL becomes
the cluster leader. The elections are held based on the cluster
quorum, which is a minimum number of nodes that
participate in the leader election. The quorum value is set in the
nquorum option when initializing the cluster with the
bihactl init command.
For example, if you have a cluster with three nodes where
and one follower node is
down, the cluster leader will continue to operate. If the leader is
down in such a cluster, two remaining followers start elections. After
the new leader node is elected, the term value
is incremented for all cluster nodes. More specifically, the new
leader and the remaining followers have
nquorum=2, while for the old leader
the value is left as term=2.
Therefore, when the old leader is back in the cluster, it becomes the
follower. After the new leader is set, followers of the cluster start
receiving WAL files from this new cluster leader. Note that once the
new leader is elected, the old leader cannot be open for write
transactions to avoid split-brain issues. Once the old leader node is
repaired, you should either recreate the cluster with this leader node
or synchronize it with the newly elected leader node by means of
pg_rewind. Both the cluster quorum and
the term concepts are implemented in biha
based on the Raft consensus algorithm.
term=1
Follower node failure. If a follower node uses asynchronous replication, the failure by no means affects the leader node. If a follower node uses synchronous replication, this failure causes the transaction on the leader node to stop. This happens because the leader stops receiving transaction confirmations from the follower and the transaction fails to end. For details on how to set up synchronous replication in the biha cluster, see Section F.8.3.5.
Network failure between the leader node and follower nodes. In this case, the leader node cannot not send and follower nodes cannot not receive any data. Note that you cannot allow write transactions on follower nodes if users are connected to the leader node. Any changes made on follower nodes will not be restored on the leader node. To avoid this, configure your network with redundant channels. It is best to provide each follower with its own communication channel to avoid single point of failure issues.
In case of an emergency, such as operating system or hardware failure,
you can reinstall Postgres Pro and remove
the biha extension from
shared_preload_libraries to go back to work as soon as
possible.
There are several variants of the cluster configuration.
Three and more nodes. Below are possible scenarios if the leader node or the network is down:
The new leader is elected as a result of the failover and elections. The leader is elected by a simple majority of votes.
In case of network interruptions, the cluster may split into
several groups of nodes. In this case, the new leader node is
elected in the group with the majority of nodes. After the
connection is restored, the new leader will be chosen between the
old one and the newly elected one depending on the
term value. Besides, if the
nquorum value equals the minnodes
value, the old leader becomes read-only.
Alternatively, the leader node can be set manually using the biha.set_leader function.
Two nodes with one leader node and one follower node. Below are possible scenarios if one node or the network is down:
The follower node becomes the new leader node. Network interruptions when the follower cannot “see” the leader may result in split-brain issues because two leaders will appear in the cluster.
Nothing happens in case of network interruptions. In this scenario, the follower node will not become the leader node.
Alternatively, the leader node can be set manually using the biha.set_leader function.
Single leader node. A possible variant that can be used to wait until follower nodes are configured. Logically, the node cannot be replaced once down, since there are no follower nodes that can become the leader node.
The biha cluster is set up by means of the bihactl utility, and there are several ways to do so:
Add the leader node and follower nodes from scratch.
Convert an existing node to make it the leader node and add new follower nodes.
Convert an existing node to make it the leader node and convert an existing node to make it the follower node.
When the high-availability cluster is initialized, biha
modifies the Postgres Pro configuration in
postgresql.conf
and pg_hba.conf. The changes are
first included in biha service files
postgresql.biha.conf and
pg_hba.biha.conf and then processed by the server
after the following include directives are specified in
postgresql.conf
and pg_hba.conf, respectively:
include 'postgresql.biha.conf' include 'pg_hba.biha.conf'
At the time of cluster initialization, biha configures Postgres Pro and creates its own service files inside the database catalog. In the course of cluster operation, the Postgres Pro configuration and the service files are dynamically modified. The standard ALTER SYSTEM mechanism is used to this end, which modifies the postgresql.auto.conf file and rereads the configuration similarly to calling the pg_reload_conf() SQL function. In this case, two parameters are modified, primary_conninfo and primary_slot_name, which are required for the automated replication control inside the high-availability cluster. If you modify any other parameters during this change and fail to reread the configuration, the parameters that you have changed may be unexpectedly reread.
To set up a built-in high-availability cluster from scratch, you first need to
install Postgres Pro on all nodes of your cluster.
Postgres Pro includes all the required dependencies
and extensions. After Postgres Pro is installed,
execute the following commands to set up your cluster:
bihactl init and
bihactl add. The cluster is
configured automatically when the above commands are executed and you only
need to specify some cluster-related options. Note that a password specified
in the password file is required to
connect the follower node to the leader node.
Perform the steps described below to set up the cluster:
Initialize the cluster and create the leader node.
Execute the bihactl init
command with the necessary options:
bihactl init
--biha-node-id=1
--host=localhost
--port=5432
--biha-port=5433
--nquorum=2
--minnodes=2
--pgdata=leader_local_PGDATA_directory > /tmp/magic-file
In this case, the initdb utility is accessed, postgresql.conf and pg_hba.conf files are modified as well as a special string called magic string is created containing the data needed to connect follower nodes to the leader node at the next stage.
You might want to save the magic string for the next step of adding the follower node:
export MAGIC_STRING="$(cat /tmp/magic-file)"
Execute the following command to start the DBMS:
pg_ctl start -Dleader_local_PGDATA_directory-lleader_log_file
Add the follower node.
Execute the bihactl add
command with the necessary options. Note that a password specified
in the password file is required
to connect the follower node to the leader node.
bihactl add
--biha-node-id=2
--host=localhost
--port=5434
--biha-port=5435
--use-leader "host=leader_host port=leader_port biha-port=leader_biha_port"
--pgdata=follower_PGDATA_directory
The follower node can be also added using the magic string:
bihactl add
--biha-node-id=2
--host=localhost
--port=5434
--biha-port=5435
--magic-string=magic_string
--pgdata=follower_PGDATA_directory
In this case, a backup of the leader node is created by means of
pg_basebackup or pg_probackup
depending on the value set in the --backup-options
parameter. Besides,
postgresql.conf
and pg_hba.conf files are
modified.
Execute the following command to start the DBMS:
pg_ctl start -Dfollower_PGDATA_directory-lfollower_log_file
To set up a built-in high-availability cluster from existing nodes, you
first need to have the nodes created by means of the
initdb command. If you have the already
created nodes, execute the following commands to set up your cluster:
bihactl init with the
--convert option and
bihactl add if you have only
one existing node and you want to make it the leader and add more new nodes
or bihactl init with the
--convert option and bihactl
add with the --convert-standby
option if you want to convert the existing nodes to make them the leader and
followers in this cluster. The cluster is configured automatically when the
above commands are executed, and you only need to specify some cluster-related
options. Note that a password specified in the
password file is required to connect the
follower node to the leader node.
Perform the steps described below to set up the cluster:
Convert an existing node to make it the leader node.
Execute the bihactl init
command with the --convert option:
bihactl init
--biha-node-id=1
--host=localhost
--port=5432
--biha-port=5433
--nquorum=2
--minnodes=2
--pgdata=leader_local_PGDATA_directory > magic-file
--convert
In this case, postgresql.conf and pg_hba.conf files are modified as well as a magic string is created containing the data needed to connect follower nodes to the leader node at the next stage.
You might want to save the magic string for the next step of adding the follower node:
export MAGIC_STRING="$(cat magic-file)"
Execute the following command to start the DBMS:
pg_ctl start -Dleader_local_PGDATA_directory-lleader_log_file
Note that before proceeding to the next step of converting an existing node to the follower node, it is necessary to stop this node:
pg_ctl stop -D follower_PGDATA_directory
Add the follower node.
You can add the follower node in two ways:
Execute the bihactl add
command with the --convert-standby option to
convert the existing node to the follower node.
bihactl add
--biha-node-id=2
--host=localhost
--port=5434
--biha-port=5435
--use-leader "host=leader_host port=leader_port biha-port=leader_biha_port"
--pgdata=follower_PGDATA_directory
--convert-standby
Execute the bihactl add
command with additional options to add a new follower node from
scratch.
When converting an existing node to the follower node,
biha creates the
and
follower_PGDATA_directory/pg_biha/biha.conf
files required for the node to be connected to the cluster and modifies
postgresql.conf
and pg_hba.conf.
follower_PGDATA_directory/pg_biha/biha.state
Execute the following command to start the DBMS:
pg_ctl start -Dfollower_PGDATA_directory-lfollower_log_file
You can change the cluster composition by adding or removing nodes.
To add a node, use the bihactl add
command with the relevant options. To remove a node, use the
biha.remove_node function. For more
information on how to set up a high-availability cluster, see
Section F.8.2.
In addition to the built-in failover capabilities, the high-availability cluster in Postgres Pro allows for the switchover. The difference between failover and switchover is that the former is performed automatically when the leader node fails and the latter is done manually by the system administrator. To switch over the leader node, use the biha.set_leader function. When you set the new leader, the following happens:
All attempts to perform elections are blocked and the timeout is set.
The current leader node becomes the follower node.
The newly selected node becomes the new leader.
If the switchover process does not end within the established timeout, the selected node becomes the follower and new elections are performed to choose the new cluster leader.
When you initialize the high-availability cluster, the biha_db
database is created as well as the biha extension
is created in the biha scheme of the
biha_db database. Besides, the following roles are created
and used:
BIHA_CLUSTER_MANAGEMENT_ROLE that is responsible
for the management of the biha cluster.
BIHA_REPLICATION_ROLE that is responsible for
the data replication in the biha cluster.
This role is used when running pg_rewind and
pg_probackup.
biha_replication_user that automatically receives
the right to connect using the replication protocol and becomes a
member in the BIHA_REPLICATION_ROLE and
BIHA_CLUSTER_MANAGEMENT_ROLE roles. The role is
used by the bihactl utility as well as when the
follower node is connected to the leader node. This role owns
the biha_db database.
The predefined pg_monitor
role that is used to monitor the state of the built-in
high-availability cluster.
The cluster initialization process also creates
replication slots
with the name set in the biha_node_
format. These slots are controlled automatically without the need to
modify or delete them manually.
id
NODE_ERROR State #Errors occurred in biha or server instance processes listed below may cause the node failure, i.e. it will not be able to restart and will damage the WAL:
A rewind automatically performed by biha using pg_rewind if node timelines diverge.
The walreceiver process in case of timeline divergence. The follower node WAL may be partially rewritten by the WAL received from the leader node.
To restore the node from the NODE_ERROR state, take the
following steps:
Save the most recent files from the pg_wal
directory, since some of the files unique to this node will be
rewritten by pg_rewind.
Run pg_rewind with the --biha
option to save biha configuration files.
Start the node and call the biha.reset_node_error function.
Built-in high availability capabilities in Postgres Pro
allow you to create a cluster with quorum-based synchronous replication.
To add nodes with synchronous replication to your cluster, add the
--sync-standbys option when executing
bihactl init. This will modify
the synchronous_standby_names parameter by specifying
the set of synchronous standby servers with the keyword ANY.
Also, the synchronous_commit parameter is used with
the default value on. Note that synchronous replication
is set only for the number of nodes specified in the --sync-standbys
option. Other nodes in the cluster will be replicated asynchronously, since
streaming replication is asynchronous by default. For more information, see
Section 26.2.8.
The biha extension provides several configuration parameters described below that are specific to the built-in high-availability cluster. In addition, some Postgres Pro configuration parameters are also used, more specifically listen_addresses, port, shared_preload_libraries, and wal_keep_size, which are set automatically at cluster set-up, as well as hot_standby, max_replication_slots, max_slot_wal_keep_size, max_wal_senders, and wal_level, which are used with the default values.
biha.autorewind
#
An optional parameter that controls the automatic rewind policy within
one node. The default value is false meaning that
the automatic rewind is not performed. When the value is set to
true, the automatic rewind is performed after the
error that usually causes the NODE_ERROR state of the
node. The automatic rewind is performed if it may complete successfully
meaning that preliminary launching of pg_rewind
with the --dry-run option was a success. If the
automatic rewind fails, the node is transferred to the
NODE_ERROR state and the
rewind.signal file is created in the
PGDATA directory. Note that you may lose some records in
the node WAL because of the rewind.
biha.heartbeat_max_lost
#
Specifies the maximum number of heartbeats that can be missed before the
action is taken. This parameter can be set with the
biha.set_heartbeat_max_lost function. The default
value is 10.
biha.heartbeat_send_period
#
Specifies the heartbeat sending frequency, in milliseconds. This
parameter can be set with the biha.set_heartbeat_send_period
function. The default value is 1000.
biha.host
#Specifies the host of the high-availability cluster node. This parameter is unique for each node of the cluster. For the first node it is set at cluster initialization, for other nodes it is set when adding them to the cluster. It is not recommended to modify this parameter.
biha.id
#Specifies the ID of the high-availability cluster node. This parameter is unique for each node of the cluster. For the first node it is set at cluster initialization, for other nodes it is set when adding them to the cluster. It is not recommended to modify this parameter.
biha.minnodes
#Specifies the minimum number of operational nodes for the leader node to be open for write transactions.
biha.no_wal_on_follower
#
Specifies the maximum timeout during which followers can wait to receive
the WAL from the leader, in milliseconds. This parameter can be set with
the biha.set_no_wal_on_follower function. The default
value is 20000.
biha.nquorum
#Specifies the number of nodes required when electing the leader node in the cluster.
biha.port
#Specifies the port used to exchange service information between nodes. This parameter is required to establish a connection with the cluster. It is not recommended to modify this parameter.
biha.get_magic_string () returns string
#Generates a magic string for the cluster node.
biha.remove_node (id integer) returns boolean
#Removes the node from the cluster.
biha.set_leader (id integer) returns boolean
#Sets the leader node manually.
biha.config () returns setof record
#
Returns the cluster configuration values: id,
term, nquorum,
minnodes, heartbeat_send_period,
heartbeat_max_lost, no_wal_on_follower.
biha.set_heartbeat_max_lost (integer) returns boolean
#
Sets the maximum number of heartbeats that can be missed before the
action is taken. For example, if the value is set to 5 and cluster
followers do not receive this number of heartbeats from the leader,
they start to propose themselves as a leader CANDIDATE
and elections begin. To start the elections, the followers also take
account of the timeout set with the
biha.set_no_wal_on_follower function. If the leader
does not receive 5 heartbeats from a follower in this case, the follower
state changes to UNKNOWN. This function can be called
only from the leader node.
biha.set_heartbeat_send_period (integer) returns boolean
#Sets the heartbeat sending frequency, in milliseconds. This function can be called only from the leader node.
biha.set_no_wal_on_follower (integer) returns boolean
#
Sets the maximum timeout during which followers can wait to receive
the WAL from the leader, in milliseconds. If followers do not receive
the WAL within this timeout, they start to propose themselves as a
leader CANDIDATE and elections begin. To start the
elections, the followers also take account of the maximum number of
missed heartbeats from the leader set with the
biha.set_heartbeat_max_lost function. This function
can be called only from the leader node.
biha.set_nquorum_and_minnodes (integer, integer) returns boolean
#
Sets the nquorum and minnodes values
for the cluster. This function can be called only from the leader node.
biha.nodes () returns setof record
#
Defines the biha.nodes_v view, which
is described in detail in the biha.nodes_v
section.
biha.status () returns setof record
#
Defines the biha.status_v view, which
is described in detail in the biha.status_v
section.
NODE_ERROR State #biha.error_details () returns setof record
#
Returns the description of why the node transferred to the
NODE_ERROR state. The returned record contains the
type of the error, its details, the place it occurred specifying
begin_lsn, end_lsn, and identifier
of the current and the next timeline, as well as
replay_lsn.
biha.reset_node_error () returns void
#
Resets the NODE_ERROR state of the node.
For details on how to restore the node in this state, see
Section F.8.3.4.
biha.nodes_v #This view displays the connection status of nodes in the cluster.
Table F.6. The biha.nodes_v view
| Column Name | Description |
|---|---|
id | The node ID. |
host | The host of the node. |
port | The port of the node. |
state |
The connection state of the node. This column may contain one of the
following values: ACTIVE,
CONNECTING, IDLE, or
INIT.
|
since_conn_start | The time since the node connection. |
conn_count | The number of times the node was connected since the start of the cluster. |
biha.status_v #This view displays the state of nodes in the cluster.
Table F.7. The biha.status_v View
| Column Name | Description |
|---|---|
id | The node ID. |
leader_id | The leader node ID. |
term | The term of the node. This is used for the purposes of the leader election. |
online | Shows whether the node is online. |
state |
The state of the node. This column may contain one of the following
values: CANDIDATE, CSTATE_FORMING,
FOLLOWER, FOLLOWER_OFFERED,
FOLLOWER_VOTING, LEADER_RO,
LEADER_RW, NODE_ERROR,
NODE_ERROR_VOTING, STARTUP, or
UNKNOWN.
|
last_known_state | The last known state of the node. |
since_last_hb | The time since the last received heartbeat message. |