intervalWalSyncArbiterThis section describes a disaster recovery cluster specifics that provide a more robust availability.
DB — Database.
DBMS — Database management system.
DC — Data center.
MDC — Main data center.
BDC — Backup data center.
HaC — Hight availability cluster.
DRC — Disaster recovery cluster.
MDC hosts the primary cluster shards. Shards are high-availability clusters that consist of two nodes with Postgres Pro DBMS instances, one as a primary node, one as a synchronous standby. Every shard has the shardmand service running that checks the Postgres Pro DBMS instances.
To ensure disaster recovery, the customer’s BDC must host an
identical cluster with the identical configuration and set of components.
By default, the standby Shardman cluster
nodes are disabled. A continuous logs delivery from MDC to BDC is
asynchronous and uses the physical replication mechanisms.
It is based on the standart Shardman utility
pg_receivewal. It writes WALs to the
default instance directory $PGDATA/pg_wal.
This utility is managed by the cluster software.
A BDC interacts with the MDC storage, receives information
about configuration updates of the MDC and applies them. The syncpoints
are calculated based on the heuristic algorithm that analyses all
pg_receivewal WALs. It allows to calculate
the possible distributed consistent points.
DRC can be managed with shardmanctl utility. See the following commands: shardmanctl cluster standby enable, shardmanctl cluster standby disable, shardmanctl cluster standby catchup, shardmanctl config update --from-primary-cluster.
Note that the
shardmanctl probackup restore
command can be used to deploy a standby cluster from the primary cluster
backup by specifying the cluster --backup-path
backup path from the data center. Should that be the case,
path--schema-only, metadata-only,
and single shard restoring are not supported. The secondary cluster
must be in standby mode at the moment
of the command execution. Once done, if lagging behind the
MDC cluster, the
shardmanctl cluster standby catchup
command should be used.
intervalWalSyncArbiter #
The BDC keeper calculate possible syncpoints,
then the intervalWalSyncArbiter process
chooses one applicable to all instances and initiates a catchup to
the chosen syncpoint.
Streaming physical replication is provided:
From the Postgres Pro DBMS shard nodes to MDC (synchronous)
From the Postgres Pro DBMS shard nodes to BDC (synchronous)
From the Postgres Pro DBMS shard nodes to DC (asynchronous)
MDC and BDC hardware must have identical system resources and configuration for all the DRC components.
DCs must be connected with fiber optic network with the capacity not less than 20 Gbit per second. A backup channel is also required.
To provide high-availability and disaster recovery clusters Shardman uses the Postgres Pro built-in streaming physical replication mechanism, for BDC it is also asynchronous.
Automatic recovery of a high-availability Shardman cluster is ensured by the cluster software.
DRC cluster recovery is only provided in manual semi-automatic mode.
Shardman cluster monitoring and management is provided within one DC with the shardmanctl utility.
To see the status of a cluster in the standby mode, see shardmanctl cluster status.
A secure channel between DCs is required.
Inter-nodes authentication and authorization is ensured by the built-in Postgres Pro DMBS tools.
Protection from unauthorized access to standby servers is provided by the operation system and network tools.
It is recommended to do periodical switchovers.
Data integrity check after a failover is provided by the backup utility
shardmanctl probackup.
Should the MDC fail, the administrator must make sure it is,
indeed, unavailable and initiate the promote of the standby nodes.
The standby cluster upgrades its state from
standby to master.
This process is only initiated and managed by the
shardmanctl utility, no other procedures required.
To recover remote nodes to the MDC, create a backup of the
primary cluster and restore it on these nodes. The backup can be
either created as a cold backup or with the pg_probackup
repository. Both options require a backup recovery to the MDC.
Once the DB is restored from the backup, run pg_receivewal
that connects to a special primary or standby shard replication slot
in the BDC, then it receives WAL segments asynchronously and writes
to the $PGDATA/pg_wal directory of the main node.
In the BDC cluster, a script creates a consistent point for each specified period of time. It is written to the BDC in-built storage and sent to the MDC storage. Once a syncpoint is in there, the MDC stanby cluster nodes check if a WAL with this record is received. If it is received by all the MDC standby cluster nodes, the cluster software initiates the DBMS server startup in the recovery with WAL mode until the syncpoint. Once the syncpoint is reached, no more WALs are applied. If all nodes successfully applied the WAL records, the DBMS server is stopped, followed by another cycle of receiving WAL, syncpoint check and recovery mode.
To switch back to the MDC, create and transfer a cluster backup from BDS to MDC, run the nodes in the standby node mode. Once the lacking WALs are received, the BDC cluster nodes are stopped, and the MDC cluster nodes are promoted.
Within the GDS (Geografically distributed systemt), BDC cluster must have the storage for the backups identical to one of the MDC. Regular syncing between the main and backup storage is also required.
The period of time the backups are stored is defined by the backup policy.