The underlying functionality is under development. For the production usage contact Support.
DB — Database.
DBMS — Database management system.
DC — Data center.
MDC — Main data center.
BDC — Backup data center.
HaC — Hight availability cluster.
DRC Disaster — recovery cluster.
MDC hosts the main cluster shards and the etcd cluster. Shards are high-availability clusters that consist of two nodes with Postgres Pro DBMS instances, one as a primary node, one as a synchronous standby. Every shard has the shardmand service running that checks the Postgres Pro DBMS instances and exchanges the information with the etcd cluster, thus providing Shardman clustering. The etcd cluster consists of three nodes that ensures a quorum.
To ensure disaster recovery, the customer’s BDC must host an
identical cluster with the identical configuration and set of components.
By default, the standby Shardman cluster
nodes are disabled. A continuous logs delivery from MDC to BDC is
asynchronous and uses the physical replication mechanisms.
It is based on the standart Shardman utility
pg_receivewal. It writes WALs to the
default instance directory $PGDATA/pg_wal.
This utility is managed by the cluster software. When a syncpoint
is detected under the standby etcd cluster,
a standby Shardman cluster nodes are
started by shardmand. It results in WAL
update till LSN received from the syncpoin. In different DCs the
etcd clusters are isolated, therefore,
to distribute the syncpoint updated information, a script is
periodically run from the MDC to BDC etcd.
Streaming physical replication is provided:
From the Postgres Pro DBMS shard nodes to MDC (synchronous)
From the Postgres Pro DBMS shard nodes to BDC (synchronous)
From the Postgres Pro DBMS shard nodes to DC (asynchronous)
MDC and BDC hardware must have identical system resources and configuration for all the DRC components.
DCs must be connected with fiber optic network with the capacity not less than 20 Gbit per second. A backup channel is also required.
To provide high-availability and disaster recovery clusters Shardman uses the Postgres Pro built-in streaming physical replication mechanism, for BDC it is also asynchronous.
Automatic recovery of a high-availability Shardman cluster is ensured by the cluster software.
DRC cluster recovery is only provided in manual semi-automatic mode.
Shardman cluster monitoring and management is provided within one DC with the shardmanctl utility.
A secure channel between DCs is required.
Inter-nodes authentication and authorization is ensured by the built-in Postgres Pro DMBS tools.
Protection from unauthorized access to standby servers is provided by the operation system and network tools.
It is recommended to do periodical switchovers.
Data integrity check after a failover is provided by the backup utility
shardmanctl probackup.
Should the MDC fail, the administrator must make sure it is,
indeed, unavailable and initiate the promote of the standby nodes.
The standby cluster upgrades its state from
standby to master.
This process is only initiated and managed by the
shardmanctl utility, no other procedures required.
To recover remote nodes to the MDC, create a backup of the
main cluster and restore it on these nodes. The backup can be
either created as a cold backup or with the pg_probackup
repository. Both options require a backup recovery to the MDC.
Once the DB is restored from the backup, run pg_receivewal
that connects to a special primary or standby shard replication slot
in the BDC, then it receives WAL segments asynchronously and writes
to the $PGDATA/pg_wal directory of the main node.
In the BDC cluster, a script creates a syncpoint each specified period of time. It is written to the BDC etcd and sent to the MDC etcd. Once a syncpoint is in etcd, the MDC stanby cluster nodes check if a WAL with this record is received. If it is received by all the MDC standby cluster nodes, the cluster software initiates the DBMS server startup in the recovery with WAL mode until the syncpoint. Once the syncpoint is reached, no more WALs are applied. If all nodes successfully applied the WAL records, the DBMS server is stopped, followed by another cycle of receiving WAL, syncpoint check and recovery mode.
To switch back to the MDC, create and transfer a cluster backup from BDS to MDC, run the nodes in the standby node mode. Once the lacking WALs are received, the BDC cluster nodes are stopped, and the MDC cluster nodes are promoted.
Within the GDS (Geografically distributed systemt), BDC cluster must have the storage for the backups identical to one of the MDC. Regular syncing between the main and backup storage is also required.
The period of time the backups are stored is defined by the backup policy.
For more information on disaster failover and normal switchover to MDC instructions, contact Prostres Pro Support.