33.5. Disaster Recovery Cluster Requirements

33.5. Disaster Recovery Cluster Requirements
Prev	Up	Chapter 33. Cluster Configuration for Distributed System	Home	Next

33.5. Disaster Recovery Cluster Requirements #

33.5.1. Terms and Abbreviations
33.5.2. High-level Description of the DRC
33.5.3. Management
33.5.4. intervalWalSyncArbiter
33.5.5. Replication Topology
33.5.6. Hardware and Network Requirements
33.5.7. Replication Mechanisms
33.5.8. Monitoring and Management
33.5.9. Security
33.5.10. QA and Rollback
33.5.11. Backup in Geografically Distributed System

This section describes a disaster recovery cluster specifics that provide a more robust availability.

33.5.1. Terms and Abbreviations #

DB — Database.

DBMS — Database management system.

DC — Data center.

MDC — Main data center.

BDC — Backup data center.

HaC — Hight availability cluster.

DRC — Disaster recovery cluster.

33.5.2. High-level Description of the DRC #

MDC hosts the primary cluster shards. Shards are high-availability clusters that consist of two nodes with Postgres Pro DBMS instances, one as a primary node, one as a synchronous standby. Every shard has the shardmand service running that checks the Postgres Pro DBMS instances.

To ensure disaster recovery, the customer’s BDC must host an identical cluster with the identical configuration and set of components. By default, the standby Shardman cluster nodes are disabled. A continuous logs delivery from MDC to BDC is asynchronous and uses the physical replication mechanisms. It is based on the standart Shardman utility pg_receivewal. It writes WALs to the default instance directory $PGDATA/pg_wal. This utility is managed by the cluster software. A BDC interacts with the MDC storage, receives information about configuration updates of the MDC and applies them. The syncpoints are calculated based on the heuristic algorithm that analyses all pg_receivewal WALs. It allows to calculate the possible distributed consistent points.

33.5.3. Management #

DRC can be managed with shardmanctl utility. See the following commands: shardmanctl cluster standby enable, shardmanctl cluster standby disable, shardmanctl cluster standby catchup, shardmanctl config update --from-primary-cluster.

Note that the shardmanctl probackup restore command can be used to deploy a standby cluster from the primary cluster backup by specifying the cluster --backup-path path backup path from the data center. Should that be the case, --schema-only, metadata-only, and single shard restoring are not supported. The secondary cluster must be in standby mode at the moment of the command execution. Once done, if lagging behind the MDC cluster, the shardmanctl cluster standby catchup command should be used.

33.5.4. `intervalWalSyncArbiter` #

The BDC keeper calculate possible syncpoints, then the intervalWalSyncArbiter process chooses one applicable to all instances and initiates a catchup to the chosen syncpoint.

33.5.5. Replication Topology #

Streaming physical replication is provided:

From the Postgres Pro DBMS shard nodes to MDC (synchronous)
From the Postgres Pro DBMS shard nodes to BDC (synchronous)
From the Postgres Pro DBMS shard nodes to DC (asynchronous)

33.5.6. Hardware and Network Requirements #

MDC and BDC hardware must have identical system resources and configuration for all the DRC components.

DCs must be connected with fiber optic network with the capacity not less than 20 Gbit per second. A backup channel is also required.

33.5.7. Replication Mechanisms #

To provide high-availability and disaster recovery clusters Shardman uses the Postgres Pro built-in streaming physical replication mechanism, for BDC it is also asynchronous.

Automatic recovery of a high-availability Shardman cluster is ensured by the cluster software.

DRC cluster recovery is only provided in manual semi-automatic mode.

33.5.8. Monitoring and Management #

Shardman cluster monitoring and management is provided within one DC with the shardmanctl utility.

To see the status of a cluster in the standby mode, see shardmanctl cluster status.

33.5.9. Security #

33.5.9.1. Encrypting Data Across A Network (TLS/SSL) #

A secure channel between DCs is required.

33.5.9.2. Inter-nodes Authentication and Authorization #

Inter-nodes authentication and authorization is ensured by the built-in Postgres Pro DMBS tools.

33.5.9.3. Protection from Unauthorized Access to Standby Servers #

Protection from unauthorized access to standby servers is provided by the operation system and network tools.

33.5.10. QA and Rollback #

It is recommended to do periodical switchovers.

33.5.10.1. Data Integrity Check After Failover #

Data integrity check after a failover is provided by the backup utility shardmanctl probackup.

33.5.10.2. Switchover to BDC #

Should the MDC fail, the administrator must make sure it is, indeed, unavailable and initiate the promote of the standby nodes. The standby cluster upgrades its state from standby to master. This process is only initiated and managed by the shardmanctl utility, no other procedures required.

33.5.10.3. MDC Recovery #

To recover remote nodes to the MDC, create a backup of the primary cluster and restore it on these nodes. The backup can be either created as a cold backup or with the pg_probackup repository. Both options require a backup recovery to the MDC. Once the DB is restored from the backup, run pg_receivewal that connects to a special primary or standby shard replication slot in the BDC, then it receives WAL segments asynchronously and writes to the $PGDATA/pg_wal directory of the main node.

In the BDC cluster, a script creates a consistent point for each specified period of time. It is written to the BDC in-built storage and sent to the MDC storage. Once a syncpoint is in there, the MDC stanby cluster nodes check if a WAL with this record is received. If it is received by all the MDC standby cluster nodes, the cluster software initiates the DBMS server startup in the recovery with WAL mode until the syncpoint. Once the syncpoint is reached, no more WALs are applied. If all nodes successfully applied the WAL records, the DBMS server is stopped, followed by another cycle of receiving WAL, syncpoint check and recovery mode.

33.5.10.4. Switching Back to MDC #

To switch back to the MDC, create and transfer a cluster backup from BDS to MDC, run the nodes in the standby node mode. Once the lacking WALs are received, the BDC cluster nodes are stopped, and the MDC cluster nodes are promoted.

33.5.11. Backup in Geografically Distributed System #

Within the GDS (Geografically distributed systemt), BDC cluster must have the storage for the backups identical to one of the MDC. Regular syncing between the main and backup storage is also required.

33.5.11.1. Storing Backups in Geographically Distributed Storages #

The period of time the backups are stored is defined by the backup policy.

Prev	Up	Next
33.4. Fault Tolerance and High Availability	Home	33.6. Cluster Services