2.5. Backup and Recovery

2.5.1. Cluster Backup Process
2.5.2. Recovery from Shardman Backups

This section describes basics of backup and recovery in Shardman.

You can use the backup command of shardman-ladle tool to perform a binary backup of a Shardman cluster to a directory on the local host and recover command to perform a recovery from this backup.

2.5.1. Cluster Backup Process

shardman-ladle conducts a backup task in several steps. The tool:

  1. Takes necessary locks in etcd to prevent concurrent cluster-wide operations.

  2. Connects to a random replication group and locks Shardman metadata tables to prevent modification of foreign servers during the backup.

  3. Creates replication slots on each replication group to ensure that WAL records are not lost.

  4. Dumps Shardman metadata, stored in etcd, to a json file in the backup directory.

  5. To get backups from each replication group, concurrently runs pg_basebackup using replication slots created.

  6. Creates restore points and uses pg_receivewal to fetch WAL logs generated after finishing base backups until restore points are reached.

  7. Fixes partial WAL files generated by pg_receivewal and creates the backup description file.

2.5.2. Recovery from Shardman Backups

You can restore a backup on the same or compatible cluster. By compatible clusters, those that use the same Shardman version and have the same number of replication groups are meant here.

shardman-ladle can perform either full restore or metadata-only restore. Metadata-only restore is useful if issues are encountered with the etcd instance, but DBMS data is not corrupted.

During metadata-only restore, shardman-ladle restores etcd data from the dump created during the backup.

Important

Restoring metadata to an incompatible cluster can lead to catastrophic consequences, including data loss, since the metadata state can differ from the actual configuration layout. Do not perform metadata-only restore if there were cluster reconfigurations after the backup, such as addition or deletion of nodes, even if the same nodes were added back again.

During a full restore, shardman-ladle checks whether the number of replication groups in the target cluster matches the number of replication groups in the backup. This means that you cannot restore on an empty cluster, but need to add as many replication groups as necessary for the total number of them to match that of the cluster from which the backup was taken.

shardman-ladle conducts full restore in several steps. The tool:

  1. Takes necessary locks in etcd to prevent concurrent cluster-wide operations and tries to assign replication groups in the backup to existing replication groups. If it cannot do this (for example, due to cluster incompatibility), the recovery fails.

  2. Restores part of the etcd metadata: the cluster specification and parts of replication group definitions.

  3. When the correct metadata is in place, runs stolon init in PITR initialization mode with RecoveryTargetName set to the value of RestorePoint from the backup info file. DataRestoreCommand and ArchiveRecoverySettings.RestoreCommand are also taken from the backup info file.

  4. Waits for each replication group to recover.

    Note

    After recovery, settings in the pg_foreign_server catalog can be incorrect (since the data from old replication groups was restored to the new ones), but shardman-monitor fixes this.

  5. Checks that shardman-monitor fixed pg_foreign_server to report that the recovery is successful.