This section describes basics of backup and recovery in Shardman.
You can use backup
command of shardman-ladle tool to perform a full binary consistent
backup of a Shardman cluster to a directory on the local host and
recover command to perform
a recovery from this backup.
Also you can use probackup backup
command of shardman-ladle tool to perform a full binary consistent
backup of a Shardman cluster to the backups repository on the local host and
probackup restore
command to perform a recovery from the any backup from the repository.
The pg_probackup utility for creating consistent full and incremental backups for PostgreSQL was integrated into the shardman-utils. Shardman-utils uses the pg_probackup approach to store backups in a pre-created repository. In addition, the pg_probackup commands archive-get and archive-push are used to deliver WAL logs into the backup repository. Backup and restore modes uses a passwordless ssh connection between the cluster nodes and the backup node.
Shardman cluster configuration parameter enable_csn_snapshot must be
set to on. This parameter is necessary for the cluster backup to be consistent.
If this option is disabled, you consistent backup is not possible.
For consistent visibility of distributed transactions, the technique of global snapshots based on physical clocks is used (Clock-Si). Similarly, it is possible to get a consistent snapshot for backups, only the time corresponding to the global snapshot must be mapped to a set of LSNs for each node. Such a set of consistent LSN in a cluster is called a Syncpoint. By getting a Syncpoint and taking the LSN for each node in the cluster from it, we can make a backup of each node, which must necessarily contain that LSN. We can also recover to this LSN using the point in time recovery (PITR) mechanism.
The backup and probackup commands use different mechanisms to create backups.
The backup command is based on the standard utilities pg_basebackup and pg_receivewal.
The probackup command uses the pg_probackup utility and its options to create a cluster backup.
shardman-ladle conducts a backup task in several steps. The tool:
Takes necessary locks in etcd to prevent concurrent cluster-wide operations.
Connects to a random replication group and locks Shardman metadata tables to prevent modification of foreign servers during the backup.
Creates replication slots on each replication group to ensure that WAL records are not lost.
Dumps Shardman metadata, stored in etcd, to a json file in the backup directory.
To get backups from each replication group, concurrently runs pg_basebackup using replication slots created.
Creates Syncpoint and uses pg_receivewal to fetch WAL logs generated after finishing base backups until LSN's exctracted from Syncpoint data structure point are reached.
Fixes partial WAL files generated by pg_receivewal and creates the backup description file.
You can restore a backup on the same or compatible cluster. By compatible clusters, those that use the same Shardman version and have the same number of replication groups are meant here.
shardman-ladle can perform either full restore or metadata-only restore. Metadata-only restore is useful if issues are encountered with the etcd instance, but DBMS data is not corrupted.
During metadata-only restore, shardman-ladle restores etcd data from the dump created during the backup.
Restoring metadata to an incompatible cluster can lead to catastrophic consequences, including data loss, since the metadata state can differ from the actual configuration layout. Do not perform metadata-only restore if there were cluster reconfigurations after the backup, such as addition or deletion of nodes, even if the same nodes were added back again.
During a full restore, shardman-ladle checks whether the number of replication groups in the target cluster matches the number of replication groups in the backup. This means that you cannot restore on an empty cluster, but need to add as many replication groups as necessary for the total number of them to match that of the cluster from which the backup was taken.
shardman-ladle conducts full restore in several steps. The tool:
Takes necessary locks in etcd to prevent concurrent cluster-wide operations and tries to assign replication groups in the backup to existing replication groups. If it cannot do this (for example, due to cluster incompatibility), the recovery fails.
Restores part of the etcd metadata: the cluster specification and parts of replication group definitions.
When the correct metadata is in place,
runs stolon init in PITR initialization mode
with RecoveryTargetName set to the value of Syncpoint LSN
from the backup info file.
DataRestoreCommand and RestoreCommand
are also taken from the backup info file.
Waits for each replication group to recover.
After recovery, settings in the pg_foreign_server
catalog can be incorrect
(since the data from old replication groups was restored to the new ones), but
shardman-monitor fixes this.
Checks that shardman-monitor
fixed pg_foreign_server to report that
the recovery is successful.
This section describes basics of backup and recovery in Shardman with the probackup command.
You can use the probackup backup
command of shardman-ladle tool
to perform a binary backups of a
Shardman cluster into the backup repository on the local (backup) host
and
probackup restore
command to perform a recovery from the selected backup. Full and partial (delta) backups are supported.
To backup and restore Shardman cluster, the following requirements must be met:
Shardman cluster configuration parameter enable_csn_snapshot must be on.
This parameter is necessary for the cluster backup to be consistent.
If this option is disabled, consistent backup is not possible;
On the backup host Shardman utilities must be installed into /opt/pgpro/sdm-14/bin;
On the backup host and on each cluster node pg_probackup must be installed into /opt/pgpro/sdm-14/bin;
On the backup host postgres Linux user and group must be created;
Passwordless ssh between backup host and each Shardman cluster node for the postgres Linux user must be configured;
Backup folder must be created;
Access for the postgres Linux user to the backup folder must be granted;
shardman-ladle utility must be run under postgres Linux user;
Init subcommand for the backup repository initialization must be successfuly executed on the backup host;
Archive-command add subcommand for the enabling archive_command for each replication group to stream WALs into inited repository must be successfuly executed on the backup host;
shardman-ladle conducts a backup task in several steps. The tool:
Takes necessary locks in etcd to prevent concurrent cluster-wide operations.
Connects to a random replication group and locks Shardman metadata tables to prevent modification of foreign servers during the backup.
Dumps Shardman metadata, stored in etcd, to a json file in the backup directory.
To get backups from each replication group, concurrently runs pg_probackup using configured archive_command.
Creates Syncpoint and get from Syncpoint data structure LSNs for each replication group.
Then pg_probackup arhive-push command used to push
WAL logs generated after finishing backup, and WAL file where syncpoint LSNs are present for each replication group.
You can restore a backup on the same or compatible cluster. By compatible clusters, those that use the same Shardman version and have the same number of replication groups are meant here.
shardman-ladle can perform either full restore or metadata-only restore. Metadata-only restore is useful if issues are encountered with the etcd instance, but DBMS data is not corrupted.
During metadata-only restore, shardman-ladle restores etcd data from the dump created during the backup.
Restoring metadata to an incompatible cluster can lead to catastrophic consequences, including data loss, since the metadata state can differ from the actual configuration layout. Do not perform metadata-only restore if there were cluster reconfigurations after the backup, such as addition or deletion of nodes, even if the same nodes were added back again.
During a full restore, shardman-ladle checks whether the number of replication groups in the target cluster matches the number of replication groups in the backup. This means that you cannot restore on an empty cluster, but need to add as many replication groups as necessary for the total number of them to match that of the cluster from which the backup was taken.
shardman-ladle conducts full restore in several steps. The tool:
Takes necessary locks in etcd to prevent concurrent cluster-wide operations and tries to assign replication groups in the backup to existing replication groups. If it cannot do this (for example, due to cluster incompatibility), the recovery fails.
Restores part of the etcd metadata: the cluster specification and parts of replication group definitions.
When the correct metadata is in place,
runs stolon init in PITR initialization mode
with RecoveryTargetName set to the value of the Syncpoint LSN
from the backup info file.
DataRestoreCommand and RestoreCommand
are also taken from the backup info file. These commands are generated automatically during the backup phase,
it is not recommended to make any corrections into the file containing the Shardman cluster backup description.
When restoring a cluster for each replication group, the WAL files containing the final LSN to restore will be
requested automatically from the backup repository from the remote backup node via the
pg_probackup archive-get command.
Waits for each replication group to recover.
After recovery, settings in the pg_foreign_server
catalog can be incorrect
(since the data from old replication groups was restored to the new ones), but
shardman-monitor fixes this.
Checks that shardman-monitor
fixed pg_foreign_server to report that
the recovery is successful.
Finally we need to enable back archive_command.