This section describes basics of backup and recovery in Shardman.
You can use the backup
command of the shardmanctl
tool to perform a full binary consistent backup of a Shardman
cluster to a shared directory or local directory (if --use-ssh specified) and the
recover
command to perform a recovery from this backup.
Also you can use the probackup backup
command of the shardmanctl
tool to perform a full binary consistent backup of a Shardman
cluster to the backup repository on the local host or S3-compatible object storage and the probackup restore
command to perform a recovery from any backup from the repository.
The PostgreSQL pg_probackup utility
for creating consistent full and incremental backups was integrated into
shardman-utils. shardman-utils
uses the pg_probackup approach to store backups in a pre-created
repository. In addition, the pg_probackup commands
archive-get and archive-push are used to deliver
WAL logs into the backup repository. Backup and restore modes use a passwordless ssh
connection between the cluster nodes and the backup node.
Shardman cluster configuration parameter enable_csn_snapshot must
be set to on. This parameter is necessary for the cluster backup to be
consistent. If this option is disabled, a consistent backup is not possible.
For consistent visibility of distributed transactions, the technique of global snapshots based on physical clocks is used. Similarly, it is possible to get a consistent snapshot for backups, only the time corresponding to the global snapshot must be mapped to the set of LSNs for each node. Such a set of consistent LSNs in a cluster is called a syncpoint. By getting the syncpoint and taking the LSN for each node in the cluster from it, we can make a backup of each node, which must necessarily contain that LSN. We can also recover to this LSN using the point in time recovery (PITR) mechanism.
The backup and probackup commands use different
mechanisms to create backups. The backup command is based on the standard utilities
pg_basebackup and pg_receivewal.
The probackup command uses the pg_probackup utility and its
options to create a cluster backup. In any case of using backup or probackup commands for restoration,
the node names, defined by hostname or IP-address, must correspond to those that were in place at the time of the backup.
This section describes basics of backup and recovery in Shardman
with the basebackup command.
To backup and restore a Shardman cluster via the basebackup command, the following
requirements must be met:
Shardman cluster configuration parameter enable_csn_snapshot must be on. This
parameter is necessary for the cluster backup to be consistent. If this
parameter is disabled, a consistent backup is not possible.
On each Shardman cluster node, Shardman
utilities must be installed into /opt/pgpro/sdm-14/bin.
On each Shardman cluster node, pg_basebackup
must be installed into /opt/pgpro/sdm-14/bin.
On each Shardman cluster node, postgres Linux user
and group must be created.
Passwordless SSH connection between Shardman cluster nodes
for the postgres Linux user must be configured.
If the --use-ssh flag isn't specified, all Shardman cluster nodes must be connected to a shared
network storage and backup folder must be created on that shared network storage.
If the --use-ssh flag is specified, the backup directory can be created on the local storage on the node
where recover will be called.
Access for the postgres Linux user to the backup
folder must be granted.
shardmanctl
utility must be run as postgres
Linux user.
shardmanctl conducts a backup task in several steps. The tool:
Takes necessary locks in etcd to prevent concurrent cluster-wide operations.
Connects to a random replication group and locks Shardman metadata tables to prevent modification of foreign servers during the backup.
Creates replication slots on each replication group to ensure that WAL records are not lost.
Dumps Shardman metadata stored in etcd to a JSON file in the backup directory.
To get backups from each replication group, concurrently runs pg_basebackup using replication slots created.
Creates the syncpoint and uses pg_receivewal to fetch WAL logs generated after finishing each basebackup until LSNs extracted from syncpoint are reached.
Fixes partial WAL files generated by pg_receivewal and creates the backup description file.
You can restore a backup on the same or compatible cluster. By compatible clusters, those that use the same Shardman version and have the same number of replication groups are meant.
shardmanctl can perform either full restore, metadata-only or schema-only restore. Metadata-only restore is useful if issues are encountered with the etcd instance, but DBMS data is not corrupted.
During metadata-only restore, shardmanctl restores etcd data from the dump created during the backup.
Restoring metadata to an incompatible cluster can lead to catastrophic consequences, including data loss, since the metadata state can differ from the actual configuration layout. Do not perform metadata-only restore if there were cluster reconfigurations after the backup, such as addition or deletion of nodes, even if the same nodes were added back again.
Schema-only recovery restore only schema information without data. It can be useful if the scale of the data is large and the schema is needed for testing or checking.
During a full restore, shardmanctl checks whether the number of replication groups in the target cluster matches the number of replication groups in the backup. This means that you cannot restore on an empty cluster, but need to add as many replication groups as necessary for the total number of them to match that of the cluster from which the backup was taken.
shardmanctl probackup restore can restore
a working or partially working cluster from a backup that was
created on a working or partially working cluster.
Also you can perform restoring only on a single shard using --shard parameter.
shardmanctl conducts full restore in several steps. The tool:
Takes the necessary locks in etcd to prevent concurrent cluster-wide operations and tries to assign replication groups in the backup to existing replication groups. If it cannot do this (for example, due to cluster incompatibility), the recovery fails.
Restores part of the etcd metadata: the cluster specification and parts of replication group definitions.
When the correct metadata is in place, runs stolon
init in PITR initialization mode with RecoveryTargetName
set to the value of the syncpoint LSN from the backup info file.
DataRestoreCommand and RestoreCommand are also
taken from the backup info file.
Waits for each replication group to recover.
This section describes basics of backup and recovery in Shardman
with the probackup command.
You can use the probackup
backup command of the
shardmanctl tool to perform binary backups of a Shardman
cluster into the backup repository on the local (backup) host and the probackup
restore command to perform a recovery
from the selected backup. Full and partial (delta) backups are supported.
To backup and restore a Shardman cluster via the probackup command, the following
requirements must be met:
Shardman cluster configuration parameter enable_csn_snapshot must be on. This
parameter is necessary for the cluster backup to be consistent. If this
parameter is disabled, a consistent backup is not possible.
On the backup host, Shardman utilities must be installed into
/opt/pgpro/sdm-14/bin.
On the backup host and on each cluster node, pg_probackup
must be installed into /opt/pgpro/sdm-14/bin.
On the backup host, postgres Linux user and group must
be created.
Passwordless SSH connection between the backup host and each Shardman cluster node for
the postgres Linux user must be configured. To do this, on each node,
the postgres user must create the .ssh subdirectory
in the /var/lib/postgresql directory and place there the keys required
for the passwordless SSH connection.
You can disable SSH for data copying by setting the
--storage-type option to the mount
or S3 value
(but SSH will be required to execute remote commands). Also this value will
be automatically used in the restore process.
A backup folder or bucket in the S3-compatible object storage must be created.
Access for the postgres Linux user to the backup
folder must be granted.
shardmanctl
utility must be run as postgres
Linux user.
init subcommand for the backup repository initialization
must be successfully executed on the backup host.
archive-command add subcommand for enabling
archive_command for each replication group to stream WALs into
the initialized repository must be successfully executed on the backup host.
shardmanctl conducts a backup task in several steps. The tool:
Takes necessary locks in etcd to prevent concurrent cluster-wide operations.
Connects to a random replication group and locks Shardman metadata tables to prevent modification of foreign servers during the backup.
Dumps Shardman metadata, stored in etcd, to a JSON file in the backup directory or bucket in the S3-compatible object storage.
To get backups from each replication group, concurrently runs
pg_probackup using the configured archive_command.
Creates the syncpoint and gets LSNs for each replication group
from the syncpoint data structure. Then uses the
pg_probackup archive-push command to push WAL logs generated after
finishing backup and the WAL file where syncpoint LSNs are present for each
replication group.
You can restore a backup on the same or compatible cluster. By compatible clusters, those that use the same Shardman version and have the same number of replication groups are meant here.
Also, you can restore other clusters from the same backup if these clusters have the same topology.
shardmanctl can perform either full restore, metadata-only or schema-only restore. Metadata-only restore is useful if issues are encountered with the etcd instance, but DBMS data is not corrupted.
During metadata-only restore, shardmanctl restores etcd data from the dump created during the backup.
Restoring metadata to an incompatible cluster can lead to catastrophic consequences, including data loss, since the metadata state can differ from the actual configuration layout. Do not perform metadata-only restore if there were cluster reconfigurations after the backup, such as addition or deletion of nodes, even if the same nodes were added back again.
Schema-only recovery restore only schema information without data. It can be useful if the scale of the data is large and the schema is needed for testing or checking.
During a full restore, shardmanctl checks whether the number of replication groups in the target cluster matches the number of replication groups in the backup. This means that you cannot restore on an empty cluster, but need to add as many replication groups as necessary for the total number of them to match that of the cluster from which the backup was taken.
Also you can perform restoring only on the single shard using --shard parameter.
Also you can perform Point-in-Time Recovery using --recovery-target-time parameter. In this case Shardman finds
closest syncpoint to specified timestamp and suggests to restore on found LSN.
You can also specify a --wal-limit option to limit the number of WAL segments to be processed.
shardmanctl conducts full restore in several steps. The tool:
Takes the necessary locks in etcd to prevent concurrent cluster-wide operations and tries to assign replication groups in the backup to existing replication groups. If it cannot do this (for example, due to cluster incompatibility), the recovery fails.
Restores part of the etcd metadata: the cluster specification and parts of replication group definitions.
When the correct metadata is in place, runs stolon
init in PITR initialization mode with RecoveryTargetName
set to the value of the syncpoint LSN from the backup info
file.
DataRestoreCommand and RestoreCommand are also
taken from the backup info file. These commands are generated automatically
during the backup phase, it is not recommended to make any corrections to the
file containing the Shardman cluster backup description. When restoring a
cluster for each replication group, the WAL files containing the final LSN to
restore will be requested automatically from the backup repository from the
remote backup node via the pg_probackup archive-get
command.
Waits for each replication group to recover.
Finally we need to enable archive_command back.
When performing a sequential restoration in PostgreSQL, be cautious of potential timeline conflicts within WAL (Write-Ahead Logging) segments. This issue commonly arises when restoring a database from a backup that was created at a certain point in time. If the database continues to operate and generate WAL segments after this backup, these new WAL segments are associated with a different timeline. During restoration, if the system tries to replay WAL segments from a different timeline - one that diverged from the point of backup - it can lead to inconsistencies and conflicts. Additionally, after completing a restoration in PostgreSQL, it is strongly advised not to restore the database onto the same timeline or onto any timeline that precedes the one from which the backup was made.
The more incremental backups are created, the bigger the total size of the backup catalog grows. To save the disk space, it is possible to merge the incremental backups to their parent full backup by running the merge command, specifying the backup ID of the most recent incremental backup to merge:
$ shardmanctl --store-endpoints http://etcd1:2379,http://etcd2:2379,http://etcd3:2379 probackup merge --backup-path backup_dir --backup-id backup_id
This command merges the backups that belong to a common incremental backup chain. If a full backup is specified, it is merged with its first incremental backup. If an incremental backup is specified, it is merged to its parent full backup, along with all the incremental backups between them. Once the merge is complete, the full backup covers all the merged data, and the incremental backups are removed as redundant. Thus, the merge operation virtually equals to removing all the outdated backups from a full backup, but a lot faster, especially for the large data volumes. It also saves I/O and network traffic when using pg_probackup in the remote mode.
Before merging, pg_probackup
validates all the affected backups to ensure that they are valid.
The current backup status can be seen by running
the show
command:
$ shardmanctl --store-endpoints http://etcd1:2379,http://etcd2:2379,http://etcd3:2379 probackup show --backup-path backup_dir
For more information, see reference.
To delete a backup that is no longer needed, run the following command:
$ shardmanctl --store-endpoints http://etcd1:2379,http://etcd2:2379,http://etcd3:2379 probackup delete --backup-path backup_dir --backup-id backup_id
This command deletes a backup with a specified
backup_id, along with all the
incremental
backups that descend from this backup_id, if any.
It allows to delete some of the recent incremental backups,
without affecting the underlying full backup and
other incremental backups that follow it.
To delete the obsolete WAL files that are not needed for recovery,
use the --delete-wal flag:
$ shardmanctl --store-endpoints http://etcd1:2379,http://etcd2:2379,http://etcd3:2379 probackup delete --backup-path backup_dir --backup-id backup_id --delete-wal
For more information, see reference.