The Shardman software comprises these main components: PostgreSQL fork, modified Stolon PostgreSQL manager, Shardman extension and management utilities. This section describes how they interact in a Shardman cluster.
A Shardman cluster is composed of several
PostgreSQL instances. Each instance can be
used to connect to the cluster and at the same time
holds parts of sharded tables or global tables. Sharded tables
are just usual PostgreSQL partitioned tables where a few partitions, making up a shard,
are regular local tables and the other partitions are foreign tables available from remote
servers via postgres_fdw. Global tables are available
to all nodes of the cluster. Now a global table is a regular table available as
a foreign table to other servers. It is expected that a global table
will be a set of synchronized regular tables in future.
The number of partitions in a sharded table is defined when it is created and
cannot be changed afterwards. When new nodes are added to the cluster,
some partitions are moved from existing nodes to the new ones to balance
the load. So, to allow scaling of clusters, the initial number of partitions
should be high enough, but not too high since an extremely large number of
partitions significantly slows down query planning.
For example, if you expect the number of nodes in your cluster to grow
by 4 times at a maximum, create sharded tables with the number of partitions
equal to 4 * N, where N is the number of nodes.
A cluster becomes unable
to scale when the number of cluster nodes reaches the number of
partitions in the sharded table with the minimal number of them.
Clusters with a large number of nodes have a high probability of a single node failure. To protect data in case of a particular node failure, Shardman uses physical replication. To implement failover at the libpq connection level, each foreign server definition contains addresses of all servers in the appropriate replication group. A replication group is a Stolon cluster with one master and one or more replicas. Replication groups are organized in so-called clovers. A clover is a set of nodes where each node holds a PostgreSQL instance that is the master for one of the replication groups and PostgreSQL instances that are replicas for all the other replication groups. The total number of nodes in a clover is equal to the replication factor. Placing replicas on the same nodes as masters of other replication groups makes node resource utilization more efficient.
Shardman cluster configuration is stored in
etcd. Shardman cluster services are organized
as systemd services. The Shardman configuration daemon
shardman-bowl monitors the cluster configuration
and manages Stolon clusters and
services. Each node has one bowl service, whose typical name
is shardman-bowl@CLUSTER_NAME.service.
Here CLUSTER_NAME is a Shardman cluster name,
cluster0 by default.
Each node has several Stolon services.
Each registered DBMS instance has an associated Stolon
keeper service that
directly manages this PostgreSQL instance. The name of the systemd service
for the Stolon keeper is
shardman-keeper@CLUSTER_NAME-clover-.
Here CLOVER_ID-
NODE_NAME-KEEPER_IDCLOVER_ID is a unique clover
identifier (integer), NODE_NAME is the name of the node
with the primary DBMS instance for this replication group,
KEEPER_ID is the integer part of the keeper UID in the
Stolon cluster (full UID is
keeper_KEEPER_ID). The keeper starts, stops, initializes
and resyncs PostgreSQL instances according to
the desired Stolon cluster state.
Each registered DBMS instance has an associated Stolon
sentinel service.
For each replication group, Stolon sentinels
elect the leader among existing sentinels. This leader
makes decisions about the desired cluster state (for example, which
keeper should become a new master when the existing one fails). When the
new master in a replication group is selected, the leader selects the keeper
with the minimal lag. When all replicas are synchronous, the keeper with
the maximal priority is selected to become a new master even when the
master in the replication group is alive. Shardman
only uses synchronous replicas (otherwise, there is a chance to lose data when a node
fails ). The keeper service for which the NODE_NAME part
of the systemd instance name matches the hostname of the node it is
running on always has higher priority (1)
than the other keepers (0). The name of the
systemd service for the Stolon sentinel is
shardman-sentinel@CLUSTER_NAME-clover-CLOVER_ID-NODE_NAME.
If the Shardman configuration parameter
UseProxy is set
to true,
each registered DBMS instance has an associated Stolon
proxy service.
It forwards queries to the current master in the replication group.
The name of the systemd service for the Stolon proxy
is shardman-proxy@CLUSTER_NAME-clover-CLOVER_ID-NODE_NAME.
The proxy service will abort current connections if the master is changed
or if the connection to etcd is lost (and hence the information
about the cluster configuration is unreliable).
In addition to Stolon services and shardman-bowl, there are several shardman-monitor services per Shardman cluster. Several instances are necessary for fault tolerance and load distribution. shardman-monitor performs the following tasks:
Makes sure that each replication group is aware of the location of
the current master of each other replication group (when
Stolon proxy is not used)
and that postgres_fdw settings for foreign servers
can be updated on the running cluster according to FDWOptions
in clusterdata. See
sdmspec.json for the FDWOptions
format description.
Resolves the prepared distributed (2PC) transactions according to the transaction status on its coordinator.
Resolves distributed deadlocks by aborting one of the transactions involved in the deadlock.
Since Shardman services are organized as systemd
units, their logs are written to journald. You can use
journalctl to examine it. For example, to get all
logs since 2021-02-09 18:22 for the keeper service
clover-1-n1-0 on node n1 of cluster cluster0, you can
use the following command:
$journalctl -u shardman-keeper@cluster0-clover-1-n1-0 --since '2021-02-09 18:22'
To control the log verbosity for all Shardman
services, set SDM_LOG_LEVEL in
the shardman-bowl configuration file.
The Shardman extension of
PostgreSQL ships DBMS functions
to be internally used by Shardman
services and to provide user API.
For example, you can create a global table using the
make_table_global(relid)
function.
All extension functions are created in schema shardman.
The Shardman extension also preprocesses all DBMS queries for coordinated modification of global cluster objects, for example, to provide SQL syntax for creating sharded tables or to deny unsafe operations on them.