2.2. Shardman Architecture

2.2.1. Cluster Structure
2.2.2. Cluster Services
2.2.3. Shardman Extension

The Shardman software comprises these main components: PostgreSQL fork, modified Stolon PostgreSQL manager, Shardman extension and management utilities. This section describes how they interact in a Shardman cluster.

2.2.1. Cluster Structure

A Shardman cluster is composed of several PostgreSQL instances. Each instance can be used to connect to the cluster and at the same time holds parts of sharded tables or global tables. Sharded tables are just usual PostgreSQL partitioned tables where a few partitions, making up a shard, are regular local tables and the other partitions are foreign tables available from remote servers via postgres_fdw. Global tables are available to all nodes of the cluster. Now a global table is a regular table available as a foreign table to other servers. It is expected that a global table will be a set of synchronized regular tables in future.

The number of partitions in a sharded table is defined when it is created and cannot be changed afterwards. When new nodes are added to the cluster, some partitions are moved from existing nodes to the new ones to balance the load. So, to allow scaling of clusters, the initial number of partitions should be high enough, but not too high since an extremely large number of partitions significantly slows down query planning. For example, if you expect the number of nodes in your cluster to grow by 4 times at a maximum, create sharded tables with the number of partitions equal to 4 * N, where N is the number of nodes. A cluster becomes unable to scale when the number of cluster nodes reaches the number of partitions in the sharded table with the minimal number of them.

Clusters with a large number of nodes have a high probability of a single node failure. To protect data in case of a particular node failure, Shardman uses physical replication. To implement failover at the libpq connection level, each foreign server definition contains addresses of all servers in the appropriate replication group. A replication group is a Stolon cluster with one master and one or more replicas. Replication groups are organized in so-called clovers. A clover is a set of nodes where each node holds a PostgreSQL instance that is the master for one of the replication groups and PostgreSQL instances that are replicas for all the other replication groups. The total number of nodes in a clover is equal to the replication factor. Placing replicas on the same nodes as masters of other replication groups makes node resource utilization more efficient.

2.2.2. Cluster Services

Shardman cluster configuration is stored in etcd. Shardman cluster services are organized as systemd services. The Shardman configuration daemon shardman-bowl monitors the cluster configuration and manages Stolon clusters and services. Each node has one bowl service, whose typical name is shardman-bowl@CLUSTER_NAME.service. Here CLUSTER_NAME is a Shardman cluster name, cluster0 by default. Each node has several Stolon services.

Each registered DBMS instance has an associated Stolon keeper service that directly manages this PostgreSQL instance. The name of the systemd service for the Stolon keeper is shardman-keeper@CLUSTER_NAME-clover-CLOVER_ID- NODE_NAME-KEEPER_ID. Here CLOVER_ID is a unique clover identifier (integer), NODE_NAME is the name of the node with the primary DBMS instance for this replication group, KEEPER_ID is the integer part of the keeper UID in the Stolon cluster (full UID is keeper_KEEPER_ID). The keeper starts, stops, initializes and resyncs PostgreSQL instances according to the desired Stolon cluster state.

Each registered DBMS instance has an associated Stolon sentinel service. For each replication group, Stolon sentinels elect the leader among existing sentinels. This leader makes decisions about the desired cluster state (for example, which keeper should become a new master when the existing one fails). When the new master in a replication group is selected, the leader selects the keeper with the minimal lag. When all replicas are synchronous, the keeper with the maximal priority is selected to become a new master even when the master in the replication group is alive. Shardman only uses synchronous replicas (otherwise, there is a chance to lose data when a node fails ). The keeper service for which the NODE_NAME part of the systemd instance name matches the hostname of the node it is running on always has higher priority (1) than the other keepers (0). The name of the systemd service for the Stolon sentinel is shardman-sentinel@CLUSTER_NAME-clover-CLOVER_ID-NODE_NAME.

If the Shardman configuration parameter UseProxy is set to true, each registered DBMS instance has an associated Stolon proxy service. It forwards queries to the current master in the replication group. The name of the systemd service for the Stolon proxy is shardman-proxy@CLUSTER_NAME-clover-CLOVER_ID-NODE_NAME. The proxy service will abort current connections if the master is changed or if the connection to etcd is lost (and hence the information about the cluster configuration is unreliable).

In addition to Stolon services and shardman-bowl, there are several shardman-monitor services per Shardman cluster. Several instances are necessary for fault tolerance and load distribution. shardman-monitor performs the following tasks:

  • Makes sure that each replication group is aware of the location of the current master of each other replication group (when Stolon proxy is not used) and that postgres_fdw settings for foreign servers can be updated on the running cluster according to FDWOptions in clusterdata. See sdmspec.json for the FDWOptions format description.

  • Resolves the prepared distributed (2PC) transactions according to the transaction status on its coordinator.

  • Resolves distributed deadlocks by aborting one of the transactions involved in the deadlock.

Since Shardman services are organized as systemd units, their logs are written to journald. You can use journalctl to examine it. For example, to get all logs since 2021-02-09 18:22 for the keeper service clover-1-n1-0 on node n1 of cluster cluster0, you can use the following command:

$  journalctl -u shardman-keeper@cluster0-clover-1-n1-0 --since '2021-02-09 18:22'

To control the log verbosity for all Shardman services, set SDM_LOG_LEVEL in the shardman-bowl configuration file.

2.2.3. Shardman Extension

The Shardman extension of PostgreSQL ships DBMS functions to be internally used by Shardman services and to provide user API. For example, you can create a global table using the make_table_global(relid) function. All extension functions are created in schema shardman.

The Shardman extension also preprocesses all DBMS queries for coordinated modification of global cluster objects, for example, to provide SQL syntax for creating sharded tables or to deny unsafe operations on them.