citus is an extension that is made compatible with Postgres Pro and provides such major functionalities as columnar data storage and distributed OLAP database, which can be used either together or separately.
citus offers the following benefits:
Columnar storage with data compression.
The ability to scale your Postgres Pro installation to a distributed database cluster.
Row-based or schema-based sharding.
Parallelized DML operations across cluster nodes.
Reference tables, which can be accessed locally on each node.
The ability to execute DML queries on any node, which allows utilizing the full capacity of your cluster for distributed queries.
citus is incompatible with some Postgres Pro Enterprise features, take note of these limitations while arranging your work with the extension:
citus cannot be used together with autonomous transactions.
With
enable_self_join_removal
set to on, query planning will result in an error if
the optimizer decides to make the query distributed. Otherwise, the
optimizer may form an erroneous distributed query plan and produce
incorrect query results. Therefore, it is recommended to set this
parameter to off.
Real-time query replanning
and citus should not be used together. If
used, the
EXPLAIN ANALYZE
command may operate incorrectly.
citus cannot operate with
standard_conforming_strings
set to off. citus_columnar
can, but to avoid any errors, it is required to set the configuration
parameter to on while executing the
CREATE EXTENSION or
ALTER EXTENSION UPDATE commands. After the
installation or update is completed, you can change the parameter value
to off, if necessary. The extension will continue to
operate correctly.
The citus extension is provided with
Postgres Pro Enterprise as a separate pre-built package
citus-ent-16. You can install either
citus 12.1 or citus 13.0
depending on the repository you connect. For the detailed installation
instructions, see Chapter 17. Once you have
Postgres Pro Enterprise installed, follow the
citus installation instructions below.
To enable citus on a single node, complete the following steps:
Add citus to the
shared_preload_libraries variable in the
postgresql.conf file:
shared_preload_libraries = 'citus'
If you want to use citus together with
other extensions, citus should be the first
on the list of shared_preload_libraries.
Reload the database server for the changes to take effect. To verify
that the citus library was installed correctly, you
can run the following command:
SHOW shared_preload_libraries;
Create the citus extension using the following query:
CREATE EXTENSION citus;
The CREATE EXTENSION command in the procedure above
also installs the citus_columnar extension.
If you want to enable only
citus_columnar, complete the same steps but
specify citus_columnar instead.
To enable citus on multiple nodes, complete the following steps on all nodes:
Add citus to the
shared_preload_libraries variable in the
postgresql.conf file:
shared_preload_libraries = 'citus'
If you want to use citus together with
other extensions, citus should be the first
on the list of shared_preload_libraries.
Set up access permissions to the database server. By default, the
database server listens only to clients on localhost.
Set the
listen_addresses
configuration parameter to * to specify all
available IP interfaces.
Configure client authentication by editing the
pg_hba.conf
file.
Reload the database server for the changes to take effect. To verify
that the citus library was installed correctly, you
can run the following command:
SHOW shared_preload_libraries;
Create the citus extension using the following query:
CREATE EXTENSION citus;
When the above steps have been taken on all nodes, perform the actions below on the coordinator node for worker nodes to be able to connect to it:
Register the hostname that worker nodes use to connect to the coordinator node:
SELECT citus_set_coordinator_host('coordinator_name',coordinator_port);
Add each worker node:
SELECT * from citus_add_node('worker_name',worker_port);
Verify that worker nodes are set successfully:
SELECT * FROM citus_get_active_worker_nodes();
To upgrade citus from version 12.1 to version 13.0, take the following steps:
Install the version 13.0 package.
Restart the Postgres Pro server.
Change the extension definition by executing
ALTER EXTENSION:
postgres=# ALTER EXTENSION citus UPDATE; ALTER EXTENSION
Check the extension version after the upgrade:
postgres=# SELECT * from citus_version();
citus_version
------------------------------------------------------------------------------------------------------
Citus 13.0.3.1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0, 64-bit
(1 row)
Most B2B applications already have the notion of a tenant, customer, or account built into their data model. In this model, the database serves many tenants, each of whose data is separate from other tenants.
citus provides full SQL functionality for this workload and enables scaling out your relational database to more than 100,000 tenants. citus also adds new features for multi-tenancy. For example, citus supports tenant isolation to provide performance guarantees for large tenants, and has the concept of reference tables to reduce data duplication across tenants.
These capabilities allow you to scale out data of your tenants across many computers and add more CPU, memory, and disk resources. Further, sharing the same database schema across multiple tenants makes efficient use of hardware resources and simplifies database management.
citus offers the following advantages for multi-tenant applications:
Fast queries for all tenants.
Sharding logic in the database rather than the application.
Hold more data in single-node Postgres Pro.
Scale out maintaining the SQL functionality.
Maintain performance under high concurrency.
Fast metrics analysis across customer base.
Scale to handle new customer sign-ups.
Isolate resource usage of large and small customers.
citus supports real-time queries over large datasets. Commonly these queries occur in rapidly growing event systems or systems with time series data. Example use cases include:
Analytic dashboards with sub-second response times.
Exploratory queries on unfolding events.
Large dataset archival and reporting.
Analyzing sessions with funnel, segmentation, and cohort queries.
citus parallelizes query execution and scales linearly with the number of worker databases in a cluster. Some advantages of citus for real-time applications are as follows:
Maintain sub-second responses as the dataset grows.
Analyze new events and new data in real time.
Parallelize SQL queries.
Scale out maintaining the SQL functionality.
Maintain performance under high concurrency.
Fast responses to dashboard queries.
Use one database rather than many on several nodes.
Rich Postgres Pro data types and extensions.
citus supports schema-based sharding, which allows distributing regular database schemas across many computers. This sharding methodology aligns well with typical microservices architecture, where storage is fully owned by the service hence cannot share the same schema definition with other tenants.
Schema-based sharding is an easier model to adopt, create a new schema,
and set the search_path in your service.
Advantages of using citus for microservices:
Allows distributing horizontally scalable state across services.
Transfer strategic business data from microservices into common distributed tables for analytics.
Efficiently use hardware by balancing services on multiple computers.
Isolate noisy services to their own nodes.
Easy to understand sharding model.
Quick adoption.
citus extends Postgres Pro with distributed functionality, but it is not a drop-in replacement that scales out all workloads. A performant citus cluster involves thinking about the data model, tooling, and choice of SQL features used.
A good way to think about tools and SQL features is the following: if your workload aligns with use cases described here and you happen to run into an unsupported tool or query, then there is usually a good workaround.
Some workloads do not need a powerful distributed database, while others require a large flow of information between worker nodes. In the first case citus is unnecessary and in the second not generally performant. Below are a few examples when you do not need to use citus:
You do not expect your workload to ever grow beyond a single Postgres Pro Enterprise node.
Offline analytics, without the need for real-time data transfer nor real-time queries.
Analytics apps that do not need to support a large number of concurrent users.
Queries that return data-heavy ETL results rather than summaries.
In this tutorial a sample ad analytics dataset is used to demonstrate how you can use citus to power your multi-tenant application.
This tutorial assumes that you already have citus installed and running. If not, consult the Installing citus on a Single Node section to set up the extension locally.
This section shows how to create a database for an ad analytics app, which can be used by companies to view, change, analyze, and manage their ads and campaigns (see an example app). Such an application has good characteristics of a typical multi-tenant system. Data from different tenants is stored in a central database, and each tenant has an isolated view of their own data.
Three Postgres Pro tables to represent this data will be used. To get started, download sample data for these tables:
curl https://examples.citusdata.com/tutorial/companies.csv > companies.csv curl https://examples.citusdata.com/tutorial/campaigns.csv > campaigns.csv curl https://examples.citusdata.com/tutorial/ads.csv > ads.csv
First connect to the citus coordinator using psql.
If you are using citus installed as
described in the
Installing citus on a Single Node
section, the coordinator node will be running on port
9700.
psql -p 9700
Create tables by using the standard
Postgres Pro CREATE TABLE
command:
CREATE TABLE companies (
id bigint NOT NULL,
name text NOT NULL,
image_url text,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL
);
CREATE TABLE campaigns (
id bigint NOT NULL,
company_id bigint NOT NULL,
name text NOT NULL,
cost_model text NOT NULL,
state text NOT NULL,
monthly_budget bigint,
blacklisted_site_urls text[],
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL
);
CREATE TABLE ads (
id bigint NOT NULL,
company_id bigint NOT NULL,
campaign_id bigint NOT NULL,
name text NOT NULL,
image_url text,
target_url text,
impressions_count bigint DEFAULT 0,
clicks_count bigint DEFAULT 0,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL
);
Create primary key indexes on each of the tables just like you would do in Postgres Pro:
ALTER TABLE companies ADD PRIMARY KEY (id); ALTER TABLE campaigns ADD PRIMARY KEY (id, company_id); ALTER TABLE ads ADD PRIMARY KEY (id, company_id);
Now you can instruct citus to distribute
tables created above across the different nodes in the cluster. To do
so, run the
create_distributed_table
function and specify the table you want to shard and the column you want
to shard on. In the example below, all the tables are sharded on the
company_id column.
SELECT create_distributed_table('companies', 'id');
SELECT create_distributed_table('campaigns', 'company_id');
SELECT create_distributed_table('ads', 'company_id');
Sharding all tables on the company_id column allows
citus to
co-locate the tables
together and allows for features like primary keys, foreign keys, and
complex joins across your cluster.
Then you can go ahead and load the downloaded data into the tables using
the standard psql \copy
command. Make sure that you specify the correct file path if you
downloaded the file to a different location.
\copy companies from 'companies.csv' with csv \copy campaigns from 'campaigns.csv' with csv \copy ads from 'ads.csv' with csv
After the data is loaded into the tables, you can run some queries.
citus supports standard
INSERT, UPDATE, and
DELETE commands for inserting and modifying rows in a
distributed table, which is the typical way of interaction for a
user-facing application.
For example, you can insert a new company by running:
INSERT INTO companies VALUES (5000, 'New Company', 'https://randomurl/image.png', now(), now());
If you want to double the budget for all campaigns of the company, run
the UPDATE command:
UPDATE campaigns SET monthly_budget = monthly_budget*2 WHERE company_id = 5;
Another example of such an operation is to run transactions, which span multiple tables. For example, you can delete a campaign and all its associated ads atomically by running:
BEGIN; DELETE FROM campaigns WHERE id = 46 AND company_id = 5; DELETE FROM ads WHERE campaign_id = 46 AND company_id = 5; COMMIT;
Each statement in a transaction causes round-trips between the coordinator and workers in the multi-node citus. For multi-tenant workloads, it is more efficient to run transactions in distributed functions. The efficiency gains become more apparent for larger transactions, but you can use the small transaction above as an example.
First create a function that does the deletions:
CREATE OR REPLACE FUNCTION delete_campaign(company_id int, campaign_id int) RETURNS void LANGUAGE plpgsql AS $fn$ BEGIN DELETE FROM campaigns WHERE id = $2 AND campaigns.company_id = $1; DELETE FROM ads WHERE ads.campaign_id = $2 AND ads.company_id = $1; END; $fn$;
Next use the
create_distributed_function
function to instruct citus to call the
function directly on workers rather than on the coordinator (except on a
single-node citus installation, which runs
everything on the coordinator). It calls the function on whatever worker
holds the shards for the
ads and campaigns tables
corresponding to the company_id value.
SELECT create_distributed_function( 'delete_campaign(int, int)', 'company_id', colocate_with := 'campaigns' ); -- You can run the function as usual SELECT delete_campaign(5, 46);
Besides transactional operations, you can also run analytics queries using standard SQL. One interesting query for a company to run is to see details about its campaigns with maximum budget.
SELECT name, cost_model, state, monthly_budget FROM campaigns WHERE company_id = 5 ORDER BY monthly_budget DESC LIMIT 10;
You can also run a join query across multiple tables to see information about running campaigns, which receive the most clicks and impressions.
SELECT campaigns.id, campaigns.name, campaigns.monthly_budget,
sum(impressions_count) AS total_impressions, sum(clicks_count) AS total_clicks
FROM ads, campaigns
WHERE ads.company_id = campaigns.company_id
AND ads.campaign_id = campaigns.id
AND campaigns.company_id = 5
AND campaigns.state = 'running'
GROUP BY campaigns.id, campaigns.name, campaigns.monthly_budget
ORDER BY total_impressions, total_clicks;
The tutorial above shows how to use citus to power a simple multi-tenant application. As a next step, you can look at the Multi-Tenant Apps section to see how you can model your own data for multi-tenancy.
This tutorial demonstrates how to use citus to ingest events data and run analytical queries on that data in human real-time. A sample GitHub events dataset is used to this end in the example.
This tutorial assumes that you already have citus installed and running. If not, consult the Installing citus on a Single Node section to set up the extension locally.
This section shows how to create a database for a real-time analytics
application. This application will insert large volumes of events data
and enable analytical queries on that data with sub-second latencies. In
this example, the GitHub events dataset is used. This dataset includes
all public events on GitHub, such as commits,
forks, new issues, and
comments on these issues.
Two Postgres Pro tables are used to represent this data. To get started, download sample data for these tables:
curl https://examples.citusdata.com/tutorial/users.csv > users.csv curl https://examples.citusdata.com/tutorial/events.csv > events.csv
To start first connect to the citus coordinator using psql.
If you are using citus installed as
described in the
Installing citus on a Single Node
section, the coordinator node will be running on port
9700.
psql -p 9700
Then you can create the tables by using the standard
Postgres Pro CREATE TABLE
command:
CREATE TABLE github_events
(
event_id bigint,
event_type text,
event_public boolean,
repo_id bigint,
payload jsonb,
repo jsonb,
user_id bigint,
org jsonb,
created_at timestamp
);
CREATE TABLE github_users
(
user_id bigint,
url text,
login text,
avatar_url text,
gravatar_id text,
display_login text
);
Next you can create indexes on events data just like you do in Postgres Pro. This example also shows how to create a GIN index to make querying on JSONB fields faster.
CREATE INDEX event_type_index ON github_events (event_type); CREATE INDEX payload_index ON github_events USING GIN (payload jsonb_path_ops);
Now you can instruct citus to distribute the
tables created above across the nodes in the cluster. To do so, you can
call the
create_distributed_table
function and specify the table you want to shard and the column you want
to shard on. In the example below, all the tables are sharded on the
user_id column.
SELECT create_distributed_table('github_users', 'user_id');
SELECT create_distributed_table('github_events', 'user_id');
Sharding all tables on the user_id column allows
citus to
co-locate the tables
together and allows for efficient joins and distributed roll-ups.
Then you can go ahead and load the downloaded data into the tables using
the standard psql \copy
command. Make sure that you specify the correct file path if you
downloaded the file to a different location.
\copy github_users from 'users.csv' with csv \copy github_events from 'events.csv' with csv
After the data is loaded into the tables, you can run some queries. First check how many users are contained in the distributed database.
SELECT count(*) FROM github_users;
Now analyze GitHub push events in the data. First
compute the number of commits per minute by using the
number of distinct commits in each
push event.
SELECT date_trunc('minute', created_at) AS minute,
sum((payload->>'distinct_size')::int) AS num_commits
FROM github_events
WHERE event_type = 'PushEvent'
GROUP BY minute
ORDER BY minute;
Also, there is a users table. You can also join the users with events and find the top ten users who created the most repositories.
SELECT login, count(*)
FROM github_events ge
JOIN github_users gu
ON ge.user_id = gu.user_id
WHERE event_type = 'CreateEvent' AND payload @> '{"ref_type": "repository"}'
GROUP BY login
ORDER BY count(*) DESC LIMIT 10;
citus also supports standard
INSERT, UPDATE, and
DELETE commands for inserting and modifying data. For
example, you can update the user display login by running the following
command:
UPDATE github_users SET display_login = 'no1youknow' WHERE user_id = 24305673;
As a next step, you can look at the Real-Time Apps section to see how you can model your own data and power real-time analytical applications.
This tutorial shows how to use citus as the storage backend for multiple microservices and demonstrates a sample setup and basic operation of such a cluster.
This tutorial assumes that you already have citus installed and running. If not, consult the Installing citus on a Single Node section to set up the extension locally.
Distributed schemas are relocatable within a citus cluster. The system can rebalance them as a whole unit across the available nodes, which allows for efficient sharing of resources without manual allocation.
By design, microservices own their storage layer, we do not make any
assumptions on the type of tables and data that they will create and
store. We, however, provide a schema for every service and assume that
they use a distinct role to connect to the database. When a user
connects, their role name is put at the beginning of the
search_path, so if the role matches the schema name,
you do not need any application changes to set the correct
search_path.
Three services are used in the example:
user service
time service
ping service
To start first connect to the citus coordinator using psql.
If you are using citus installed as
described in
the Installing citus on a Single Node
section, the coordinator node will be running on port
9700.
psql -p 9700
You can now create the database roles for every service:
CREATE USER user_service; CREATE USER time_service; CREATE USER ping_service;
There are two ways to distribute a schema in citus:
Manually by calling the
citus_schema_distribute('
function:
schema_name')
CREATE SCHEMA AUTHORIZATION user_service;
CREATE SCHEMA AUTHORIZATION time_service;
CREATE SCHEMA AUTHORIZATION ping_service;
SELECT citus_schema_distribute('user_service');
SELECT citus_schema_distribute('time_service');
SELECT citus_schema_distribute('ping_service');
This method also allows you to convert existing regular schemas into distributed schemas.
You can only distribute schemas that do not contain distributed and reference tables.
Alternative approach is to enable the citus.enable_schema_based_sharding configuration parameter:
SET citus.enable_schema_based_sharding TO ON; CREATE SCHEMA AUTHORIZATION user_service; CREATE SCHEMA AUTHORIZATION time_service; CREATE SCHEMA AUTHORIZATION ping_service;
The parameter can be changed for the current session or permanently
in the postgresql.conf file. With the parameter
set to ON all created schemas are distributed by
default.
You can list the currently distributed schemas:
SELECT * FROM citus_schemas;
schema_name | colocation_id | schema_size | schema_owner -------------+---------------+-------------+-------------- user_service | 5 | 0 bytes | user_service time_service | 6 | 0 bytes | time_service ping_service | 7 | 0 bytes | ping_service (3 rows)
You now need to connect to the citus
coordinator for every microservice. You can use the \c
command to swap the user within an existing psql
instance.
\c citus user_service
CREATE TABLE users (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
email VARCHAR(255) NOT NULL
);
\c citus time_service
CREATE TABLE query_details (
id SERIAL PRIMARY KEY,
ip_address INET NOT NULL,
query_time TIMESTAMP NOT NULL
);
\c citus ping_service
CREATE TABLE ping_results (
id SERIAL PRIMARY KEY,
host VARCHAR(255) NOT NULL,
result TEXT NOT NULL
);
For the purpose of this tutorial a very simple set of services is used. You can obtain them by cloning this public repository:
git clone https://github.com/citusdata/citus-example-microservices.git
The repository contains the ping,
time, and user services. All of
them have the app.py file, which we run.
$ tree
.
├── LICENSE
├── README.md
├── ping
│ ├── app.py
│ ├── ping.sql
│ └── requirements.txt
├── time
│ ├── app.py
│ ├── requirements.txt
│ └── time.sql
└── user
├── app.py
├── requirements.txt
└── user.sql
Before you run the services, however, edit the
user/app.py, ping/app.py,
and time/app.py files providing the
connection configuration
for your citus cluster:
# Database configuration
db_config = {
'host': 'localhost',
'database': 'citus',
'user': 'ping_service',
'port': 9700
}
After making the changes save all modified files and move on to the next step of running the services.
Change into every app directory and run them in their own
python environment.
cd user pipenv install pipenv shell python app.py
Repeat the above for the time and
ping service, after which you can use the API.
Create some users:
curl -X POST -H "Content-Type: application/json" -d '[
{"name": "John Doe", "email": "john@example.com"},
{"name": "Jane Smith", "email": "jane@example.com"},
{"name": "Mike Johnson", "email": "mike@example.com"},
{"name": "Emily Davis", "email": "emily@example.com"},
{"name": "David Wilson", "email": "david@example.com"},
{"name": "Sarah Thompson", "email": "sarah@example.com"},
{"name": "Alex Miller", "email": "alex@example.com"},
{"name": "Olivia Anderson", "email": "olivia@example.com"},
{"name": "Daniel Martin", "email": "daniel@example.com"},
{"name": "Sophia White", "email": "sophia@example.com"}
]' http://localhost:5000/users
List the created users:
curl http://localhost:5000/users
Get current time:
curl http://localhost:5001/current_time
Run the ping against example.com:
curl -X POST -H "Content-Type: application/json" -d '{"host": "example.com"}' http://localhost:5002/ping
Now that we called some API functions, data has been stored and we can check if the citus_schemas view reflects what we expect:
SELECT * FROM citus_schemas;
schema_name | colocation_id | schema_size | schema_owner --------------+---------------+-------------+-------------- user_service | 1 | 112 kB | user_service time_service | 2 | 32 kB | time_service ping_service | 3 | 32 kB | ping_service (3 rows)
At the time of schemas creation you do not instruct citus on which computer to create them. It is done automatically. Execute the following query to see where each schema resides:
SELECT nodename,nodeport, table_name, pg_size_pretty(sum(shard_size)) FROM citus_shards GROUP BY nodename,nodeport, table_name;
nodename | nodeport | table_name | pg_size_pretty -----------+----------+----------------------------+---------------- localhost | 9701 | time_service.query_details | 32 kB localhost | 9702 | user_service.users | 112 kB localhost | 9702 | ping_service.ping_results | 32 kB
We can see that the time service landed on node
localhost:9701, while the user and
ping services share space on the second worker
localhost:9702. This is only an example, and the data
sizes here can be ignored, but let us assume that we are annoyed by the
uneven storage space utilization between the nodes. It makes more sense
to have the two smaller time and
ping services reside on one computer, while the large
user service resides alone.
We can do this by instructing citus to rebalance the cluster by disk size:
SELECT citus_rebalance_start();
NOTICE: Scheduled 1 moves as job 1
DETAIL: Rebalance scheduled as background job
HINT: To monitor progress, run: SELECT * FROM citus_rebalance_status();
citus_rebalance_start
-----------------------
1
(1 row)
When done, check how the new layout looks:
SELECT nodename,nodeport, table_name, pg_size_pretty(sum(shard_size)) FROM citus_shards GROUP BY nodename,nodeport, table_name;
nodename | nodeport | table_name | pg_size_pretty -----------+----------+----------------------------+---------------- localhost | 9701 | time_service.query_details | 32 kB localhost | 9701 | ping_service.ping_results | 32 kB localhost | 9702 | user_service.users | 112 kB (3 rows)
We expect that the schemas have been moved and the cluster has become more balanced. This operation is transparent for the applications. Therefore, there is no need for a restart, and they will continue serving queries.
If you are building a Software-as-a-service (SaaS) application, you probably already have the notion of tenancy built into your data model. Typically, most information relates to tenants/customers/accounts and the database tables capture this natural relation.
For SaaS applications, each tenant's data can be stored together in a single database instance and kept isolated from and invisible to other tenants. This is efficient in three ways. First, application improvements apply to all clients. Second, sharing a database between tenants uses hardware efficiently. Last, it is much simpler to manage a single database for all tenants than a different database server for each tenant.
However, a single relational database instance has traditionally had trouble scaling to the volume of data needed for a large multi-tenant application. Developers were forced to relinquish the benefits of the relational model when data exceeded the capacity of a single database node.
The citus extension allows users to write multi-tenant applications as if they are connecting to a single Postgres Pro database, when in fact the database is a horizontally scalable cluster of computers. Client code requires minimal modifications and can continue to use full SQL capabilities.
This guide takes a sample multi-tenant application and describes how to model it for scalability with citus. Along the way typical challenges for multi-tenant applications are examined like isolating tenants from noisy neighbors, scaling hardware to accommodate more data, and storing data that differs across tenants. Postgres Pro and citus provide all the tools needed to handle these challenges, so let's get building.
We will build the back-end for an application that tracks online advertising performance and provides an analytics dashboard on top. It is a natural fit for a multi-tenant application because user requests for data concern one company (their own) at a time. Code for the full example application is available on GitHub.
Let's start by considering a simplified schema for this application. The application must keep track of multiple companies, each of which runs advertising campaigns. Campaigns have many ads, and each ad has associated records of its clicks and impressions.
Here is the example schema. We will make some minor changes later, which allow us to effectively distribute and isolate the data in a distributed environment.
CREATE TABLE companies ( id bigserial PRIMARY KEY, name text NOT NULL, image_url text, created_at timestamp without time zone NOT NULL, updated_at timestamp without time zone NOT NULL ); CREATE TABLE campaigns ( id bigserial PRIMARY KEY, company_id bigint REFERENCES companies (id), name text NOT NULL, cost_model text NOT NULL, state text NOT NULL, monthly_budget bigint, blacklisted_site_urls text[], created_at timestamp without time zone NOT NULL, updated_at timestamp without time zone NOT NULL ); CREATE TABLE ads ( id bigserial PRIMARY KEY, campaign_id bigint REFERENCES campaigns (id), name text NOT NULL, image_url text, target_url text, impressions_count bigint DEFAULT 0, clicks_count bigint DEFAULT 0, created_at timestamp without time zone NOT NULL, updated_at timestamp without time zone NOT NULL ); CREATE TABLE clicks ( id bigserial PRIMARY KEY, ad_id bigint REFERENCES ads (id), clicked_at timestamp without time zone NOT NULL, site_url text NOT NULL, cost_per_click_usd numeric(20,10), user_ip inet NOT NULL, user_data jsonb NOT NULL ); CREATE TABLE impressions ( id bigserial PRIMARY KEY, ad_id bigint REFERENCES ads (id), seen_at timestamp without time zone NOT NULL, site_url text NOT NULL, cost_per_impression_usd numeric(20,10), user_ip inet NOT NULL, user_data jsonb NOT NULL );
There are modifications we can make to the schema, which will give it a performance boost in a distributed environment like citus. To see how, we must become familiar with how the extension distributes data and executes queries.
The relational data model is great for applications. It protects data integrity, allows flexible queries, and accommodates changing data. Traditionally the only problem was that relational databases were not considered capable of scaling to the workloads needed for big SaaS applications. Developers had to put up with NoSQL databases, or a collection of backend services, to reach that size.
With citus you can keep your data model and make it scale. The extension appears to applications as a single Postgres Pro database, but it internally routes queries to an adjustable number of physical servers (nodes), which can process requests in parallel.
Multi-tenant applications have a nice property that we can take advantage of: queries usually always request information for one tenant at a time, not a mix of tenants. For instance, when a salesperson is searching prospect information in a CRM, the search results are specific to his employer; other businesses' leads and notes are not included.
Because application queries are restricted to a single tenant, such as a store or company, one approach for making multi-tenant application queries fast is to store all data for a given tenant on the same node. This minimizes network overhead between the nodes and allows citus to support all your application's joins, key constraints and transactions efficiently. With this, you can scale across multiple nodes without having to totally re-write or re-architect your application. See the figure below to learn more.
Figure J.1. Multi-Tenant Ad Routing Diagram
This can be done in citus by making sure
every table in our schema has a column to clearly mark which tenant owns
which rows. In the ad analytics application the tenants are companies,
so we must ensure all tables have a company_id column.
We can tell citus to use this column to read
and write rows to the same node when the rows are marked for the same
company. In citus terminology
company_id is the
distribution column, which you can learn more about
in the
Choosing Distribution Column
section.
In the previous section we identified the correct distribution column
for our multi-tenant application: company_id. Even in
a single-computer database it can be useful to denormalize tables with
the addition of company_id, whether it be for
row-level security or for additional indexing. The extra benefit, as we
saw, is that including the extra column helps for multi-machine scaling
as well.
The schema we have created so far uses a separate id
column as primary key for each table. citus
requires that primary and foreign key constraints include the
distribution column. This requirement makes enforcing these constraints
much more efficient in a distributed environment as only a single node
has to be checked to guarantee them.
In SQL, this requirement translates to making primary and foreign keys
composite by including company_id. This is compatible
with the multi-tenant case because what we really need there is to ensure
uniqueness on a per-tenant basis.
Putting it all together, here are the changes that prepare the tables
for distribution by company_id.
CREATE TABLE companies (
id bigserial PRIMARY KEY,
name text NOT NULL,
image_url text,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL
);
CREATE TABLE campaigns (
id bigserial, -- was: PRIMARY KEY
company_id bigint REFERENCES companies (id),
name text NOT NULL,
cost_model text NOT NULL,
state text NOT NULL,
monthly_budget bigint,
blacklisted_site_urls text[],
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
PRIMARY KEY (company_id, id) -- added
);
CREATE TABLE ads (
id bigserial, -- was: PRIMARY KEY
company_id bigint, -- added
campaign_id bigint, -- was: REFERENCES campaigns (id)
name text NOT NULL,
image_url text,
target_url text,
impressions_count bigint DEFAULT 0,
clicks_count bigint DEFAULT 0,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
PRIMARY KEY (company_id, id), -- added
FOREIGN KEY (company_id, campaign_id) -- added
REFERENCES campaigns (company_id, id)
);
CREATE TABLE clicks (
id bigserial, -- was: PRIMARY KEY
company_id bigint, -- added
ad_id bigint, -- was: REFERENCES ads (id),
clicked_at timestamp without time zone NOT NULL,
site_url text NOT NULL,
cost_per_click_usd numeric(20,10),
user_ip inet NOT NULL,
user_data jsonb NOT NULL,
PRIMARY KEY (company_id, id), -- added
FOREIGN KEY (company_id, ad_id) -- added
REFERENCES ads (company_id, id)
);
CREATE TABLE impressions (
id bigserial, -- was: PRIMARY KEY
company_id bigint, -- added
ad_id bigint, -- was: REFERENCES ads (id),
seen_at timestamp without time zone NOT NULL,
site_url text NOT NULL,
cost_per_impression_usd numeric(20,10),
user_ip inet NOT NULL,
user_data jsonb NOT NULL,
PRIMARY KEY (company_id, id), -- added
FOREIGN KEY (company_id, ad_id) -- added
REFERENCES ads (company_id, id)
);
You can learn more about migrating your own data model in the Identify Distribution Strategy section.
This guide is designed so you can follow along in your own citus database. This tutorial assumes that you already have the extension installed and running. If not, consult the Installing citus on a Single Node section to set up the extension locally.
At this point feel free to follow along in your own citus cluster by downloading and executing the SQL to create the schema. Once the schema is ready, we can tell citus to create shards on the workers. From the coordinator node run:
SELECT create_distributed_table('companies', 'id');
SELECT create_distributed_table('campaigns', 'company_id');
SELECT create_distributed_table('ads', 'company_id');
SELECT create_distributed_table('clicks', 'company_id');
SELECT create_distributed_table('impressions', 'company_id');
The create_distributed_table function informs citus that a table should be distributed among nodes and that future incoming queries to those tables should be planned for distributed execution. The function also creates shards for the table on worker nodes, which are low-level units of data storage citus uses to assign data to nodes.
The next step is loading sample data into the cluster from the command line:
# Download and ingest datasets from the shell
for dataset in companies campaigns ads clicks impressions geo_ips; do
curl -O https://examples.citusdata.com/mt_ref_arch/${dataset}.csv
done
Being an extension of Postgres Pro,
citus supports bulk loading with the
/copy command. Use it to ingest the data you
downloaded and make sure that you specify the correct file path if
you downloaded the file to some other location. Back inside
psql run this:
\copy companies from 'companies.csv' with csv \copy campaigns from 'campaigns.csv' with csv \copy ads from 'ads.csv' with csv \copy clicks from 'clicks.csv' with csv \copy impressions from 'impressions.csv' with csv
Once you have made the slight schema modification outlined earlier, your application can scale with very little work. You will just connect the app to citus and let the database take care of keeping the queries fast and the data safe.
Any application queries or update statements, which include a filter on
company_id, will continue to work exactly as they are.
As mentioned earlier, this kind of filter is common in multi-tenant apps.
When using an Object-Relational Mapper (ORM) you can recognize these
queries by methods such as where or
filter.
ActiveRecord:
Impression.where(company_id: 5).count
Django:
Impression.objects.filter(company_id=5).count()
Basically when the resulting SQL executed in the database contains a
WHERE company_id = :value clause on every table
(including tables in JOIN queries), then
citus will recognize that the query should be
routed to a single node and execute it there as it is. This makes sure
that all SQL functionality is available. The node is an ordinary
Postgres Pro server after all.
Also, to make it even simpler, you can use our
activerecord-multi-tenant
library for Ruby on Rails, or
django-multitenant
for Django, which will automatically add these filters to all your queries,
even the complicated ones. Check out our migration guides for
Ruby on Rails
and
Django.
This guide is framework-agnostic, so we will point out some citus features using SQL. Use your imagination for how these statements would be expressed in your language of choice.
Here is a simple query and update operating on a single tenant.
-- Campaigns with highest budget SELECT name, cost_model, state, monthly_budget FROM campaigns WHERE company_id = 5 ORDER BY monthly_budget DESC LIMIT 10; -- Double the budgets! UPDATE campaigns SET monthly_budget = monthly_budget*2 WHERE company_id = 5;
A common pain point for users scaling applications with NoSQL databases is the lack of transactions and joins. However, transactions work as you would expect them to in citus:
-- Transactionally reallocate campaign budget money BEGIN; UPDATE campaigns SET monthly_budget = monthly_budget + 1000 WHERE company_id = 5 AND id = 40; UPDATE campaigns SET monthly_budget = monthly_budget - 1000 WHERE company_id = 5 AND id = 41; COMMIT;
As a final demo of SQL support, we have a query that includes aggregates and window functions and it works the same in citus as it does in Postgres Pro. The query ranks the ads in each campaign by the count of their impressions.
SELECT a.campaign_id,
RANK() OVER (
PARTITION BY a.campaign_id
ORDER BY a.campaign_id, count(*) desc
), count(*) as n_impressions, a.id
FROM ads as a
JOIN impressions as i
ON i.company_id = a.company_id
AND i.ad_id = a.id
WHERE a.company_id = 5
GROUP BY a.campaign_id, a.id
ORDER BY a.campaign_id, n_impressions desc;
In short, when queries are scoped to a tenant then the
INSERT, UPDATE, DELETE,
complex SQL commands, and transactions all work as expected.
Up until now all tables have been distributed by
company_id, but sometimes there is data that can be
shared by all tenants and does not “belong” to any tenant in
particular. For instance, all companies using this example ad platform
might want to get geographical information for their audience based on IP
addresses. In a single computer database this could be accomplished by a
lookup table for geo-ip, like the following.
(A real table would probably use PostGIS, but
bear with the simplified example.)
CREATE TABLE geo_ips (
addrs cidr NOT NULL PRIMARY KEY,
latlon point NOT NULL
CHECK (-90 <= latlon[0] AND latlon[0] <= 90 AND
-180 <= latlon[1] AND latlon[1] <= 180)
);
CREATE INDEX ON geo_ips USING gist (addrs inet_ops);
To use this table efficiently in a distributed setup, we need to find a
way to co-locate the geo_ips table with clicks for
not just one but every company. That way, no network traffic need be
incurred at query time. This can be done in citus
by designating geo_ips as a
reference table.
-- Make synchronized copies of geo_ips on all workers
SELECT create_reference_table('geo_ips');
Reference tables are replicated across all worker nodes, and citus automatically keeps them in sync during modifications. Notice that we call the create_reference_table function rather than the create_distributed_table function.
Now that geo_ips is established as a reference table,
load it with example data:
\copy geo_ips from 'geo_ips.csv' with csv
Now joining clicks with this table can execute efficiently. We can ask,
for example, the locations of everyone who clicked on ad
290.
SELECT c.id, clicked_at, latlon FROM geo_ips, clicks c WHERE addrs >> c.user_ip AND c.company_id = 5 AND c.ad_id = 290;
Another challenge with multi-tenant systems is keeping the schemas for all the tenants in sync. Any schema change needs to be consistently reflected across all the tenants. In citus, you can simply use standard Postgres Pro DDL commands to change the schema of your tables, and the extension will propagate them from the coordinator node to the workers using a two-phase commit protocol.
For example, the advertisements in this application could use a text caption. We can add a column to the table by issuing the standard SQL on the coordinator:
ALTER TABLE ads ADD COLUMN caption text;
This updates all the workers as well. Once this command finishes, the
citus cluster will accept queries that read
or write data in the new caption column.
For a fuller explanation of how DDL commands propagate through the cluster, see the Modifying Tables section.
Given that all tenants share a common schema and hardware infrastructure, how can we accommodate tenants, which want to store information not needed by others? For example, one of the tenant applications using our advertising database may want to store tracking cookie information with clicks, whereas another tenant may care about browser agents. Traditionally databases using a shared schema approach for multi-tenancy have resorted to creating a fixed number of pre-allocated “custom” columns, or having external “extension tables”. However, Postgres Pro provides a much easier way with its unstructured column types, notably JSONB.
Notice that our schema already has a JSONB field in clicks
called user_data. Each tenant can use it for flexible
storage.
Suppose company five includes information in the field to track whether the user is on a mobile device. The company can query to find who clicks more, mobile or traditional visitors:
SELECT user_data->>'is_mobile' AS is_mobile, count(*) AS count FROM clicks WHERE company_id = 5 GROUP BY user_data->>'is_mobile' ORDER BY count DESC;
The database administrator can even create a
partial index to improve speed
for an individual tenant's query patterns. Here is one to improve filters
for clicks of the company with company_id = 5 from
users on mobile devices:
CREATE INDEX click_user_data_is_mobile ON clicks ((user_data->>'is_mobile')) WHERE company_id = 5;
Additionally, Postgres Pro supports
GIN indices on JSONB. Creating a GIN
index on a JSONB column will create an index on every key and value
within that JSON document. This speeds up a number of
JSONB operators such as
?, ?|, and ?&.
CREATE INDEX click_user_data ON clicks USING gin (user_data); -- this speeds up queries like, "which clicks have -- the is_mobile key present in user_data?" SELECT id FROM clicks WHERE user_data ? 'is_mobile' AND company_id = 5;
Multi-tenant databases should be designed for future scale as business grows or tenants want to store more data. citus can scale out easily by adding new computers without having to make any changes or take application downtime.
Being able to rebalance data in the citus cluster allows you to grow your data size or number of customers and improve performance on demand. Adding new computers allows you to keep data in memory even when it is much larger than what a single computer can store.
Also, if data increases for only a few large tenants, then you can isolate those particular tenants to separate nodes for better performance.
To scale out your citus cluster, first add a new worker node to it with the citus_add_node function.
Once you add the node it is available in the system. However, at this point no tenants are stored on it and citus will not yet run any queries there. To move your existing data, you can ask citus to rebalance the data. This operation moves bundles of rows called shards between the currently active nodes to attempt to equalize the amount of data on each node.
SELECT citus_rebalance_start();
Applications do not need to undergo downtime during shard rebalancing. Read requests continue seamlessly, and writes are locked only when they affect shards, which are currently in flight. In citus writes to shards are blocked during rebalancing but reads are unaffected.
The previous section describes a general-purpose way to scale a cluster as the number of tenants increases. However, users often have two questions. The first is what will happen to their largest tenant if it grows too big. The second is what are the performance implications of hosting a large tenant together with small ones on a single worker node.
Regarding the first question, investigating data from large SaaS sites reveals that as the number of tenants increases, the size of tenant data typically tends to follow a Zipfian distribution. See the figure below to learn more.
Figure J.2. Ziphian Distribution
For instance, in a database of 100 tenants, the largest is predicted to account for about 20% of the data. In a more realistic example for a large SaaS company, if there are 10,000 tenants, the largest will account for around 2% of the data. Even at 10TB of data, the largest tenant will require 200GB, which can pretty easily fit on a single node.
Another question is regarding performance when large and small tenants are on the same node. Standard shard rebalancing will improve overall performance but it may or may not improve the mixing of large and small tenants. The rebalancer simply distributes shards to equalize storage usage on nodes, without examining which tenants are allocated on each shard.
To improve resource allocation and make guarantees of tenant QoS it is worthwhile to move large tenants to dedicated nodes. The citus extension provides the tools to do this.
In our case, let's imagine that the company with
company_id=5 is very large. We can isolate the data
for this tenant in two steps. We will present the commands here, and you
can consult the Tenant Isolation
section to learn more about them.
First isolate the tenant's data to a dedicated shard suitable to move.
The CASCADE option also applies this change to the
rest of our tables distributed by company_id.
SELECT isolate_tenant_to_new_shard( 'companies', 5, 'CASCADE' );
The output is the shard ID dedicated to hold company_id=5:
┌─────────────────────────────┐ │ isolate_tenant_to_new_shard │ ├─────────────────────────────┤ │ 102240 │ └─────────────────────────────┘
Next we move the data across the network to a new dedicated node. Create a new node as described in the previous section. Take note of its hostname.
-- Find the node currently holding the new shard
SELECT nodename, nodeport
FROM pg_dist_placement AS placement,
pg_dist_node AS node
WHERE placement.groupid = node.groupid
AND node.noderole = 'primary'
AND shardid = 102240;
-- Move the shard to your choice of worker (it will also move the
-- other shards created with the CASCADE option)
-- Note that you should set wal_level for all nodes to be >= logical
-- to use citus_move_shard_placement
-- You also need to restart your cluster after setting wal_level in
-- postgresql.conf files
SELECT citus_move_shard_placement(
102240,
'source_host', source_port,
'dest_host', dest_port);
You can confirm the shard movement by querying the pg_dist_placement table again.
With this, you now know how to use citus to power your multi-tenant application for scalability. If you have an existing schema and want to migrate it for citus, see the Migrating an Existing App section.
To adjust a front-end application, specifically Ruby on Rails or Django, read Ruby on Rails or Django migration guides.
citus provides real-time queries over large datasets. One workload we commonly see at citus involves powering real-time dashboards of event data.
For example, you could be a cloud services provider helping other businesses monitor their HTTP traffic. Every time one of your clients receives an HTTP request your service receives a log record. You want to ingest all those records and create an HTTP analytics dashboard that gives your clients insights such as the number HTTP errors their sites served. It is important that this data shows up with as little latency as possible so your clients can fix problems with their sites. It is also important for the dashboard to show graphs of historical trends.
Alternatively, maybe you are building an advertising network and want to show clients clickthrough rates on their campaigns. In this example latency is also critical, raw data volume is also high, and both historical and live data are important.
In this section we will demonstrate how to build part of the first example, but this architecture would work equally well for the second and many other use cases.
The data we are dealing with is an immutable stream of log data. We will insert directly into citus but it is also common for this data to first be routed through something like Kafka. Doing so has the usual advantages, and makes it easier to pre-aggregate the data once data volumes become unmanageably high.
We will use a simple schema for ingesting HTTP event data. This schema serves as an example to demonstrate the overall architecture; a real system might use additional columns.
-- This is run on the coordinator
CREATE TABLE http_request (
site_id INT,
ingest_time TIMESTAMPTZ DEFAULT now(),
url TEXT,
request_country TEXT,
ip_address TEXT,
status_code INT,
response_time_msec INT
);
SELECT create_distributed_table('http_request', 'site_id');
When we call the
create_distributed_table
function we ask citus to hash-distribute
http_request using the site_id
column. That means all the data for a particular site will live in the
same shard.
The user defined functions use the default configuration values for shard count. We recommend using 2-4x as many shards as CPU cores in your cluster. Using this many shards lets you rebalance data across your cluster after adding new worker nodes.
With this, the system is ready to accept data and serve queries. Keep the following loop running in a psql console in the background while you continue with the other commands in this article. It generates fake data every second or two.
DO $$
BEGIN LOOP
INSERT INTO http_request (
site_id, ingest_time, url, request_country,
ip_address, status_code, response_time_msec
) VALUES (
trunc(random()*32), clock_timestamp(),
concat('http://example.com/', md5(random()::text)),
('{China,India,USA,Indonesia}'::text[])[ceil(random()*4)],
concat(
trunc(random()*250 + 2), '.',
trunc(random()*250 + 2), '.',
trunc(random()*250 + 2), '.',
trunc(random()*250 + 2)
)::inet,
('{200,404}'::int[])[ceil(random()*2)],
5+trunc(random()*150)
);
COMMIT;
PERFORM pg_sleep(random() * 0.25);
END LOOP;
END $$;
Once you are ingesting data, you can run dashboard queries such as:
SELECT
site_id,
date_trunc('minute', ingest_time) as minute,
COUNT(1) AS request_count,
SUM(CASE WHEN (status_code between 200 and 299) THEN 1 ELSE 0 END) as success_count,
SUM(CASE WHEN (status_code between 200 and 299) THEN 0 ELSE 1 END) as error_count,
SUM(response_time_msec) / COUNT(1) AS average_response_time_msec
FROM http_request
WHERE date_trunc('minute', ingest_time) > now() - '5 minutes'::interval
GROUP BY site_id, minute
ORDER BY minute ASC;
The setup described above works but has two drawbacks:
Your HTTP analytics dashboard must go over each row every time it needs to generate a graph. For example, if your clients are interested in trends over the past year, your queries will aggregate every row for the past year from scratch.
Your storage costs will grow proportionally with the ingest rate and the length of the queryable history. In practice, you may want to keep raw events for a shorter period of time (one month) and look at historical graphs over a longer time period (years).
You can overcome both drawbacks by rolling up the raw data into a pre-aggregated form. Here, we will aggregate the raw data into a table, which stores summaries of 1-minute intervals. In a production system, you would probably also want something like 1-hour and 1-day intervals, these each correspond to zoom-levels in the dashboard. When the user wants request times for the last month the dashboard can simply read and chart the values for each of the last 30 days.
CREATE TABLE http_request_1min (
site_id INT,
ingest_time TIMESTAMPTZ, -- which minute this row represents
error_count INT,
success_count INT,
request_count INT,
average_response_time_msec INT,
CHECK (request_count = error_count + success_count),
CHECK (ingest_time = date_trunc('minute', ingest_time))
);
SELECT create_distributed_table('http_request_1min', 'site_id');
CREATE INDEX http_request_1min_idx ON http_request_1min (site_id, ingest_time);
This looks a lot like the previous code block. Most importantly: It also
shards on site_id and uses the same default
configuration for shard count. Because all three of those match, there
is a 1-to-1 correspondence between http_request
shards and http_request_1min shards, and
citus will place matching shards on the same
worker. This is called co-location;
it makes queries such as joins faster and our rollups possible. See the
figure below to learn
more.
Figure J.3. Collocation Diagram
In order to populate http_request_1min we are going to
periodically run INSERT INTO SELECT. This is
possible because the tables are co-located. The following function wraps
the rollup query up for convenience.
-- Single-row table to store when we rolled up last
CREATE TABLE latest_rollup (
minute timestamptz PRIMARY KEY,
-- "minute" should be no more precise than a minute
CHECK (minute = date_trunc('minute', minute))
);
-- Initialize to a time long ago
INSERT INTO latest_rollup VALUES ('10-10-1901');
-- Function to do the rollup
CREATE OR REPLACE FUNCTION rollup_http_request() RETURNS void AS $$
DECLARE
curr_rollup_time timestamptz := date_trunc('minute', now() - interval '1 minute');
last_rollup_time timestamptz := minute from latest_rollup;
BEGIN
INSERT INTO http_request_1min (
site_id, ingest_time, request_count,
success_count, error_count, average_response_time_msec
) SELECT
site_id,
date_trunc('minute', ingest_time),
COUNT(1) as request_count,
SUM(CASE WHEN (status_code between 200 and 299) THEN 1 ELSE 0 END) as success_count,
SUM(CASE WHEN (status_code between 200 and 299) THEN 0 ELSE 1 END) as error_count,
SUM(response_time_msec) / COUNT(1) AS average_response_time_msec
FROM http_request
-- Roll up only data new since last_rollup_time
WHERE ingest_time <@ tstzrange(last_rollup_time, curr_rollup_time, '(]')
GROUP BY 1, 2;
-- Update the value in latest_rollup so that next time we run the
-- rollup it will operate on data newer than curr_rollup_time
UPDATE latest_rollup SET minute = curr_rollup_time;
END;
$$ LANGUAGE plpgsql;
The above function should be called every minute. You could do this by
adding a crontab entry on the coordinator node:
* * * * * psql -c 'SELECT rollup_http_request();'
Alternatively, an extension such as pg_cron allows you to schedule recurring queries directly from the database.
The dashboard query from earlier is now a lot nicer:
SELECT site_id, ingest_time as minute, request_count,
success_count, error_count, average_response_time_msec
FROM http_request_1min
WHERE ingest_time > date_trunc('minute', now()) - '5 minutes'::interval;
The rollups make queries faster, but we still need to expire old data to avoid unbounded storage costs. Simply decide how long you would like to keep data for each granularity and use standard queries to delete expired data. In the following example, we decided to keep raw data for one day, and per-minute aggregations for one month:
DELETE FROM http_request WHERE ingest_time < now() - interval '1 day'; DELETE FROM http_request_1min WHERE ingest_time < now() - interval '1 month';
In production you could wrap these queries in a function and call it
every minute in a cron job.
Data expiration can go even faster by using table range partitioning on top of citus hash distribution. See the Timeseries Data section for a detailed example.
Those are the basics. We provided an architecture that ingests HTTP events and then rolls up these events into their pre-aggregated form. This way you can both store raw events and also power your analytical dashboards with subsecond queries.
The next sections extend upon the basic architecture and show you how to resolve questions, which often appear.
A common question in HTTP analytics deals with approximate distinct counts: How many unique visitors visited your site over the last month? Answering this question exactly requires storing the list of all previously seen visitors in the rollup tables, a prohibitively large amount of data. However, an approximate answer is much more manageable.
A datatype called HyperLogLog, or hll, can answer the query
approximately; it takes a surprisingly small amount of space to tell you
approximately how many unique elements are in a set. Its accuracy can be
adjusted. We will use ones which, using only 1,280 bytes, will be able
to count up to tens of billions of unique visitors with at most 2.2%
error.
An equivalent problem appears if you want to run a global query, such as
the number of unique IP addresses, which visited any of your client's
sites over the last month. Without hll this query involves
shipping lists of IP addresses from the workers to the coordinator for
it to deduplicate. That is both a lot of network traffic and a lot of
computation. By using hll you can greatly improve query
speed.
You can install the hll extension, whose instructions are available in the GitHub repository, and enable it as follows:
CREATE EXTENSION hll;
Now we are ready to track IP addresses in our rollup with hll. First add a column to the rollup table.
ALTER TABLE http_request_1min ADD COLUMN distinct_ip_addresses hll;
Next use our custom aggregation to populate the column. Just add it to the query in our rollup function:
@@ -1,10 +1,12 @@
INSERT INTO http_request_1min (
site_id, ingest_time, request_count,
success_count, error_count, average_response_time_msec
+ , distinct_ip_addresses
) SELECT
site_id,
date_trunc('minute', ingest_time),
COUNT(1) as request_count,
SUM(CASE WHEN (status_code between 200 and 299) THEN 1 ELSE 0 END) as success_count,
SUM(CASE WHEN (status_code between 200 and 299) THEN 0 ELSE 1 END) as error_count,
SUM(response_time_msec) / COUNT(1) AS average_response_time_msec
+ , hll_add_agg(hll_hash_text(ip_address)) AS distinct_ip_addresses
FROM http_request
Dashboard queries are a little more complicated, you have to read out
the distinct number of IP addresses by calling the
hll_cardinality function:
SELECT site_id, ingest_time as minute, request_count,
success_count, error_count, average_response_time_msec,
hll_cardinality(distinct_ip_addresses) AS distinct_ip_address_count
FROM http_request_1min
WHERE ingest_time > date_trunc('minute', now()) - interval '5 minutes';
hll is not just faster, it lets you do things you could not
previously. Say we did our rollups, but instead of using
hll we saved the exact unique counts. This works fine, but
you cannnot answer queries such as
“how many distinct sessions were there during this one-week period
in the past we've thrown away the raw data for?”.
With hll, this is easy. You can compute distinct IP counts
over a time period with the following query:
SELECT hll_cardinality(hll_union_agg(distinct_ip_addresses))
FROM http_request_1min
WHERE ingest_time > date_trunc('minute', now()) - '5 minutes'::interval;
You can find more information about the hll extension in the project's GitHub repository.
The citus extension works well with Postgres Pro built-in support for unstructured data types. To demonstrate this, let's keep track of the number of visitors, which came from each country. Using a semi-structure data type saves you from needing to add a column for every individual country and ending up with rows that have hundreds of sparsely filled columns. It is recommended to use the JSONB format, here we will demonstrate how to incorporate JSONB columns into your data model.
First, add the new column to our rollup table:
ALTER TABLE http_request_1min ADD COLUMN country_counters JSONB;
Next, include it in the rollups by modifying the rollup function:
@@ -1,14 +1,19 @@
INSERT INTO http_request_1min (
site_id, ingest_time, request_count,
success_count, error_count, average_response_time_msec
+ , country_counters
) SELECT
site_id,
date_trunc('minute', ingest_time),
COUNT(1) as request_count,
SUM(CASE WHEN (status_code between 200 and 299) THEN 1 ELSE 0 END) as success_count
SUM(CASE WHEN (status_code between 200 and 299) THEN 0 ELSE 1 END) as error_count
SUM(response_time_msec) / COUNT(1) AS average_response_time_msec
- FROM http_request
+ , jsonb_object_agg(request_country, country_count) AS country_counters
+ FROM (
+ SELECT *,
+ count(1) OVER (
+ PARTITION BY site_id, date_trunc('minute', ingest_time), request_country
+ ) AS country_count
+ FROM http_request
+ ) h
Now, if you want to get the number of requests that came from America in your dashboard, you can modify the dashboard query to look like this:
SELECT
request_count, success_count, error_count, average_response_time_msec,
COALESCE(country_counters->>'USA', '0')::int AS american_visitors
FROM http_request_1min
WHERE ingest_time > date_trunc('minute', now()) - '5 minutes'::interval;
In a timeseries workload, applications (such as some real-time apps) query recent information, while archiving old information.
To deal with this workload, a single-node Postgres Pro database would typically use table partitioning to break a big table of time-ordered data into multiple inherited tables with each containing different time ranges.
Storing data in multiple physical tables speeds up data expiration. In a single big table, deleting rows incurs the cost of scanning to find which to delete, and then vacuuming the emptied space. On the other hand, dropping a partition is a fast operation independent of data size. It is the equivalent of simply removing files on disk that contain the data. See the figure below to learn more.
Figure J.4. Delete vs. Drop Diagram
Partitioning a table also makes indices smaller and faster within each date range. Queries operating on recent data are likely to operate on “hot” indices that fit in memory. This speeds up reads. See the figure below to learn more.
Figure J.5. SELECT Across Multiple Indexes
Also inserts have smaller indices to update, so they go faster too. See the figure below to learn more.
Figure J.6. INSERT Across Multiple Indexes
Time-based partitioning makes most sense when:
Most queries access a very small subset of the most recent data.
Older data is periodically expired (deleted/dropped).
Keep in mind that, in the wrong situation, reading all these partitions can hurt overhead more than it helps. However, in the right situations it is quite helpful. For example, when keeping a year of time series data and regularly querying only the most recent week.
We can mix the single-node table partitioning techniques with citus distributed sharding to make a scalable time-series database. It is the best of both worlds. It is especially elegant atop Postgres Pro declarative table partitioning. See the figure below to learn more.
Figure J.7. Timeseries Sharding and Partitioning
For example, let's distribute and partition a table holding the historical GitHub events data.
Each record in this GitHub data set represents an event created in GitHub, along with key information regarding the event such as event type, creation date, and the user who created the event.
The first step is to create and partition the table by time as we would in a single-node Postgres Pro database:
-- Declaratively partitioned table CREATE TABLE github_events ( event_id bigint, event_type text, event_public boolean, repo_id bigint, payload jsonb, repo jsonb, actor jsonb, org jsonb, created_at timestamp ) PARTITION BY RANGE (created_at);
Notice the PARTITION BY RANGE (created_at). This tells
Postgres Pro that the table will be partitioned
by the created_at column in ordered ranges. We have
not yet created any partitions for specific ranges, though.
Before creating specific partitions, let's distribute the table in
citus. We will shard by
repo_id, meaning the events will be clustered into
shards per repository.
SELECT create_distributed_table('github_events', 'repo_id');
At this point citus has created shards for
this table across worker nodes. Internally each shard is a table with
the name github_events_
for each shard identifier NN. Also,
citus propagated the partitioning information,
and each of these shards has Partition key: RANGE (created_at)
declared.
A partitioned table cannot directly contain data, it is more like a view across its partitions. Thus the shards are not yet ready to hold data. We need to create partitions and specify their time ranges, after which we can insert data that match the ranges.
citus provides helper functions for partition management. We can create a batch of monthly partitions using the create_time_partitions function:
SELECT create_time_partitions( table_name := 'github_events', partition_interval := '1 month', end_at := now() + '12 months' );
citus also includes the time_partitions view for an easy way to investigate the partitions it has created.
SELECT partition FROM time_partitions WHERE parent_table = 'github_events'::regclass; ┌────────────────────────┐ │ partition │ ├────────────────────────┤ │ github_events_p2021_10 │ │ github_events_p2021_11 │ │ github_events_p2021_12 │ │ github_events_p2022_01 │ │ github_events_p2022_02 │ │ github_events_p2022_03 │ │ github_events_p2022_04 │ │ github_events_p2022_05 │ │ github_events_p2022_06 │ │ github_events_p2022_07 │ │ github_events_p2022_08 │ │ github_events_p2022_09 │ │ github_events_p2022_10 │ └────────────────────────┘
As time progresses, you will need to do some maintenance to create new partitions and drop old ones. It is best to set up a periodic job to run the maintenance functions with an extension like pg_cron:
-- Set two monthly cron jobs:
-- 1. Ensure we have partitions for the next 12 months
SELECT cron.schedule('create-partitions', '0 0 1 * *', $$
SELECT create_time_partitions(
table_name := 'github_events',
partition_interval := '1 month',
end_at := now() + '12 months'
)
$$);
-- 2. (Optional) Ensure we never have more than one year of data
SELECT cron.schedule('drop-partitions', '0 0 1 * *', $$
CALL drop_old_time_partitions(
'github_events',
now() - interval '12 months' /* older_than */
);
$$);
Be aware that native partitioning in Postgres Pro is still quite new and has a few quirks. Maintenance operations on partitioned tables will acquire aggressive locks that can briefly stall queries.
Some applications have data that logically divides into a small updatable part and a larger part that is “frozen”. Examples include logs, clickstreams, or sales records. In this case we can combine partitioning with columnar table storage to compress historical partitions on disk. citus columnar tables are currently append-only, meaning they do not support updates or deletes, but we can use them for the immutable historical partitions.
A partitioned table may be made up of any combination of row and columnar partitions. When using range partitioning on a timestamp key, we can make the newest partition a row table, and periodically roll the newest partition into another historical columnar partition.
Let's see an example, using GitHub events again. We will create a new
table called github_columnar_events for
disambiguation from the earlier example. To focus entirely on the
columnar storage aspect, we will not distribute this table.
Next, download sample data:
wget http://examples.citusdata.com/github_archive/github_events-2015-01-01-{0..5}.csv.gz
gzip -c -d github_events-2015-01-01-*.gz >> github_events.csv
-- Our new table, same structure as the example in -- the previous section CREATE TABLE github_columnar_events ( LIKE github_events ) PARTITION BY RANGE (created_at); -- Create partitions to hold two hours of data each SELECT create_time_partitions( table_name := 'github_columnar_events', partition_interval := '2 hours', start_from := '2015-01-01 00:00:00', end_at := '2015-01-01 08:00:00' ); -- Fill with sample data -- (note that this data requires the database to have UTF8 encoding) \COPY github_columnar_events FROM 'github_events.csv' WITH (format CSV) -- List the partitions, and confirm they are -- using row-based storage (heap access method) SELECT partition, access_method FROM time_partitions WHERE parent_table = 'github_columnar_events'::regclass;
┌─────────────────────────────────────────┬───────────────┐ │ partition │ access_method │ ├─────────────────────────────────────────┼───────────────┤ │ github_columnar_events_p2015_01_01_0000 │ heap │ │ github_columnar_events_p2015_01_01_0200 │ heap │ │ github_columnar_events_p2015_01_01_0400 │ heap │ │ github_columnar_events_p2015_01_01_0600 │ heap │ └─────────────────────────────────────────┴───────────────┘
-- Convert older partitions to use columnar storage CALL alter_old_partitions_set_access_method( 'github_columnar_events', '2015-01-01 06:00:00' /* older_than */, 'columnar' ); -- The old partitions are now columnar, while the -- latest uses row storage and can be updated SELECT partition, access_method FROM time_partitions WHERE parent_table = 'github_columnar_events'::regclass;
┌─────────────────────────────────────────┬───────────────┐ │ partition │ access_method │ ├─────────────────────────────────────────┼───────────────┤ │ github_columnar_events_p2015_01_01_0000 │ columnar │ │ github_columnar_events_p2015_01_01_0200 │ columnar │ │ github_columnar_events_p2015_01_01_0400 │ columnar │ │ github_columnar_events_p2015_01_01_0600 │ heap │ └─────────────────────────────────────────┴───────────────┘
To see the compression ratio for a columnar table, use
VACUUM VERBOSE. The compression ratio for our three
columnar partitions is pretty good:
VACUUM VERBOSE github_columnar_events;
INFO: statistics for "github_columnar_events_p2015_01_01_0000": storage id: 10000000003 total file size: 4481024, total data size: 4444425 compression rate: 8.31x total row count: 15129, stripe count: 1, average rows per stripe: 15129 chunk count: 18, containing data for dropped columns: 0, zstd compressed: 18 INFO: statistics for "github_columnar_events_p2015_01_01_0200": storage id: 10000000004 total file size: 3579904, total data size: 3548221 compression rate: 8.26x total row count: 12714, stripe count: 1, average rows per stripe: 12714 chunk count: 18, containing data for dropped columns: 0, zstd compressed: 18 INFO: statistics for "github_columnar_events_p2015_01_01_0400": storage id: 10000000005 total file size: 2949120, total data size: 2917407 compression rate: 8.51x total row count: 11756, stripe count: 1, average rows per stripe: 11756 chunk count: 18, containing data for dropped columns: 0, zstd compressed: 18
One power of the partitioned table
github_columnar_events is that it can be
queried in its entirety like a normal table.
SELECT COUNT(DISTINCT repo_id) FROM github_columnar_events;
┌───────┐ │ count │ ├───────┤ │ 16001 │ └───────┘
Entries can be updated or deleted, as long as there is a
WHERE clause on the partition key, which filters
entirely into row table partitions.
When a row partition has filled its range, you can archive it to compressed columnar storage. We can automate this with pg_cron like so:
-- A monthly cron job
SELECT cron.schedule('compress-partitions', '0 0 1 * *', $$
CALL alter_old_partitions_set_access_method(
'github_columnar_events',
now() - interval '6 months' /* older_than */,
'columnar'
);
$$);
For more information, see the Columnar Storage section.
citus is a Postgres Pro extension that allows commodity database servers (called nodes) to coordinate with one another in a “shared-nothing” architecture. The nodes form a cluster that allows Postgres Pro to hold more data and use more CPU cores than would be possible on a single computer. This architecture also allows the database to scale by simply adding more nodes to the cluster.
Every cluster has one special node called the coordinator (the others are known as workers). Applications send their queries to the coordinator node, which relays it to the relevant workers and accumulates the results.
For each query, the coordinator either routes it to a single worker node, or parallelizes it across several depending on whether the required data lives on a single node or multiple. The coordinator knows how to do this by consulting its metadata tables. These tables specific to citus track the DNS names and health of worker nodes, and the distribution of data across nodes. For more information, see the citus Tables and Views section.
Sharding is a technique used in database systems and distributed computing to horizontally partition data across multiple servers or nodes. It involves breaking up a large database or dataset into smaller, more manageable parts called shards. Each shard contains a subset of the data, and together they form the complete dataset.
citus offers two types of data sharding: row-based and schema-based. Each option comes with its own sharding tradeoffs allowing you to choose the approach that best aligns with requirements of your application.
The traditional way in which citus shards tables is the single database, shared schema model also known as row-based sharding, tenants co-exist as rows within the same table. The tenant is determined by defining the distribution column, which allows splitting up a table horizontally.
This is the most hardware efficient way of sharding. Tenants are densely packed and distributed among the nodes in the cluster. This approach, however, requires making sure that all tables in the schema have the distribution column and that all queries in the application filter by it. Row-based sharding shines in IoT workloads and for achieving the best margin out of hardware use.
Benefits:
Best performance
Best tenant density per node
Drawbacks:
Requires schema modifications
Requires application query modifications
All tenants must share the same schema
Schema-based sharding is the shared database, separate schema model,
the schema becomes the logical shard within the database. Multi-tenant
apps can a use a schema per tenant to easily shard along the tenant
dimension. Query changes are not required and the application usually
only needs a small modification to set the proper
search_path when switching tenants. Schema-based
sharding is an ideal solution for microservices, and for ISVs deploying
applications that cannot undergo the changes required to onboard
row-based sharding.
Benefits:
Tenants can have heterogeneous schemas
No schema modifications required
No application query modifications required
Schema-based sharding SQL compatibility is better compared to the row-based sharding
Drawbacks:
Fewer tenants per node compared to row-based sharding
| Schema-Based Sharding | Row-Based Sharding | |
|---|---|---|
| Multi-tenancy model | Separate schema per tenant | Shared tables with tenant ID columns |
| citus version | 12.0+ | All versions |
| Additional steps compared to Postgres Pro | None, only a config change |
Use the
create_distributed_table
function on each table to distribute and co-locate tables by
tenant_id
|
| Number of tenants | 1-10k | 1-1M+ |
| Data modelling requirement | No foreign keys across distributed schemas |
Need to include the tenant_id column (a
distribution column, also known as a sharding key) in each table,
and in primary keys, foreign keys
|
| SQL requirement for single node queries | Use a single distributed schema per query |
Joins and WHERE clauses should include
tenant_id column
|
| Parallel cross-tenant queries | No | Yes |
| Custom table definitions per tenant | Yes | No |
| Access control | Schema permissions | Schema permissions |
| Data sharing across tenants | Yes, using reference tables (in a separate schema) | Yes, using reference tables |
| Tenant to shard isolation | Every tenant has its own shard group by definition | Can give specific tenant IDs their own shard group via the isolate_tenant_to_new_shard function. |
There are several types of tables in a citus cluster, each used for different purposes.
Type 1: Distributed Tables.
The first type, and most common, is distributed tables. These appear to be normal tables to SQL statements, but are horizontally partitioned across worker nodes. See the figure below to learn more.
Figure J.8. Parallel SELECT Diagram
Here the rows of table are stored in tables
table_1001, table_1002, etc. on
the workers. The component worker tables are called
shards.
citus runs not only SQL but DDL statements throughout a cluster, so changing the schema of a distributed table cascades to update all the table shards across workers.
To learn how to create a distributed table, see the Creating and Modifying Distributed Objects (DDL) section.
Distribution Column. citus uses algorithmic sharding to assign rows to shards. This means the assignment is made deterministically — in our case based on the value of a particular table column called the distribution column. The cluster administrator must designate this column when distributing a table. Making the right choice is important for performance and functionality, as described in the general topic of the Choosing Distribution Column section.
Type 2: Reference Tables.
A reference table is a type of distributed table whose entire contents are concentrated into a single shard, which is replicated on every worker. Thus queries on any worker can access the reference information locally, without the network overhead of requesting rows from another node. Reference tables have no distribution column because there is no need to distinguish separate shards per row.
Reference tables are typically small and are used to store data that is relevant to queries running on any worker node. For example, enumerated values like order statuses or product categories.
When interacting with a reference table, we automatically perform two-phase commits on transactions. This means that citus makes sure your data is always in a consistent state, regardless of whether you are writing, modifying or deleting it.
The Reference Tables section talks more about these tables and how to create them.
Type 3: Local Tables.
When you use citus, the coordinator node you connect to and interact with is a regular Postgres Pro database with the citus extension installed. Thus you can create ordinary tables and choose not to shard them. This is useful for small administrative tables that do not participate in join queries. An example would be users table for application login and authentication.
Creating standard Postgres Pro tables is easy
because it is the default. It is what you get when you run
CREATE TABLE. In almost every
citus deployment we see standard
Postgres Pro tables co-existing with
distributed and reference tables. Indeed,
citus itself uses local tables to hold
cluster metadata, as mentioned earlier.
Type 4: Local Managed Tables.
When the citus.enable_local_reference_table_foreign_keys configuration parameter is enabled, citus may automatically add local tables to metadata if a foreign key reference exists between a local table and a reference table. Additionally this tables can be manually created by calling the citus_add_local_table_to_metadata function on regular local tables. Tables present in metadata are considered managed tables and can be queried from any node, citus will know to route to the coordinator to obtain data from the local managed table. Such tables are displayed as local in the citus_tables view.
Type 5: Schema Tables.
When using schema-based sharding, distributed schemas are automatically associated with individual co-location groups such that the tables created in those schemas are automatically converted to co-located distributed tables without a shard key. Such tables are considered schema tables and are displayed as schema in the citus_tables view.
The previous section described a shard as containing a subset of the rows of a distributed table in a smaller table within a worker node. This section gets more into the technical details.
The pg_dist_shard
metadata table on the coordinator contains a row for each shard of each
distributed table in the system. The row matches a
shardid with a range of integers in a hash space
(shardminvalue, shardmaxvalue):
SELECT * FROM pg_dist_shard; logicalrelid | shardid | shardstorage | shardminvalue | shardmaxvalue ---------------+---------+--------------+---------------+--------------- github_events | 102026 | t | 268435456 | 402653183 github_events | 102027 | t | 402653184 | 536870911 github_events | 102028 | t | 536870912 | 671088639 github_events | 102029 | t | 671088640 | 805306367 (4 rows)
If the coordinator node wants to determine which shard holds a row of
github_events, it hashes the value of the distribution
column in the row, and checks which shard's range contains the hashed
value. (The ranges are defined so that the image of the hash function is
their disjoint union.)
Suppose that shard 102027 is associated with the
row in question. This means the row should be read or written to a
table called github_events_102027 in one of the
workers. Which worker? That is determined entirely by the metadata
tables, and the mapping of shard to worker is known as the shard
placement.
Joining some metadata tables
gives us the answer. These are the types of lookups that the coordinator
does to route queries. It rewrites queries into fragments that refer to
the specific tables like github_events_102027, and
runs those fragments on the appropriate workers.
SELECT
shardid,
node.nodename,
node.nodeport
FROM pg_dist_placement placement
JOIN pg_dist_node node
ON placement.groupid = node.groupid
AND node.noderole = 'primary'::noderole
WHERE shardid = 102027;
┌─────────┬───────────┬──────────┐ │ shardid │ nodename │ nodeport │ ├─────────┼───────────┼──────────┤ │ 102027 │ localhost │ 5433 │ └─────────┴───────────┴──────────┘
In our example of github_events there were four
shards. The number of shards is configurable per table at the time of
its distribution across the cluster. The best choice of shard count
depends on your use case, see the
Shard Count
section.
Finally note that citus allows shards to be replicated for protection against data loss using Postgres Pro streaming replication to back up the entire database of each node to a follower database. This is transparent and does not require the involvement of citus metadata tables.
Since shards can be placed on nodes as desired, it makes sense to place shards containing related rows of related tables together on the same nodes. That way join queries between them can avoid sending as much information over the network, and can be performed inside a single citus node.
One example is a database with stores, products, and purchases. If all
three tables contain — and are distributed by — the
store_id column, then all queries restricted to a
single store can run efficiently on a single worker node. This is true
even when the queries involve any combination of these tables.
For the full explanation and examples of this concept, see the Table Co-Location section.
Spreading queries across multiple computers allows more queries to run at once, and allows processing speed to scale by adding new computers to the cluster. Additionally splitting a single query into fragments as described in the previous section boosts the processing power devoted to it. The latter situation achieves the greatest parallelism, meaning utilization of CPU cores.
Queries reading or affecting shards spread evenly across many nodes are able to run at “real-time” speed. Note that the results of the query still need to pass back through the coordinator node, so the speedup is most apparent when the final results are compact, such as aggregate functions like counting and descriptive statistics.
The Query Processing section explains more about how queries are broken into fragments and how their execution is managed.
When executing multi-shard queries, citus must balance the gains from parallelism with the overhead from database connections (network latency and worker node resource usage). To configure citus query execution for best results with your database workload, it helps to understand how citus manages and conserves database connections between the coordinator node and worker nodes.
citus transforms each incoming multi-shard
query session into per-shard queries called tasks. It queues the tasks,
and runs them once it is able to obtain connections to the relevant worker
nodes. For queries on distributed tables foo and
bar, see the
connection management diagram
below.
Figure J.9. Executor Overview
The coordinator node has a connection pool for each session. Each query
(such as SELECT * FROM foo in the diagram) is limited
to opening at most simultaneous connections for its tasks per worker set
in the
citus.max_adaptive_executor_pool_size
configuration parameter. It is configurable at the session level, for
priority management.
It can be faster to execute short tasks sequentially over the same connection rather than establishing new connections for them in parallel. Long running tasks, on the other hand, benefit from more immediate parallelism.
To balance the needs of short and long tasks, citus
uses the
citus.executor_slow_start_interval
configuration parameter. It specifies a delay between connection attempts
for the tasks in a multi-shard query. When a query first queues tasks, the
tasks can acquire just one connection. At the end of each interval where
there are pending connections, citus increases
the number of simultaneous connections it will open. The slow start
behavior can be disabled entirely by setting the GUC to
0.
When a task finishes using a connection, the session pool will hold the connection open for later. Caching the connection avoids the overhead of connection reestablishment between coordinator and worker. However, each pool will hold no more than the number of idle connections open at once set by the citus.max_cached_conns_per_worker configuration parameter, to limit idle connection resource usage in the worker.
Finally, the citus.max_shared_pool_size configuration parameter acts as a fail-safe. It limits the total connections per worker between all tasks.
For recommendations about tuning these parameters to match your workload, see the Connection Management section.
Running efficient queries on a citus cluster requires that data be properly distributed across computers. This varies by the type of application and its query patterns.
There are broadly two kinds of applications that work very well on citus. The first step in data modeling is to identify which of them more closely resembles your application.
| Multi-Tenant Applications | Real-Time Applications |
|---|---|
| Sometimes dozens or hundreds of tables in schema | Small number of tables |
| Queries relating to one tenant (company/store) at a time | Relatively simple analytics queries with aggregations |
| OLTP workloads for serving web clients | High ingest volume of mostly immutable data |
| OLAP workloads that serve per-tenant analytical queries | Often centering around a big table of events |
These are typically SaaS applications that serve other companies,
accounts, or organizations. Most SaaS applications are inherently
relational. They have a natural dimension on which to distribute data
across nodes: just shard by tenant_id.
citus enables you to scale out your database to millions of tenants without having to re-architect your application. You can keep the relational semantics you need, like joins, foreign key constraints, transactions, ACID, and consistency.
Examples: Websites, which host store-fronts for other businesses, such as a digital marketing solution, or a sales automation tool.
Characteristics: Queries relating to a single tenant rather than joining information across tenants. This includes OLTP workloads for serving web clients, and OLAP workloads that serve per-tenant analytical queries. Having dozens or hundreds of tables in your database schema is also an indicator for the multi-tenant data model.
Scaling a multi-tenant app with citus also requires minimal changes to application code. We have support for popular frameworks like Ruby on Rails and Django.
Applications needing massive parallelism, coordinating hundreds of cores for fast results to numerical, statistical, or counting queries. By sharding and parallelizing SQL queries across multiple nodes, citus makes it possible to perform real-time queries across billions of records in under a second.
Examples: Customer-facing analytics dashboards requiring sub-second response times.
Characteristics: Few tables,
often centering around a big table of device-, site- or user-events
and requiring high ingest volume of mostly immutable data.
Relatively simple (but computationally intensive) analytics queries
involving several aggregations and GROUP BY
operations.
If your situation resembles either cases above, then the next step is to decide how to shard your data in the citus cluster. As explained in the Architecture Concepts section, citus assigns table rows to shards according to the hashed value of the table distribution column. The database administrator's choice of distribution columns needs to match the access patterns of typical queries to ensure performance.
citus uses the distribution column in distributed tables to assign table rows to shards. Choosing the distribution column for each table is one of the most important modeling decisions because it determines how data is spread across nodes.
If the distribution columns are chosen correctly, then related data will group together on the same physical nodes, making queries fast and adding support for all SQL features. If the columns are chosen incorrectly, the system will run needlessly slowly, and will not be able to support all SQL features across nodes.
This section gives distribution column tips for the two most common citus scenarios. It concludes by going in-depth on “co-location”, the desirable grouping of data on nodes.
The multi-tenant architecture uses a form of hierarchical database
modeling to distribute queries across nodes in the distributed cluster.
The top of the data hierarchy is known as the tenant_id,
and needs to be stored in a column on each table.
citus inspects queries to see which
tenant_id they involve and routes the query to a
single worker node for processing, specifically the node that holds the
data shard associated with the tenant_id. Running a
query with all relevant data placed on the same node is called
co-location.
The following diagram
illustrates co-location in the multi-tenant data model. It contains two
tables, Accounts and Campaigns, each distributed by
account_id. The shaded boxes represent shards, each
of whose color represents which worker node contains it. Green shards
are stored together on one worker node, and blue on another. Notice how
a join query between Accounts and Campaigns would have all the necessary
data together on one node when restricting both tables to the same
account_id.
Figure J.10. Multi-Tenant Co-Location
To apply this design in your own schema the first step is identifying
what constitutes a tenant in your application. Common instances include
company, account, organization, or customer. The column name will be
something like company_id or customer_id.
Examine each of your queries and ask yourself: would it work if it had
additional WHERE clauses to restrict all tables
involved to rows with the same tenant_id? Queries in
the multi-tenant model are usually scoped to a tenant, for instance,
queries on sales or inventory would be scoped within a certain store.
Best practices are as follows:
Partition distributed tables by the common tenant_id column.
For instance, in a SaaS application where tenants are
companies, the tenant_id will likely be
company_id.
Convert small cross-tenant tables to reference tables. When multiple tenants share a small table of information, distribute it as a reference table.
Restrict filter all application queries by tenant_id.
Each query should request information for one tenant at a time.
Consult the Multi-Tenant Applications section for a detailed example of building this kind of application.
While the multi-tenant architecture introduces a hierarchical structure and uses data co-location to route queries per tenant, real-time architectures depend on specific distribution properties of their data to achieve highly parallel processing.
We use “entity ID” as a term for distribution columns in the real-time model, as opposed to tenant IDs in the multi-tenant model. Typical entities are users, hosts, or devices.
Real-time queries typically ask for numeric aggregates grouped by date or category. citus sends these queries to each shard for partial results and assembles the final answer on the coordinator node. Queries run fastest when as many nodes contribute as possible, and when no single node must do a disproportionate amount of work.
Best practices are as follows:
Choose a column with high cardinality as the distribution column. For comparison, a “status” field on an order table with values “new”, “paid”, and “shipped” is a poor choice of distribution column because it assumes only those few values. The number of distinct values limits the number of shards that can hold the data, and the number of nodes that can process it. Among columns with high cardinality, it is good additionally to choose those that are frequently used in group-by clauses or as join keys.
Choose a column with even distribution. If you distribute a table on a column skewed to certain common values, then data in the table will tend to accumulate in certain shards. The nodes holding those shards will end up doing more work than other nodes.
Distribute fact and dimension tables on their common columns. Your fact table can have only one distribution key. Tables that join on another key will not be co-located with the fact table. Choose one dimension to co-locate based on how frequently it is joined and the size of the joining rows.
Change some dimension tables into reference tables. If a dimension table cannot be co-located with the fact table, you can improve query performance by distributing copies of the dimension table to all of the nodes in the form of a reference table.
Consult the Real-Time Dashboards section for a detailed example of building this kind of application.
In a time-series workload, applications query recent information while archiving old information.
The most common mistake in modeling timeseries information in citus is using the timestamp itself as a distribution column. A hash distribution based on time will distribute times seemingly at random into different shards rather than keeping ranges of time together in shards. However, queries involving time generally reference ranges of time (for example, the most recent data), so such a hash distribution would lead to network overhead.
Best practices are as follows:
Do not choose a timestamp as the distribution column.
Choose a different distribution column. In a multi-tenant app, use
the tenant_id or in a real-time app use the
entity_id.
Use Postgres Pro table partitioning for time instead. Use table partitioning to break a big table of time-ordered data into multiple inherited tables with each containing different time ranges. Distributing a Postgres Pro partitioned table in citus creates shards for the inherited tables.
Consult the Timeseries Data section for a detailed example of building this kind of application.
Relational databases are the first choice of data store for many applications due to their enormous flexibility and reliability. Historically the one criticism of relational databases is that they can run on only a single computer, which creates inherent limitations when data storage needs outpace server improvements. The solution to rapidly scaling databases is to distribute them, but this creates a performance problem of its own: relational operations such as joins then need to cross the network boundary. Co-location is the practice of dividing data tactically, where one keeps related information on the same computers to enable efficient relational operations, but takes advantage of the horizontal scalability for the whole dataset.
The principle of data co-location is that all tables in the database have a common distribution column and are sharded across computers in the same way, such that rows with the same distribution column value are always on the same computer, even across different tables. As long as the distribution column provides a meaningful grouping of data, relational operations can be performed within the groups.
The citus extension for Postgres Pro is unique in being able to form a distributed database of databases. Every node in a citus cluster is a fully functional Postgres Pro database and the extension adds the experience of a single homogenous database on top. While it does not provide the full functionality of Postgres Pro in a distributed way, in many cases it can take full advantage of features offered by Postgres Pro on a single computer through co-location, including full SQL support, transactions, and foreign keys.
In citus a row is stored in a shard if the hash of the value in the distribution column falls within the shard hash range. To ensure co-location, shards with the same hash range are always placed on the same node even after rebalance operations, such that equal distribution column values are always on the same node across tables. See the figure below to learn more.
Figure J.11. Co-Location Shards
A distribution column that we have found to work well in practice is
tenant_id in multi-tenant applications. For example,
SaaS applications typically have many tenants, but every query they make
is specific to a particular tenant. While one option is providing a
database or schema for every tenant, it is often costly and impractical
as there can be many operations that span across users (data loading,
migrations, aggregations, analytics, schema changes, backups, etc). That
becomes harder to manage as the number of tenants grows.
Consider the following tables, which might be part of a multi-tenant web analytics SaaS:
CREATE TABLE event ( tenant_id int, event_id bigint, page_id int, payload jsonb, primary key (tenant_id, event_id) ); CREATE TABLE page ( tenant_id int, page_id int, path text, primary key (tenant_id, page_id) );
Now we want to answer queries that may be issued by a customer-facing
dashboard, such as: “Return the number of visits
in the past week for all pages starting with /blog in
tenant six”.
If our data was in a single Postgres Pro node, we could easily express our query using the rich set of relational operations offered by SQL:
SELECT page_id, count(event_id) FROM page LEFT JOIN ( SELECT * FROM event WHERE (payload->>'time')::timestamptz >= now() - interval '1 week' ) recent USING (tenant_id, page_id) WHERE tenant_id = 6 AND path LIKE '/blog%' GROUP BY page_id;
As long as the working set for this query fits in memory, this is an appropriate solution for many applications since it offers maximum flexibility. However, even if you do not need to scale yet, it can be useful to consider the implications of scaling out on your data model.
As the number of tenants and the data stored for each tenant grows,
query times will typically go up as the working set no longer fits in
memory or CPU becomes a bottleneck. In this case, we can shard the data
across many nodes using citus. The first and
the most important choice we need to make when sharding is the
distribution column. Let's start with a naive choice of using
event_id for the event table and
page_id for the page table:
-- Naively use event_id and page_id as distribution columns
SELECT create_distributed_table('event', 'event_id');
SELECT create_distributed_table('page', 'page_id');
Given that the data is dispersed across different workers, we cannot simply perform a join as we would on a single Postgres Pro node. Instead, we will need to issue two queries:
Across all shards of the page table (Q1):
SELECT page_id FROM page WHERE path LIKE '/blog%' AND tenant_id = 6;
Across all shards of the event table (Q2):
SELECT page_id, count(*) AS count FROM event WHERE page_id IN (/*…page IDs from first query…*/) AND tenant_id = 6 AND (payload->>'time')::date >= now() - interval '1 week' GROUP BY page_id ORDER BY count DESC LIMIT 10;
Afterwards, the results from the two steps need to be combined by the application.
The data required to answer the query is scattered across the shards on the different nodes and each of those shards will need to be queried. See the figure below to learn more.
Figure J.12. Co-Location With Inefficient Queries
In this case the data distribution creates substantial drawbacks:
Overhead from querying each shard, running multiple queries.
Overhead of Q1 returning many rows to the client.
Q2 becomes very large.
The need to write queries in multiple steps, combine results, requires changes in the application.
A potential upside of the relevant data being dispersed is that the queries can be parallelised, which citus will do. However, this is only beneficial if the amount of work that the query does is substantially greater than the overhead of querying many shards. It is generally better to avoid doing such heavy lifting directly from the application, for example, by pre-aggregating the data.
Looking at our query again, we can see that all the rows that the query
needs have one dimension in common: tenant_id. The
dashboard will only ever query for a tenant's own data. That means that
if data for the same tenant is always co-located on a single
Postgres Pro node, our original query could be
answered in a single step by that node by performing a join on
tenant_id and page_id.
In citus, rows with the same distribution
column value are guaranteed to be on the same node. Each shard in a
distributed table effectively has a set of co-located shards from other
distributed tables that contain the same distribution column values
(data for the same tenant). Starting over, we can create our tables
with tenant_id as the distribution column.
-- Co-locate tables by using a common distribution column
SELECT create_distributed_table('event', 'tenant_id');
SELECT create_distributed_table('page', 'tenant_id', colocate_with => 'event');
In this case, citus can answer the same query that you would run on a single Postgres Pro node without modification (Q1):
SELECT page_id, count(event_id) FROM page LEFT JOIN ( SELECT * FROM event WHERE (payload->>'time')::timestamptz >= now() - interval '1 week' ) recent USING (tenant_id, page_id) WHERE tenant_id = 6 AND path LIKE '/blog%' GROUP BY page_id;
Because of the tenant_id filter and join on
tenant_id, citus knows that
the entire query can be answered using the set of co-located shards that
contain the data for that particular tenant, and the
Postgres Pro node can answer the query in a
single step, which enables full SQL support. See the
figure
below to learn more.
Figure J.13. Co-Location With Better Queries
In some cases, queries and table schemas will require minor modifications
to ensure that the tenant_id is always included in
unique constraints and join conditions. However, this is usually a
straightforward change, and the extensive rewrite that would be required
without having co-location is avoided.
While the example above queries just one node because there is a specific
tenant_id = 6 filter, co-location also allows us to
efficiently perform distributed joins on tenant_id
across all nodes, be it with SQL limitations.
The full list of citus features that are unlocked by co-location are:
Full SQL support for queries on a single set of co-located shards.
Multi-statement transaction support for modifications on a single set of co-located shards.
Aggregation through INSERT...SELECT.
Foreign keys.
Distributed outer joins.
Pushdown CTEs.
Data co-location is a powerful technique for providing both horizontal scale and support to relational data models. The cost of migrating or building applications using a distributed database that enables relational operations through co-location is often substantially lower than moving to a restrictive data model (e.g. NoSQL) and, unlike a single-node database, it can scale out with the size of your business. For more information about migrating an existing database, see the Migrating an Existing App.
citus parallelizes incoming queries by breaking it into multiple fragment queries (“tasks”), which run on the worker shards in parallel. This allows citus to utilize the processing power of all the nodes in the cluster and also of individual cores on each node for each query. Due to this parallelization, you can get performance, which is cumulative of the computing power of all of the cores in the cluster leading to a dramatic decrease in query times versus Postgres Pro on a single server.
citus employs a two stage optimizer when planning SQL queries. The first phase involves converting the SQL queries into their commutative and associative form so that they can be pushed down and run on the workers in parallel. As discussed in previous sections, choosing the right distribution column and distribution method allows the distributed query planner to apply several optimizations to the queries. This can have a significant impact on query performance due to reduced network I/O.
The distributed executor of the citus extension then takes these individual query fragments and sends them to worker Postgres Pro instances. There are several aspects of both the distributed planner and the executor, which can be tuned in order to improve performance. When these individual query fragments are sent to the workers, the second phase of query optimization kicks in. The workers are simply running extended Postgres Pro servers and they apply Postgres Pro standard planning and execution logic to run these fragment SQL queries. Therefore, any optimization that helps Postgres Pro also helps citus. Postgres Pro by default comes with conservative resource settings; and therefore optimizing these configuration settings can improve query times significantly.
We discuss the relevant performance tuning steps in the Query Performance Tuning section.
Migrating an existing application to citus sometimes requires adjusting the schema and queries for optimal performance. citus extends Postgres Pro with distributed functionality, but row-based sharding is not a drop-in replacement that scales out all workloads. A performant citus cluster involves thinking about the data model, tooling, and choice of SQL features used.
There is another mode of operation in citus called schema-based sharding, and while row-based sharding results in best performance and hardware efficiency, see schema-based sharding if you are in a need for a more drop-in approach.
The first steps are to optimize the existing database schema so that it can work efficiently across multiple computers.
Next, update application code and queries to deal with the schema changes.
After testing the changes in a development environment, the last step is to migrate production data to a citus cluster and switch over the production app. We have techniques to minimize downtime for this step.
The first step in migrating to citus is
identifying suitable distribution keys and planning table distribution
accordingly. In multi-tenant applications this will typically be an
internal identifier for tenants. We typically refer to it as the
tenant_id. The use cases may vary, so we advise being
thorough on this step.
For guidance, consult these sections:
Review your environment to be sure that the ideal distribution key is chosen. To do so, examine schema layouts, larger tables, long-running and/or problematic queries, standard use cases, and more.
Once a distribution key is identified, review the schema to identify how each table will be handled and whether any modifications to table layouts will be required.
Tables will generally fall into one of the following categories:
Ready for distribution. These tables already contain the distribution key, and are ready for distribution.
Needs backfill. These tables can be logically distributed by the chosen key but do not contain a column directly referencing it. The tables will be modified later to add the column.
Reference table. These tables are typically small, do not contain the distribution key, are commonly joined by distributed tables, and/or are shared across tenants. A copy of each of these tables will be maintained on all nodes. Common examples include country code lookups, product categories, and the like.
Local table. These are typically not joined to other tables, and do not contain the distribution key. They are maintained exclusively on the coordinator node. Common examples include admin user lookups and other utility tables.
Consider an example multi-tenant application similar to Etsy or Shopify where each tenant is a store. A simplified schema is presented in the diagram below. (Underlined items are primary keys, italicized items are foreign keys.)
Figure J.14. Simplified Schema Example
In this example stores are a natural tenant. The
tenant_id is in this case the
store_id. After distributing tables in the cluster,
we want rows relating to the same store to reside together on the same
nodes.
Once the scope of needed database changes is identified, the next major step is to modify the data structure for the application's existing database. First, tables requiring backfill are modified to add a column for the distribution key.
In our storefront example the stores and products tables have a
store_id and are ready for distribution. Being
normalized, the line_items table lacks
store_id. If we want to distribute by
store_id, the table needs this column.
-- Denormalize line_items by including store_id ALTER TABLE line_items ADD COLUMN store_id uuid;
Be sure to check that the distribution column has the same type in all
tables, e.g. do not mix int and bigint.
The column types must match to ensure proper data co-location.
Once the schema is updated, backfill missing values for the
tenant_id column in tables where the column was added.
In our example line_items requires values for
store_id.
We backfill the table by obtaining the missing values from a join query with orders:
UPDATE line_items SET store_id = orders.store_id FROM line_items INNER JOIN orders WHERE line_items.order_id = orders.order_id;
Doing the whole table at once may cause too much load on the database and disrupt other queries. The backfill can be done more slowly instead. One way to do that is to make a function that backfills small batches at a time, then call the function repeatedly with pg_cron.
-- The function to backfill up to ten
-- thousand rows from line_items
CREATE FUNCTION backfill_batch()
RETURNS void LANGUAGE sql AS $$
WITH batch AS (
SELECT line_items_id, order_id
FROM line_items
WHERE store_id IS NULL
LIMIT 1000
FOR UPDATE
SKIP LOCKED
)
UPDATE line_items AS li
SET store_id = orders.store_id
FROM batch, orders
WHERE batch.line_item_id = li.line_item_id
AND batch.order_id = orders.order_id;
$$;
-- Run the function every quarter hour
SELECT cron.schedule('*/15 * * * *', 'SELECT backfill_batch()');
-- Note the return value of cron.schedule
Once the backfill is caught up, the cron job can be disabled:
-- Assuming 42 is the job id returned -- from cron.schedule SELECT cron.unschedule(42);
When modifying the application to work with citus, you will need a database to test against. Follow the instructions in the Installing citus on a Single Node section to set up the extension.
Next dump a copy of the schema from your application's original database and restore the schema in the new development database.
# get schema from source db pg_dump \ --format=plain \ --no-owner \ --schema-only \ --file=schema.sql \ --schema=target_schema \ postgres://user:pass@host:5432/db # load schema into test db psql postgres://user:pass@testhost:5432/db -f schema.sql
The schema should include a distribution key (tenant_id)
in all tables you wish to distribute. Before running
pg_dump
for the schema, be sure to
prepare source tables for migration.
citus
cannot enforce
uniqueness constraints unless a unique index or primary key
contains the distribution column. Thus we must modify primary and
foreign keys in our example to include store_id.
Some of the libraries listed in the next section are able to help migrate the database schema to include the distribution column in keys. However, here is an example of the underlying SQL commands to turn the simple keys composite in the development database:
BEGIN; -- Drop simple primary keys (cascades to foreign keys) ALTER TABLE products DROP CONSTRAINT products_pkey CASCADE; ALTER TABLE orders DROP CONSTRAINT orders_pkey CASCADE; ALTER TABLE line_items DROP CONSTRAINT line_items_pkey CASCADE; -- Recreate primary keys to include would-be distribution column ALTER TABLE products ADD PRIMARY KEY (store_id, product_id); ALTER TABLE orders ADD PRIMARY KEY (store_id, order_id); ALTER TABLE line_items ADD PRIMARY KEY (store_id, line_item_id); -- Recreate foreign keys to include would-be distribution column ALTER TABLE line_items ADD CONSTRAINT line_items_store_fkey FOREIGN KEY (store_id) REFERENCES stores (store_id); ALTER TABLE line_items ADD CONSTRAINT line_items_product_fkey FOREIGN KEY (store_id, product_id) REFERENCES products (store_id, product_id); ALTER TABLE line_items ADD CONSTRAINT line_items_order_fkey FOREIGN KEY (store_id, order_id) REFERENCES orders (store_id, order_id); COMMIT;
Thus completed, our schema from the previous section will look like this (Underlined items are primary keys, italicized items are foreign keys.):
Figure J.15. Simplified Schema Example
Be sure to modify data flows to add keys to incoming data.
Once the distribution key is present on all appropriate tables, the application needs to include it in queries. Take the following steps using a copy of the application running in a development environment, and testing against a citus back-end. After the application is working with the extension we will see how to migrate production data from the source database into a real citus cluster.
Application code and any other ingestion processes that write to the tables should be updated to include the new columns.
Running the application test suite against the modified schema on citus is a good way to determine which areas of the code need to be modified.
It is a good idea to enable database logging. The logs can help uncover stray cross-shard queries in a multi-tenant app that should be converted to per-tenant queries.
Cross-shard queries are supported, but in a multi-tenant application
most queries should be targeted to a single node. For simple
SELECT, UPDATE, and
DELETE queries this means that the
WHERE clause should filter by
tenant_id. citus can then
run these queries efficiently on a single node.
There are helper libraries for a number of popular application
frameworks that make it easy to include tenant_id
in queries:
It is possible to use the libraries for database writes first (including data ingestion) and later for read queries. The activerecord-multi-tenant gem, for instance, has a write-only mode that modifies only the write queries.
If you are using a different ORM than those above or executing multi-tenant queries more directly in SQL, follow these general principles. We will use our earlier example of the e-commerce application.
Suppose we want to get the details for an order. Distributed queries
that filter on the tenant_id run most efficiently
in multi-tenant apps, so the change below makes the query faster
(while both queries return the same results):
-- Before SELECT * FROM orders WHERE order_id = 123; -- After SELECT * FROM orders WHERE order_id = 123 AND store_id = 42; -- <== added
The tenant_id column is not just beneficial but
critical for INSERT statements. Inserts must
include a value for the tenant_id column or else
citus will be unable to route the data to
the correct shard and will raise an error.
Finally, when joining tables make sure to filter by
tenant_id too. For instance, here is how to inspect
how many “awesome wool pants” a given store has sold:
-- One way is to include store_id in the join and also
-- filter by it in one of the queries
SELECT sum(l.quantity)
FROM line_items l
INNER JOIN products p
ON l.product_id = p.product_id
AND l.store_id = p.store_id
WHERE p.name='Awesome Wool Pants'
AND l.store_id='8c69aa0d-3f13-4440-86ca-443566c1fc75'
-- Equivalently you omit store_id from the join condition
-- but filter both tables by it. This may be useful if
-- building the query in an ORM
SELECT sum(l.quantity)
FROM line_items l
INNER JOIN products p ON l.product_id = p.product_id
WHERE p.name='Awesome Wool Pants'
AND l.store_id='8c69aa0d-3f13-4440-86ca-443566c1fc75'
AND p.store_id='8c69aa0d-3f13-4440-86ca-443566c1fc75'
Clients should connect to citus with SSL to protect information and prevent man-in-the-middle attacks.
With large and complex application code-bases, certain queries
generated by the application can often be overlooked and thus will
not have the tenant_id filter on them.
citus parallel executor will still execute
these queries successfully, and so, during testing, these queries
remain hidden since the application still works fine. However, if a
query does not contain the tenant_id filter,
citus executor will hit every shard in
parallel, but only one will return any data. This consumes resources
needlessly and may exhibit itself as a problem only when one moves to
a higher-throughput production environment.
To prevent encountering such issues only after launching in production, one can set a config value to log queries, which hit more than one shard. In a properly configured and migrated multi-tenant application, each query should only hit one shard at a time.
During testing, one can configure the following:
-- Adjust for your own database's name of course ALTER DATABASE citus SET citus.multi_task_query_log_level = 'error';
citus will then error out if it encounters queries that are going to hit more than one shard. Erroring out during testing allows the application developer to find and migrate such queries.
During a production launch, one can configure the same setting to log, instead of error out:
ALTER DATABASE citus SET citus.multi_task_query_log_level = 'log';
Visit the citus.multi_task_query_log_level section description to learn more about the supported values.
At this time, having updated the database schema and application queries to work with citus, you are ready for the final step. It is time to migrate data to the citus cluster and cut over the application to its new database. The data migration procedure is presented in the Database Migration section.
For smaller environments that can tolerate a little downtime, use a simple pg_dump/pg_restore process. Here are the steps:
Save the database structure from your development database:
pg_dump \
--format=plain \
--no-owner \
--schema-only \
--file=schema.sql \
--schema=target_schema \
postgres://user:pass@host:5432/db
Connect to the citus cluster using psql and create a schema:
\i schema.sql
Call the create_distributed_table and create_reference_table functions. If you get an error about foreign keys, it is generally due to the order of operations. Drop foreign keys before distributing tables and then re-add them.
Put the application into maintenance mode and disable any other writes to the old database.
Save the data from the original production database to disk with pg_dump:
pg_dump \
--format=custom \
--no-owner \
--data-only \
--file=data.dump \
--schema=target_schema \
postgres://user:pass@host:5432/db
Import into citus using pg_restore:
# remember to use connection details for citus, # not the source database pg_restore \ --host=host\ --dbname=dbname\ --username=username\ data.dump # it will prompt you for the connection password
Test application.
citus supports schema-based sharding, which allows a schema to be distributed. Distributed schemas are automatically associated with individual co-location groups such that the tables created in those schemas will be automatically converted to co-located distributed tables without a shard key.
There are two ways in which a schema can be distributed in citus:
Manually by calling the citus_schema_distribute function:
SELECT citus_schema_distribute('user_service');
This method also allows you to convert existing regular schemas into distributed schemas.
You can only distribute schemas that do not contain distributed and reference tables.
Alternative approach is to enable the citus.enable_schema_based_sharding configuration parameter:
SET citus.enable_schema_based_sharding TO ON; CREATE SCHEMA AUTHORIZATION user_service;
The parameter can be changed for the current session or permanently
in postgresql.conf. With the parameter set to
ON, all created schemas are be distributed by
default.
The process of distributing the schema will automatically assign and move it to an existing node in the cluster. The background shard rebalancer takes these schemas and all tables within them when rebalancing the cluster, performing the optimal moves, and migrating the schemas between the nodes in the cluster.
To convert a schema back into a regular Postgres Pro schema, use the citus_schema_undistribute function:
SELECT citus_schema_undistribute('user_service');
The tables and data in the user_service schema
will be moved from the current node back to the coordinator node in the
cluster.
To create a distributed table, you need to first define the table schema.
To do so, you can define a table using the
CREATE TABLE
command in the same way as you would do with a regular
Postgres Pro table.
CREATE TABLE github_events
(
event_id bigint,
event_type text,
event_public boolean,
repo_id bigint,
payload jsonb,
repo jsonb,
actor jsonb,
org jsonb,
created_at timestamp
);
Next, you can use the create_distributed_table function to specify the table distribution column and create the worker shards.
SELECT create_distributed_table('github_events', 'repo_id');
This function informs citus that the
github_events table should be distributed on the
repo_id column (by hashing the column value). The
function also creates shards on the worker nodes using the
citus.shard_count configuration
parameter.
This example would create a total of citus.shard_count
number of shards where each shard owns a portion of a hash token space.
Once the shards are created, this function saves all distributed
metadata on the coordinator.
Each created shard is assigned a unique shard_id.
Each shard is represented on the worker node as a regular
Postgres Pro table with the
tablename_shardid name where tablename
is the name of the distributed table and shardid is
the unique ID assigned to that shard. You can connect to the worker
Postgres Pro instances to view or run commands
on individual shards.
You are now ready to insert data into the distributed table and run queries on it. You can also learn more about the function used in this section in the citus Utility Functions section.
The above method distributes tables into multiple horizontal shards, but another possibility is distributing tables into a single shard and replicating the shard to every worker node. Tables distributed this way are called reference tables. They are used to store data that needs to be frequently accessed by multiple nodes in a cluster.
Common candidates for reference tables include:
Smaller tables that need to join with larger distributed tables.
Tables in multi-tenant apps that lack a tenant_id
column or which are not associated with a tenant.
(In some cases, to reduce migration effort, users might even
choose to make reference tables out of tables associated with a
tenant but which currently lack a tenant ID.)
Tables that need unique constraints across multiple columns and are small enough.
For instance, suppose a multi-tenant eCommerce site needs to calculate sales tax for transactions in any of its stores. Tax information is not specific to any tenant. It makes sense to consolidate it in a shared table. A US-centric reference table might look like this:
-- A reference table
CREATE TABLE states (
code char(2) PRIMARY KEY,
full_name text NOT NULL,
general_sales_tax numeric(4,3)
);
-- Distribute it to all workers
SELECT create_reference_table('states');
Now queries such as one calculating tax for a shopping cart can join
on the states table with no network overhead and
can add a foreign key to the state code for better validation.
In addition to distributing a table as a single replicated shard, the create_reference_table function marks it as a reference table in the citus metadata tables. citus automatically performs two-phase commits for modifications to tables marked this way, which provides strong consistency guarantees.
If you have an existing distributed table, you can change it to a reference table by running:
SELECT undistribute_table('table_name');
SELECT create_reference_table('table_name');
For another example of using reference tables in a multi-tenant application, see the Sharing Data Between Tenants section.
If an existing Postgres Pro database is converted into the coordinator node for a citus cluster, the data in its tables can be distributed efficiently and with minimal interruption to an application.
The
create_distributed_table
function described earlier works on both empty and non-empty tables
and for the latter it automatically distributes table rows throughout
the cluster. You will know if it does this by the presence of the
following message:
NOTICE: Copying data from local table.... For
example:
CREATE TABLE series AS SELECT i FROM generate_series(1,1000000) i;
SELECT create_distributed_table('series', 'i');
NOTICE: Copying data from local table...
NOTICE: copying the data has completed
DETAIL: The local data in the table is no longer visible, but is still on disk.
HINT: To remove the local data, run: SELECT truncate_local_data_after_distributing_table($$public.series$$)
create_distributed_table
--------------------------
(1 row)
Writes on the table are blocked while the data is migrated, and pending writes are handled as distributed queries once the function commits. (If the function fails, then the queries become local again.) Reads can continue as normal and will become distributed queries once the function commits.
When distributing tables A and B, where A has a foreign key to B, distribute the key destination table B first. Doing it in the wrong order will cause an error:
ERROR: cannot create foreign key constraint DETAIL: Referenced table must be a distributed table or a reference table.
If it is not possible to distribute in the correct order, then drop the foreign keys, distribute the tables, and recreate the foreign keys.
After the tables are distributed, use the truncate_local_data_after_distributing_table function to remove local data. Leftover local data in distributed tables is inaccessible to citus queries and can cause irrelevant constraint violations on the coordinator.
Co-location is the practice of dividing data tactically, keeping related information on the same computers to enable efficient relational operations, while taking advantage of the horizontal scalability for the whole dataset. For more information and examples, see the Table Co-Location section.
Tables are co-located in groups. To manually control a table's
co-location group assignment use the optional
colocate_with parameter of the
create_distributed_table
function. If you do not care about a table's co-location, then omit this
parameter. It defaults to the value 'default', which
groups the table with any other default co-location table having the same
distribution column type and shard count. If you want to break or update
this implicit co-location, you can use the
update_distributed_table_colocation
function.
-- These tables are implicitly co-located by using the same
-- distribution column type and shard count with the default
-- co-location group
SELECT create_distributed_table('A', 'some_int_col');
SELECT create_distributed_table('B', 'other_int_col');
When a new table is not related to others in its would-be implicit
co-location group, specify colocated_with => 'none'.
-- Not co-located with other tables
SELECT create_distributed_table('A', 'foo', colocate_with => 'none');
Splitting unrelated tables into their own co-location groups will improve shard rebalancing performance, because shards in the same group have to be moved together.
When tables are indeed related (for instance when they will be joined), it can make sense to explicitly co-locate them. The gains of appropriate co-location are more important than any rebalancing overhead.
To explicitly co-locate multiple tables, distribute one and then put the others into its co-location group. For example:
-- Distribute stores
SELECT create_distributed_table('stores', 'store_id');
-- Add to the same group as stores
SELECT create_distributed_table('orders', 'store_id', colocate_with => 'stores');
SELECT create_distributed_table('products', 'store_id', colocate_with => 'stores');
Information about co-location groups is stored in the pg_dist_colocation table, while the pg_dist_partition table reveals which tables are assigned to which groups.
You can use the standard Postgres Pro
DROP TABLE command to remove your distributed tables.
As with regular tables, DROP TABLE removes any
indexes, rules, triggers, and constraints that exist for the target
table. In addition, it also drops the shards on the worker nodes and
cleans up their metadata.
DROP TABLE github_events;
citus automatically propagates many kinds of DDL statements, which means that modifying a distributed table on the coordinator node will update shards on the workers too. Other DDL statements require manual propagation, and certain others are prohibited such as those which would modify a distribution column. Attempting to run DDL that is ineligible for automatic propagation will raise an error and leave tables on the coordinator node unchanged.
Here is a reference of the categories of DDL statements, which propagate. Note that automatic propagation can be enabled or disabled with the citus.enable_ddl_propagation configuration parameter.
citus propagates most
ALTER TABLE
commands automatically. Adding columns or changing their default
values work as they would in a single-machine
Postgres Pro database:
-- Adding a column ALTER TABLE products ADD COLUMN description text; -- Changing default value ALTER TABLE products ALTER COLUMN price SET DEFAULT 7.77;
Significant changes to an existing column like renaming it or changing its data type are fine too. However, the data type of the distribution column cannot be altered. This column determines how table data distributes through the citus cluster, and modifying its data type would require moving the data.
Attempting to do so causes an error:
-- Assuming store_id is the distribution column -- for products and that it has type integer ALTER TABLE products ALTER COLUMN store_id TYPE text; /* ERROR: cannot execute ALTER TABLE command involving partition column */
As a workaround, you can consider changing the distribution column using the alter_distributed_table function, updating it, and changing it back.
Using citus allows you to continue to enjoy the safety of a relational database, including database constraints. Due to the nature of distributed systems, citus will not cross-reference uniqueness constraints or referential integrity between worker nodes.
To set up a foreign key between co-located distributed tables, always include the distribution column in the key. This may involve making the key compound.
Foreign keys may be created in these situations:
between two local (non-distributed) tables,
between two reference tables,
between reference tables and local tables (by default enabled via the citus.enable_local_reference_table_foreign_keys configuration parameter),
between two co-located distributed tables when the key includes the distribution column, or
as a distributed table referencing a reference table.
Foreign keys from reference tables to distributed tables are not supported.
citus supports all
referential actions on
foreign keys from local to reference tables but does not support
ON DELETE/ON UPDATE CASCADE in
the reverse direction (reference to local).
Primary keys and uniqueness constraints must include the distribution column. Adding them to a non-distribution column will generate the creating unique indexes on non-partition columns is currently unsupported error.
This example shows how to create primary and foreign keys on distributed tables:
--
-- Adding a primary key
-- --------------------
-- We will distribute these tables on the account_id. The ads and clicks
-- tables must use compound keys that include account_id
ALTER TABLE accounts ADD PRIMARY KEY (id);
ALTER TABLE ads ADD PRIMARY KEY (account_id, id);
ALTER TABLE clicks ADD PRIMARY KEY (account_id, id);
-- Next distribute the tables
SELECT create_distributed_table('accounts', 'id');
SELECT create_distributed_table('ads', 'account_id');
SELECT create_distributed_table('clicks', 'account_id');
--
-- Adding foreign keys
-- -------------------
-- Note that this can happen before or after distribution, as long as
-- there exists a uniqueness constraint on the target column(s), which
-- can only be enforced before distribution
ALTER TABLE ads ADD CONSTRAINT ads_account_fk
FOREIGN KEY (account_id) REFERENCES accounts (id);
ALTER TABLE clicks ADD CONSTRAINT clicks_ad_fk
FOREIGN KEY (account_id, ad_id) REFERENCES ads (account_id, id);
Similarly, include the distribution column in uniqueness constraints:
-- Suppose we want every ad to use a unique image. Notice we can -- enforce it only per account when we distribute by account_id ALTER TABLE ads ADD CONSTRAINT ads_unique_image UNIQUE (account_id, image_url);
Not-null constraints can be applied to any column (distribution or not) because they require no lookups between workers.
ALTER TABLE ads ALTER COLUMN image_url SET NOT NULL;
NOT VALID Constraints #
In some situations it can be useful to enforce constraints for new
rows, while allowing existing non-conforming rows to remain unchanged.
citus supports this feature for the
CHECK constraints and foreign keys using the
Postgres Pro NOT VALID
constraint designation.
For example, consider an application that stores user profiles in a reference table.
-- We are using the "text" column type here, but a real application
-- might use "citext", which is available in the
-- Postgres Pro contrib module
CREATE TABLE users ( email text PRIMARY KEY );
SELECT create_reference_table('users');
In the course of time imagine that a few non-addresses get into the table.
INSERT INTO users VALUES
('foo@example.com'), ('hacker12@aol.com'), ('lol');
We would like to validate the addresses, but
Postgres Pro does not ordinarily allow us to
add the CHECK constraint that fails for existing
rows. However, it does allow a constraint marked
NOT VALID:
ALTER TABLE users
ADD CONSTRAINT syntactic_email
CHECK (email ~
'^[a-zA-Z0-9.!#$%&''*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$'
) NOT VALID;
This succeeds, and new rows are protected.
INSERT INTO users VALUES ('fake');
/*
ERROR: new row for relation "users_102010" violates
check constraint "syntactic_email_102010"
DETAIL: Failing row contains (fake).
*/
Later, during non-peak hours, a database administrator can attempt to fix the bad rows and re-validate the constraint.
-- Later, attempt to validate all rows ALTER TABLE users VALIDATE CONSTRAINT syntactic_email;
The Postgres Pro documentation has more
information about NOT VALID and
VALIDATE CONSTRAINT in the
section about the
ALTER TABLE command.
citus supports adding and removing indices:
-- Adding an index CREATE INDEX clicked_at_idx ON clicks USING BRIN (clicked_at); -- Removing an index DROP INDEX clicked_at_idx;
Adding an index takes a write lock, which can be undesirable in a multi-tenant “system-of-record”. To minimize application downtime, create the index concurrently instead. This method requires more total work than a standard index build and takes significantly longer to complete. However, since it allows normal operations to continue while the index is built, this method is useful for adding new indexes in a production environment.
-- Adding an index without locking table writes CREATE INDEX CONCURRENTLY clicked_at_idx ON clicks USING BRIN (clicked_at);
Creating custom SQL types and user-defined functions propogates to worker nodes. However, creating such database objects in a transaction with distributed operations involves tradeoffs.
citus parallelizes operations such as create_distributed_table across shards using multiple connections per worker. Whereas, when creating a database object, citus propagates it to worker nodes using a single connection per worker. Combining the two operations in a single transaction may cause issues, because the parallel connections will not be able to see the object that was created over a single connection but not yet committed.
Consider a transaction block that creates a type, a table, loads data, and distributes the table:
BEGIN;
-- Type creation over a single connection:
CREATE TYPE coordinates AS (x int, y int);
CREATE TABLE positions (object_id text primary key, position coordinates);
-- Data loading thus goes over a single connection:
SELECT create_distributed_table('positions', 'object_id');
\COPY positions FROM 'positions.csv'
COMMIT;
citus default behaviour prioritizes schema
consistency between coordinator and worker nodes. This behavior has a
downside: if object propagation happens after a
parallel command in the same transaction, then the transaction can
no longer be completed, as highlighted by the
ERROR in the code block below:
BEGIN;
CREATE TABLE items (key text, value text);
-- Parallel data loading:
SELECT create_distributed_table('items', 'key');
\COPY items FROM 'items.csv'
CREATE TYPE coordinates AS (x int, y int);
ERROR: cannot run type command because there was a parallel operation on a distributed table in the transaction
If you run into this issue, there is a simple workaround: use the
citus.multi_shard_modify_mode parameter set to
sequential to disable per-node parallelism. Data load
in the same transaction might be slower.
Most DDL commands are auto-propagated. For any others, you can propagate the changes manually. See the Manual Query Propagation section.
To insert data into distributed tables, you can use the standard
Postgres Pro
INSERT
command. As an example, we pick two rows randomly from the
GitHub Archive dataset.
/*
CREATE TABLE github_events
(
event_id bigint,
event_type text,
event_public boolean,
repo_id bigint,
payload jsonb,
repo jsonb,
actor jsonb,
org jsonb,
created_at timestamp
);
*/
INSERT INTO github_events VALUES (2489373118,'PublicEvent','t',24509048,'{}','{"id": 24509048, "url": "https://api.github.com/repos/SabinaS/csee6868", "name": "SabinaS/csee6868"}','{"id": 2955009, "url": "https://api.github.com/users/SabinaS", "login": "SabinaS", "avatar_url": "https://avatars.githubusercontent.com/u/2955009?", "gravatar_id": ""}',NULL,'2015-01-01 00:09:13');
INSERT INTO github_events VALUES (2489368389,'WatchEvent','t',28229924,'{"action": "started"}','{"id": 28229924, "url": "https://api.github.com/repos/inf0rmer/blanket", "name": "inf0rmer/blanket"}','{"id": 1405427, "url": "https://api.github.com/users/tategakibunko", "login": "tategakibunko", "avatar_url": "https://avatars.githubusercontent.com/u/1405427?", "gravatar_id": ""}',NULL,'2015-01-01 00:00:24');
When inserting rows into distributed tables, the distribution column of
the row being inserted must be specified. Based on the distribution
column, citus determines the right shard to
which the insert should be routed to. Then, the query is forwarded to
the right shard, and the remote INSERT command is
executed on all the replicas of that shard.
Sometimes it is convenient to put multiple INSERT
statements together into a single INSERT of multiple
rows. It can also be more efficient than making repeated database
queries. For instance, the example from the previous section can be
loaded all at once like this:
INSERT INTO github_events VALUES
(
2489373118,'PublicEvent','t',24509048,'{}','{"id": 24509048, "url": "https://api.github.com/repos/SabinaS/csee6868", "name": "SabinaS/csee6868"}','{"id": 2955009, "url": "https://api.github.com/users/SabinaS", "login": "SabinaS", "avatar_url": "https://avatars.githubusercontent.com/u/2955009?", "gravatar_id": ""}',NULL,'2015-01-01 00:09:13'
), (
2489368389,'WatchEvent','t',28229924,'{"action": "started"}','{"id": 28229924, "url": "https://api.github.com/repos/inf0rmer/blanket", "name": "inf0rmer/blanket"}','{"id": 1405427, "url": "https://api.github.com/users/tategakibunko", "login": "tategakibunko", "avatar_url": "https://avatars.githubusercontent.com/u/1405427?", "gravatar_id": ""}',NULL,'2015-01-01 00:00:24'
);
citus also supports
INSERT … SELECT statements, which insert rows based
on the results of the SELECT query. This is a
convenient way to fill tables and also allows
UPSERTS with the
ON CONFLICT clause, the easiest way
to do distributed rollups.
In citus there are three ways that
inserting from the SELECT statement can happen:
The first is if the source tables and the destination table are
co-located and the
SELECT statements
both include the distribution column. In this case,
citus can push the
/INSERTINSERT … SELECT statement down for parallel
execution on all nodes.
The second way of executing the INSERT … SELECT
statement is by repartitioning the results of the result set
into chunks, and sending those chunks among workers to matching
destination table shards. Each worker node can insert the values
into local destination shards.
The repartitioning optimization can happen when the
SELECT query does not require a merge step on
the coordinator. It does nor work with the following SQL
features, which require a merge step:
ORDER BY
LIMIT
OFFSET
GROUP BY when distribution column is not
part of the group key
Window functions when partitioning by a non-distribution column in the source table(s)
Joins between non-colocated tables (i.e. repartition joins)
When the source and destination tables are not co-located and
the repartition optimization cannot be applied, then
citus uses the third way of executing
INSERT … SELECT. It selects the results from
worker nodes and pulls the data up to the coordinator node. The
coordinator redirects rows back down to the appropriate shard.
Because all the data must pass through a single node, this
method is not as efficient.
When in doubt about which method citus is
using, use the EXPLAIN command, as described in the
Postgres Pro Tuning
section. When the target table has a very large shard count, it may be
wise to disable repartitioning, see the
citus.enable_repartitioned_insert_select
configuration parameter.
\copy Command (Bulk Load) #
To bulk load data from a file, you can directly use the
\copy
command.
First download our example github_events dataset by
running:
wget http://examples.citusdata.com/github_archive/github_events-2015-01-01-{0..5}.csv.gz
gzip -d github_events-2015-01-01-*.gz
Then, you can copy the data using psql. Note that this data requires the database to have UTF-8 encoding:
\COPY github_events FROM 'github_events-2015-01-01-0.csv' WITH (format CSV)
There is no notion of snapshot isolation across shards, which
means that a multi-shard SELECT that runs
concurrently with the \copy command might see it
committed on some shards, but not on others. If the user is storing
events data, he may occasionally observe small gaps in recent data.
It is up to applications to deal with this if it is a problem (e.g.
exclude the most recent data from queries or use some lock).
If \copy fails to open a connection for a shard
placement, then it behaves in the same way as INSERT,
namely to mark the placement(s) as inactive unless there are no more
active placements. If any other failure occurs after connecting, the
transaction is rolled back and thus no metadata changes are made.
Applications like event data pipelines and real-time dashboards require sub-second queries on large volumes of data. One way to make these queries fast is by calculating and saving aggregates ahead of time. This is called “rolling up” the data and it avoids the cost of processing raw data at run-time. As an extra benefit, rolling up timeseries data into hourly or daily statistics can also save space. Old data may be deleted when its full details are no longer needed and aggregates suffice.
For example, here is a distributed table for tracking page views by URL:
CREATE TABLE page_views (
site_id int,
url text,
host_ip inet,
view_time timestamp default now(),
PRIMARY KEY (site_id, url)
);
SELECT create_distributed_table('page_views', 'site_id');
Once the table is populated with data, we can run an aggregate query to count page views per URL per day, restricting to a given site and year.
-- How many views per url per day on site 5?
SELECT view_time::date AS day, site_id, url, count(*) AS view_count
FROM page_views
WHERE site_id = 5 AND
view_time >= date '2016-01-01' AND view_time < date '2017-01-01'
GROUP BY view_time::date, site_id, url;
The setup described above works but has two drawbacks. First, when you repeatedly execute the aggregate query, it must go over each related row and recompute the results for the entire data set. If you are using this query to render a dashboard, it is faster to save the aggregated results in a daily page views table and query that table. Second, storage costs will grow proportionally with data volumes and the length of queryable history. In practice, you may want to keep raw events for a short time period and look at historical graphs over a longer time window.
To receive those benefits, we can create the
daily_page_views table to store the daily statistics.
CREATE TABLE daily_page_views (
site_id int,
day date,
url text,
view_count bigint,
PRIMARY KEY (site_id, day, url)
);
SELECT create_distributed_table('daily_page_views', 'site_id');
In this example, we distributed both page_views
and daily_page_views on the
site_id column. This ensures that data corresponding
to a particular site will be
co-located on the same
node. Keeping the rows of the two tables together on each node minimizes
network traffic between nodes and enables highly parallel execution.
Once we create this new distributed table, we can then run
INSERT INTO ... SELECT to roll up raw page views into
the aggregated table. In the following, we aggregate page views each day.
citus users often wait for a certain time
period after the end of day to run a query like this, to accommodate
late arriving data.
-- Roll up yesterday's data
INSERT INTO daily_page_views (day, site_id, url, view_count)
SELECT view_time::date AS day, site_id, url, count(*) AS view_count
FROM page_views
WHERE view_time >= date '2017-01-01' AND view_time < date '2017-01-02'
GROUP BY view_time::date, site_id, url;
-- Now the results are available right out of the table
SELECT day, site_id, url, view_count
FROM daily_page_views
WHERE site_id = 5 AND
day >= date '2016-01-01' AND day < date '2017-01-01';
The rollup query above aggregates data from the previous day and inserts
it into the daily_page_views table. Running the query
once each day means that no rollup tables rows need to be updated,
because the new day's data does not affect previous rows.
The situation changes when dealing with late arriving data, or running
the rollup query more than once per day. If any new rows match days
already in the rollup table, the matching counts should increase.
Postgres Pro can handle this situation with
ON CONFLICT, which is its technique for doing
UPSERTS. Here is an example.
-- Roll up from a given date onward,
-- updating daily page views when necessary
INSERT INTO daily_page_views (day, site_id, url, view_count)
SELECT view_time::date AS day, site_id, url, count(*) AS view_count
FROM page_views
WHERE view_time >= date '2017-01-01'
GROUP BY view_time::date, site_id, url
ON CONFLICT (day, url, site_id) DO UPDATE SET
view_count = daily_page_views.view_count + EXCLUDED.view_count;
You can update or delete rows from your distributed tables using
the standard Postgres Pro
UPDATE and
DELETE commands.
DELETE FROM github_events WHERE repo_id IN (24509048, 24509049); UPDATE github_events SET event_public = TRUE WHERE (org->>'id')::int = 5430905;
When the UPDATE/DELETE operations
affect multiple shards as in the above example,
citus defaults to using a one-phase commit
protocol. For greater safety you can enable two-phase commits by
setting the citus.multi_shard_commit_protocol
configuration parameter:
SET citus.multi_shard_commit_protocol = '2pc';
If an UPDATE or DELETE operation
affects only a single shard, then it runs within a single worker node.
In this case enabling 2PC is unnecessary. This often happens when
updates or deletes filter by a table's distribution column:
-- Since github_events is distributed by repo_id, -- this will execute in a single worker node DELETE FROM github_events WHERE repo_id = 206084;
Furthermore, when dealing with a single shard,
citus supports
SELECT … FOR UPDATE. This is a technique sometimes
used by object-relational mappers (ORMs) to safely:
Load rows
Make a calculation in application code
Update the rows based on calculation
Selecting the rows for update puts a write lock on them to prevent other processes from causing the “lost update” anomaly.
BEGIN; -- Select events for a repo, but -- lock them for writing SELECT * FROM github_events WHERE repo_id = 206084 FOR UPDATE; -- Calculate a desired value event_public using -- application logic that uses those rows -- Now make the update UPDATE github_events SET event_public = :our_new_value WHERE repo_id = 206084; COMMIT;
This feature is supported for hash distributed and reference tables only.
Both INSERT and
UPDATE/DELETE statements can be
scaled up to around 50,000 queries per second on large machines. However,
to achieve this rate, you will need to use many parallel, long-lived
connections and consider how to deal with locking. For more
information, you can consult the
Scaling Out Data Ingestion
section.
As discussed in the previous sections, citus
extends the latest Postgres Pro for distributed
execution. This means that you can use standard
Postgres Pro
SELECT
queries on the citus coordinator. The
extension will then parallelize the SELECT queries
involving complex selections, groupings and orderings, and
JOINs to speed up the query performance. At a high
level, citus partitions the
SELECT query into smaller query fragments, assigns
these query fragments to workers, oversees their execution, merges their
results (and orders them if needed), and returns the final result to the
user.
In the following sections, we discuss the different types of queries you can run using citus.
citus supports and parallelizes most aggregate functions supported by Postgres Pro, including custom user-defined aggregates. Aggregates execute using one of three methods, in this order of preference:
When the aggregate is grouped by a distribution column of a table, citus can push down execution of the entire query to each worker. All aggregates are supported in this situation and execute in parallel on the worker nodes. (Any custom aggregates being used must be installed on the workers.)
When the aggregate is not grouped by a
distribution column, citus can still
optimize on a case-by-case basis. citus
has internal rules for certain aggregates like
sum(), avg(), and
count(distinct) that allow it to rewrite
queries for partial aggregation on workers. For
instance, to calculate an average, citus
obtains a sum and a count from each worker, and then the coordinator
node computes the final average.
Full list of the special-case aggregates:
avg, min,
max, sum,
count, array_agg,
jsonb_agg,
jsonb_object_agg, json_agg,
json_object_agg, bit_and,
bit_or, bool_and,
bool_or, every,
hll_add_agg, hll_union_agg,
topn_add_agg, topn_union_agg,
any_value,
tdigest(double precision, int),
tdigest_percentile(double precision, int, double precision),
tdigest_percentile(double precision, int, double precision[]),
tdigest_percentile(tdigest, double precision),
tdigest_percentile(tdigest, double precision[]),
tdigest_percentile_of(double precision, int, double precision),
tdigest_percentile_of(double precision, int, double precision[]),
tdigest_percentile_of(tdigest, double precision),
tdigest_percentile_of(tdigest, double precision[])
Last resort: pull all rows from the workers and perform the aggregation on the coordinator node. When the aggregate is not grouped on a distribution column, and is not one of the predefined special cases, then citus falls back to this approach. It causes network overhead and can exhaust the coordinator resources if the data set to be aggregated is too large. (It is possible to disable this fallback, see below.)
Beware that small changes in a query can change execution modes causing
potentially surprising inefficiency. For example,
sum(x) grouped by a non-distribution column
could use distributed execution, while
sum(distinct x) has to pull up the entire set of
input records to the coordinator.
All it takes is one column to hurt the execution of a whole query.
In the example below, if sum(distinct value2)
has to be grouped on the coordinator, then so will
sum(value1) even if the latter was fine on its own.
SELECT sum(value1), sum(distinct value2) FROM distributed_table;
To avoid accidentally pulling data to the coordinator, you can set
the citus.coordinator_aggregation_strategy parameter:
SET citus.coordinator_aggregation_strategy TO 'disabled';
Note that disabling the coordinator aggregation strategy will prevent “type three” aggregate queries from working at all.
count(distinct) Aggregates #
citus supports
count(distinct) aggregates in several ways. If
the count(distinct) aggregate is on the
distribution column, citus can directly
push down the query to the workers. If not, citus
runs SELECT distinct statements on each worker and
returns the list to the coordinator where it obtains the final count.
Note that transferring this data becomes slower when workers have a
greater number of distinct items. This is especially true for queries
containing multiple count(distinct) aggregates,
e.g.:
-- Multiple distinct counts in one query tend to be slow SELECT count(distinct a), count(distinct b), count(distinct c) FROM table_abc;
For these kind of queries, the resulting SELECT
distinct statements on the workers essentially produce a cross-product
of rows to be transferred to the coordinator.
For increased performance you can choose to make an approximate count instead. Follow the steps below:
Download and install the hll extension on all Postgres Pro instances (the coordinator and all the workers).
You can visit the hll GitHub repository for specifics on obtaining the extension.
Create the hll extension on all the Postgres Pro instances by simply running the below command from the coordinator:
CREATE EXTENSION hll;
Enable count(distinct) approximations by
setting the
citus.count_distinct_error_rate
configuration parameter. Lower values for this configuration
setting are expected to give more accurate results but take more
time for computation. We recommend setting this to
0.005.
SET citus.count_distinct_error_rate TO 0.005;
After this step, count(distinct) aggregates
automatically switch to using hll with
no changes necessary to your queries. You should be able to run
approximate count(distinct) queries on any
column of the table.
HyperLogLog Column.
Certain users already store their data as hll
columns. In such cases, they can dynamically roll up those data by
calling the hll_union_agg(hll_column) function.
Calculating the first n elements in a set by
applying count, sort, and
limit is simple. However, as data sizes increase,
this method becomes slow and resource intensive. It is more efficient
to use an approximation.
The open source
topn
extension for Postgres Pro enables fast
approximate results to “top-n” queries. The extension
materializes the top values into a json data type.
The topn extension can incrementally update
these top values or merge them on-demand across different time
intervals.
Before seeing a realistic example of topn,
let's see how some of its primitive operations work. First
topn_add updates a JSON object with counts of how
many times a key has been seen:
-- Starting from nothing, record that we saw an "a"
SELECT topn_add('{}', 'a');
-- => {"a": 1}
-- Record the sighting of another "a"
SELECT topn_add(topn_add('{}', 'a'), 'a');
-- => {"a": 2}
The extension also provides aggregations to scan multiple values:
-- For normal_rand
CREATE EXTENSION tablefunc;
-- Count values from a normal distribution
SELECT topn_add_agg(floor(abs(i))::text)
FROM normal_rand(1000, 5, 0.7) i;
-- => {"2": 1, "3": 74, "4": 420, "5": 425, "6": 77, "7": 3}
If the number of distinct values crosses a threshold, the aggregation
drops information for those seen least frequently. This keeps space
usage under control. The threshold can be controlled by the
topn.number_of_counters configuration parameter.
Its default value is 1000.
Now onto a more realistic example of how topn works in practice. Let's ingest Amazon product reviews from the year 2000 and use topn to query it quickly. First download the dataset:
curl -L https://examples.citusdata.com/customer_reviews_2000.csv.gz | \ gunzip > reviews.csv
Next, ingest it into a distributed table:
CREATE TABLE customer_reviews
(
customer_id TEXT,
review_date DATE,
review_rating INTEGER,
review_votes INTEGER,
review_helpful_votes INTEGER,
product_id CHAR(10),
product_title TEXT,
product_sales_rank BIGINT,
product_group TEXT,
product_category TEXT,
product_subcategory TEXT,
similar_product_ids CHAR(10)[]
);
SELECT create_distributed_table('customer_reviews', 'product_id');
\COPY customer_reviews FROM 'reviews.csv' WITH CSV
Next we will add the extension, create a destination table to
store the JSON data generated by topn, and
apply the topn_add_agg function we saw
previously.
-- Run below command from coordinator, it will be propagated to the worker nodes as well
CREATE EXTENSION topn;
-- A table to materialize the daily aggregate
CREATE TABLE reviews_by_day
(
review_date date unique,
agg_data jsonb
);
SELECT create_reference_table('reviews_by_day');
-- Materialize how many reviews each product got per day per customer
INSERT INTO reviews_by_day
SELECT review_date, topn_add_agg(product_id)
FROM customer_reviews
GROUP BY review_date;
Now, rather than writing a complex window function on
customer_reviews, we can simply apply
topn to reviews_by_day.
For instance, the following query finds the most frequently reviewed
product for each of the first five days:
SELECT review_date, (topn(agg_data, 1)).* FROM reviews_by_day ORDER BY review_date LIMIT 5;
┌─────────────┬────────────┬───────────┐ │ review_date │ item │ frequency │ ├─────────────┼────────────┼───────────┤ │ 2000-01-01 │ 0939173344 │ 12 │ │ 2000-01-02 │ B000050XY8 │ 11 │ │ 2000-01-03 │ 0375404368 │ 12 │ │ 2000-01-04 │ 0375408738 │ 14 │ │ 2000-01-05 │ B00000J7J4 │ 17 │ └─────────────┴────────────┴───────────┘
The JSON fields created by topn can be
merged with topn_union and
topn_union_agg. We can use the latter to merge
the data for the entire first month and list the five most reviewed
products during that period.
SELECT (topn(topn_union_agg(agg_data), 5)).* FROM reviews_by_day WHERE review_date >= '2000-01-01' AND review_date < '2000-02-01' ORDER BY 2 DESC;
┌────────────┬───────────┐ │ item │ frequency │ ├────────────┼───────────┤ │ 0375404368 │ 217 │ │ 0345417623 │ 217 │ │ 0375404376 │ 217 │ │ 0375408738 │ 217 │ │ 043936213X │ 204 │ └────────────┴───────────┘
For more details and examples, see the topn readme file.
Finding an exact percentile over a large number of rows can be prohibitively expensive, because all rows must be transferred to the coordinator for final sorting and processing. Finding an approximation, on the other hand, can be done in parallel on worker nodes using a so-called sketch algorithm. The coordinator node then combines compressed summaries into the final result rather than reading through the full rows.
A popular sketch algorithm for percentiles uses a compressed data structure called t-digest, and is available for Postgres Pro in the tdigest extension. citus has integrated support for this extension.
Here is how to use tdigest in citus:
Download and install the tdigest extension on all Postgres Pro nodes (the coordinator and all the workers). The tdigest extension GitHub repository has installation instructions.
Create the tdigest extension within the database. Run the following command on the coordinator:
CREATE EXTENSION tdigest;
The coordinator will propagate the command to the workers as well.
When any of the aggregates defined in the extension are used in queries, citus will rewrite the queries to push down partial tdigest computation to the workers where applicable.
tdigest accuracy can be controlled with the
compression argument passed into aggregates. The
trade-off is accuracy vs the amount of data shared between workers and
the coordinator. For a full explanation of how to use the aggregates in
tdigest, have a look at the documentation of
the extension.
citus also pushes down the limit clauses to the shards on the workers wherever possible to minimize the amount of data transferred across network.
However, in some cases, SELECT queries with
LIMIT clauses may need to fetch all rows from each
shard to generate exact results. For example, if the query requires
ordering by the aggregate column, it would need results of that column
from all shards to determine the final aggregate value. This reduces
performance of the LIMIT clause due to high volume of
network data transfer. In such cases, and where an approximation would
produce meaningful results, citus provides an
option for network efficient approximate LIMIT clauses.
LIMIT approximations are disabled by default and can
be enabled by setting the
citus.limit_clause_row_fetch_count
configuration parameter. On the basis of this configuration value,
citus will limit the number of rows returned
by each task for aggregation on the coordinator. Due to this limit, the
final results may be approximate. Increasing this limit will increase
the accuracy of the final results, while still providing an upper bound
on the number of rows pulled from the workers.
SET citus.limit_clause_row_fetch_count TO 10000;
citus supports all views on distributed
tables. To learn more about syntax and features of views, see the
section about the
CREATE VIEW
command.
Note that some views cause a less efficient query plan than others. For more information about detecting and improving poor view performance, see the Subquery/CTE Network Overhead section. (Views are treated inside the extension as subqueries.)
citus supports materialized views as well and stores them as local tables on the coordinator node.
citus supports equi-joins between any number of tables irrespective of their size and distribution method. The query planner chooses the optimal join method and join order based on how tables are distributed. It evaluates several possible join orders and creates a join plan which requires minimum data to be transferred across network.
When two tables are co-located then they can be joined efficiently on their common distribution columns. A co-located join is the most efficient way to join two large distributed tables.
Internally, the citus coordinator knows which shards of the co-located tables might match with shards of the other table by looking at the distribution column metadata. This allows citus to prune away shard pairs, which cannot produce matching join keys. The joins between remaining shard pairs are executed in parallel on the workers and then the results are returned to the coordinator.
Be sure that the tables are distributed into the same number of shards
and that the distribution columns of each table have exactly matching
types. Attempting to join on columns of slightly different types such
as int and bigint can cause problems.
Reference tables can be used as “dimension” tables to join efficiently with large “fact” tables. Because reference tables are replicated in full across all worker nodes, a reference join can be decomposed into local joins on each worker and performed in parallel. A reference join is like a more flexible version of a co-located join because reference tables are not distributed on any particular column and are free to join on any of their columns.
Reference tables can also join with tables local to the coordinator node.
In some cases, you may need to join two tables on columns other than the distribution column. For such cases, citus also allows joining on non-distribution key columns by dynamically repartitioning the tables for the query.
In such cases the table(s) to be partitioned are determined by the query optimizer on the basis of the distribution columns, join keys and sizes of the tables. With repartitioned tables, it can be ensured that only relevant shard pairs are joined with each other reducing the amount of data transferred across network drastically.
In general, co-located joins are more efficient than repartition joins as repartition joins require shuffling of data. So, you should try to distribute your tables by the common join keys whenever possible.
A citus cluster consists of a coordinator instance and multiple worker instances. The data is sharded on the workers while the coordinator stores metadata about these shards. All queries issued to the cluster are executed via the coordinator. The coordinator partitions the query into smaller query fragments where each query fragment can be run independently on a shard. The coordinator then assigns the query fragments to workers, oversees their execution, merges their results, and returns the final result to the user. The query processing architecture can be described in brief by the diagram below.
Figure J.16. Query Processing Architecture
citus query processing pipeline involves the two components:
Distributed query planner and executor
Postgres Pro planner and executor
We discuss them in greater detail in the subsequent sections.
citus distributed query planner takes in a SQL query and plans it for distributed execution.
For SELECT queries, the planner first creates a plan
tree of the input query and transforms it into its commutative and
associative form so it can be parallelized. It also applies several
optimizations to ensure that the queries are executed in a scalable
manner, and that network I/O is minimized.
Next, the planner breaks the query into two parts: the coordinator query, which runs on the coordinator, and the worker query fragments, which run on individual shards on the workers. The planner then assigns these query fragments to the workers such that all their resources are used efficiently. After this step, the distributed query plan is passed on to the distributed executor for execution.
The planning process for key-value lookups on the distribution column or modification queries is slightly different as they hit exactly one shard. Once the planner receives an incoming query, it needs to decide the correct shard to which the query should be routed. To do this, it extracts the distribution column in the incoming row and looks up the metadata to determine the right shard for the query. Then, the planner rewrites the SQL of that command to reference the shard table instead of the original table. This re-written plan is then passed to the distributed executor.
citus distributed executor runs distributed query plans and handles failures. The executor is well suited for getting fast responses to queries involving filters, aggregations, and co-located joins, as well as running single-tenant queries with full SQL coverage. It opens one connection per shard to the workers as needed and sends all fragment queries to them. It then fetches the results from each fragment query, merges them, and gives the final results back to the user.
If necessary citus can gather results from subqueries and CTEs into the coordinator node and then push them back across workers for use by an outer query. This allows citus to support a greater variety of SQL constructs.
For example, having subqueries in the WHERE clause
sometimes cannot execute inline at the same time as the main query, but
must be done separately. Suppose a web analytics application maintains
a page_views table partitioned by
page_id. To query the number of visitor hosts on the
top twenty most visited pages, we can use a subquery to find the list of
pages, then an outer query to count the hosts.
SELECT page_id, count(distinct host_ip) FROM page_views WHERE page_id IN ( SELECT page_id FROM page_views GROUP BY page_id ORDER BY count(*) DESC LIMIT 20 ) GROUP BY page_id;
The executor would like to run a fragment of this query against each
shard by page_id, counting distinct
host_ips, and combining the results on the
coordinator. However, the LIMIT in the subquery means
the subquery cannot be executed as part of the fragment. By recursively
planning the query citus can run the subquery
separately, push the results to all workers, run the main fragment query,
and pull the results back to the coordinator. The
“push-pull” design supports subqueries like the one above.
Let's see this in action by reviewing the
EXPLAIN output for
this query. It is fairly involved:
GroupAggregate (cost=0.00..0.00 rows=0 width=0)
Group Key: remote_scan.page_id
-> Sort (cost=0.00..0.00 rows=0 width=0)
Sort Key: remote_scan.page_id
-> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=0 width=0)
-> Distributed Subplan 6_1
-> Limit (cost=0.00..0.00 rows=0 width=0)
-> Sort (cost=0.00..0.00 rows=0 width=0)
Sort Key: COALESCE((pg_catalog.sum((COALESCE((pg_catalog.sum(remote_scan.worker_column_2))::bigint, '0'::bigint))))::bigint, '0'::bigint) DESC
-> HashAggregate (cost=0.00..0.00 rows=0 width=0)
Group Key: remote_scan.page_id
-> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=0 width=0)
Task Count: 32
Tasks Shown: One of 32
-> Task
Node: host=localhost port=9701 dbname=postgres
-> HashAggregate (cost=54.70..56.70 rows=200 width=12)
Group Key: page_id
-> Seq Scan on page_views_102008 page_views (cost=0.00..43.47 rows=2247 width=4)
Task Count: 32
Tasks Shown: One of 32
-> Task
Node: host=localhost port=9701 dbname=postgres
-> HashAggregate (cost=84.50..86.75 rows=225 width=36)
Group Key: page_views.page_id, page_views.host_ip
-> Hash Join (cost=17.00..78.88 rows=1124 width=36)
Hash Cond: (page_views.page_id = intermediate_result.page_id)
-> Seq Scan on page_views_102008 page_views (cost=0.00..43.47 rows=2247 width=36)
-> Hash (cost=14.50..14.50 rows=200 width=4)
-> HashAggregate (cost=12.50..14.50 rows=200 width=4)
Group Key: intermediate_result.page_id
-> Function Scan on read_intermediate_result intermediate_result (cost=0.00..10.00 rows=1000 width=4)
Let's break it apart and examine each piece.
GroupAggregate (cost=0.00..0.00 rows=0 width=0)
Group Key: remote_scan.page_id
-> Sort (cost=0.00..0.00 rows=0 width=0)
Sort Key: remote_scan.page_id
The root of the tree is what the coordinator node does with the results
from the workers. In this case, it is grouping them, and
GroupAggregate requires they be sorted first.
-> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=0 width=0) -> Distributed Subplan 6_1 .
The custom scan has two large sub-trees, starting with a “distributed subplan”.
-> Limit (cost=0.00..0.00 rows=0 width=0)
-> Sort (cost=0.00..0.00 rows=0 width=0)
Sort Key: COALESCE((pg_catalog.sum((COALESCE((pg_catalog.sum(remote_scan.worker_column_2))::bigint, '0'::bigint))))::bigint, '0'::bigint) DESC
-> HashAggregate (cost=0.00..0.00 rows=0 width=0)
Group Key: remote_scan.page_id
-> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=0 width=0)
Task Count: 32
Tasks Shown: One of 32
-> Task
Node: host=localhost port=9701 dbname=postgres
-> HashAggregate (cost=54.70..56.70 rows=200 width=12)
Group Key: page_id
-> Seq Scan on page_views_102008 page_views (cost=0.00..43.47 rows=2247 width=4)
.
Worker nodes run the above for each of the thirty-two shards
(citus is choosing one representative for
display). We can recognize all the pieces of the IN (…)
subquery: the sorting, grouping and limiting. When all workers
have completed this query, they send their output back to the
coordinator which puts it together as “intermediate results”.
Task Count: 32
Tasks Shown: One of 32
-> Task
Node: host=localhost port=9701 dbname=postgres
-> HashAggregate (cost=84.50..86.75 rows=225 width=36)
Group Key: page_views.page_id, page_views.host_ip
-> Hash Join (cost=17.00..78.88 rows=1124 width=36)
Hash Cond: (page_views.page_id = intermediate_result.page_id)
.
The citus extension starts another executor
job in this second subtree. It is going to count distinct hosts in
page_views. It uses a JOIN to
connect with the intermediate results. The intermediate results will
help it restrict to the top twenty pages.
-> Seq Scan on page_views_102008 page_views (cost=0.00..43.47 rows=2247 width=36)
-> Hash (cost=14.50..14.50 rows=200 width=4)
-> HashAggregate (cost=12.50..14.50 rows=200 width=4)
Group Key: intermediate_result.page_id
-> Function Scan on read_intermediate_result intermediate_result (cost=0.00..10.00 rows=1000 width=4)
.
The worker internally retrieves intermediate results using the
read_intermediate_result function, which loads data
from a file that was copied in from the coordinator node.
This example showed how citus executed the
query in multiple steps with a distributed subplan and how you can use
EXPLAIN to learn about distributed query execution.
Once the distributed executor sends the query fragments to the workers, they are processed like regular Postgres Pro queries. The Postgres Pro planner on that worker chooses the most optimal plan for executing that query locally on the corresponding shard table. The Postgres Pro executor then runs that query and returns the query results back to the distributed executor. Learn more about the Postgres Pro planner and executor. Finally, the distributed executor passes the results to the coordinator for final aggregation.
When the user issues a query, the citus coordinator partitions it into smaller query fragments where each query fragment can be run independently on a worker shard. This allows citus to distribute each query across the cluster.
However, the way queries are partitioned into fragments (and which queries are propagated at all) varies by the type of query. In some advanced situations it is useful to manually control this behavior. citus provides utility functions to propagate SQL to workers, shards, or co-located placements.
Manual query propagation bypasses coordinator logic, locking, and any other consistency checks. These functions are available as a last resort to allow statements which citus otherwise does not run natively. Use them carefully to avoid data inconsistency and deadlocks.
The least granular level of execution is broadcasting a statement for execution on all workers. This is useful for viewing properties of entire worker databases.
-- List the work_mem setting of each worker database SELECT run_command_on_workers($cmd$ SHOW work_mem; $cmd$);
To run on all nodes, both workers and the coordinator, use the
run_command_on_all_nodes function.
This command should not be used to create database objects on the workers, as doing so will make it harder to add worker nodes in an automated fashion.
The run_command_on_workers function and other
manual propagation commands in this section can run only queries
that return a single column and single row.
The next level of granularity is running a command across all shards of a particular distributed table. It can be useful, for instance, in reading the properties of a table directly on workers. Queries run locally on a worker node have full access to metadata such as table statistics.
The run_command_on_shards function applies an SQL
command to each shard, where the shard name is provided for interpolation
in the command. Here is an example of estimating the row count for a
distributed table by using the pg_class
table on each worker to estimate the number of rows for each shard.
Notice the %s, which will be replaced with each
shard name.
-- Get the estimated row count for a distributed table by summing the
-- estimated counts of rows for each shard
SELECT sum(result::bigint) AS estimated_count
FROM run_command_on_shards(
'my_distributed_table',
$cmd$
SELECT reltuples
FROM pg_class c
JOIN pg_catalog.pg_namespace n on n.oid=c.relnamespace
WHERE (n.nspname || '.' || relname)::regclass = '%s'::regclass
AND n.nspname NOT IN ('citus', 'pg_toast', 'pg_catalog')
$cmd$
);
A useful companion to run_command_on_shards is
the run_command_on_colocated_placements function.
It interpolates the names of two placements of
co-located distributed
tables into a query. The placement pairs are always chosen to be local
to the same worker where full SQL coverage is available. Thus we can use
advanced SQL features like triggers to relate the tables:
-- Suppose we have two distributed tables
CREATE TABLE little_vals (key int, val int);
CREATE TABLE big_vals (key int, val int);
SELECT create_distributed_table('little_vals', 'key');
SELECT create_distributed_table('big_vals', 'key');
-- We want to synchronize them so that every time little_vals
-- are created, big_vals appear with double the value
--
-- First we make a trigger function, which will
-- take the destination table placement as an argument
CREATE OR REPLACE FUNCTION embiggen() RETURNS TRIGGER AS $$
BEGIN
IF (TG_OP = 'INSERT') THEN
EXECUTE format(
'INSERT INTO %s (key, val) SELECT ($1).key, ($1).val*2;',
TG_ARGV[0]
) USING NEW;
END IF;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
-- Next we relate the co-located tables by the trigger function
-- on each co-located placement
SELECT run_command_on_colocated_placements(
'little_vals',
'big_vals',
$cmd$
CREATE TRIGGER after_insert AFTER INSERT ON %s
FOR EACH ROW EXECUTE PROCEDURE embiggen(%L)
$cmd$
);
There are no safeguards against deadlock for multi-statement transactions.
There are no safeguards against mid-query failures and resulting inconsistencies.
Query results are cached in memory; these functions cannot deal with very big result sets.
The functions error out early if they cannot connect to a node.
As citus provides distributed functionality by extending Postgres Pro, it is compatible with Postgres Pro constructs. This means that users can use the tools and features that come with the rich and extensible Postgres Pro ecosystem for distributed tables created with citus.
citus has 100% SQL coverage for any queries it is able to execute on a single worker node. These kind of queries are common in multi-tenant applications when accessing information about a single tenant.
Even cross-node queries (used for parallel computations) support most SQL features. However, some SQL features are not supported for queries, which combine information from multiple nodes.
These limitations apply to all models of operation:
The rule system is not supported.
Subqueries within INSERT queries
are not supported.
Distributing multi-level partitioned tables is not supported.
Functions used in UPDATE queries on distributed
tables must not be VOLATILE.
STABLE functions used in UPDATE
queries cannot be called with column references.
Modifying views when the query contains citus tables is not supported.
citus encodes the node identifier in the
sequence generated on every node, this allows every individual node to
take inserts directly without having the sequence overlap. This method
however does not work for sequences that are smaller than
bigint, which may result in inserts on worker nodes failing,
in that case you need to drop the column and add a bigint
based one, or route the inserts via the coordinator.
SELECT … FOR UPDATE
work in single-shard queries only.
TABLESAMPLE work in single-shard queries only.
Correlated subqueries are supported only when the correlation is on the distribution column.
Outer joins between distributed tables are only supported on the distribution column.
Recursive CTEs work in single-shard queries only.
Grouping sets work in single-shard queries only.
Only regular, foreign or partitioned tables can be distributed.
The SQL MERGE
command is supported in the following combinations of
table types:
| Target | Source | Support | Comments |
|---|---|---|---|
|
Local |
Local |
Yes | |
|
Local |
Reference |
Yes | |
|
Local |
Distributed |
No |
Feature in development |
|
Distributed |
Local |
Yes | |
|
Distributed |
Distributed |
Yes |
Including non co-located tables |
|
Distributed |
Reference |
Yes | |
|
Reference |
N/A |
No |
Reference table as target is not allowed |
For a detailed reference of the Postgres Pro SQL command dialect (which can be used as is by citus users), you can see the SQL Commands section.
When using schema-based sharding the following features are not available:
Foreign keys across distributed schemas are not supported.
Joins across distributed schemas are subject to cross-node SQL queries limitations.
Creating a distributed schema and tables in a single SQL statement is not supported.
Before attempting workarounds consider whether citus is appropriate for your situation. The citus extension works well for real-time analytics and multi-tenant use cases.
citus supports all SQL statements in the multi-tenant use case. Even in the real-time analytics use cases, with queries that span across nodes, citus supports the majority of statements. The few types of unsupported queries are listed in the Are there any Postgres Pro features not supported by citus? section. Many of the unsupported features have workarounds; below are a number of the most useful.
When a SQL query is unsupported, one way to work around it is using CTEs, which use what we call pull-push execution.
SELECT * FROM dist WHERE EXISTS (SELECT 1 FROM local WHERE local.a = dist.a); /* ERROR: direct joins between distributed and local tables are not supported HINT: Use CTEs or subqueries to select from local tables and use them in joins */
To work around this limitation, you can turn the query into a router query by wrapping the distributed part in a CTE.
WITH cte AS (SELECT * FROM dist) SELECT * FROM cte WHERE EXISTS (SELECT 1 FROM local WHERE local.a = cte.a);
Remember that the coordinator will send the results of the CTE to all workers which require it for processing. Thus it is best to either add the most specific filters and limits to the inner query as possible, or else aggregate the table. That reduces the network overhead which such a query can cause. More about this in the Subquery/CTE Network Overhead section.
There are still a few queries that are unsupported even with the use of push-pull execution via subqueries. One of them is using grouping sets on a distributed table.
In our
real-time analytics tutorial
we created a table called github_events,
distributed by the column user_id. Let's query it and
find the earliest events for a preselected set of repos, grouped by
combinations of event type and event publicity. A convenient way to do
this is with grouping sets. However, as mentioned, this feature is not
yet supported in distributed queries:
-- This will not work
SELECT repo_id, event_type, event_public,
grouping(event_type, event_public),
min(created_at)
FROM github_events
WHERE repo_id IN (8514, 15435, 19438, 21692)
GROUP BY repo_id, ROLLUP(event_type, event_public);
ERROR: could not run distributed query with GROUPING HINT: Consider using an equality filter on the distributed table's partition column.
There is a trick, though. We can pull the relevant information to the coordinator as a temporary table:
-- Grab the data, minus the aggregate, into a local table
CREATE TEMP TABLE results AS (
SELECT repo_id, event_type, event_public, created_at
FROM github_events
WHERE repo_id IN (8514, 15435, 19438, 21692)
);
-- Now run the aggregate locally
SELECT repo_id, event_type, event_public,
grouping(event_type, event_public),
min(created_at)
FROM results
GROUP BY repo_id, ROLLUP(event_type, event_public);
repo_id | event_type | event_public | grouping | min
---------+-------------------+--------------+----------+---------------------
8514 | PullRequestEvent | t | 0 | 2016-12-01 05:32:54
8514 | IssueCommentEvent | t | 0 | 2016-12-01 05:32:57
19438 | IssueCommentEvent | t | 0 | 2016-12-01 05:48:56
21692 | WatchEvent | t | 0 | 2016-12-01 06:01:23
15435 | WatchEvent | t | 0 | 2016-12-01 05:40:24
21692 | WatchEvent | | 1 | 2016-12-01 06:01:23
15435 | WatchEvent | | 1 | 2016-12-01 05:40:24
8514 | PullRequestEvent | | 1 | 2016-12-01 05:32:54
8514 | IssueCommentEvent | | 1 | 2016-12-01 05:32:57
19438 | IssueCommentEvent | | 1 | 2016-12-01 05:48:56
15435 | | | 3 | 2016-12-01 05:40:24
21692 | | | 3 | 2016-12-01 06:01:23
19438 | | | 3 | 2016-12-01 05:48:56
8514 | | | 3 | 2016-12-01 05:32:54
Creating a temporary table on the coordinator is a last resort. It is limited by the disk size and CPU of the node.
INSERT Queries #
Try rewriting your queries with INSERT INTO ... SELECT
syntax.
The following SQL:
INSERT INTO a.widgets (map_id, widget_name)
VALUES (
(SELECT mt.map_id FROM a.map_tags mt WHERE mt.map_license = '12345'),
'Test'
);
Would become:
INSERT INTO a.widgets (map_id, widget_name) SELECT mt.map_id, 'Test' FROM a.map_tags mt WHERE mt.map_license = '12345';
This section contains reference information for the user defined functions provided by citus. These functions help in providing additional distributed functionality to citus other than the standard SQL commands.
citus_schema_distribute (schemaname regnamespace) returns void
#Converts existing regular schemas into distributed schemas, which are automatically associated with individual co-location groups such that the tables created in those schemas will be automatically converted to co-located distributed tables without a shard key. The process of distributing the schema will automatically assign and move it to an existing node in the cluster.
Arguments:
schemaname — the name of the schema,
which needs to be distributed.
The example below shows how to distribute three schemas named
tenant_a, tenant_b, and
tenant_c. For more examples, see the
Microservices
section:
SELECT citus_schema_distribute('tenant_a');
SELECT citus_schema_distribute('tenant_b');
SELECT citus_schema_distribute('tenant_c');
citus_schema_undistribute (schemaname regnamespace) returns void
#Converts an existing distributed schema back into a regular schema. The process results in the tables and data being moved from the current node back to the coordinator node in the cluster.
Arguments:
schemaname — the name of the schema,
which needs to be distributed.
The example below shows how to convert three different distributed schemas back into regular schemas. For more examples, see the Microservices section:
SELECT citus_schema_undistribute('tenant_a');
SELECT citus_schema_undistribute('tenant_b');
SELECT citus_schema_undistribute('tenant_c');
citus_schema_move (schema_id regnamespace, target_node_name text, target_node_port integer, shard_transfer_mode citus.shard_transfer_mode) returns void
#Moves a distributed schema from one node to another.
There are two ways to move a distributed schema: blocking or non-blocking. The blocking approach means that during the move all modifications to the tables in the schema are paused. The second way, which avoids blocking writes, relies on Postgres Pro 10 logical replication.
Arguments:
schema_id — object ID of the
distributed schema to be moved. If you provide the name of the
schema as a string literal, this string is automatically casted
to the oid.
target_node_name — the DNS name of
the node on which the distributed schema will be moved
(“target” node).
target_node_port — the port on the
target worker node on which the database server is listening.
shard_transfer_mode — specify the
method of replication, whether to use
Postgres Pro logical replication or a
cross-worker COPY command. The allowed values
of this optional argument are:
auto — require replica identity
if logical replication is possible, otherwise use legacy
behaviour. This is the default value.
force_logical — use logical
replication even if the table does not have a replica
identity. Any concurrent update/delete statements to the
table will fail during replication.
block_writes — use
COPY (blocking writes) for tables
lacking primary key or replica identity.
The example below shows how to use the function:
SELECT citus_schema_move('schema-name', 'to_host', 5432);
create_distributed_table (table_name regclass, distribution_column text, distribution_type citus.distribution_type, colocate_with text, shard_count int) returns void
#Defines a distributed table and create its shards if it is a hash-distributed table. This function takes in a table name, the distribution column, and an optional distribution method and inserts appropriate metadata to mark the table as distributed. The function defaults to hash distribution if no distribution method is specified. If the table is hash-distributed, the function also creates worker shards based on the shard count configuration value. If the table contains any rows, they are automatically distributed to worker nodes.
Arguments:
table_name — the name of the table,
which needs to be distributed.
distribution_column — the column on
which the table is to be distributed.
distribution_type — an optional
distribution method. The default value is hash.
colocate_with — include current
table in the co-location group of another table. This is an
optional argument. By default tables are co-located when they
are distributed by columns of the same type with the same shard
count. If you want to break this co-location later, you can use
the
update_distributed_table_colocation
function. Possible values for this argument are
default, which is the default value,
none to start a new co-location group, or the
name of another table to co-locate with the table. To learn more,
see the Co-Locating Tables
section.
Keep in mind that the default value of the
colocate_with argument does implicit
co-location. As explained in the
Table Co-Location
section, this can be a great thing when tables are related or
will be joined. However, when two tables are unrelated but happen
to use the same datatype for their distribution columns,
accidentally co-locating them can decrease performance during
shard rebalancing.
The table shards will be moved together unnecessarily in a
“cascade”. If you want to break this implicit
co-location, you can use the
update_distributed_table_colocation
function.
If a new distributed table is not related to other tables, it is
best to specify colocate_with => 'none'.
shard_count — the number of shards
to create for the new distributed table. This is an optional
argument. When specifying shard_count you
cannot specify a value of colocate_with
other than none. To change the shard count of
an existing table or co-location group, use the
alter_distributed_table
function.
Allowed values for the shard_count
argument are between 1 and
64000. For guidance on choosing the optimal
value, see the
Shard Count
section.
This example informs the database that the
github_events table should be distributed by hash
on the repo_id column. For more examples, see the
Creating and Modifying Distributed Objects (DDL)
section:
SELECT create_distributed_table('github_events', 'repo_id');
-- Alternatively, to be more explicit:
SELECT create_distributed_table('github_events', 'repo_id',
colocate_with => 'github_repo');
truncate_local_data_after_distributing_table (function_name regclass) returns void
#Truncates all local rows after distributing a table and prevent constraints from failing due to outdated local records. The truncation cascades to tables having a foreign key to the designated table. If the referring tables are not themselves distributed, then truncation is forbidden until they are to protect referential integrity:
ERROR: cannot truncate a table referenced in a foreign key constraint by a local table
Truncating local coordinator node table data is safe for distributed tables because their rows, if they have any, are copied to worker nodes during distribution.
Arguments:
table_name — the name of the
distributed table whose local counterpart on the coordinator
node should be truncated.
The example below shows how to use the function:
-- Requires that argument is a distributed table
SELECT truncate_local_data_after_distributing_table('public.github_events');
undistribute_table (table_name regclass, cascade_via_foreign_keys boolean) returns void
#Undoes the action of the create_distributed_table or create_reference_table functions. Undistributing moves all data from shards back into a local table on the coordinator node (assuming the data can fit), then deletes the shards.
citus will not undistribute tables that
have, or are referenced by, foreign keys, unless the
cascade_via_foreign_keys argument is set to
true. If this argument is false
(or omitted), then you must manually drop the offending foreign key
constraints before undistributing.
Arguments:
table_name — the name of the
distributed or reference table to undistribute.
cascade_via_foreign_keys — when
this optional argument is set to true, the
function also undistributes all tables that are related to
table_name through foreign keys. Use
caution with this argument because it can potentially affect
many tables. The default value is false.
The example below shows how to distribute the
github_events table and then undistribute it:
-- First distribute the table
SELECT create_distributed_table('github_events', 'repo_id');
-- Undo that and make it local again
SELECT undistribute_table('github_events');
alter_distributed_table (table_name regclass, distribution_column text, shard_count int, colocate_with text, cascade_to_colocated boolean) returns void
#Changes the distribution column, shard count or co-location properties of a distributed table.
Arguments:
table_name — the name of the
distributed table, which will be altered.
distribution_column — the name of
the new distribution column. The default value of this optional
argument is NULL.
shard_count — the new shard count.
The default value of this optional argument is
NULL.
colocate_with — the table that the
current distributed table will be co-located with. Possible values
are default, none to start
a new co-location group, or the name of another table with which
to co-locate. The default value of this optional argument is
default.
cascade_to_colocated. When this argument
is set to true,
shard_count and
colocate_with changes will also be
applied to all of the tables that were previously co-located
with the table, and the co-location will be preserved. If it is
false, the current co-location of this table
will be broken. The default value of this optional argument is
false.
The example below shows how to use the function:
-- Change distribution column
SELECT alter_distributed_table('github_events', distribution_column:='event_id');
-- Change shard count of all tables in colocation group
SELECT alter_distributed_table('github_events', shard_count:=6, cascade_to_colocated:=true);
-- Change colocation
SELECT alter_distributed_table('github_events', colocate_with:='another_table');
alter_table_set_access_method (table_name regclass, access_method text) returns void
#
Changes access method of a table (e.g. heap or
columnar).
Arguments:
table_name — the name of the table
whose access method will change.
access_method — the name of the new
access method.
The example below shows how to use the function:
SELECT alter_table_set_access_method('github_events', 'columnar');
remove_local_tables_from_metadata () returns void
#Removes local tables from metadata of the citus extension that no longer need to be there. (See the citus.enable_local_reference_table_foreign_keys configuration parameter.)
Usually if a local table is in citus
metadata, there is a reason, such as the existence of foreign keys
between the table and a reference table. However, if
citus.enable_local_reference_table_foreign_keys
is disabled, citus will no longer manage
metadata in that situation, and unnecessary metadata can persist
until manually cleaned.
create_reference_table (table_name regclass) returns void
#Defines a small reference or dimension table. This function takes in a table name, and creates a distributed table with just one shard, replicated to every worker node.
Arguments:
table_name — the name of the small
dimension or reference table, which needs to be distributed.
The example below informs the database that the
nation table should be defined as a reference
table:
SELECT create_reference_table('nation');
citus_add_local_table_to_metadata (table_name regclass, cascade_via_foreign_keys boolean) returns void
#Adds a local Postgres Pro table into citus metadata. A major use case for this function is to make local tables on the coordinator accessible from any node in the cluster. This is mostly useful when running queries from other nodes. The data associated with the local table stays on the coordinator, only its schema and metadata are sent to the workers.
Note that adding local tables to the metadata comes at a slight cost. When you add the table, citus must track it in the pg_dist_partition. Local tables that are added to metadata inherit the same limitations as reference tables (see the Creating and Modifying Distributed Objects (DDL) and SQL Support and Workarounds sections).
If you use the undistribute_table function, citus will automatically remove the resulting local tables from metadata, which eliminates such limitations on those tables.
Arguments:
table_name — the name of the table
on the coordinator to be added to citus
metadata.
cascade_via_foreign_keys — when
this optional argument is set to true,
the function adds other tables that are in a foreign key
relationship with given table into metadata automatically. Use
caution with this argument, because it can potentially affect
many tables. The default value is false.
The example below informs the database that the
nation table should be defined as a
coordinator-local table, accessible from any node:
SELECT citus_add_local_table_to_metadata('nation');
update_distributed_table_colocation (table_name regclass, colocate_with text) returns void
#
Updates co-location of a distributed table. This function can also
be used to break co-location of a distributed table.
citus will implicitly co-locate two tables
if the distribution column is the same type, this can be useful if
the tables are related and will do some joins. If table
A and B are co-located and
table A gets rebalanced, table
B will also be rebalanced. If table
B does not have a replica identity, the rebalance
will fail. Therefore, this function can be useful breaking the
implicit co-location in that case. Note that this function does not
move any data around physically.
Arguments:
table_name — the name of the table
co-location of which will be updated.
colocate_with — the table with
which the table should be co-located.
If you want to break the co-location of a table, specify
colocate_with => 'none'.
The example below shows that co-location of table A
is updated as co-location of table B:
SELECT update_distributed_table_colocation('A', colocate_with => 'B');
Assume that table A and table B
are co-located (possibily implicitly). If you want to break the
co-location, do the following:
SELECT update_distributed_table_colocation('A', colocate_with => 'none');
Now, assume that tables A, B,
C, and D are co-located and
you want to co-locate table A with
B and table C with
table D:
SELECT update_distributed_table_colocation('C', colocate_with => 'none');
SELECT update_distributed_table_colocation('D', colocate_with => 'C');
If you have a hash-distributed table named none
and you want to update its co-location, you can do:
SELECT update_distributed_table_colocation('"none"', colocate_with => 'some_other_hash_distributed_table');
create_distributed_function (function_name regprocedure, distribution_arg_name text, colocate_with text, force_delegation bool) returns void
#
Propagates a function from the coordinator node to workers and marks
it for distributed execution. When a distributed function is called
on the coordinator, citus uses the value
of the distribution_arg_name argument to pick
a worker node to run the function. Calling the function on workers
increases parallelism and can bring the code closer to data in shards
for lower latency.
Note that the Postgres Pro search path is not propagated from the coordinator to workers during distributed function execution, so distributed function code should fully qualify the names of database objects. Also notices emitted by the functions will not be displayed to the user.
Arguments:
function_name — the name of the
function to be distributed. The name must include the function
parameter types in parentheses because multiple functions can
have the same name in Postgres Pro.
For instance, 'foo(int)' is different from
'foo(int, text)'.
distribution_arg_name — the argument
name by which to distribute. For convenience (or if the function
arguments do not have names), a positional placeholder is
allowed, such as '$1'. If this argument is
not specified, then the function named by
function_name is merely created on the
workers. If worker nodes are added in the future, the function
will automatically be created there too. This is an optional
argument.
colocate_with — when the distributed
function reads or writes to a distributed table (or more
generally
co-locating tables),
be sure to name that table using the this argument. This ensures
that each invocation of the function runs on the worker node
containing relevant shards. This is an optional argument.
force_delegation. The default value is
NULL.
The example below shows how to use the function:
-- An example function that updates a hypothetical -- event_responses table, which itself is distributed by event_id CREATE OR REPLACE FUNCTION register_for_event(p_event_id int, p_user_id int) RETURNS void LANGUAGE plpgsql AS $fn$ BEGIN INSERT INTO event_responses VALUES ($1, $2, 'yes') ON CONFLICT (event_id, user_id) DO UPDATE SET response = EXCLUDED.response; END; $fn$; -- Distribute the function to workers, using the p_event_id argument -- to determine which shard each invocation affects, and explicitly -- colocating with event_responses which the function updates SELECT create_distributed_function( 'register_for_event(int, int)', 'p_event_id', colocate_with := 'event_responses' );
alter_columnar_table_set (table_name regclass, chunk_group_row_limit int, stripe_row_limit int, compression name, compression_level int) returns void
#
Changes settings on a columnar table.
Calling this function on a non-columnar table gives an error. All
arguments except the table_name are optional.
To view current options for all columnar tables, consult this table:
SELECT * FROM columnar.options;
The default values for columnar settings for newly created tables can be overridden with these configuration parameters:
columnar.compression
columnar.compression_level
columnar.stripe_row_count
columnar.chunk_row_count
Arguments:
table_name — the name of the
columnar table.
chunk_row_count — the maximum
number of rows per chunk for newly inserted data. Existing
chunks of data will not be changed and may have more rows than
this maximum value. The default value is 10000.
stripe_row_count — the maximum
number of rows per stripe for newly inserted data. Existing
stripes of data will not be changed and may have more rows than
this maximum value. The default value is 150000.
compression — the compression type
for the newly inserted data. Existing data will not be
recompressed or decompressed. The default and generally suggested
value is zstd (if support has been compiled in).
Allowed values are none,
pglz, zstd,
lz4, and lz4hc.
compression_level. Allowed values are
from 1 to 19. If the compression method does not support the
level chosen, the closest level will be selected instead.
The example below shows how to use the function:
SELECT alter_columnar_table_set( 'my_columnar_table', compression => 'none', stripe_row_count => 10000);
create_time_partitions (table_name regclass, partition_interval interval, end_at timestamptz, start_from timestamptz) returns boolean
#
Creates partitions of a given interval to cover a given range of
time. Returns true if new partitions are created
and false if they already exist.
Arguments:
table_name — the table for which to
create new partitions. The table must be partitioned on one
column of type date, timestamp, or
timestamptz.
partition_interval — the interval
of time, such as '2 hours', or
'1 month', to use when setting ranges on new
partitions.
end_at — create partitions up to
this time. The last partition will contain the point
end_at and no later partitions will be
created.
start_from — pick the first
partition so that it contains the point
start_from. The default value is
now().
The example below shows how to use the function:
-- Create a year's worth of monthly partitions -- in table foo, starting from the current time SELECT create_time_partitions( table_name := 'foo', partition_interval := '1 month', end_at := now() + '12 months' );
drop_old_time_partitions (table_name regclass, older_than timestamptz)
#Removes all partitions whose intervals fall before a given timestamp. In addition to using this function, you might consider the alter_old_partitions_set_access_method function to compress the old partitions with columnar storage.
Arguments:
table_name — the table for which to
remove partitions. The table must be partitioned on one column
of type date, timestamp, or
timestamptz.
older_than — drop partitions whose
upper limit is less than or equal to the
older_than value.
The example below shows how to use the procedure:
-- Drop partitions that are over a year old
CALL drop_old_time_partitions('foo', now() - interval '12 months');
alter_old_partitions_set_access_method (parent_table_name regclass, older_than timestamptz, new_access_method name)
#In the timeseries data use case tables are often partitioned by time and old partitions are compressed into read-only columnar storage.
Arguments:
parent_table_name — the table for
which to change partitions. The table must be partitioned on one
column of type date, timestamp, or
timestamptz.
older_than — change partitions
whose upper limit is less than or equal to the
older_than value.
new_access_method. Allowed values are
heap for row-based storage or
columnar for columnar storage.
The example below shows how to use the procedure:
CALL alter_old_partitions_set_access_method( 'foo', now() - interval '6 months', 'columnar' );
citus_add_node (nodename text, nodeport integer, groupid integer, noderole noderole, nodecluster name) returns integer
#This function requires database superuser access to run.
Registers a new node addition in the cluster in the
citus metadata table
pg_dist_node. It
also copies reference tables to the new node. The function returns
the nodeid column from the newly inserted row in
pg_dist_node.
If you call the function on a single-node cluster, be sure to call the citus_set_coordinator_host function first.
Arguments:
nodename — the DNS name or IP
address of the new node to be added.
nodeport — the port on which
Postgres Pro is listening on the worker
node.
groupid — the group of one primary
server and its secondary servers, relevant only for streaming
replication. Be sure to set this argument to a value greater
than zero, since zero is reserved for the coordinator node. The
default value is -1.
noderole — the role of the node.
Allowed values are primary and
secondary. The default value is
primary.
nodecluster — the name of the
cluster. The default value is default.
The example below shows how to use the function:
SELECT * FROM citus_add_node('new-node', 12345);
citus_add_node
-----------------
7
(1 row)
citus_update_node (node_id int, new_node_name text, new_node_port int, force bool, lock_cooldown int) returns void
#This function requires database superuser access to run.
Changes the hostname and port for a node registered in the citus metadata table pg_dist_node.
Arguments:
node_id — the node ID from the
pg_dist_node table.
new_node_name — the updated DNS
name or IP address for the node.
new_node_port — the updated port on
which Postgres Pro is listening on the
worker node.
force. The default value is
false.
lock_cooldown. The default value is
10000.
The example below shows how to use the function:
SELECT * FROM citus_update_node(123, 'new-address', 5432);
citus_set_node_property (nodename text, nodeport integer, property text, value boolean) returns void
#
Changes properties in the citus metadata
table pg_dist_node.
Currently it can change only the shouldhaveshards
property.
Arguments:
nodename — the DNS name or IP
address for the node.
nodeport — the port on which
Postgres Pro is listening on the worker
node.
property — the column to change in
pg_dist_node, currently only the
shouldhaveshard property is supported.
value — the new value for the
column.
The example below shows how to use the function:
SELECT * FROM citus_set_node_property('localhost', 5433, 'shouldhaveshards', false);
citus_add_inactive_node (nodename text, nodeport integer, groupid integer, noderole noderole, nodecluster name) returns integer
#This function requires database superuser access to run.
Similarly to the citus_add_node
function, registers a new node in
pg_dist_node.
However, it marks the new node as inactive, meaning no shards will
be placed there. Also it does not copy reference
tables to the new node. The function returns the
nodeid column from the newly inserted row in
pg_dist_node.
Arguments:
nodename — the DNS name or IP
address of the new node to be added.
nodeport — the port on which
Postgres Pro is listening on the worker
node.
groupid — the group of one primary
server and zero or more secondary servers, relevant only for
streaming replication. The default is -1.
noderole — the role of the node.
Allowed values are primary and
secondary. The default value is
primary.
nodecluster — the name of the cluster.
The default value is default.
The example below shows how to use the function:
SELECT * FROM citus_add_inactive_node('new-node', 12345);
citus_add_inactive_node
--------------------------
7
(1 row)
citus_activate_node (nodename text, nodeport integer) returns integer
#This function requires database superuser access to run.
Marks a node as active in the citus
metadata table
pg_dist_node and
copies reference tables to the node. Useful for nodes added via
citus_add_inactive_node.
The function returns the nodeid column from the
newly inserted row in pg_dist_node.
Arguments:
nodename — the DNS name or IP
address of the new node to be added.
nodeport — the port on which
Postgres Pro is listening on the worker
node.
The example below shows how to use the function:
SELECT * FROM citus_activate_node('new-node', 12345);
citus_activate_node
----------------------
7
(1 row)
citus_disable_node (nodename text, nodeport integer, synchronous bool) returns void
#This function requires database superuser access to run.
This function is the opposite from citus_activate_node. It marks a node as inactive in the citus metadata table pg_dist_node, removing it from the cluster temporarily. The function also deletes all reference table placements from the disabled node. To reactivate the node, just call citus_activate_node again.
Arguments:
nodename — the DNS name or IP
address of the node to be disabled.
nodeport — the port on which
Postgres Pro is listening on the worker
node.
synchronous. The default value is
false.
The example below shows how to use the function:
SELECT * FROM citus_disable_node('new-node', 12345);
citus_add_secondary_node (nodename text, nodeport integer, primaryname text, primaryport integer, nodecluster name) returns integer
#This function requires database superuser access to run.
Registers a new secondary node in the cluster for an existing
primary node. The function updates the citus
pg_dist_node metadata
table. The function returns the nodeid column for
the secondary node from the inserted row in
pg_dist_node.
Arguments:
nodename — the DNS name or IP
address of the new node to be added.
nodeport — the port on which
Postgres Pro is listening on the worker
node.
primaryname — the DNS name or IP
address of the primary node for this secondary.
primaryport — the port on which
Postgres Pro is listening on the
primary node.
nodecluster — the name of the
cluster. The default value is default.
The example below shows how to use the function:
SELECT * FROM citus_add_secondary_node('new-node', 12345, 'primary-node', 12345);
citus_add_secondary_node
---------------------------
7
(1 row)
citus_remove_node (nodename text, nodeport integer) returns void
#This function requires database superuser access to run.
Removes the specified node from the pg_dist_node metadata table. This function will error out if there are existing shard placements on this node. Thus, before using this function, the shards will need to be moved off that node.
Arguments:
nodename — the DNS name of the node
to be removed.
nodeport — the port on which
Postgres Pro is listening on the worker
node.
The example below shows how to use the function:
SELECT citus_remove_node('new-node', 12345);
citus_remove_node
--------------------
(1 row)
citus_get_active_worker_nodes () returns setof record
#Returns active worker host names and port numbers as a list of tuples where each tuple contains the following information:
node_name — the DNS name of the
worker node.
node_port — the port on the worker
node on which the database server is listening.
The example below shows the output of the function:
SELECT * FROM citus_get_active_worker_nodes(); node_name | node_port -----------+----------- localhost | 9700 localhost | 9702 localhost | 9701 (3 rows)
citus_backend_gpid () returns bigint
#Returns the global process identifier (GPID) for the Postgres Pro backend serving the current session. The GPID value encodes both a node in the citus cluster and the operating system process ID of Postgres Pro on that node. The GPID is returned in the following form: (node ID * 10,000,000,000) + process ID.
citus extends the Postgres Pro
server signaling functions
pg_cancel_backend and pg_terminate_backend
so that they accept GPIDs. In citus,
calling these functions on one node can affect a backend running on
another node.
The example below shows the output of the function:
SELECT citus_backend_gpid();
citus_backend_gpid
--------------------
10000002055
citus_check_cluster_node_health () returns setof record
#Checks connectivity between all nodes. If there are N nodes, this function checks all N2 connections between them. The function returns the list of tuples where each tuple contains the following information:
from_nodename — the DNS name of the
source worker node.
from_nodeport — the port on the
source worker node on which the database server is listening.
to_nodename — the DNS name of the
destination worker node.
to_nodeport — the port on the
destination worker node on which the database server is
listening.
result — whether a connection could
be established.
The example below shows the output of the function:
SELECT * FROM citus_check_cluster_node_health();
from_nodename │ from_nodeport │ to_nodename │ to_nodeport │ result ---------------+---------------+-------------+-------------+-------- localhost | 1400 | localhost | 1400 | t localhost | 1400 | localhost | 1401 | t localhost | 1400 | localhost | 1402 | t localhost | 1401 | localhost | 1400 | t localhost | 1401 | localhost | 1401 | t localhost | 1401 | localhost | 1402 | t localhost | 1402 | localhost | 1400 | t localhost | 1402 | localhost | 1401 | t localhost | 1402 | localhost | 1402 | t (9 rows)
citus_set_coordinator_host (host text, port integer, node_role noderole, node_cluster name) returns void
#
This function is required when adding worker nodes to a
citus cluster, which was created initially
as a single-node cluster.
When the coordinator registers a new worker, it adds a coordinator
hostname from the value of the
citus.local_hostname
configuration parameter, which is localhost
by default. The worker would attempt to connect to
localhost to talk to the coordinator, which is
obviously wrong.
Thus, the system administrator should call this function before calling the citus_add_node function in a single-node cluster.
Arguments:
host — the DNS name of the
coordinator node.
port — the port on which the
coordinator lists for Postgres Pro
connections. The default value of this optional argument is
current_setting('port').
node_role — the role of the node.
The default value of this optional argument is
primary.
node_cluster — the name of the
cluster. The default value of this optional argument is
default.
The example below shows how to use the function:
-- Assuming we are in a single-node cluster
-- First establish how workers should reach us
SELECT citus_set_coordinator_host('coord.example.com', 5432);
-- Then add a worker
SELECT * FROM citus_add_node('worker1.example.com', 5432);
get_shard_id_for_distribution_column (table_name regclass, distribution_value "any") returns bigint
#
citus assigns every row of a distributed
table to a shard based on the value of the row's distribution column
and the table's method of distribution. In most cases the precise
mapping is a low-level detail that the database administrator can
ignore. However, it can be useful to determine a row's shard either
for manual database maintenance tasks or just to satisfy curiosity.
The get_shard_id_for_distribution_column
function provides this info for hash-distributed tables as well as
reference tables and returns the shard ID that
citus associates with the distribution
column value for the given table.
Arguments:
table_name — the name of the
distributed table.
distribution_value — the value of
the distribution column. The default value is
NULL.
The example below shows how to use the function:
SELECT get_shard_id_for_distribution_column('my_table', 4);
get_shard_id_for_distribution_column
--------------------------------------
540007
(1 row)
column_to_column_name (table_name regclass, column_var_text text) returns text
#
Translates the partkey column of the
pg_dist_partition
table into a textual column name. This is useful to determine the
distribution column of a distributed table. The function returns the
distribution column name of the table_name
table. To learn more, see the
Finding the Distribution Column For a Table
section.
Arguments:
table_name — name of the
distributed table.
column_var_text — value of
partkey column in the
pg_dist_partition table.
The example below shows how to use the function:
-- Get distribution column name for products table SELECT column_to_column_name(logicalrelid, partkey) AS dist_col_name FROM pg_dist_partition WHERE logicalrelid='products'::regclass;
┌───────────────┐ │ dist_col_name │ ├───────────────┤ │ company_id │ └───────────────┘
citus_relation_size (logicalrelid regclass) returns bigint
#Returns the disk space used by all the shards of the specified distributed table. This includes the size of the “main fork” but excludes the visibility map and free space map for the shards.
Arguments:
logicalrelid — the name of the
distributed table.
The example below shows how to use the function:
SELECT pg_size_pretty(citus_relation_size('github_events'));
pg_size_pretty -------------- 23 MB
citus_table_size (logicalrelid regclass) returns bigint
#Returns the disk space used by all the shards of the specified distributed table, excluding indexes (but including TOAST, free space map, and visibility map).
Arguments:
logicalrelid — the name of the
distributed table.
The example below shows how to use the function:
SELECT pg_size_pretty(citus_table_size('github_events'));
pg_size_pretty -------------- 37 MB
citus_total_relation_size (logicalrelid regclass, fail_on_error boolean) returns bigint
#Returns the total disk space used by the all the shards of the specified distributed table, including all indexes and TOAST data.
Arguments:
logicalrelid — the name of the
distributed table.
fail_on_error. The default value is
true.
The example below shows how to use the function:
SELECT pg_size_pretty(citus_total_relation_size('github_events'));
pg_size_pretty -------------- 73 MB
citus_stat_statements_reset () returns void
#
Removes all rows from the
citus_stat_statements
table. Note that this works independently from the
pg_stat_statements_reset
function. To reset all stats, call both functions.
citus_move_shard_placement (shard_id bigint, source_node_name text, source_node_port integer, target_node_name text, target_node_port integer, shard_transfer_mode citus.shard_transfer_mode) returns void
#Moves a given shard (and shards co-located with it) from one node to another. It is typically used indirectly during shard rebalancing rather than being called directly by a database administrator.
There are two ways to move the data: blocking or non-blocking. The blocking approach means that during the move all modifications to the shard are paused. The second way, which avoids blocking shard writes, relies on Postgres Pro 10 logical replication.
After a successful move operation, shards in the source node get deleted. If the move fails at any point, this function throws an error and leaves the source and target nodes unchanged.
Arguments:
shard_id — the ID of the shard to be
moved.
source_node_name — the DNS name of
the node on which the healthy shard placement is present
(“source” node).
source_node_port — the port on the
source worker node on which the database server is listening.
target_node_name — the DNS name of
the node on which the invalid shard placement is present
(“target” node).
target_node_port — the port on the
target worker node on which the database server is listening.
shard_transfer_mode — specify the
method of replication, whether to use
Postgres Pro logical replication or a
cross-worker COPY command. The allowed values
of this optional argument are:
auto — require replica identity
if logical replication is possible, otherwise use legacy
behaviour. This is the default value.
force_logical — use logical
replication even if the table does not have a replica
identity. Any concurrent update/delete statements to the
table will fail during replication.
block_writes — use
COPY (blocking writes) for tables
lacking primary key or replica identity.
The example below shows how to use the function:
SELECT citus_move_shard_placement(12345, 'from_host', 5432, 'to_host', 5432);
citus_rebalance_start (rebalance_strategy name, drain_only boolean, shard_transfer_mode citus.shard_transfer_mode) returns bigint
#Moves table shards to make them evenly distributed among the workers. It begins a background job to do the rebalancing and returns immediately.
The rebalancing process first calculates the list of moves it needs to make in order to ensure that the cluster is balanced within the given threshold. Then, it moves shard placements one by one from the source node to the destination node and updates the corresponding shard metadata to reflect the move.
Every shard is assigned a cost when determining whether shards are
“evenly distributed”. By default each shard has the
same cost (a value of 1), so distributing to equalize the cost
across workers is the same as equalizing the number of shards on
each. The constant cost strategy is called
by_shard_count and is the default rebalancing
strategy.
The by_shard_count strategy is appropriate under
these circumstances:
The shards are roughly the same size.
The shards get roughly the same amount of traffic.
Worker nodes are all the same size/type.
Shards have not been pinned to particular workers.
If any of these assumptions do not hold, then rebalancing using the
by_shard_count strategy can result in a bad plan.
If any of these assumptions do not hold, then rebalancing using the
by_shard_count strategy can result in a bad plan.
The default rebalancing starategy is by_disk_size.
You can always customize the strategy, using the
rebalance_strategy parameter.
It is advisable to call the
get_rebalance_table_shards_plan
function before citus_rebalance_start to see
and verify the actions to be performed.
Arguments:
rebalance_strategy — name of a
strategy in the
pg_dist_rebalance_strategy
table. If this argument is omitted, the function chooses the
default strategy, as indicated in the table. The default value
of this optional argument is NULL.
drain_only. When true,
move shards off worker nodes who have
shouldhaveshards set to false
in the pg_dist_node
table; move no other shards. The default value of this optional
argument is false.
shard_transfer_mode — specify the
method of replication, whether to use
Postgres Pro logical replication or a
cross-worker COPY command.
The allowed values of this optional argument are:
auto — require replica identity
if logical replication is possible, otherwise use legacy
behaviour. This is the default value.
force_logical — use logical
replication even if the table does not have a replica
identity. Any concurrent update/delete statements to the
table will fail during replication.
block_writes — use
COPY (blocking writes) for tables
lacking primary key or replica identity.
The example below will attempt to rebalance shards:
SELECT citus_rebalance_start(); NOTICE: Scheduling... NOTICE: Scheduled as job 1337. DETAIL: Rebalance scheduled as background job 1337. HINT: To monitor progress, run: SELECT details FROM citus_rebalance_status();
citus_rebalance_status () returns table
#Allows you to monitor the progress of the rebalance. Returns immediately, while the rebalance continues as a background job.
To get general information about the rebalance, you can select all columns from the status. This shows the basic state of the job:
SELECT * FROM citus_rebalance_status();
.
job_id | state | job_type | description | started_at | finished_at | details
--------+----------+-----------+---------------------------------+-------------------------------+-------------------------------+-----------
4 | running | rebalance | Rebalance colocation group 1 | 2022-08-09 21:57:27.833055+02 | 2022-08-09 21:57:27.833055+02 | { ... }
Rebalancer specifics live in the details column,
in JSON format:
SELECT details FROM citus_rebalance_status();
{
"phase": "copy",
"phase_index": 1,
"phase_count": 3,
"last_change":"2022-08-09 21:57:27",
"colocations": {
"1": {
"shard_moves": 30,
"shard_moved": 29,
"last_move":"2022-08-09 21:57:27"
},
"1337": {
"shard_moves": 130,
"shard_moved": 0
}
}
}
citus_rebalance_stop () returns void
#Cancels the rebalance in progress, if any.
citus_rebalance_wait () returns void
#Blocks until a running rebalance is complete. If no rebalance is in progress when this function is called, then the function returns immediately.
The function can be useful for scripts or benchmarking.
get_rebalance_table_shards_plan () returns table
#
Outputs the planned shard movements of the
citus_rebalance_start
function without performing them. While it is unlikely, this function
can output a slightly different plan than what a
citus_rebalance_start call with the same
arguments will do. This could happen because they are not executed
at the same time, so facts about the cluster, e.g. disk space, might
differ between the calls. The function returns tuples containing the
following columns:
table_name — the table whose shards
would move.
shardid — the shard in question.
shard_size — the size, in bytes.
sourcename — the hostname of the
source node.
sourceport — the port of the source
node.
targetname — the hostname of the
destination node.
targetport — the port of the
destination node.
Arguments:
A superset of the arguments for
the
citus_rebalance_start
function:
relation, threshold,
max_shard_moves,
excluded_shard_list, and
drain_only.
get_rebalance_progress () returns table
#Once the shard rebalance begins, this function lists the progress of every shard involved. It monitors the moves planned and executed by the citus_rebalance_start function. The function returns tuples containing the following columns:
sessionid —
the Postgres Pro PID of the rebalance
monitor.
table_name — the table whose shards
are moving.
shardid — the shard in question.
shard_size — the size of the shard,
in bytes.
sourcename — the hostname of the
source node.
sourceport — the port of the source
node.
targetname — the hostname of the
destination node.
targetport — the port of the destination
node.
progress. The following values may be
returned: 0 — waiting to be moved,
1 — moving, 2
— complete.
source_shard_size — the size of the
shard on the source node, in bytes.
target_shard_size — the size of the
shard on the target node, in bytes.
The example below shows how to use the function:
SELECT * FROM get_rebalance_progress();
┌───────────┬────────────┬─────────┬────────────┬───────────────┬────────────┬───────────────┬────────────┬──────────┬───────────────────┬───────────────────┐ │ sessionid │ table_name │ shardid │ shard_size │ sourcename │ sourceport │ targetname │ targetport │ progress │ source_shard_size │ target_shard_size │ ├───────────┼────────────┼─────────┼────────────┼───────────────┼────────────┼───────────────┼────────────┼──────────┼───────────────────┼───────────────────┤ │ 7083 │ foo │ 102008 │ 1204224 │ n1.foobar.com │ 5432 │ n4.foobar.com │ 5432 │ 0 │ 1204224 │ 0 │ │ 7083 │ foo │ 102009 │ 1802240 │ n1.foobar.com │ 5432 │ n4.foobar.com │ 5432 │ 0 │ 1802240 │ 0 │ │ 7083 │ foo │ 102018 │ 614400 │ n2.foobar.com │ 5432 │ n4.foobar.com │ 5432 │ 1 │ 614400 │ 354400 │ │ 7083 │ foo │ 102019 │ 8192 │ n3.foobar.com │ 5432 │ n4.foobar.com │ 5432 │ 2 │ 0 │ 8192 │ └───────────┴────────────┴─────────┴────────────┴───────────────┴────────────┴───────────────┴────────────┴──────────┴───────────────────┴───────────────────┘
citus_add_rebalance_strategy (name name, shard_cost_function regproc, node_capacity_function regproc, shard_allowed_on_node_function regproc, default_threshold float4, minimum_threshold float4, improvement_threshold float4) returns void
#Append a row to the pg_dist_rebalance_strategy table.
Arguments:
name — the identifier for the new
strategy.
shard_cost_function — identifies
the function used to determine the “cost” of each
shard.
node_capacity_function — identifies
the function to measure node capacity.
shard_allowed_on_node_function —
identifies the function that determines which shards can be
placed on which nodes.
default_threshold — floating point
threshold that tunes how precisely the cumulative shard cost
should be balanced between nodes.
minimum_threshold — safeguard
column that holds the minimum value allowed for the threshold
argument of the
citus_rebalance_start
function. The default value is 0.
improvement_threshold. The default value
is 0.
citus_set_default_rebalance_strategy (name text) returns void
#Update the pg_dist_rebalance_strategy table changing the strategy named by its argument to be the default chosen when rebalancing shards.
Arguments:
name — the name of the strategy in
the pg_dist_rebalance_strategy table.
The example below shows how to use the function:
SELECT citus_set_default_rebalance_strategy('by_disk_size');
citus_remote_connection_stats () returns setof record
#Shows the number of active connections to each remote node.
The example below shows how to use the function:
SELECT * FROM citus_remote_connection_stats();
.
hostname | port | database_name | connection_count_to_node
----------------+------+---------------+--------------------------
citus_worker_1 | 5432 | postgres | 3
(1 row)
citus_drain_node (nodename text, nodeport integer, shard_transfer_mode citus.shard_transfer_mode, rebalance_strategy name) returns void
#
Moves shards off the designated node and onto other nodes who have
shouldhaveshards set to true
in the pg_dist_node
table. This function is designed to be called prior to removing a
node from the cluster, i.e. turning the node's physical server off.
Arguments:
nodename — the DNS name of the
node to be drained.
nodeport — the port number of the
node to be drained.
shard_transfer_mode — specify the
method of replication, whether to use
Postgres Pro logical replication or a
cross-worker COPY command. The allowed values
of this optional argument are:
auto — require replica identity
if logical replication is possible, otherwise use legacy
behaviour. This is the default value.
force_logical — use logical
replication even if the table does not have a replica
identity. Any concurrent update/delete statements to the
table will fail during replication.
block_writes — use
COPY (blocking writes) for tables
lacking primary key or replica identity.
rebalance_strategy — the name of a
strategy in the
pg_dist_rebalance_strategy
table. If this argument is omitted, the function chooses the
default strategy, as indicated in the table. The default value
of this optional argument is NULL.
Here are the typical steps to remove a single node (for example '10.0.0.1' on a standard Postgres Pro port):
Drain the node.
SELECT * FROM citus_drain_node('10.0.0.1', 5432);
Wait until the command finishes.
Remove the node.
When draining multiple nodes it is recommended to use the citus_rebalance_start function instead. Doing so allows citus to plan ahead and move shards the minimum number of times.
Run this for each node that you want to remove:
SELECT * FROM citus_set_node_property(node_hostname, node_port, 'shouldhaveshards', false);
Drain them all at once with the citus_rebalance_start function:
SELECT * FROM citus_rebalance_start(drain_only := true);
Wait until the draining rebalance finishes.
Remove the nodes.
isolate_tenant_to_new_shard (table_name regclass, tenant_id "any", cascade_option text, shard_transfer_mode citus.shard_transfer_mode) returns bigint
#Creates a new shard to hold rows with a specific single value in the distribution column. It is especially handy for the multi-tenant citus use case, where a large tenant can be placed alone on its own shard and ultimately its own physical node. To learn more, see the Tenant Isolation section. The function returns the unique ID assigned to the newly created shard.
Arguments:
table_name — the name of the table
to get a new shard.
tenant_id — the value of the
distribution column which will be assigned to the new shard.
cascade_option. When set to
CASCADE, also isolates a shard from all
tables in the current table's
co-locating tables.
shard_transfer_mode — specify the
method of replication, whether to use
Postgres Pro logical replication or a
cross-worker COPY command. The allowed values
of this optional argument are:
auto — require replica identity
if logical replication is possible, otherwise use legacy
behaviour. This is the default value.
force_logical — use logical
replication even if the table does not have a replica
identity. Any concurrent update/delete statements to the
table will fail during replication.
block_writes — use
COPY (blocking writes) for tables
lacking primary key or replica identity.
The example below shows how to create a new shard to hold the
lineitems for tenant 135:
SELECT isolate_tenant_to_new_shard('lineitem', 135);
┌─────────────────────────────┐ │ isolate_tenant_to_new_shard │ ├─────────────────────────────┤ │ 102240 │ └─────────────────────────────┘
citus_create_restore_point (name text) returns pg_lsn
#
Temporarily blocks writes to the cluster, and creates a named
restore point on all nodes. This function is similar to
pg_create_restore_point,
but applies to all nodes and makes sure the restore point is
consistent across them. This function is well suited to doing
point-in-time recovery, and cluster forking. The function returns
the coordinator_lsn value, i.e. the log sequence
number of the restore point in the coordinator node WAL.
Arguments:
name — the name of the restore
point to create.
The example below shows how to use the function:
SELECT citus_create_restore_point('foo');
┌────────────────────────────┐ │ citus_create_restore_point │ ├────────────────────────────┤ │ 0/1EA2808 │ └────────────────────────────┘
citus divides each distributed table into multiple logical shards based on the distribution column. The coordinator then maintains metadata tables to track statistics and information about the health and location of these shards. In this section, we describe each of these metadata tables and their schema. You can view and query these tables using SQL after logging into the coordinator node.
pg_dist_partition Table #
The pg_dist_partition table stores metadata
about which tables in the database are distributed. For each
distributed table, it also stores information about the distribution
method and detailed information about the distribution column.
| Name | Type | Description |
|---|---|---|
| logicalrelid | regclass |
Distributed table to which this row corresponds. This value
references the relfilenode column in the
pg_class system
catalog table.
|
| partmethod | char |
The method used for partitioning / distribution. The values of
this column corresponding to different distribution methods
are: hash — h, reference table —
n.
|
| partkey | text | Detailed information about the distribution column including column number, type, and other relevant information. |
| colocationid | integer |
Co-location group to which this table belongs. Tables in the
same group allow co-located joins and distributed rollups
among other optimizations. This value references the
colocationid column in the
pg_dist_colocation
table.
|
| repmodel | char |
The method used for data replication. The values of this
column corresponding to different replication methods are:
Postgres Pro streaming replication
— s, two-phase commit (for
reference tables) — t.
|
SELECT * FROM pg_dist_partition;
logicalrelid | partmethod | partkey | colocationid | repmodel
---------------+------------+------------------------------------------------------------------------------------------------------------------------+--------------+----------
github_events | h | {VAR :varno 1 :varattno 4 :vartype 20 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 4 :location -1} | 2 | s
(1 row)
pg_dist_shard Table #
The pg_dist_shard table stores metadata about
individual shards of a table. This includes information about which
distributed table the shard belongs to and statistics about the
distribution column for that shard. In case of hash distributed tables,
they are hash token ranges assigned to that shard. These statistics
are used for pruning away unrelated shards during
SELECT queries.
| Name | Type | Description |
|---|---|---|
| logicalrelid | regclass |
Distributed table to which this shard belongs. This value
references the relfilenode column in the
pg_class system
catalog table.
|
| shardid | bigint | Globally unique identifier assigned to this shard. |
| shardstorage | char | Type of storage used for this shard. Different storage types are discussed in the table below. |
| shardminvalue | text | For hash distributed tables, minimum hash token value assigned to that shard (inclusive). |
| shardmaxvalue | text | For hash distributed tables, maximum hash token value assigned to that shard (inclusive). |
SELECT * FROM pg_dist_shard; logicalrelid | shardid | shardstorage | shardminvalue | shardmaxvalue ---------------+---------+--------------+---------------+--------------- github_events | 102026 | t | 268435456 | 402653183 github_events | 102027 | t | 402653184 | 536870911 github_events | 102028 | t | 536870912 | 671088639 github_events | 102029 | t | 671088640 | 805306367 (4 rows)
The shardstorage column in
pg_dist_shard indicates the type of storage
used for the shard. A brief overview of different shard storage types
and their representation is below.
| Storage Type |
shardstorage value
| Description |
|---|---|---|
| TABLE |
t
| Indicates that shard stores data belonging to a regular distributed table. |
| COLUMNAR |
c
| Indicates that shard stores columnar data. (Used by distributed cstore_fdw tables). |
| FOREIGN |
f
| Indicates that shard stores foreign data. (Used by distributed file_fdw tables). |
citus_shards View #
In addition to the low-level shard metadata table described above,
citus provides the
citus_shards view to easily check:
Where each shard is (node and port),
What kind of table it belongs to, and
Its size.
This view helps you inspect shards to find, among other things, any size imbalances across nodes.
SELECT * FROM citus_shards;
. table_name | shardid | shard_name | citus_table_type | colocation_id | nodename | nodeport | shard_size ------------+---------+--------------+------------------+---------------+-----------+----------+------------ dist | 102170 | dist_102170 | distributed | 34 | localhost | 9701 | 90677248 dist | 102171 | dist_102171 | distributed | 34 | localhost | 9702 | 90619904 dist | 102172 | dist_102172 | distributed | 34 | localhost | 9701 | 90701824 dist | 102173 | dist_102173 | distributed | 34 | localhost | 9702 | 90693632 ref | 102174 | ref_102174 | reference | 2 | localhost | 9701 | 8192 ref | 102174 | ref_102174 | reference | 2 | localhost | 9702 | 8192 dist2 | 102175 | dist2_102175 | distributed | 34 | localhost | 9701 | 933888 dist2 | 102176 | dist2_102176 | distributed | 34 | localhost | 9702 | 950272 dist2 | 102177 | dist2_102177 | distributed | 34 | localhost | 9701 | 942080 dist2 | 102178 | dist2_102178 | distributed | 34 | localhost | 9702 | 933888
The colocation_id refers to the
colocation group.
For more info about citus_table_type, see the
Table Types section.
pg_dist_placement Table #
The pg_dist_placement table tracks the
location of shards on worker nodes. Each shard assigned to a specific
node is called a shard placement. This table stores information about
the health and location of each shard placement.
| Name | Type | Description |
|---|---|---|
| placementid | bigint | Unique auto-generated identifier for each individual placement. |
| shardid | bigint |
Shard identifier associated with this placement. This value
references the shardid column in the
pg_dist_shard
catalog table.
|
| shardstate | int | Describes the state of this placement. Different shard states are discussed in the section below. |
| shardlength | bigint | For hash distributed tables, zero. |
| groupid | int | Identifier used to denote a group of one primary server and zero or more secondary servers. |
SELECT * FROM pg_dist_placement;
placementid | shardid | shardstate | shardlength | groupid
-------------+---------+------------+-------------+---------
1 | 102008 | 1 | 0 | 1
2 | 102008 | 1 | 0 | 2
3 | 102009 | 1 | 0 | 2
4 | 102009 | 1 | 0 | 3
5 | 102010 | 1 | 0 | 3
6 | 102010 | 1 | 0 | 4
7 | 102011 | 1 | 0 | 4
pg_dist_node Table #
The pg_dist_node table contains information
about the worker nodes in the cluster.
| Name | Type | Description |
|---|---|---|
| nodeid | int | Auto-generated identifier for an individual node. |
| groupid | int |
Identifier used to denote a group of one primary server and
zero or more secondary servers. By default it is the same as
the nodeid.
|
| nodename | text | Host name or IP Address of the Postgres Pro worker node. |
| nodeport | int | Port number on which the Postgres Pro worker node is listening. |
| noderack | text | Rack placement information for the worker node. This is an optional column. |
| hasmetadata | boolean | Reserved for internal use. |
| isactive | boolean | Whether the node is active accepting shard placements. |
| noderole | text | Whether the node is a primary or secondary. |
| nodecluster | text | The name of the cluster containing this node. |
| metadatasynced | boolean | Reserved for internal use. |
| shouldhaveshards | boolean | If false, shards will be moved off node (drained) when rebalancing, nor will shards from new distributed tables be placed on the node, unless they are co-located with shards already there. |
SELECT * FROM pg_dist_node;
nodeid | groupid | nodename | nodeport | noderack | hasmetadata | isactive | noderole | nodecluster | metadatasynced | shouldhaveshards
--------+---------+-----------+----------+----------+-------------+----------+----------+-------------+----------------+------------------
1 | 1 | localhost | 12345 | default | f | t | primary | default | f | t
2 | 2 | localhost | 12346 | default | f | t | primary | default | f | t
3 | 3 | localhost | 12347 | default | f | t | primary | default | f | t
(3 rows)
citus.pg_dist_object Table #
The citus.pg_dist_object table contains a
list of objects such as types and functions that have been created on
the coordinator node and propagated to worker nodes. When an
administrator adds new worker nodes to the cluster,
citus automatically creates copies of the
distributed objects on the new nodes (in the correct order to satisfy
object dependencies).
| Name | Type | Description |
|---|---|---|
| classid | oid | Class of the distributed object |
| objid | oid | Object ID of the distributed object |
| objsubid | integer |
Object sub-ID of the distributed object, e.g.
attnum
|
| type | text | Part of the stable address used during upgrades with pg_upgrade |
| object_names | text[] | Part of the stable address used during upgrades with pg_upgrade |
| object_args | text[] | Part of the stable address used during upgrades with pg_upgrade |
| distribution_argument_index | integer | Only valid for distributed functions/procedures |
| colocationid | integer | Only valid for distributed functions/procedures |
“Stable addresses” uniquely identify objects independently of a specific server. citus tracks objects during a Postgres Pro upgrade using stable addresses created with the pg_identify_object_as_address function.
Here is an example of how the
create_distributed_function
function adds entries to the
citus.pg_dist_object table:
CREATE TYPE stoplight AS enum ('green', 'yellow', 'red');
CREATE OR REPLACE FUNCTION intersection()
RETURNS stoplight AS $$
DECLARE
color stoplight;
BEGIN
SELECT *
FROM unnest(enum_range(NULL::stoplight)) INTO color
ORDER BY random() LIMIT 1;
RETURN color;
END;
$$ LANGUAGE plpgsql VOLATILE;
SELECT create_distributed_function('intersection()');
-- Will have two rows, one for the TYPE and one for the FUNCTION
TABLE citus.pg_dist_object;
-[ RECORD 1 ]---------------+------ classid | 1247 objid | 16780 objsubid | 0 type | object_names | object_args | distribution_argument_index | colocationid | -[ RECORD 2 ]---------------+------ classid | 1255 objid | 16788 objsubid | 0 type | object_names | object_args | distribution_argument_index | colocationid |
citus_schemas View #
citus supports
schema-based sharding
and provides the citus_schemas view that
shows which schemas have been distributed in the system. The view only
lists distributed schemas, local schemas are not displayed.
| Name | Type | Description |
|---|---|---|
| schema_name | regnamespace | Name of the distributed schema |
| colocation_id | integer | Co-location ID of the distributed schema |
| schema_size | text | Human-readable size summary of all objects within the schema |
| schema_owner | name | Role that owns the schema |
Here is an example:
schema_name | colocation_id | schema_size | schema_owner --------------+---------------+-------------+-------------- user_service | 1 | 0 bytes | user_service time_service | 2 | 0 bytes | time_service ping_service | 3 | 632 kB | ping_service
citus_tables View #
The citus_tables view shows a summary of all
tables managed by citus (distributed and
reference tables). The view combines information from
citus metadata tables for an easy,
human-readable overview of these table properties:
Human-readable size
Shard count
Owner (database user)
Access method (heap or
columnar)
Here is an example:
SELECT * FROM citus_tables;
┌────────────┬──────────────────┬─────────────────────┬───────────────┬────────────┬─────────────┬─────────────┬───────────────┐ │ table_name │ citus_table_type │ distribution_column │ colocation_id │ table_size │ shard_count │ table_owner │ access_method │ ├────────────┼──────────────────┼─────────────────────┼───────────────┼────────────┼─────────────┼─────────────┼───────────────┤ │ foo.test │ distributed │ test_column │ 1 │ 0 bytes │ 32 │ citus │ heap │ │ ref │ reference │ <none> │ 2 │ 24 GB │ 1 │ citus │ heap │ │ test │ distributed │ id │ 1 │ 248 TB │ 32 │ citus │ heap │ └────────────┴──────────────────┴─────────────────────┴───────────────┴────────────┴─────────────┴─────────────┴───────────────┘
time_partitions View #
citus provides user defined functions to
manage partitions for the
timeseries
use case. It also maintains the time_partitions
view to inspect the partitions it manages.
The columns of this view are as follows:
parent_table — the table which is
partitioned.
partition_column — the column on which
the parent table is partitioned.
partition — the name of a partition.
from_value — lower bound in time for rows
in this partition.
to_value — upper bound in time for rows
in this partition.
access_method —
heap for row-based storage and
columnar for columnar storage.
SELECT * FROM time_partitions;
┌────────────────────────┬──────────────────┬─────────────────────────────────────────┬─────────────────────┬─────────────────────┬───────────────┐ │ parent_table │ partition_column │ partition │ from_value │ to_value │ access_method │ ├────────────────────────┼──────────────────┼─────────────────────────────────────────┼─────────────────────┼─────────────────────┼───────────────┤ │ github_columnar_events │ created_at │ github_columnar_events_p2015_01_01_0000 │ 2015-01-01 00:00:00 │ 2015-01-01 02:00:00 │ columnar │ │ github_columnar_events │ created_at │ github_columnar_events_p2015_01_01_0200 │ 2015-01-01 02:00:00 │ 2015-01-01 04:00:00 │ columnar │ │ github_columnar_events │ created_at │ github_columnar_events_p2015_01_01_0400 │ 2015-01-01 04:00:00 │ 2015-01-01 06:00:00 │ columnar │ │ github_columnar_events │ created_at │ github_columnar_events_p2015_01_01_0600 │ 2015-01-01 06:00:00 │ 2015-01-01 08:00:00 │ heap │ └────────────────────────┴──────────────────┴─────────────────────────────────────────┴─────────────────────┴─────────────────────┴───────────────┘
pg_dist_colocation Table #
The pg_dist_colocation table contains
information about which tables' shards should be placed together, or
co-located. When two
tables are in the same co-location group,
citus ensures shards with the same
partition values will be placed on the same worker nodes. This enables
join optimizations, certain distributed rollups, and foreign key
support. Shard co-location is inferred when the shard counts, and
partition column types all match between two tables; however, a custom
co-location group may be specified when creating a distributed table,
if so desired.
| Name | Type | Description |
|---|---|---|
| colocationid | int | Unique identifier for the co-location group this row corresponds to |
| shardcount | int | Shard count for all tables in this co-location group |
| replicationfactor | int | Replication factor for all tables in this co-location group. (Deprecated) |
| distributioncolumntype | oid | The type of the distribution column for all tables in this co-location group |
| distributioncolumncollation | oid | The collation of the distribution column for all tables in this co-location group |
SELECT * FROM pg_dist_colocation;
colocationid | shardcount | replicationfactor | distributioncolumntype | distributioncolumncollation
--------------+------------+-------------------+------------------------+-----------------------------
2 | 32 | 1 | 20 | 0
(1 row)
pg_dist_rebalance_strategy Table #This table defines strategies that the citus_rebalance_start function can use to determine where to move shards.
| Name | Type | Description |
|---|---|---|
| name | name | Unique name for the strategy |
| default_strategy | boolean | Whether citus_rebalance_start should choose this strategy by default. Use citus_set_default_rebalance_strategy to update this column. |
| shard_cost_function | regproc |
Identifier for a cost function, which must take a
shardid as bigint and return
its notion of a cost, as type real.
|
| node_capacity_function | regproc |
Identifier for a capacity function, which must take a
nodeid as int and return its
notion of node capacity as type real.
|
| shard_allowed_on_node_function | regproc |
Identifier for a function that given
shardid bigint and
nodeidarg int, returns
boolean for whether the shard is allowed to be
stored on the node.
|
| default_threshold | float4 | Threshold for deeming a node too full or too empty, which determines when the citus_rebalance_start function should try to move shards. |
| minimum_threshold | float4 | A safeguard to prevent the threshold argument of citus_rebalance_start from being set too low. |
| improvement_threshold | float4 |
Determines when moving a shard is worth it during a rebalance.
The rebalancer will move a shard when the ratio of the
improvement with the shard move to the improvement without
crosses the threshold. This is most useful with the
by_disk_size strategy.
|
A citus installation ships with these strategies in the table:
SELECT * FROM pg_dist_rebalance_strategy;
-[ RECORD 1 ]------------------+--------------------------------- name | by_shard_count default_strategy | f shard_cost_function | citus_shard_cost_1 node_capacity_function | citus_node_capacity_1 shard_allowed_on_node_function | citus_shard_allowed_on_node_true default_threshold | 0 minimum_threshold | 0 improvement_threshold | 0 -[ RECORD 2 ]------------------+--------------------------------- name | by_disk_size default_strategy | t shard_cost_function | citus_shard_cost_by_disk_size node_capacity_function | citus_node_capacity_1 shard_allowed_on_node_function | citus_shard_allowed_on_node_true default_threshold | 0.1 minimum_threshold | 0.01 improvement_threshold | 0.5
The by_shard_count strategy assigns every
shard the same cost. Its effect is to equalize the shard count
across nodes. The default strategy,
by_disk_size, assigns a cost to each shard
matching its disk size in bytes plus that of the shards that are
co-located with it. The disk size is calculated using
pg_total_relation_size,
so it includes indices. This strategy attempts to achieve the same
disk space on every node. Note the threshold of 0.1 — it prevents
unnecessary shard movement caused by insigificant differences in
disk space.
Here are examples of functions that can be used within new shard
rebalancer strategies, and registered in the
pg_dist_rebalance_strategy table with the
citus_add_rebalance_strategy
function.
Setting a node capacity exception by hostname pattern:
-- Example of node_capacity_function
CREATE FUNCTION v2_node_double_capacity(nodeidarg int)
RETURNS real AS $$
SELECT
(CASE WHEN nodename LIKE '%.v2.worker.citusdata.com' THEN 2.0::float4 ELSE 1.0::float4 END)
FROM pg_dist_node where nodeid = nodeidarg
$$ LANGUAGE sql;
Rebalancing by number of queries that go to a shard, as measured by the citus_stat_statements table:
-- Example of shard_cost_function
CREATE FUNCTION cost_of_shard_by_number_of_queries(shardid bigint)
RETURNS real AS $$
SELECT coalesce(sum(calls)::real, 0.001) as shard_total_queries
FROM citus_stat_statements
WHERE partition_key is not null
AND get_shard_id_for_distribution_column('tab', partition_key) = shardid;
$$ LANGUAGE sql;
Isolating a specific shard (10000) on a node (address '10.0.0.1'):
-- Example of shard_allowed_on_node_function
CREATE FUNCTION isolate_shard_10000_on_10_0_0_1(shardid bigint, nodeidarg int)
RETURNS boolean AS $$
SELECT
(CASE WHEN nodename = '10.0.0.1' THEN shardid = 10000 ELSE shardid != 10000 END)
FROM pg_dist_node where nodeid = nodeidarg
$$ LANGUAGE sql;
-- The next two definitions are recommended in combination with the above function.
-- This way the average utilization of nodes is not impacted by the isolated shard
CREATE FUNCTION no_capacity_for_10_0_0_1(nodeidarg int)
RETURNS real AS $$
SELECT
(CASE WHEN nodename = '10.0.0.1' THEN 0 ELSE 1 END)::real
FROM pg_dist_node where nodeid = nodeidarg
$$ LANGUAGE sql;
CREATE FUNCTION no_cost_for_10000(shardid bigint)
RETURNS real AS $$
SELECT
(CASE WHEN shardid = 10000 THEN 0 ELSE 1 END)::real
$$ LANGUAGE sql;
citus_stat_statements Table #
citus provides the
citus_stat_statements table
for stats about how queries are being executed, and for whom. It is
analogous to (and can be joined with) the
pg_stat_statements
view in Postgres Pro, which tracks statistics
about query speed.
| Name | Type | Description |
|---|---|---|
| queryid | bigint |
Identifier (good for pg_stat_statements joins)
|
| userid | oid | User who ran the query |
| dbid | oid | Database instance of coordinator |
| query | text | Anonymized query string |
| executor | text |
citus
executor
used: adaptive, or INSERT-SELECT
|
| partition_key | text |
Value of distribution column in router-executed queries,
else NULL
|
| calls | bigint | Number of times the query was run |
-- Create and populate distributed table
CREATE TABLE foo ( id int );
SELECT create_distributed_table('foo', 'id');
INSERT INTO foo select generate_series(1,100);
-- Enable stats
-- pg_stat_statements must be in shared_preload_libraries
CREATE EXTENSION pg_stat_statements;
SELECT count(*) from foo;
SELECT * FROM foo where id = 42;
SELECT * FROM citus_stat_statements;
Results:
-[ RECORD 1 ]-+---------------------------------------------- queryid | -909556869173432820 userid | 10 dbid | 13340 query | insert into foo select generate_series($1,$2) executor | insert-select partition_key | calls | 1 -[ RECORD 2 ]-+---------------------------------------------- queryid | 3919808845681956665 userid | 10 dbid | 13340 query | select count(*) from foo; executor | adaptive partition_key | calls | 1 -[ RECORD 3 ]-+---------------------------------------------- queryid | 5351346905785208738 userid | 10 dbid | 13340 query | select * from foo where id = $1 executor | adaptive partition_key | 42 calls | 1
Caveats:
The stats data is not replicated and will not survive database crashes or failover.
Tracks a limited number of queries set by the
pg_stat_statements.max
configuration parameter. The default value is 5000.
To truncate the table, use the citus_stat_statements_reset function.
citus_stat_tenants View #
The citus_stat_tenants view augments the
citus_stat_statements
table with information about how many queries each tenant is running.
Tracing queries to originating tenants helps, among other things, for
deciding when to do
tenant isolation.
This view counts recent single-tenant queries happening during a
configurable time period. The tally of read-only and total queries for
the period increases until the current period ends. After that, the
counts are moved to last period's statistics, which stays constant
until expiration. The period length can be set in seconds using
citus.stats_tenants_period, and is 60 seconds
by default.
The view displays up to citus.stat_tenants_limit
rows (by default 100). It counts only queries
filtered to a single tenant, ignoring queries that apply to multiple
tenants at once.
| Name | Type | Description |
|---|---|---|
| nodeid | int | Node ID from the pg_dist_node |
| colocation_id | int | ID of the co-location group |
| tenant_attribute | text | Value in the distribution column identifying tenant |
| read_count_in_this_period | int |
Number of read (SELECT) queries for tenant
in period
|
| read_count_in_last_period | int | Number of read queries one period of time ago |
| query_count_in_this_period | int | Number of read/write queries for tenant in time period |
| query_count_in_last_period | int | Number of read/write queries one period of time ago |
| cpu_usage_in_this_period | double | Seconds of CPU time spent for this tenant in period |
| cpu_usage_in_last_period | double | Seconds of CPU time spent for this tenant last period |
Tracking tenant level statistics adds overhead, and by default
is disabled. To enable it, set
citus.stat_tenants_track to
'all'.
By way of example, suppose we have a distributed table called
dist_table, with distribution column
tenant_id. Then we make some queries:
INSERT INTO dist_table(tenant_id) VALUES (1); INSERT INTO dist_table(tenant_id) VALUES (1); INSERT INTO dist_table(tenant_id) VALUES (2); SELECT count(*) FROM dist_table WHERE tenant_id = 1;
The tenant-level statistics will reflect the queries we just made:
SELECT tenant_attribute, read_count_in_this_period,
query_count_in_this_period, cpu_usage_in_this_period
FROM citus_stat_tenants;
tenant_attribute | read_count_in_this_period | query_count_in_this_period | cpu_usage_in_this_period ------------------+---------------------------+----------------------------+-------------------------- 1 | 1 | 3 | 0.000883 2 | 0 | 1 | 0.000144
In some situations, queries might get blocked on row-level locks on one of the shards on a worker node. If that happens then those queries would not show up in pg_locks on the citus coordinator node.
citus provides special views to watch queries and locks throughout the cluster, including shard-specific queries used internally to build results for distributed queries.
citus_stat_activity — shows the
distributed queries that are executing on all nodes. A superset of
pg_stat_activity
usable wherever the latter is.
citus_dist_stat_activity —
the same as citus_stat_activity but
restricted to distributed queries only, and excluding
citus fragments queries.
citus_lock_waits — blocked queries
throughout the cluster.
The first two views include all columns of
pg_stat_activity plus the global PID of the
worker that initiated the query.
For example, consider counting the rows in a distributed table:
-- Run in one session -- (with a pg_sleep so we can see it) SELECT count(*), pg_sleep(3) FROM users_table;
We can see the query appear in
citus_dist_stat_activity:
-- Run in another session SELECT * FROM citus_dist_stat_activity; -[ RECORD 1 ]----+------------------------------------------- global_pid | 10000012199 nodeid | 1 is_worker_query | f datid | 13724 datname | postgres pid | 12199 leader_pid | usesysid | 10 usename | postgres application_name | psql client_addr | client_hostname | client_port | -1 backend_start | 2022-03-23 11:30:00.533991-05 xact_start | 2022-03-23 19:35:28.095546-05 query_start | 2022-03-23 19:35:28.095546-05 state_change | 2022-03-23 19:35:28.09564-05 wait_event_type | Timeout wait_event | PgSleep state | active backend_xid | backend_xmin | 777 query_id | query | SELECT count(*), pg_sleep(3) FROM users_table; backend_type | client backend
The citus_dist_stat_activity view hides
internal citus fragment queries. To see
those, we can use the more detailed
citus_stat_activity view. For instance, the
previous count(*) query requires information from
all shards. Some of the information is in shard
users_table_102039, which is visible in the query
below.
SELECT * FROM citus_stat_activity; -[ RECORD 1 ]----+----------------------------------------------------------------------- global_pid | 10000012199 nodeid | 1 is_worker_query | f datid | 13724 datname | postgres pid | 12199 leader_pid | usesysid | 10 usename | postgres application_name | psql client_addr | client_hostname | client_port | -1 backend_start | 2022-03-23 11:30:00.533991-05 xact_start | 2022-03-23 19:32:18.260803-05 query_start | 2022-03-23 19:32:18.260803-05 state_change | 2022-03-23 19:32:18.260821-05 wait_event_type | Timeout wait_event | PgSleep state | active backend_xid | backend_xmin | 777 query_id | query | SELECT count(*), pg_sleep(3) FROM users_table; backend_type | client backend -[ RECORD 2 ]----+----------------------------------------------------------------------------------------- global_pid | 10000012199 nodeid | 1 is_worker_query | t datid | 13724 datname | postgres pid | 12725 leader_pid | usesysid | 10 usename | postgres application_name | citus_internal gpid=10000012199 client_addr | 127.0.0.1 client_hostname | client_port | 44106 backend_start | 2022-03-23 19:29:53.377573-05 xact_start | query_start | 2022-03-23 19:32:18.278121-05 state_change | 2022-03-23 19:32:18.278281-05 wait_event_type | Client wait_event | ClientRead state | idle backend_xid | backend_xmin | query_id | query | SELECT count(*) AS count FROM public.users_table_102039 users WHERE true backend_type | client backend
The query field shows rows being counted in shard
102039.
Here are examples of useful queries you can build using
citus_stat_activity:
-- Active queries' wait events SELECT query, wait_event_type, wait_event FROM citus_stat_activity WHERE state='active'; -- Active queries' top wait events SELECT wait_event, wait_event_type, count(*) FROM citus_stat_activity WHERE state='active' GROUP BY wait_event, wait_event_type ORDER BY count(*) desc; -- Total internal connections generated per node by citus SELECT nodeid, count(*) FROM citus_stat_activity WHERE is_worker_query GROUP BY nodeid;
The next view is citus_lock_waits. To see how
it works, we can generate a locking situation manually. First we will
set up a test table from the coordinator:
CREATE TABLE numbers AS
SELECT i, 0 AS j FROM generate_series(1,10) AS i;
SELECT create_distributed_table('numbers', 'i');
Then, using two sessions on the coordinator, we can run this sequence of statements:
-- Session 1 -- Session 2
------------------------------------- -------------------------------------
BEGIN;
UPDATE numbers SET j = 2 WHERE i = 1;
BEGIN;
UPDATE numbers SET j = 3 WHERE i = 1;
-- (this blocks)
The citus_lock_waits view shows the
situation.
SELECT * FROM citus_lock_waits; -[ RECORD 1 ]-------------------------+-------------------------------------- waiting_gpid | 10000011981 blocking_gpid | 10000011979 blocked_statement | UPDATE numbers SET j = 3 WHERE i = 1; current_statement_in_blocking_process | UPDATE numbers SET j = 2 WHERE i = 1; waiting_nodeid | 1 blocking_nodeid | 1
In this example the queries originated on the coordinator, but the view can also list locks between queries originating on workers.
citus has other informational tables and views which are accessible on all nodes, not just the coordinator.
pg_dist_authinfo Table #
The pg_dist_authinfo table holds
authentication parameters used by citus
nodes to connect to one another.
| Name | Type | Description |
|---|---|---|
| nodeid | integer | Node ID from pg_dist_node, or 0, or -1 |
| rolename | name | Postgres Pro role |
| authinfo | text | Space-separated libpq connection parameters |
Upon beginning a connection, a node consults the table to see whether
a row with the destination nodeid and desired
rolename exists. If so, the node includes the
corresponding authinfo string in its
libpq connection. A common example is to
store a password, like 'password=abc123', but you
can review the
full list of possibilities.
The parameters in authinfo are space-separated, in
the form key=val. To write an empty value, or a
value containing spaces, surround it with single quotes, e.g.,
keyword='a value'. Single quotes and backslashes
within the value must be escaped with a backslash, i.e.,
\' and \\.
The nodeid column can also take the special values
0 and -1, which mean
all nodes or loopback connections,
respectively. If, for a given node, both specific and all-node rules
exist, the specific rule has precedence.
SELECT * FROM pg_dist_authinfo;
nodeid | rolename | authinfo
--------+----------+-----------------
123 | jdoe | password=abc123
(1 row)
pg_dist_poolinfo Table #
If you want to use a connection pooler to connect to a node, you
can specify the pooler options using
pg_dist_poolinfo. This metadata table holds
the host, port and database name for citus
to use when connecting to a node through a pooler.
If pool information is present, citus will
try to use these values instead of setting up a direct connection. The
pg_dist_poolinfo information in this case
supersedes pg_dist_node.
| Name | Type | Description |
|---|---|---|
| nodeid | integer | Node ID from pg_dist_node |
| poolinfo | text |
Space-separated parameters: host,
port, or dbname
|
In some situations citus ignores the
settings in pg_dist_poolinfo. For instance
shard rebalancing
is not compatible with connection poolers such as
pgbouncer. In these scenarios
citus will use a direct connection.
-- How to connect to node 1 (as identified in pg_dist_node)
INSERT INTO pg_dist_poolinfo (nodeid, poolinfo)
VALUES (1, 'host=127.0.0.1 port=5433');
There are various configuration parameters that affect the behaviour of citus. These include both standard Postgres Pro parameters and citus specific parameters. To learn more about Postgres Pro configuration parameters, you can visit the Server Configuration chapter.
The rest of this reference aims at discussing citus
specific configuration parameters. These parameters can be set similar
to Postgres Pro parameters by modifying
postgresql.conf or
by using the SET command.
As an example you can update a setting with:
ALTER DATABASE citus SET citus.multi_task_query_log_level = 'log';
citus.max_background_task_executors_per_node (integer)
#
Determines how many background tasks can be executed in parallel at
a given time. For instance, these tasks are for shard moves from/to
a node. When increasing the value of this parameter, you will often
also want to increase the value of the
citus.max_background_task_executors and
max_worker_processes
parameters. The minimum
value is 1, the maximum value is
128. The default value is 1.
citus.max_worker_nodes_tracked (integer)
#
citus tracks worker nodes' locations and
their membership in a shared hash table on the coordinator node.
This configuration parameter limits the size of the hash table and
consequently the number of worker nodes that can be tracked. The
default value is 2048. This parameter can only be
set at server start and is effective on the coordinator node.
citus.use_secondary_nodes (enum)
#
Sets the policy to use when choosing nodes for the
SELECT queries. If set to always,
the planner will query only nodes whose noderole
is marked as secondary in the
pg_dist_node table.
The allowed values are:
never — all reads happen on primary
nodes. This is the default value.
always — reads run against secondary
nodes instead and INSERT/UPDATE
statements are disabled.
citus.cluster_name (text)
#
Informs the coordinator node planner which cluster it coordinates.
Once cluster_name is set, the planner
will query worker nodes in that cluster alone.
citus.enable_version_checks (boolean)
#
Upgrading citus version requires a server
restart (to pick up the new shared library), as well as running the
ALTER EXTENSION UPDATE command. The failure to
execute both steps could potentially cause errors or crashes.
citus thus validates the version of the
code and that of the extension match, and errors out if they do not.
The default value is true, and the parameter is
effective on the coordinator. In rare cases, complex upgrade
processes may require setting this parameter to
false, thus disabling the check.
citus.log_distributed_deadlock_detection (boolean)
#
Specifies whether to log distributed deadlock detection related
processing in the server log. The default value is
false.
citus.distributed_deadlock_detection_factor (floating point)
#
Sets the time to wait before checking for distributed deadlocks. In
particular the time to wait will be this value multiplied by the
value set in the Postgres Pro
deadlock_timeout parameter.
The default value is 2. The value of
-1 disables distributed deadlock detection.
citus.node_connection_timeout (integer)
#
Sets the maximum duration to wait for connection establishment, in
milliseconds. citus raises an error if
the timeout elapses before at least one worker connection is
established. This configuration parameter affects connections from
the coordinator to workers and workers to each other. The minimum
value is 10 milliseconds, the maximum value is
1 hour. The default value is 30
seconds.
The example below shows how to set this parameter:
-- Set to 60 seconds ALTER DATABASE foo SET citus.node_connection_timeout = 60000;
citus.node_conninfo (text)
#Sets non-sensitive libpq connection parameters used for all inter-node connections.
The example below shows how to set this parameter:
-- key=value pairs separated by spaces. -- For example, ssl options: ALTER DATABASE foo SET citus.node_conninfo = 'sslrootcert=/path/to/citus.crt sslmode=verify-full';
citus supports only a specific subset of the allowed options, namely:
connect_timeout
gsslib (subject to the runtime presence of
optional Postgres Pro features)
host
keepalives
keepalives_count
keepalives_idle
keepalives_interval
krbsrvname (subject to the runtime presence of
optional Postgres Pro features)
sslcompression
sslcrl
sslmode (defaults to require)
sslnegotiation
sslrootcert
tcp_user_timeout
The citus.node_conninfo configuration parameter
takes effect only on newly opened connections. To force all
connections to use the new settings, make sure to reload the
Postgres Pro configuration:
SELECT pg_reload_conf();
citus.local_hostname (text)
#
citus nodes need occasionally to connect
to themselves for systems operations. By default, they use the
localhost address to refer to themselves, but
this can cause problems. For instance, when a host requires
sslmode=verify-full for incoming connections,
adding localhost as an alternative hostname on
the SSL certificate is not always desirable or even feasible.
The citus.local_hostname configuration parameter
selects the hostname a node uses to connect to itself. The default
value is localhost.
The example below shows how to set this parameter:
ALTER SYSTEM SET citus.local_hostname TO 'mynode.example.com';
citus.show_shards_for_app_name_prefixes (text)
#By default, citus hides shards from the list of tables Postgres Pro gives to SQL clients. It does this because there are multiple shards per distributed table, and the shards can be distracting to the SQL client.
The citus.show_shards_for_app_name_prefixes
configuration parameter allows shards to be displayed for selected
clients that want to see them. The default value is
''.
The example below shows how to set this parameter:
-- Show shards to psql only (hide in other clients, like pgAdmin) SET citus.show_shards_for_app_name_prefixes TO 'psql'; -- Also accepts a comma-separated list SET citus.show_shards_for_app_name_prefixes TO 'psql,pg_dump';
citus.rebalancer_by_disk_size_base_cost (integer)
#
When using the by_disk_size rebalance strategy
each shard group will get this cost in bytes added to its actual
disk size. This is used to avoid creating a bad balance when there
is very little data in some of the shards. The assumption is that
even empty shards have some cost, because of parallelism and because
empty shard groups will likely grow in the future. The default value
is 100 MB.
citus.stat_statements_purge_interval (integer)
#
Sets the frequency at which the maintenance daemon removes records
from the
citus_stat_statements
table that are unmatched in the
pg_stat_statements
view. This configuration parameter sets the time interval between
purges in seconds, with the default value of 10.
The value of 0 disables the purges. This
parameter is effective on the coordinator and can be changed at
runtime.
The example below shows how to set this parameter:
SET citus.stat_statements_purge_interval TO 5;
citus.stat_statements_max (integer)
#
The maximum number of rows to store in the
citus_stat_statements
table. The default value is 50000 and may be
changed to any value in the range of 1000 -
10000000. Note that each row requires 140 bytes
of storage, so setting citus.stat_statements_max
to its maximum value of 10M would consume 1.4GB of memory.
Changing this configuration parameter will not take effect until Postgres Pro is restarted.
citus.stat_statements_track (enum)
#
Recording statistics for
citus_stat_statements
requires extra CPU resources. When the database is experiencing load,
the administrator may wish to disable statement tracking. The
citus.stat_statements_track configuration parameter
can turn tracking on and off. The allowed values are:
all — track all statements.
none — disable tracking. This is the
default value.
citus.stat_tenants_untracked_sample_rate (floating point)
#
Sampling rate for new tenants in the
citus_stat_tenants
view. The rate can be of range between 0.0 and
1.0. The default value is 1.0
meaning 100% of untracked tenant queries are sampled. Setting it to
a lower value means that the already tracked tenants have 100%
queries sampled, but tenants that are currently untracked are sampled
only at the provided rate.
citus.shard_count (integer)
#
Sets the shard count for hash-partitioned tables and defaults to
32. This value is used by the
create_distributed_table
function when creating hash-partitioned tables. This parameter can
be set at runtime and is effective on the coordinator.
citus.metadata_sync_mode (enum)
#This configuration parameter requires superuser access to change.
This configuration parameter determines how citus synchronizes metadata across nodes. By default, citus updates all metadata in a single transaction for consistency. However, Postgres Pro has a hard memory limit related to cache invalidations, and citus metadata syncing for a large cluster can fail from memory exhaustion.
As a workaround, citus provides an optional nontransactional sync mode, which uses a series of smaller transactions. While this mode works in limited memory, there is a possibility of transactions failing and leaving metadata in an inconsistency state. To help with this potential problem, nontransactional metadata sync is designed as an idempotent action, so you can re-run it repeatedly if needed.
There allowed values for this configiration parameters are as follows:
transactional — synchronize all
metadata in a single transaction. This is the default value.
nontransactional — synchronize metadata
using multiple small transactions.
The example below shows how to set this parameter:
-- To add a new node and sync nontransactionally SET citus.metadata_sync_mode TO 'nontransactional'; SELECT citus_add_node(<ip>, <port>); -- To manually (re)sync SET citus.metadata_sync_mode TO 'nontransactional'; SELECT start_metadata_sync_to_all_nodes();
We advise trying transactional mode first and switching to nontransactional only if a memory failure occurs.
citus.local_table_join_policy (enum)
#Determines how citus moves data when doing a join between local and distributed tables. Customizing the join policy can help reduce the amount of data sent between worker nodes.
citus will send either the local or distributed tables to nodes as necessary to support the join. Copying table data is referred to as a “conversion”. If a local table is converted, then it will be sent to any workers that need its data to perform the join. If a distributed table is converted, then it will be collected in the coordinator to support the join. The citus planner will send only the necessary rows doing a conversion.
There are four modes available to express conversion preference:
auto — citus
will convert either all local or all distributed tables to
support local and distributed table joins.
citus decides which to convert using
a heuristic. It will convert distributed tables if they are
joined using a constant filter on a unique index (such as a
primary key). This ensures less data gets moved between workers.
This is the default value.
never — citus
will not allow joins between local and distributed tables.
prefer-local — citus
will prefer converting local tables to support local and
distributed table joins.
prefer-distributed —
citus will prefer converting distributed
tables to support local and distributed table joins. If the
distributed tables are huge, using this option might result in
moving lots of data between workers.
For example, assume citus_table is a
distributed table distributed by the column
x, and that postgres_table
is a local table:
CREATE TABLE citus_table(x int primary key, y int);
SELECT create_distributed_table('citus_table', 'x');
CREATE TABLE postgres_table(x int, y int);
-- Even though the join is on primary key, there isn't a constant filter
-- hence postgres_table will be sent to worker nodes to support the join
SELECT * FROM citus_table JOIN postgres_table USING (x);
-- There is a constant filter on a primary key, hence the filtered row
-- from the distributed table will be pulled to coordinator to support the join
SELECT * FROM citus_table JOIN postgres_table USING (x) WHERE citus_table.x = 10;
SET citus.local_table_join_policy to 'prefer-distributed';
-- Since we prefer distributed tables, citus_table will be pulled to coordinator
-- to support the join. Note that citus_table can be huge
SELECT * FROM citus_table JOIN postgres_table USING (x);
SET citus.local_table_join_policy to 'prefer-local';
-- Even though there is a constant filter on primary key for citus_table
-- postgres_table will be sent to necessary workers because we are using 'prefer-local'
SELECT * FROM citus_table JOIN postgres_table USING (x) WHERE citus_table.x = 10;
citus.limit_clause_row_fetch_count (integer)
#
Sets the number of rows to fetch per task for limit clause
optimization. In some cases, SELECT queries with
LIMIT clauses may need to fetch all rows from
each task to generate results. In those cases, and where an
approximation would produce meaningful results, this configuration
parameter sets the number of rows to fetch from each shard. Limit
approximations are disabled by default and this parameter is set to
-1. This value can be set at runtime and is
effective on the coordinator.
citus.count_distinct_error_rate (floating point)
#
citus can calculate
count(distinct) approximates using the
Postgres Pro hll
extension. This configuration parameter sets the desired error rate
when calculating count(distinct):
0.0, which is the default value, disables
approximations for count(distinct), and
1.0, which provides no guarantees about the
accuracy of results. We recommend setting this parameter to
0.005 for best results. This value can be set at
runtime and is effective on the coordinator.
citus.task_assignment_policy (enum)
#This configuration parameter is applicable for queries against reference tables.
Sets the policy to use when assigning tasks to workers. The coordinator assigns tasks to workers based on shard locations. This configuration parameter specifies the policy to use when making these assignments. Currently, there are three possible task assignment policies, which can be used:
greedy — aims at evenly distributing
tasks across workers. This is the default value.
round-robin — assigns tasks to workers
in a round-robin fashion alternating between
different replicas. This enables much better cluster utilization
when the shard count for a table is low compared to the number
of workers.
first-replica — assigns tasks on the
basis of the insertion order of placements (replicas) for the
shards. In other words, the fragment query for a shard is simply
assigned to the worker which has the first replica of that shard.
This method allows you to have strong guarantees about which
shards will be used on which nodes (i.e. stronger memory
residency guarantees).
This configuration parameter can be set at runtime and is effective on the coordinator.
citus.enable_non_colocated_router_query_pushdown (boolean)
#Enables router planner for the queries that reference non-colocated distributed tables.
Normally, router planner is only enabled for the queries that
reference co-located distributed tables because it is not guaranteed
to have the target shards always on the same node, e.g., after
rebalancing the shards. For this reason, while enabling this flag
allows some degree of optimization for the queries that reference
non-colocated distributed tables, it is not guaranteed that the same
query will work after rebalancing the shards or altering the shard
count of one of those distributed tables. The default value is
off.
citus.max_intermediate_result_size (integer)
#
The maximum size in KB of intermediate results for CTEs that are
unable to be pushed down to worker nodes for execution, and for
complex subqueries. The default is 1 GB and a
value of -1 means no limit. Queries exceeding the
limit will be canceled and produce an error message.
citus.enable_ddl_propagation (boolean)
#
Specifies whether to automatically propagate DDL changes from the
coordinator to all workers. The default value is
true. Because some schema changes require an
access exclusive lock on tables and because the automatic propagation
applies to all workers sequentially it can make a
citus cluster temporarily less responsive.
You may choose to disable this setting and propagate changes manually.
For a list of DDL propagation support, see the Modifying Tables section.
citus.enable_local_reference_table_foreign_keys (boolean)
#
Allows foreign keys to be created between reference and local tables.
For the feature to work, the coordinator node must be registered with
itself, using the
citus_add_node
function. The default value is true.
Note that foreign keys between reference tables and local tables come at a slight cost. When you create the foreign key, citus must add the plain table to its metadata and track it in the pg_dist_partition table. Local tables that are added to metadata inherit the same limitations as reference tables (see the Creating and Modifying Distributed Objects (DDL) and SQL Support and Workarounds sections).
If you drop the foreign keys, citus will automatically remove such local tables from metadata, which eliminates such limitations on those tables.
citus.enable_change_data_capture (boolean)
#
Causes citus to alter the
wal2json and
pgoutput logical decoders to work with
distributed tables. Specifically, it rewrites the names of shards
(e.g. foo_102027) in decoder output to the base
names of the distributed tables (e.g. foo). It
also avoids publishing duplicate events during tenant isolation and
shard split/move/rebalance operations. The default value is
false.
citus.enable_schema_based_sharding (boolean)
#
With the parameter set to ON all created schemas
will be distributed by default. Distributed schemas are automatically
associated with individual co-location groups such that the tables
created in those schemas will be automatically converted to
co-located distributed tables without a shard key. This parameter
can be modified for individual sessions.
To learn how to use this configuration parameter, see the Microservices section.
citus.all_modifications_commutative (boolean)
#
citus enforces commutativity rules and
acquires appropriate locks for modify operations in order to
guarantee correctness of behavior. For example, it assumes that an
INSERT statement commutes with another
INSERT statement, but not with an
UPDATE or DELETE statement.
Similarly, it assumes that an UPDATE or
DELETE statement does not commute with another
UPDATE or DELETE statement.
This means that UPDATE and DELETE
statements require citus to acquire
stronger locks.
If you have UPDATE statements that are commutative
with your INSERTs or other UPDATEs,
then you can relax these commutativity assumptions by setting this
parameter to true. When this parameter is set to
true, all commands are considered commutative and
claim a shared lock, which can improve overall throughput. This
parameter can be set at runtime and is effective on the coordinator.
citus.multi_task_query_log_level (enum)
#
Sets a log-level for any query which generates more than one task
(i.e. which hits more than one shard). This is useful during a
multi-tenant application migration, as you can choose to error or
warn for such queries, to find them and add the
tenant_id filter to them. This parameter can be
set at runtime and is effective on the coordinator. The default
value for this parameter is off. The following
values are supported:
off — turns off logging any queries,
which generate multiple tasks (i.e. span multiple shards).
debug — logs statement at the
DEBUG severity level.
log — logs statement at the
LOG severity level. The log line will include
the SQL query that was run.
notice — logs statement at the
NOTICE severity level.
warning — logs statement at the
WARNING severity level.
error — logs statement at the
ERROR severity level.
Note that it may be useful to use error during
development testing and a lower log-level like log
during actual production deployment. Choosing log
will cause multi-task queries to appear in the database logs with
the query itself shown after STATEMENT.
LOG: multi-task query about to be executed HINT: Queries are split to multiple tasks if they have to be split into several queries on the workers. STATEMENT: SELECT * FROM foo;
citus.propagate_set_commands (enum)
#
Determines which SET commands are propagated from
the coordinator to workers. The default value is
none. The following values are supported:
none — no SET
commands are propagated.
local — only SET LOCAL
commands are propagated.
citus.enable_repartition_joins (boolean)
#
Ordinarily, attempting to perform
repartition joins
with the adaptive executor will fail with an error message. However,
setting this configuration parameter to true
allows citus to perform the join. The
default value is false.
citus.enable_repartitioned_insert_select (boolean)
#
By default, an INSERT INTO … SELECT statement
that cannot be pushed down will attempt to repartition rows from the
SELECT statement and transfer them between
workers for insertion. However, if the target table has too many
shards then repartitioning will probably not perform well. The
overhead of processing the shard intervals when determining how to
partition the results is too great. Repartitioning can be disabled
manually by setting this configuration parameter to
false.
citus.enable_binary_protocol (boolean)
#
Setting this parameter to true instructs the
coordinator node to use Postgres Pro binary
serialization format (when applicable) to transfer data with workers.
Some column types do not support binary serialization.
Enabling this parameter is mostly useful when the workers must return
large amounts of data. Examples are when a lot of rows are requested,
the rows have many columns, or they use big types such as
hll type from the hll
extension.
The default value is true. When set to
false, all results are encoded and transferred in
text format.
citus.max_shared_pool_size (integer)
#Specifies the maximum number of connections that the coordinator node, across all simultaneous sessions, is allowed to make per worker node. Postgres Pro must allocate fixed resources for every connection and this configuration parameter helps ease connection pressure on workers.
Without connection throttling, every multi-shard query creates
connections on each worker proportional to the number of shards it
accesses (in particular, up to #shards/#workers).
Running dozens of multi-shard queries at once can easily hit worker
nodes' max_connections
limit, causing queries to fail.
By default, the value is automatically set equal to the coordinator's
own max_connections, which is not guaranteed to
match that of the workers (see the note below). The value
-1 disables throttling.
There are certain operations that do not obey this parameter, most
importantly repartition joins. That is why it can be prudent to
increase the max_connections on the workers a
bit higher than max_connections on the coordinator.
This gives extra space for connections required for repartition
queries on the workers.
citus.max_adaptive_executor_pool_size (integer)
#
Whereas
citus.max_shared_pool_size
limits worker connections across all sessions, the
citus.max_adaptive_executor_pool_size limits
worker connections from just the current
session. This parameter is useful for:
Preventing a single backend from getting all the worker resources.
Providing priority management: designate low priority sessions
with low citus.max_adaptive_executor_pool_size
value and high priority sessions with higher values.
The default value is 16.
citus.executor_slow_start_interval (integer)
#Time to wait between opening connections to the same worker node, in milliseconds.
When the individual tasks of a multi-shard query take very little time, they can often be finished over a single (often already cached) connection. To avoid redundantly opening additional connections, the executor waits between connection attempts for the configured number of milliseconds. At the end of the interval, it increases the number of connections it is allowed to open next time.
For long queries (those taking >500 ms), slow
start might add latency, but for short queries it is faster. The
default value is 10 ms.
citus.max_cached_conns_per_worker (integer)
#Each backend opens connections to the workers to query the shards. At the end of the transaction, the configured number of connections is kept open to speed up subsequent commands. Increasing this value will reduce the latency of multi-shard queries but will also increase overhead on the workers.
The default value is 1. A larger value such as
2 might be helpful for clusters that use a small
number of concurrent sessions, but it's not wise to go much further
(e.g. 16 would be too high).
citus.force_max_query_parallelization (boolean)
#Simulates the deprecated and now nonexistent real-time executor. This is used to open as many connections as possible to maximize query parallelization.
When this configuration parameter is enabled, citus
will force the adaptive executor to use as many connections as
possible while executing a parallel distributed query. If not enabled,
the executor might choose to use fewer connections to optimize overall
query execution throughput. Internally, setting this parameter to
true will end up using one connection per task.
The default value is false.
One place where this is useful is in a transaction whose first query is lightweight and requires few connections, while a subsequent query would benefit from more connections. citus decides how many connections to use in a transaction based on the first statement, which can throttle other queries unless we use the configuration parameter to provide a hint.
The example below shows how to set this parameter:
BEGIN; -- Add this hint SET citus.force_max_query_parallelization TO ON; -- A lightweight query that doesn't require many connections SELECT count(*) FROM table WHERE filter = x; -- A query that benefits from more connections, and can obtain -- them since we forced max parallelization above SELECT ... very .. complex .. SQL; COMMIT;
citus.explain_all_tasks (boolean)
#
By default, citus shows the output of a
single arbitrary task when running the
EXPLAIN
command on a distributed query. In most cases, the
EXPLAIN output will be similar across tasks.
Occasionally, some of the tasks will be planned differently or have
much higher execution times. In those cases, it can be useful to
enable this parameter, after which the EXPLAIN
output will include all tasks. This may cause the
EXPLAIN to take longer.
citus.explain_analyze_sort_method (enum)
#
Determines the sort method of the tasks in the output of
EXPLAIN ANALYZE. The following values are
supported:
execution-time — sort by execution time.
taskId — sort by task ID.
In this section, we discuss how you can add or remove nodes from your citus cluster and how you can deal with node failures.
To make moving shards across nodes or re-replicating shards on failed nodes easier, citus supports fully online shard rebalancing. We discuss briefly the functions provided by the shard rebalancer when relevant in the sections below. You can learn more about these functions, their arguments, and usage in the Cluster Management And Repair Functions section.
This section explores configuration settings for running a cluster in production.
Choosing the shard count for each distributed table is a balance between the flexibility of having more shards and the overhead for query planning and execution across them. If you decide to change the shard count of a table after distributing, you can use the alter_distributed_table function.
The optimal choice varies depending on your access patterns for the data. For instance, in the multi-tenant SaaS database use case we recommend choosing between 32 and 128 shards. For smaller workloads, say <100GB, you could start with 32 shards and for larger workloads you could choose 64 or 128 shards. This means that you have the leeway to scale from 32 to 128 worker machines.
In the real-time analytics use case, shard count should be related to the total number of cores on the workers. To ensure maximum parallelism, you should create enough shards on each node such that there is at least one shard per CPU core. We typically recommend creating a high number of initial shards, e.g. 2x or 4x the number of current CPU correspond. This allows for future scaling if you add more workers and CPU cores.
However, keep in mind that for each query citus
opens one database connection per shard, and these connections are
limited. Be careful to keep the shard count small enough that
distributed queries will not often have to wait for a connection. Put
another way, the connections needed,
(max concurrent queries * shard count), should
generally not exceed the total connections possible in the system,
(number of workers * max_connections per worker).
The size of a cluster, in terms of number of nodes and their hardware capacity, is easy to change. However, you still need to choose an initial size for a new cluster. Here are some tips for a reasonable initial cluster size.
For those migrating to citus from an existing single-node database instance, we recommend choosing a cluster where the number of worker cores and RAM in total equals that of the original instance. In such scenarios we have seen 2-3x performance improvements because sharding improves resource utilization, allowing smaller indices, etc.
The coordinator node needs less memory than workers, so you can choose a compute-optimized machine for running the coordinator. The number of cores required depends on your existing workload (write/read throughput).
Total cores: when working data fits in RAM, you can expect a linear performance improvement on citus proportional to the number of worker cores. To determine the right number of cores for your needs, consider the current latency for queries in your single-node database and the required latency in citus. Divide current latency by desired latency, and round the result.
Worker RAM: the best case would be
providing enough memory that the majority of the working set fits in
memory. The type of queries your application uses affect memory
requirements. You can run EXPLAIN ANALYZE on a query
to determine how much memory it requires.
citus logical sharding based architecture allows you to scale out your cluster without any downtime. This section describes how you can add more nodes to your citus cluster in order to improve query performance / scalability.
citus stores all the data for distributed tables on the worker nodes. Hence, if you want to scale out your cluster by adding more computing power, you can do so by adding a worker.
To add a new node to the cluster, you first need to add the DNS name or IP address of that node and port (on which Postgres Pro is running) in the pg_dist_node catalog table. You can do so using the citus_add_node function. Example:
SELECT * from citus_add_node('node-name', 5432);
The new node is available for shards of new distributed tables. Existing shards will stay where they are unless redistributed, so adding a new worker may not help performance without further steps.
Also, new nodes synchronize citus metadata upon creation. By default, the sync happens inside a single transaction for consistency. However, in a big cluster with large amounts of metadata, the transaction can run out of memory and fail. If you encounter this situation, you can choose a non-transactional metadata sync mode with the citus.metadata_sync_mode configuration parameter.
If you want to move existing shards to a newly added worker, citus provides the citus_rebalance_start function to make it easier. This function will distribute shards evenly among the workers.
The function is configurable to rebalance shards according to a number of strategies, to best match your database workload. See the function reference to learn which strategy to choose. Here is an example of rebalancing shards using the default strategy:
SELECT citus_rebalance_start();
Many products like multi-tenant SaaS applications cannot tolerate downtime, and rebalancing is able to honor this requirement. This means reads and writes from the application can continue with minimal interruption while data is being moved.
This operation carries out multiple shard moves in a sequential order by default. There are some cases where you may prefer to rebalance faster at the expense of using more resources such as network bandwidth. In those situations, customers are able to configure a rebalance operation to perform a number of shard moves in parallel.
The
citus.max_background_task_executors_per_node
configuration parameter allows tasks such as shard rebalancing to
operate in parallel. You can increase it from its default value of
1 as desired to boost parallelism.
ALTER SYSTEM SET citus.max_background_task_executors_per_node = 2; SELECT pg_reload_conf(); SELECT citus_rebalance_start();
What are the typical use cases?
Scaling out faster when adding new nodes to the cluster.
Rebalancing the cluster faster to even out the utilization of nodes.
Corner Cases and Gotchas
The citus.max_background_task_executors_per_node configuration parameter limits the number of parallel task executors in general. Also, shards in the same colocation group will always move sequentially so parallelism may be limited by the number of colocation groups.
citus shard rebalancing uses Postgres Pro logical replication to move data from the old shard (called the “publisher” in replication terms) to the new (the “subscriber”). Logical replication allows application reads and writes to continue uninterrupted while copying shard data. citus puts a brief write-lock on a shard only during the time it takes to update metadata to promote the subscriber shard as active.
As the Postgres Pro documentation explains, the source needs a replica identity configured:
A published table must have a “replica identity”
configured in order to be able to replicate UPDATE
and DELETE operations, so that appropriate rows to
update or delete can be identified on the subscriber side. By default,
this is the primary key, if there is one. Another unique index (with
certain additional requirements) can also be set to be the replica
identity.
In other words, if your distributed table has a primary key defined then it is ready for shard rebalancing with no extra work. However, if it does not have a primary key or an explicitly defined replica identity, then attempting to rebalance it will cause an error. Here is how to fix it.
First, does the table have a unique index?
If the table to be replicated already has a unique index, which includes the distribution column, then choose that index as a replica identity:
-- Supposing my_table has unique index my_table_idx -- which includes distribution column ALTER TABLE my_table REPLICA IDENTITY USING INDEX my_table_idx;
While REPLICA IDENTITY USING INDEX is fine, we
recommend against adding
REPLICA IDENTITY FULL to a table. This setting
would result in each UPDATE/
DELETE doing a full-table-scan on the subscriber
side to find the tuple with those rows. In our testing we have found
this to result in worse performance than even solution four below.
Otherwise, can you add a primary key?
Add a primary key to the table. If the desired key happens to be the distribution column, then it's quite easy, just add the constraint. Otherwise, a primary key with a non-distribution column must be composite and contain the distribution column too.
The citus coordinator only stores metadata about the table shards and does not store any data. This means that all the computation is pushed down to the workers and the coordinator does only final aggregations on the result of the workers. Therefore, it is not very likely that the coordinator becomes a bottleneck for read performance. Also, it is easy to boost up the coordinator by shifting to a more powerful machine.
However, in some write-heavy use cases where the coordinator becomes a performance bottleneck, you can add another node as described below and load balance the client connections.
SELECT * FROM citus_add_node(second_coordinator_hostname, second_coordinator_port); SELECT * FROM citus_set_node_property(second_coordinator_hostname, second_coordinator_port, 'shouldhaveshards', false);
DDL queries can only be run though the first coordinator node.
In this subsection, we discuss how you can deal with node failures without incurring any downtime on your citus cluster.
citus uses Postgres Pro streaming replication, allowing it to tolerate worker-node failures. This option replicates entire worker nodes by continuously streaming their WAL records to a standby. You can configure streaming replication on-premise yourself by consulting the Streaming Replication section.
The citus coordinator maintains metadata tables to track all of the cluster nodes and the locations of the database shards on those nodes. The metadata tables are small (typically a few MBs in size) and do not change very often. This means that they can be replicated and quickly restored if the node ever experiences a failure. There are several options on how users can deal with coordinator failures.
Use Postgres Pro streaming replication. You can use Postgres Pro streaming replication feature to create a hot standby of the coordinator. Then, if the primary coordinator node fails, the standby can be promoted to the primary automatically to serve queries to your cluster. For details on setting this up, please refer to the Streaming Replication section.
Use backup tools. Since the metadata tables are small, users can use EBS volumes, or Postgres Pro backup tools to backup the metadata. Then, they can easily copy over that metadata to new nodes to resume operation.
citus places table rows into worker shards based on the hashed value of the rows' distribution column. Multiple distribution column values often fall into the same shard. In the citus multi-tenant use case this means that tenants often share shards.
However, sharing shards can cause resource contention when tenants differ drastically in size. This is a common situation for systems with a large number of tenants — we have observed that the size of tenant data tend to follow a Zipfian distribution as the number of tenants increases. This means there are a few very large tenants, and many smaller ones. To improve resource allocation and make guarantees of tenant QoS it is worthwhile to move large tenants to dedicated nodes.
citus provides the tools to isolate a tenant on a specific node. This happens in two phases: firstly, isolating the tenant's data to a new dedicated shard, then moving the shard to the desired node. To understand the process, it helps to know precisely how rows of data are assigned to shards.
Every shard is marked in citus metadata with the range of hashed values it contains (more info in the reference for the pg_dist_shard table). The isolate_tenant_to_new_shard function moves a tenant into a dedicated shard in three steps:
Creates a new shard for table_name, which
includes rows whose distribution column has value
tenant_id and excludes all other rows.
Moves the relevant rows from their current shard to the new shard.
Splits the old shard into two with hash ranges that abut the excision above and below.
Furthermore, the function takes the CASCADE option,
which isolates the tenant rows of not just table_name
but of all tables co-located
with it. Here is an example:
-- This query creates an isolated shard for the given tenant_id and
-- returns the new shard id.
-- General form:
SELECT isolate_tenant_to_new_shard('table_name', tenant_id);
-- Specific example:
SELECT isolate_tenant_to_new_shard('lineitem', 135);
-- If the given table has co-located tables, the query above errors out and
-- advises to use the CASCADE option
SELECT isolate_tenant_to_new_shard('lineitem', 135, 'CASCADE');
Output:
┌─────────────────────────────┐ │ isolate_tenant_to_new_shard │ ├─────────────────────────────┤ │ 102240 │ └─────────────────────────────┘
The new shard(s) are created on the same node as the shard(s) from which the tenant was removed. For true hardware isolation they can be moved to a separate node in the citus cluster. As mentioned, the isolate_tenant_to_new_shard function returns the newly created shard ID, and this ID can be used to move the shard:
In schema-based sharding, the act of isolating a tenant is not required as by definition each tenant already resides in its own schema. The only thing that is needed is obtaining a shard identifier for a schema to perform a move.
First find the colocation ID of the schema you want to move.
SELECT * FROM citus_schemas;
schema_name | colocation_id | schema_size | schema_owner --------------+---------------+-------------+-------------- user_service | 1 | 0 bytes | user_service time_service | 2 | 0 bytes | time_service ping_service | 3 | 0 bytes | ping_service a | 4 | 128 kB | citus b | 5 | 32 kB | citus with_data | 11 | 6408 kB | citus (6 rows)
The next step is to query citus_shards, we will use
co-location identifier 11 from the output above:
SELECT * FROM citus_shards where colocation_id = 11;
table_name | shardid | shard_name | citus_table_type | colocation_id | nodename | nodeport | shard_size -----------------+---------+------------------------+------------------+---------------+-----------+----------+------------ with_data.test | 102180 | with_data.test_102180 | schema | 11 | localhost | 9702 | 647168 with_data.test2 | 102183 | with_data.test2_102183 | schema | 11 | localhost | 9702 | 5914624 (2 rows)
You can pick any shardid from the output as making
the move will also propagate to all co-located tables, which in case of
schema-based sharding means moving all tables within the schema.
Knowing the shard ID that denotes the tenant, you can execute the move:
-- Find the node currently holding the new shard SELECT nodename, nodeport FROM citus_shards WHERE shardid = 102240; -- List the available worker nodes that could hold the shard SELECT * FROM master_get_active_worker_nodes(); -- Move the shard to your choice of worker -- (it will also move any shards created with the CASCADE option) SELECT citus_move_shard_placement( 102240, 'source_host', source_port, 'dest_host', dest_port);
Note that the citus_move_shard_placement function will also move any shards which are co-located with the specified one, to preserve their co-location.
When administering a citus cluster it is
useful to know what queries users are running, which nodes are involved,
and which execution method citus is using for
each query. The extension records query statistics in a metadata view
called citus_stat_statements,
named analogously to Postgres Pro
pg_stat_statements.
Whereas pg_stat_statements stores info about
query duration and I/O, citus_stat_statements
stores info about citus execution methods and
shard partition keys (when applicable).
citus requires the
pg_stat_statements extension to be installed
in order to track query statistics. On a self-hosted
Postgres Pro instance load the extension in
postgresql.conf via
shared_preload_libraries, then create the extension
in SQL:
CREATE EXTENSION pg_stat_statements;
Let's see how this works. Assume we have a table called
foo that is hash-distributed by its
id column.
-- Create and populate distributed table
CREATE TABLE foo ( id int );
SELECT create_distributed_table('foo', 'id');
INSERT INTO foo SELECT generate_series(1,100);
We will run two more queries and
citus_stat_statements will show how
citus chooses to execute them.
-- Counting all rows executes on all nodes, and sums -- the results on the coordinator SELECT count(*) FROM foo; -- Specifying a row by the distribution column routes -- execution to an individual node SELECT * FROM foo WHERE id = 42;
To find how these queries were executed, ask the stats table:
SELECT * FROM citus_stat_statements;
Results:
-[ RECORD 1 ]-+---------------------------------------------- queryid | -6844578505338488014 userid | 10 dbid | 13340 query | SELECT count(*) FROM foo; executor | adaptive partition_key | calls | 1 -[ RECORD 2 ]-+---------------------------------------------- queryid | 185453597994293667 userid | 10 dbid | 13340 query | INSERT INTO foo SELECT generate_series($1,$2) executor | insert-select partition_key | calls | 1 -[ RECORD 3 ]-+---------------------------------------------- queryid | 1301170733886649828 userid | 10 dbid | 13340 query | SELECT * FROM foo WHERE id = $1 executor | adaptive partition_key | 42 calls | 1
We can see that citus uses the adaptive
executor most commonly to run queries. This executor fragments the query
into constituent queries to run on relevant nodes and combines the
results on the coordinator node. In the case of the second query
(filtering by the distribution column id = $1),
citus determined that it needed the data from
just one node. Lastly, we can see that the INSERT INTO foo SELECT…
statement ran with the insert-select executor that
provides flexibility to run these kind of queries.
So far the information in this view does not give us anything we could
not already learn by running the EXPLAIN command for
a given query. However, in addition to getting information about
individual queries, the
citus_stat_statements
view allows us to answer questions such as
“what percentage of queries in the cluster are scoped to a single tenant?”
SELECT sum(calls),
partition_key IS NOT NULL AS single_tenant
FROM citus_stat_statements
GROUP BY 2;
. sum | single_tenant -----+--------------- 2 | f 1 | t
In a multi-tenant database, for instance, we would expect the vast majority of queries to be single tenant. Seeing too many multi-tenant queries may indicate that queries do not have the proper filters to match a tenant, and are using unnecessary resources.
To investigate which tenants in particular are most active, you can use the citus_stat_tenants view.
The pg_stat_statements
view limits the number of statements it tracks and the duration of its
records. Because the
citus_stat_statements
table tracks a strict subset of the queries in
pg_stat_statements, a choice of equal limits
for the two views would cause a mismatch in their data retention.
Mismatched records can cause joins between the views to behave
unpredictably.
There are three ways to help synchronize the views, and all three can be used together.
Have the maintenance daemon periodically sync the
citus and
Postgres Pro statistics. The
citus.stat_statements_purge_interval
configuration parameter sets time in seconds for the sync. A value
of 0 disables periodic syncs.
Adjust the number of entries in
citus_stat_statements. The
citus.stat_statements_max
configuration parameter removes old entries when new ones cross the
threshold. The default value is 50000, and the
highest allowable value is 10000000. Note that
each entry costs about 140 bytes in shared memory so set the value
wisely.
Increase pg_stat_statements.max.
Its default value is 5000 and could be increased
to 10000, 20000 or
even 50000 without much overhead. This is most
beneficial when there is more local (i.e. coordinator) query workload.
Changing pg_stat_statements.max or
citus.stat_statements_max requires restarting
the Postgres Pro service. Changing
citus.stat_statements_purge_interval, on the
other hand, will come into effect with a call to
the pg_reload_conf
function.
Long running queries can hold locks, queue up WAL, or just consume a lot of system resources, so in a production environment it is good to prevent them from running too long. You can set the statement_timeout parameter on the coordinator and workers to cancel queries that run too long.
-- Limit queries to five minutes
ALTER DATABASE citus
SET statement_timeout TO 300000;
SELECT run_command_on_workers($cmd$
ALTER DATABASE citus
SET statement_timeout TO 300000;
$cmd$);
The timeout is specified in milliseconds.
To customize the timeout per query, use SET LOCAL in
a transaction:
BEGIN; -- this limit applies to just the current transaction SET LOCAL statement_timeout TO 300000; -- ... COMMIT;
The traffic between the different nodes in the cluster is encrypted for new installations. This is done by using TLS with self-signed certificates. This means that this does not protect against man-in-the-middle attacks. This only protects against passive eavesdropping on the network.
Clusters originally created with citus do
not have any network encryption enabled between nodes (even if upgraded
later). To set up self-signed TLS on this type of installation follow
the steps in the
Creating Certificates
section together with the citus specific
settings described here, i.e. changing the
citus.node_conninfo
parameter to sslmode=require. This setup should be
done on the coordinator and workers.
When citus nodes communicate with one another they consult a table with connection credentials. This gives the database administrator flexibility to adjust parameters for security and efficiency.
To set non-sensitive libpq connection
parameters to be used for all node connections, update the
citus.node_conninfo configuration parameter:
-- key=value pairs separated by spaces. -- For example, ssl options: ALTER SYSTEM SET citus.node_conninfo = 'sslrootcert=/path/to/citus-ca.crt sslcrl=/path/to/citus-ca.crl sslmode=verify-full';
There is a whitelist of options that the
citus.node_conninfo
configuration parameter accepts. The default value is
sslmode=require, which prevents unencrypted
communication between nodes. If your cluster was originally created with
citus, the value will be
sslmode=prefer. After setting up self-signed
certificates on all nodes it is recommended to change this setting to
sslmode=require.
After changing this setting it is important to reload the Postgres Pro configuration. Even though the changed setting might be visible in all sessions, the setting is only consulted by citus when new connections are established. When a reload signal is received, citus marks all existing connections to be closed which causes a reconnect after running transactions have been completed.
SELECT pg_reload_conf();
-- Only superusers can access this table -- Add a password for user jdoe INSERT INTO pg_dist_authinfo (nodeid, rolename, authinfo) VALUES (123, 'jdoe', 'password=abc123');
After this INSERT, any query needing to connect to
node 123 as the user jdoe will use
the supplied password. To learn more, see the section about the
pg_dist_authinfo
table.
-- Update user jdoe to use certificate authentication UPDATE pg_dist_authinfo SET authinfo = 'sslcert=/path/to/user.crt sslkey=/path/to/user.key' WHERE nodeid = 123 AND rolename = 'jdoe';
This changes the user from using a password to use a certificate and keyfile while connecting to node 123 instead. Make sure the user certificate is signed by a certificate that is trusted by the worker you are connecting to and authentication settings on the worker allow for certificate based authentication. Full documentation on how to use client certificates can be found in the Client Certificates section.
Changing the pg_dist_authinfo table does not
force any existing connection to reconnect.
This section assumes you have a trusted Certificate Authority that can issue server certificates to you for all nodes in your cluster. It is recommended to work with the security department in your organization to prevent key material from being handled incorrectly. This guide covers only citus specific configuration that needs to be applied, not best practices for PKI management.
For all nodes in the cluster you need to get a valid certificate signed by the same Certificate Authority. The following machine-specific files are assumed to be available on every machine:
/path/to/server.key — Server Private Key
/path/to/server.crt — Server Certificate
or Certificate Chain for Server Key, signed by trusted Certificate
Authority
Next to these machine-specific files you need these cluster or Certificate Authority wide files available:
/path/to/ca.crt — Certificate of the
Certificate Authority
/path/to/ca.crl — Certificate Revocation
List of the Certificate Authority
The Certificate Revocation List is likely to change over time. Work with your security department to set up a mechanism to update the revocation list on to all nodes in the cluster in a timely manner. A reload of every node in the cluster is required after the revocation list has been updated.
Once all files are in place on the nodes, the following settings need to be configured in the Postgres configuration file:
# The following settings allow the postgres server to enable ssl, and # configure the server to present the certificate to clients when # connecting over tls/ssl ssl = on ssl_key_file = '/path/to/server.key' ssl_cert_file = '/path/to/server.crt' # This will tell citus to verify the certificate of the server it is connecting to citus.node_conninfo = 'sslmode=verify-full sslrootcert=/path/to/ca.crt sslcrl=/path/to/ca.crl'
After changing, reload the configuration to apply these changes. Also,
adjusting
citus.local_hostname may be
required for proper functioning with sslmode=verify-full.
Depending on the policy of the Certificate Authority used you might need
or want to change sslmode=verify-full in
citus.node_conninfo to
sslmode=verify-ca. For the difference between the two
settings, consult the
SSL Mode Descriptions
section.
Lastly, to prevent any user from connecting via an un-encrypted connection,
changes need to be made to pg_hba.conf. Many
Postgres Pro installations will have entries
allowing host connections which allow SSL/TLS
connections as well as plain TCP connections. By replacing all
host entries with hostssl entries,
only encrypted connections will be allowed to authenticate to
Postgres Pro. For full documentation on these
settings take a look at the section about the
pg_hba.conf
file.
When a trusted Certificate Authority is not available, one can create their own via a self-signed root certificate. This is non-trivial and the developer or operator should seek guidance from their security team when doing so.
To verify the connections from the coordinator to the workers are encrypted you can run the following query. It will show the SSL/TLS version used to encrypt the connection that the coordinator uses to talk to the worker:
SELECT run_command_on_workers($$ SELECT version FROM pg_stat_ssl WHERE pid = pg_backend_pid() $$);
┌────────────────────────────┐ │ run_command_on_workers │ ├────────────────────────────┤ │ (localhost,9701,t,TLSv1.2) │ │ (localhost,9702,t,TLSv1.2) │ └────────────────────────────┘ (2 rows)
For your convenience getting started, our
multi-node installation instructions
direct you to set up the pg_hba.conf
on the workers with its
authentication method set to
trust for local network connections. However, you
might desire more security.
To require that all connections supply a hashed password, update
the Postgres Pro pg_hba.conf
on every worker node with something like this:
# Require password access and a ssl/tls connection to nodes in the local # network. The following ranges correspond to 24, 20, and 16-bit blocks # in Private IPv4 address spaces. hostssl all all 10.0.0.0/8 md5 # Require passwords and ssl/tls connections when the host connects to # itself as well. hostssl all all 127.0.0.1/32 md5 hostssl all all ::1/128 md5
The coordinator node needs to know roles' passwords in order to communicate with the workers. In citus the authentication information has to be maintained in the .pgpass file. Edit the file in the Postgres Pro user home directory, with a line for each combination of worker address and role:
hostname:port:database:username:password
Sometimes workers need to connect to one another, such as during
repartition joins.
Thus each worker node requires a copy of the .pgpass
file as well.
Postgres Pro row-level security policies restrict, on a per-user basis, which rows can be returned by normal queries or inserted, updated, or deleted by data modification commands. This can be especially useful in a multi-tenant citus cluster because it allows individual tenants to have full SQL access to the database while hiding each tenant's information from other tenants.
We can implement the separation of tenant data by using a naming
convention for database roles that ties into table row-level security
policies. We will assign each tenant a database role in a numbered
sequence: tenant_1, tenant_2, etc.
Tenants will connect to citus using these
separate roles. Row-level security policies can compare the role name to
values in the tenant_id distribution column to decide
whether to allow access.
Here is how to apply the approach on a simplified events table distributed
by tenant_id. First create the roles
tenant_1 and tenant_2. Then run
the following as an administrator:
CREATE TABLE events(
tenant_id int,
id int,
type text
);
SELECT create_distributed_table('events','tenant_id');
INSERT INTO events VALUES (1,1,'foo'), (2,2,'bar');
-- Assumes that roles tenant_1 and tenant_2 exist
GRANT select, update, insert, delete
ON events TO tenant_1, tenant_2;
As it stands, anyone with SELECT permissions for this
table can see both rows. Users from either tenant can see and update the
row of the other tenant. We can solve this with row-level table security
policies.
Each policy consists of two clauses: USING and
WITH CHECK. When a user tries to read or write rows,
the database evaluates each row against these clauses. Existing table
rows are checked against the expression specified in
USING, while new rows that would be created via
INSERT or UPDATE are checked
against the expression specified in WITH CHECK.
-- First a policy for the system admin "citus" user CREATE POLICY admin_all ON events TO citus -- apply to this role USING (true) -- read any existing row WITH CHECK (true); -- insert or update any row -- Next a policy which allows role "tenant_<n>" to -- access rows where tenant_id = <n> CREATE POLICY user_mod ON events USING (current_user = 'tenant_' || tenant_id::text); -- Lack of CHECK means same condition as USING -- Enforce the policies ALTER TABLE events ENABLE ROW LEVEL SECURITY;
Now roles tenant_1 and tenant_2
get different results for their queries:
Connected as tenant_1:
SELECT * FROM events;
┌───────────┬────┬──────┐ │ tenant_id │ id │ type │ ├───────────┼────┼──────┤ │ 1 │ 1 │ foo │ └───────────┴────┴──────┘
Connected as tenant_2:
SELECT * FROM events;
┌───────────┬────┬──────┐ │ tenant_id │ id │ type │ ├───────────┼────┼──────┤ │ 2 │ 2 │ bar │ └───────────┴────┴──────┘
INSERT INTO events VALUES (3,3,'surprise'); /* ERROR: new row violates row-level security policy for table "events_102055" */
citus provides distributed functionality by
extending Postgres Pro using the hook and
extension APIs. This allows users to benefit from the features that come
with the rich Postgres Pro ecosystem. These
features include, but are not limited to, support for a wide range of
data types (including semi-structured
data types like jsonb and hstore),
operators and functions, full text
search, and other extensions such as
PostGIS and
HyperLogLog.
Further, proper use of the extension APIs enable compatibility with
standard Postgres Pro tools such as
pgAdmin and
pg_upgrade.
As citus is an extension which can be
installed on any Postgres Pro instance, you can
directly use other extensions such as hstore,
hll, or PostGIS
with citus. However, there is one thing to
keep in mind. While including other extensions in
shared_preload_libraries, you should make sure that
citus is the first extension.
There are several extensions, which may be useful when working with citus:
cstore_fdw — columnar store for analytics. The columnar nature delivers performance by reading only relevant data from disk, and it may compress data 6x-10x to reduce space requirements for data archival.
pg_cron — run periodic jobs directly from the database.
topn — returns the top values in a database according to some criteria. Uses an approximation algorithm to provide fast results with modest compute and memory resources.
hll — HyperLogLog data structure as a native data type. It is a fixed-size, set-like structure used for distinct value counting with tunable precision.
Each Postgres Pro server can hold multiple databases. However, new databases do not inherit the extensions of any others; all desired extensions must be added afresh. To run citus on a new database, you will need to create the database on the coordinator and workers, create the citus extension within that database, and register the workers in the coordinator database.
Connect to each of the worker nodes and run:
-- On every worker node CREATE DATABASE newbie; \c newbie CREATE EXTENSION citus;
Then, on the coordinator:
CREATE DATABASE newbie;
\c newbie
CREATE EXTENSION citus;
SELECT * from citus_add_node('node-name', 5432);
SELECT * from citus_add_node('node-name2', 5432);
-- ... for all of them
Now the new database will be operating as another citus cluster.
The usual way to find table sizes in Postgres Pro, pg_total_relation_size, drastically under-reports the size of distributed tables. All this function does on a citus cluster is reveal the size of tables on the coordinator node. In reality the data in distributed tables lives on the worker nodes (in shards), not on the coordinator. A true measure of distributed table size is obtained as a sum of shard sizes. citus provides helper functions to query this information.
| Function | Returns |
|---|---|
| citus_relation_size |
|
| citus_table_size |
|
| citus_total_relation_size |
|
These functions are analogous to three of the standard Postgres Pro object size functions, with the additional note that if they cannot connect to a node, they error out.
Here is an example of using one of the helper functions to list the sizes of all distributed tables:
SELECT logicalrelid AS name,
pg_size_pretty(citus_table_size(logicalrelid)) AS size
FROM pg_dist_partition;
Output:
┌───────────────┬───────┐ │ name │ size │ ├───────────────┼───────┤ │ github_users │ 39 MB │ │ github_events │ 37 MB │ └───────────────┴───────┘
In Postgres Pro (and other MVCC databases), an
UPDATE or DELETE of a row does not
immediately remove the old version of the row. The accumulation of
outdated rows is called bloat and must be cleaned to avoid decreased
query performance and unbounded growth of disk space requirements.
Postgres Pro runs a process called the
auto-vacuum daemon that periodically vacuums (removes) outdated rows.
It is not just user queries which scale in a distributed database, vacuuming does too. In Postgres Pro big busy tables have great potential to bloat, both from lower sensitivity to Postgres Pro vacuum scale factor parameter, and generally because of the extent of their row churn. Splitting a table into distributed shards means both that individual shards are smaller tables and that auto-vacuum workers can parallelize over different parts of the table on different machines. Ordinarily auto-vacuum can only run one worker per table.
Due to the above, auto-vacuum operations on a citus cluster are probably good enough for most cases. However, for tables with particular workloads, or companies with certain “safe” hours to schedule a vacuum, it might make more sense to manually vacuum a table rather than leaving all the work to auto-vacuum.
To vacuum a table, simply run this on the coordinator node:
VACUUM my_distributed_table;
Using vacuum against a distributed table will send the
VACUUM command to every one of that table's placements
(one connection per placement). This is done in parallel. All
options are supported (including the
table_and_columns list) except for
VERBOSE. The VACUUM command also
runs on the coordinator, and does so before any workers nodes are
notified. Note that unqualified vacuum commands (i.e. those without a
table specified) do not propagate to worker nodes.
Postgres Pro ANALYZE command
collects statistics about the contents of tables in the database.
Subsequently, the query planner uses these statistics to help determine
the most efficient execution plans for queries.
The auto-vacuum daemon, discussed in the previous section, will
automatically issue ANALYZE commands whenever the
content of a table has changed sufficiently. The daemon schedules
ANALYZE strictly as a function of the number of rows
inserted or updated; it has no knowledge of whether that will lead to
meaningful statistical changes. Administrators might prefer to manually
schedule ANALYZE operations instead, to coincide with
statistically meaningful table changes.
To analyze a table, run this on the coordinator node:
ANALYZE my_distributed_table;
citus propagates the ANALYZE
command to all worker node placements.
citus provides append-only columnar table storage for analytic and data warehousing workloads. When columns (rather than rows) are stored contiguously on disk, data becomes more compressible, and queries can request a subset of columns more quickly.
To use columnar storage, specify USING columnar
when creating a table:
CREATE TABLE contestant (
handle TEXT,
birthdate DATE,
rating INT,
percentile FLOAT,
country CHAR(3),
achievements TEXT[]
) USING columnar;
You can also convert between row-based (heap) and
columnar storage.
-- Convert to row-based (heap) storage
SELECT alter_table_set_access_method('contestant', 'heap');
-- Convert to columnar storage (indexes will be dropped)
SELECT alter_table_set_access_method('contestant', 'columnar');
citus converts rows to columnar storage in “stripes” during insertion. Each stripe holds one transaction's worth of data, or 150000 rows, whichever is less. (The stripe size and other parameters of a columnar table can be changed with the alter_columnar_table_set function.)
For example, the following statement puts all five rows into the same stripe, because all values are inserted in a single transaction:
-- Insert these values into a single columnar stripe
INSERT INTO contestant VALUES
('a','1990-01-10',2090,97.1,'XA','{a}'),
('b','1990-11-01',2203,98.1,'XA','{a,b}'),
('c','1988-11-01',2907,99.4,'XB','{w,y}'),
('d','1985-05-05',2314,98.3,'XB','{}'),
('e','1995-05-05',2236,98.2,'XC','{a}');
It is best to make large stripes when possible, because
citus compresses columnar data separately per
stripe. We can see facts about our columnar table like compression rate,
number of stripes, and average rows per stripe by using
VACUUM VERBOSE:
VACUUM VERBOSE contestant;
INFO: statistics for "contestant": storage id: 10000000000 total file size: 24576, total data size: 248 compression rate: 1.31x total row count: 5, stripe count: 1, average rows per stripe: 5 chunk count: 6, containing data for dropped columns: 0, zstd compressed: 6
The output shows that citus used the
zstd compression algorithm to obtain 1.31x data
compression. The compression rate compares the size of inserted data as
it was staged in memory against the size of that data compressed in its
eventual stripe.
Because of how it is measured, the compression rate may or may not match the size difference between row and columnar storage for a table. The only way to truly find that difference is to construct a row and columnar table that contain the same data and compare.
Let's create a new example with more data to benchmark the compression savings.
-- First a wide table using row storage CREATE TABLE perf_row( c00 int8, c01 int8, c02 int8, c03 int8, c04 int8, c05 int8, c06 int8, c07 int8, c08 int8, c09 int8, c10 int8, c11 int8, c12 int8, c13 int8, c14 int8, c15 int8, c16 int8, c17 int8, c18 int8, c19 int8, c20 int8, c21 int8, c22 int8, c23 int8, c24 int8, c25 int8, c26 int8, c27 int8, c28 int8, c29 int8, c30 int8, c31 int8, c32 int8, c33 int8, c34 int8, c35 int8, c36 int8, c37 int8, c38 int8, c39 int8, c40 int8, c41 int8, c42 int8, c43 int8, c44 int8, c45 int8, c46 int8, c47 int8, c48 int8, c49 int8, c50 int8, c51 int8, c52 int8, c53 int8, c54 int8, c55 int8, c56 int8, c57 int8, c58 int8, c59 int8, c60 int8, c61 int8, c62 int8, c63 int8, c64 int8, c65 int8, c66 int8, c67 int8, c68 int8, c69 int8, c70 int8, c71 int8, c72 int8, c73 int8, c74 int8, c75 int8, c76 int8, c77 int8, c78 int8, c79 int8, c80 int8, c81 int8, c82 int8, c83 int8, c84 int8, c85 int8, c86 int8, c87 int8, c88 int8, c89 int8, c90 int8, c91 int8, c92 int8, c93 int8, c94 int8, c95 int8, c96 int8, c97 int8, c98 int8, c99 int8 ); -- Next a table with identical columns using columnar storage CREATE TABLE perf_columnar(LIKE perf_row) USING COLUMNAR;
Fill both tables with the same large dataset:
INSERT INTO perf_row
SELECT
g % 00500, g % 01000, g % 01500, g % 02000, g % 02500, g % 03000, g % 03500, g % 04000, g % 04500, g % 05000,
g % 05500, g % 06000, g % 06500, g % 07000, g % 07500, g % 08000, g % 08500, g % 09000, g % 09500, g % 10000,
g % 10500, g % 11000, g % 11500, g % 12000, g % 12500, g % 13000, g % 13500, g % 14000, g % 14500, g % 15000,
g % 15500, g % 16000, g % 16500, g % 17000, g % 17500, g % 18000, g % 18500, g % 19000, g % 19500, g % 20000,
g % 20500, g % 21000, g % 21500, g % 22000, g % 22500, g % 23000, g % 23500, g % 24000, g % 24500, g % 25000,
g % 25500, g % 26000, g % 26500, g % 27000, g % 27500, g % 28000, g % 28500, g % 29000, g % 29500, g % 30000,
g % 30500, g % 31000, g % 31500, g % 32000, g % 32500, g % 33000, g % 33500, g % 34000, g % 34500, g % 35000,
g % 35500, g % 36000, g % 36500, g % 37000, g % 37500, g % 38000, g % 38500, g % 39000, g % 39500, g % 40000,
g % 40500, g % 41000, g % 41500, g % 42000, g % 42500, g % 43000, g % 43500, g % 44000, g % 44500, g % 45000,
g % 45500, g % 46000, g % 46500, g % 47000, g % 47500, g % 48000, g % 48500, g % 49000, g % 49500, g % 50000
FROM generate_series(1,50000000) g;
INSERT INTO perf_columnar
SELECT
g % 00500, g % 01000, g % 01500, g % 02000, g % 02500, g % 03000, g % 03500, g % 04000, g % 04500, g % 05000,
g % 05500, g % 06000, g % 06500, g % 07000, g % 07500, g % 08000, g % 08500, g % 09000, g % 09500, g % 10000,
g % 10500, g % 11000, g % 11500, g % 12000, g % 12500, g % 13000, g % 13500, g % 14000, g % 14500, g % 15000,
g % 15500, g % 16000, g % 16500, g % 17000, g % 17500, g % 18000, g % 18500, g % 19000, g % 19500, g % 20000,
g % 20500, g % 21000, g % 21500, g % 22000, g % 22500, g % 23000, g % 23500, g % 24000, g % 24500, g % 25000,
g % 25500, g % 26000, g % 26500, g % 27000, g % 27500, g % 28000, g % 28500, g % 29000, g % 29500, g % 30000,
g % 30500, g % 31000, g % 31500, g % 32000, g % 32500, g % 33000, g % 33500, g % 34000, g % 34500, g % 35000,
g % 35500, g % 36000, g % 36500, g % 37000, g % 37500, g % 38000, g % 38500, g % 39000, g % 39500, g % 40000,
g % 40500, g % 41000, g % 41500, g % 42000, g % 42500, g % 43000, g % 43500, g % 44000, g % 44500, g % 45000,
g % 45500, g % 46000, g % 46500, g % 47000, g % 47500, g % 48000, g % 48500, g % 49000, g % 49500, g % 50000
FROM generate_series(1,50000000) g;
VACUUM (FREEZE, ANALYZE) perf_row;
VACUUM (FREEZE, ANALYZE) perf_columnar;
For this data, you can see a compression ratio of better than 8X in the columnar table.
SELECT pg_total_relation_size('perf_row')::numeric/
pg_total_relation_size('perf_columnar') AS compression_ratio;
. compression_ratio -------------------- 8.0196135873627944 (1 row)
Columnar storage works well with table partitioning. For example, see the Archiving with Columnar Storage section.
Columnar storage compresses per stripe. Stripes are created per transaction, so inserting one row per transaction will put single rows into their own stripes. Compression and performance of single row stripes will be worse than a row table. Always insert in bulk to a columnar table.
Even if you mess up and columnarize a bunch of tiny stripes, it is
possible to repair it. Simply run VACUUM (FULL)
on the table like so:
VACUUM (FULL) foo_table;
In some cases it might be more desirable to create a new table, move the data and drop the old one. You can do it like so:
BEGIN; CREATE TABLE foo_compacted (LIKE foo) USING columnar; INSERT INTO foo_compacted SELECT * FROM foo; DROP TABLE foo; ALTER TABLE foo_compacted RENAME TO foo; COMMIT;
Fundamentally non-compressible data can be a problem, although it can still be useful to use columnar so that less is loaded into memory when selecting specific columns.
On a partitioned table with a mix of row and column partitions, updates must be carefully targeted or filtered to hit only the row partitions.
If the operation is targeted at a specific row partition (e.g.
UPDATE p2 SET i = i + 1), it will
succeed; if targeted at a specified columnar partition
(e.g. UPDATE p1 SET i = i + 1), it will
fail.
If the operation is targeted at the partitioned table and has
a WHERE clause that excludes all columnar
partitions (e.g.
UPDATE parent SET i = i + 1 WHERE timestamp = '2020-03-15'),
it will succeed.
If the operation is targeted at the partitioned table, but
does not exclude all columnar partitions, it will fail; even
if the actual data to be updated only affects row tables (e.g.
UPDATE parent SET i = i + 1 WHERE n = 300).
Future versions of citus will incrementally lift the current limitations:
Append-only (no UPDATE/DELETE
support)
No space reclamation (e.g. rolled-back transactions may still consume disk space)
No bitmap index scans
No TID scan
No sample scans
No TOAST support (large values supported inline)
No support for ON CONFLICT statements (except
DO NOTHING actions with no target specified)
No support for tuple locks (SELECT ... FOR SHARE,
SELECT ... FOR UPDATE)
No support for serializable isolation level
Support for Postgres Pro server versions 12+ only
No support for foreign keys
No support for logical decoding
No support for intra-node parallel scans
No support for AFTER ... FOR EACH ROW triggers
No UNLOGGED columnar tables
In this section, we describe how you can tune your citus cluster to get maximum performance. We begin by explaining how choosing the right distribution column affects performance. We then describe how you can first tune your database for high performance on one Postgres Pro server and then scale it out across all the CPUs in the cluster. In this section, we also discuss several performance related configuration parameters wherever relevant.
The first step while creating a distributed table is choosing the right distribution column. This helps citus push down several operations directly to the worker shards and prune away unrelated shards, which lead to significant query speedups.
Typically, you should pick that column as the distribution column which
is the most commonly used join key or on which most queries have filters.
For filters, citus uses the distribution
column ranges to prune away unrelated shards, ensuring that the query
hits only those shards which overlap with the WHERE
clause ranges. For joins, if the join key is the same as the distribution
column, then citus executes the join only
between those shards, which have matching / overlapping distribution
column ranges. All these shard joins can be executed in parallel on the
workers and hence are more efficient.
In addition, citus can push down several operations directly to the worker shards if they are based on the distribution column. This greatly reduces both the amount of computation on each node and the network bandwidth involved in transferring data across nodes.
Once you choose the right distribution column, you can then proceed to the next step, which is tuning worker node performance.
The citus coordinator partitions an incoming query into fragment queries and sends them to the workers for parallel processing. The workers are just extended Postgres Pro servers and they apply Postgres Pro standard planning and execution logic for these queries. So, the first step in tuning citus is tuning the Postgres Pro configuration parameters on the workers for high performance.
Tuning the parameters is a matter of experimentation and often takes several attempts to achieve acceptable performance. Thus it is best to load only a small portion of your data when tuning to make each iteration go faster.
To begin the tuning process create a citus
cluster and load data in it. From the coordinator node, run the
EXPLAIN command on representative queries to inspect
performance. citus extends the
EXPLAIN command to provide information about
distributed query execution. The EXPLAIN output shows
how each worker processes the query and also a little about how the
coordinator node combines their results.
Here is an example of explaining the plan for a particular example query.
We use the VERBOSE flag to see the actual queries,
which were sent to the worker nodes.
EXPLAIN VERBOSE
SELECT date_trunc('minute', created_at) AS minute,
sum((payload->>'distinct_size')::int) AS num_commits
FROM github_events
WHERE event_type = 'PushEvent'
GROUP BY minute
ORDER BY minute;
Sort (cost=0.00..0.00 rows=0 width=0)
Sort Key: remote_scan.minute
-> HashAggregate (cost=0.00..0.00 rows=0 width=0)
Group Key: remote_scan.minute
-> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=0 width=0)
Task Count: 32
Tasks Shown: One of 32
-> Task
Query: SELECT date_trunc('minute'::text, created_at) AS minute, sum(((payload OPERATOR(pg_catalog.->>) 'distinct_size'::text))::integer) AS num_commits FROM github_events_102042 github_events WHERE (event_type OPERATOR(pg_catalog.=) 'PushEvent'::text) GROUP BY (date_trunc('minute'::text, created_at))
Node: host=localhost port=5433 dbname=postgres
-> HashAggregate (cost=93.42..98.36 rows=395 width=16)
Group Key: date_trunc('minute'::text, created_at)
-> Seq Scan on github_events_102042 github_events (cost=0.00..88.20 rows=418 width=503)
Filter: (event_type = 'PushEvent'::text)
(13 rows)
This tells you several things. To begin with there are 32 shards, and the planner chose the citus adaptive executor to execute this query:
-> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=0 width=0) Task Count: 32
Next it picks one of the workers and shows you more about how the query behaves there. It indicates the host, port, database, and the query that was sent to the worker so you can connect to the worker directly and try the query if desired:
Tasks Shown: One of 32
-> Task
Query: SELECT date_trunc('minute'::text, created_at) AS minute, sum(((payload OPERATOR(pg_catalog.->>) 'distinct_size'::text))::integer) AS num_commits FROM github_events_102042 github_events WHERE (event_type OPERATOR(pg_catalog.=) 'PushEvent'::text) GROUP BY (date_trunc('minute'::text, created_at))
Node: host=localhost port=5433 dbname=postgres
Distributed EXPLAIN next shows the results of running
a normal Postgres Pro EXPLAIN
on that worker for the fragment query:
-> HashAggregate (cost=93.42..98.36 rows=395 width=16)
Group Key: date_trunc('minute'::text, created_at)
-> Seq Scan on github_events_102042 github_events (cost=0.00..88.20 rows=418 width=503)
Filter: (event_type = 'PushEvent'::text)
You can now connect to the worker at localhost, port
5433 and tune query performance for the shard
github_events_102042 using standard
Postgres Pro techniques. As you make changes run
EXPLAIN again from the coordinator or right on the
worker.
The first set of such optimizations relates to configuration settings. Postgres Pro by default comes with conservative resource settings; and among these settings shared_buffers and work_mem are probably the most important ones in optimizing read performance. We discuss these parameters in brief below. Apart from them, several other configuration settings impact query performance. These settings are covered in more detail in the Server Configuration chapter.
The shared_buffers configuration parameter defines
the amount of memory allocated to the database for caching data and
defaults to 128MB. If you have a worker node with
1GB or more RAM, a reasonable starting value for
shared_buffers is 1/4 of the memory in your system.
There are some workloads where even larger settings for
shared_buffers are effective, but given the way
Postgres Pro also relies on the operating system
cache, it is unlikely you will find using more than 25% of RAM to work
better than a smaller amount.
If you do a lot of complex sorts, then increasing
work_mem allows Postgres Pro
to do larger in-memory sorts, which will be faster than disk-based
equivalents. If you see lot of disk activity on your worker node inspite
of having a decent amount of memory, then increasing
work_mem to a higher value can be useful. This will
help Postgres Pro in choosing more efficient
query plans and allow for greater amount of operations to occur in memory.
Other than the above configuration settings, the
Postgres Pro query planner relies on statistical
information about the contents of tables to generate good plans. These
statistics are gathered when ANALYZE is run, which
is enabled by default. You can learn more about the
Postgres Pro planner and the
ANALYZE command in greater detail in
the relevant section.
Lastly, you can create indexes on your tables to enhance database
performance. Indexes allow the database to find and retrieve specific
rows much faster than it could do without an index. To choose which
indexes give the best performance, you can run the query with the
EXPLAIN command to
view query plans and optimize the slower parts of the query. After an
index is created, the system has to keep it synchronized with the table
which adds overhead to data manipulation operations. Therefore, indexes
that are seldom or never used in queries should be removed.
For write performance, you can use general
Postgres Pro configuration tuning to increase
INSERT rates. We commonly recommend increasing
checkpoint_timeout and
max_wal_size settings. Also,
depending on the reliability requirements of your application, you can
choose to change fsync or
synchronous_commit values.
Once you have tuned a worker to your satisfaction you will have to manually apply those changes to the other workers as well. To verify that they are all behaving properly, set this configuration variable on the coordinator:
SET citus.explain_all_tasks = 1;
This will cause EXPLAIN to show the query plan for
all tasks, not just one.
EXPLAIN
SELECT date_trunc('minute', created_at) AS minute,
sum((payload->>'distinct_size')::int) AS num_commits
FROM github_events
WHERE event_type = 'PushEvent'
GROUP BY minute
ORDER BY minute;
Sort (cost=0.00..0.00 rows=0 width=0)
Sort Key: remote_scan.minute
-> HashAggregate (cost=0.00..0.00 rows=0 width=0)
Group Key: remote_scan.minute
-> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=0 width=0)
Task Count: 32
Tasks Shown: All
-> Task
Node: host=localhost port=5433 dbname=postgres
-> HashAggregate (cost=93.42..98.36 rows=395 width=16)
Group Key: date_trunc('minute'::text, created_at)
-> Seq Scan on github_events_102042 github_events (cost=0.00..88.20 rows=418 width=503)
Filter: (event_type = 'PushEvent'::text)
-> Task
Node: host=localhost port=5434 dbname=postgres
-> HashAggregate (cost=103.21..108.57 rows=429 width=16)
Group Key: date_trunc('minute'::text, created_at)
-> Seq Scan on github_events_102043 github_events (cost=0.00..97.47 rows=459 width=492)
Filter: (event_type = 'PushEvent'::text)
--
-- ... repeats for all 32 tasks
-- alternating between workers one and two
-- (running in this case locally on ports 5433, 5434)
--
(199 rows)
Differences in worker execution can be caused by tuning configuration
differences, uneven data distribution across shards, or hardware
differences between the machines. To get more information about the time
it takes the query to run on each shard you can use
EXPLAIN ANALYZE.
Note that when
citus.explain_all_tasks
is enabled, EXPLAIN plans are retrieved sequentially,
which may take a long time for EXPLAIN ANALYZE.
citus, by default, sorts tasks by execution
time in descending order. If
citus.explain_all_tasks
is disabled, then citus shows the single
longest-running task. Please note that this functionality can be used
only with EXPLAIN ANALYZE, since regular
EXPLAIN does not execute the queries, and therefore
does not know any execution times. To change the sort order, you can use
the
citus.explain_analyze_sort_method
configuration parameter.
As mentioned, once you have achieved the desired performance for a single shard you can set similar configuration parameters on all your workers. As citus runs all the fragment queries in parallel across the worker nodes, users can scale out the performance of their queries to be the cumulative of the computing power of all of the CPU cores in the cluster assuming that the data fits in memory.
Users should try to fit as much of their working set in memory as possible to get best performance with citus. If fitting the entire working set in memory is not feasible, we recommend using SSDs over HDDs as a best practice. This is because HDDs are able to show decent performance when you have sequential reads over contiguous blocks of data, but have significantly lower random read / write performance. In cases where you have a high number of concurrent queries doing random reads and writes, using SSDs can improve query performance by several times as compared to HDDs. Also, if your queries are highly compute-intensive, it might be beneficial to choose machines with more powerful CPUs.
To measure the disk space usage of your database objects, you can log
into the worker nodes and use
Postgres Pro administration functions
for individual shards. The pg_total_relation_size
function can be used to get the total disk space used by a table. You
can also use other functions mentioned in the
Postgres Pro documentation to get more specific
size information. On the basis of these statistics for a shard and the
shard count, users can compute the hardware requirements for their
cluster.
Another factor that affects performance is the number of shards per worker node. citus partitions an incoming query into its fragment queries which run on individual worker shards. Hence, the degree of parallelism for each query is governed by the number of shards the query hits. To ensure maximum parallelism, you should create enough shards on each node such that there is at least one shard per CPU core. Another consideration to keep in mind is that citus will prune away unrelated shards if the query has filters on the distribution column. So, creating more shards than the number of cores might also be beneficial so that you can achieve greater parallelism even after shard pruning.
Once you have distributed your data across the cluster, with each worker optimized for best performance, you should be able to see high performance gains on your queries. After this, the final step is to tune a few distributed performance tuning parameters.
Before we discuss the specific configuration parameters, we recommend that you measure query times on your distributed cluster and compare them with the single shard performance. This can be done by enabling, timing, and running the query on the coordinator node and running one of the fragment queries on the worker nodes. This helps in determining the amount of time spent on the worker nodes and the amount of time spent in fetching the data to the coordinator node. Then, you can figure out what the bottleneck is and optimize the database accordingly.
In this section, we discuss the parameters that help optimize the distributed query planner and executor. There are several relevant parameters and we discuss them in two sections about general performance tuning and advanced performance tuning. The first section is sufficient for most use cases and covers all the common configs. The second covers parameters that may provide performance gains in specific use cases.
For higher INSERT performance, the factor that impacts
insert rates the most is the level of concurrency. You should try to run
several concurrent INSERT statements in parallel.
This way you can achieve very high insert rates if you have a powerful
coordinator node and are able to use all the CPU cores on that node
together.
In the best case citus can execute queries containing subqueries and CTEs in a single step. This is usually because both the main query and subquery filter by distribution column of tables in the same way and can be pushed down to worker nodes together. However, citus is sometimes forced to execute subqueries before executing the main query, copying the intermediate subquery results to other worker nodes for use by the main query. This technique is called subquery/CTE push-pull execution.
It is important to be aware when subqueries are executed in a separate
step and avoid sending too much data between worker nodes. The network
overhead will hurt performance. The EXPLAIN command
allows you to discover how queries will be executed, including whether
multiple steps are required. For a detailed example, see the
Subquery/CTE Push-Pull Execution
section.
Also you can defensively set a safeguard against large intermediate
results. Adjust the
citus.max_intermediate_result_size
limit in a new connection to the coordinator node. By default the max
intermediate result size is 1 GB, which is large
enough to allow some inefficient queries. Try turning it down and
running your queries:
-- Set a restrictive limit for intermediate results SET citus.max_intermediate_result_size = '512kB'; -- Attempt to run queries -- SELECT …
If the query has subqueries or CTEs that exceed this limit, the query will be canceled and you will see an error message:
ERROR: the intermediate result size exceeds citus.max_intermediate_result_size (currently 512 kB) DETAIL: Citus restricts the size of intermediate results of complex subqueries and CTEs to avoid accidentally pulling large result sets into once place. HINT: To run the current query, set citus.max_intermediate_result_size to a higher value or -1 to disable.
The size of intermediate results and their destination is available in
EXPLAIN ANALYZE output:
EXPLAIN ANALYZE WITH deleted_rows AS ( DELETE FROM page_views WHERE tenant_id IN (3, 4) RETURNING * ), viewed_last_week AS ( SELECT * FROM deleted_rows WHERE view_time > current_timestamp - interval '7 days' ) SELECT count(*) FROM viewed_last_week;
Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=0 width=0) (actual time=570.076..570.077 rows=1 loops=1)
-> Distributed Subplan 31_1
Subplan Duration: 6978.07 ms
Intermediate Data Size: 26 MB
Result destination: Write locally
-> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=0 width=0) (actual time=364.121..364.122 rows=0 loops=1)
Task Count: 2
Tuple data received from nodes: 0 bytes
Tasks Shown: One of 2
-> Task
Tuple data received from node: 0 bytes
Node: host=localhost port=5433 dbname=postgres
-> Delete on page_views_102016 page_views (cost=5793.38..49272.28 rows=324712 width=6) (actual time=362.985..362.985 rows=0 loops=1)
-> Bitmap Heap Scan on page_views_102016 page_views (cost=5793.38..49272.28 rows=324712 width=6) (actual time=362.984..362.984 rows=0 loops=1)
Recheck Cond: (tenant_id = ANY ('{3,4}'::integer[]))
-> Bitmap Index Scan on view_tenant_idx_102016 (cost=0.00..5712.20 rows=324712 width=0) (actual time=19.193..19.193 rows=325733 loops=1)
Index Cond: (tenant_id = ANY ('{3,4}'::integer[]))
Planning Time: 0.050 ms
Execution Time: 363.426 ms
Planning Time: 0.000 ms
Execution Time: 364.241 ms
Task Count: 1
Tuple data received from nodes: 6 bytes
Tasks Shown: All
-> Task
Tuple data received from node: 6 bytes
Node: host=localhost port=5432 dbname=postgres
-> Aggregate (cost=33741.78..33741.79 rows=1 width=8) (actual time=565.008..565.008 rows=1 loops=1)
-> Function Scan on read_intermediate_result intermediate_result (cost=0.00..29941.56 rows=1520087 width=0) (actual time=326.645..539.158 rows=651466 loops=1)
Filter: (view_time > (CURRENT_TIMESTAMP - '7 days'::interval))
Planning Time: 0.047 ms
Execution Time: 569.026 ms
Planning Time: 1.522 ms
Execution Time: 7549.308 ms
In the above EXPLAIN ANALYZE output, you can see
the following information about the intermediate results:
Intermediate Data Size: 26 MB Result destination: Write locally
It tells us how large the intermediate results were and where the
intermediate results were written to. In this case, they were written
to the node coordinating the query execution, as specified by
Write locally. For some other queries it can also
be of the following format:
Intermediate Data Size: 26 MB Result destination: Send to 2 nodes
Which means the intermediate result was pushed to 2 worker nodes and it involved more network traffic.
When using CTEs, or joins between CTEs and distributed tables, you can avoid push-pull execution by following these rules:
Tables should be co-located.
The CTE queries should not require any merge steps (e.g.,
LIMIT or GROUP BY on a
non-distribution key).
Tables and CTEs should be joined on distribution keys.
Also Postgres Pro allows
citus to take advantage of
CTE inlining to push CTEs down to workers in more
circumstances. The inlining behavior can be controlled with the
MATERIALIZED keyword. To learn more, see the
WITH Queries (Common Table Expressions)
section.
In this section, we discuss advanced performance tuning parameters. These parameters are applicable to specific use cases and may not be required for all deployments.
When executing multi-shard queries, citus must balance the gains from parallelism with the overhead from database connections. The Query Execution section explains the steps of turning queries into worker tasks and obtaining database connections to the workers.
Set the
citus.max_adaptive_executor_pool_size
configuration parameter to a low value like 1
or 2 for transactional workloads with short
queries (e.g. < 20ms of latency). For analytical workloads where
parallelism is critical, leave this setting at its default value
of 16.
Set the
citus.executor_slow_start_interval
configuration parameter to a high value like 100
ms for transactional workloads comprised of short queries that are
bound on network latency rather than parallelism. For analytical
workloads, leave this setting at its default value of
10 ms.
The default value of 1 for the
citus.max_cached_conns_per_worker
configuration parameter is reasonable. A larger value such as
2 might be helpful for clusters that use a
small number of concurrent sessions, but it is not wise to go much
further (e.g. 16 would be too high). If set too
high, sessions will hold idle connections and use worker resources
unnecessarily.
Set the citus.max_shared_pool_size configuration parameter to match the max_connections setting of your worker nodes. This setting is mainly a fail-safe.
The citus query planner assigns tasks to the worker nodes based on shard locations. The algorithm used while making these assignments can be chosen by setting the citus.task_assignment_policy configuration parameter. Users can alter this configuration parameter to choose the policy, which works best for their use case.
The greedy policy aims to distribute tasks evenly
across the workers. This policy is the default and works well in most
of the cases. The round-robin policy assigns tasks
to workers in a round-robin fashion alternating between different
replicas. This enables much better cluster utilization when the shard
count for a table is low compared to the number of workers. The third
policy is the first-replica policy that assigns
tasks on the basis of the insertion order of placements (replicas) for
the shards. With this policy, users can be sure of which shards will
be accessed on each machine. This helps in providing stronger memory
residency guarantees by allowing you to keep your working set in memory
and use it for querying.
In some cases, a large part of query time is spent in sending query
results from workers to the coordinator. This mostly happens when
queries request many rows (such as SELECT * FROM table),
or when result columns use big types (like hll or
tdigest from the hll and
tdigest extensions).
In those cases it can be beneficial to set
citus.enable_binary_protocol
to true, which will change the encoding of the
results to binary, rather than using text encoding. Binary encoding
significantly reduces bandwidth for types that have a compact binary
representation, such as hll, tdigest,
timestamp and double precision. The default
value for this configuration parameter is already true.
So explicitly enabling it has no effect.
citus lets you scale out data ingestion to very high rates, but there are several trade-offs to consider in terms of application integration, throughput, and latency. In this section, we discuss different approaches to data ingestion, and provide guidelines for expected throughput and latency numbers.
On the citus coordinator, you can perform
INSERT, INSERT .. ON CONFLICT,
UPDATE, and DELETE commands
directly on distributed tables. When you issue one of these commands,
the changes are immediately visible to the user.
When you run the INSERT (or another ingest command),
citus first finds the right shard placements
based on the value in the distribution column. citus
then connects to the worker nodes storing the shard placements, and
performs an INSERT on each of them. From the
perspective of the user, the INSERT takes several
milliseconds to process because of the network latency to worker nodes.
The citus coordinator node, however, can
process concurrent INSERTs to reach high throughputs.
When loading data for temporary staging, consider using an unlogged table. These are tables which are not backed by the Postgres Pro write-ahead log. This makes them faster for inserting rows but not suitable for long term data storage. You can use an unlogged table as a place to load incoming data, prior to manipulating the data and moving it to permanent tables.
-- Example unlogged table
CREATE UNLOGGED TABLE unlogged_table (
key text,
value text
);
-- Its shards will be unlogged as well when
-- the table is distributed
SELECT create_distributed_table('unlogged_table', 'key');
-- Ready to load data
Distributed tables support the
COPY from the
citus coordinator for bulk ingestion, which
can achieve much higher ingestion rates than INSERT
statements.
COPY can be used to load data directly from an
application using COPY .. FROM STDIN, from a file on
the server, or program executed on the server.
COPY pgbench_history FROM STDIN WITH (FORMAT CSV);
In psql, the \copy command
can be used to load data from the local machine. The \COPY
command actually sends a COPY .. FROM STDIN command
to the server before sending the local data, as would an application
that loads data directly.
psql -c "\COPY pgbench_history FROM 'pgbench_history-2016-03-04.csv' (FORMAT CSV)"
A powerful feature of COPY for distributed tables is
that it asynchronously copies data to the workers over many parallel
connections, one for each shard placement. This means that data can be
ingested using multiple workers and multiple cores in parallel. Especially
when there are expensive indexes such as a GIN, this can lead to major
performance boosts over ingesting into a regular
Postgres Pro table.
From a throughput standpoint, you can expect data ingest ratios of
250K - 2M rows per second when using COPY.
Make sure your benchmarking setup is well configured so you can observe
optimal COPY performance. Follow these tips:
We recommend a large batch size (~ 50000-100000). You can benchmark with multiple files (1, 10, 1000, 10000, etc), each of that batch size.
Use parallel ingestion. Increase the number of threads/ingestors to 2, 4, 8, 16 and run benchmarks.
Use a compute-optimized coordinator. For the workers choose memory-optimized boxes with a decent number of vCPUs.
Go with a relatively small shard count, 32 should suffice, but you could benchmark with 64, too.
Ingest data for a suitable amount of time (say 2, 4, 8, 24 hrs). Longer tests are more representative of a production setup.
The rows of a distributed table are grouped into shards, and each shard
is placed on a worker node in the citus
cluster. In the multi-tenant citus use case
we can determine which worker node contains the rows for a specific
tenant by putting together two pieces of information: the
shard_id
associated with the tenant_id, and the shard
placements on workers. The two can be retrieved together in a single
query. Suppose our multi-tenant
application's tenants are stores, and we want to find which worker
node holds the data for gap.com (id=4, suppose).
To find the worker node holding the data for store id=4,
ask for the placement of rows whose distribution column has value 4:
SELECT shardid, shardstate, shardlength, nodename, nodeport, placementid
FROM pg_dist_placement AS placement,
pg_dist_node AS node
WHERE placement.groupid = node.groupid
AND node.noderole = 'primary'
AND shardid = (
SELECT get_shard_id_for_distribution_column('stores', 4)
);
The output contains the host and port of the worker database.
┌─────────┬────────────┬─────────────┬───────────┬──────────┬─────────────┐ │ shardid │ shardstate │ shardlength │ nodename │ nodeport │ placementid │ ├─────────┼────────────┼─────────────┼───────────┼──────────┼─────────────┤ │ 102009 │ 1 │ 0 │ localhost │ 5433 │ 2 │ └─────────┴────────────┴─────────────┴───────────┴──────────┴─────────────┘
Distributed schemas are automatically associated with individual co-location groups such that the tables created in those schemas are converted to co-located distributed tables without a shard key. You can find where a distributed schema resides by joining the citus_shards view with the citus_schemas view:
SELECT schema_name, nodename, nodeport
FROM citus_shards
JOIN citus_schemas cs
ON cs.colocation_id = citus_shards.colocation_id
GROUP BY 1,2,3;
schema_name | nodename | nodeport -------------+-----------+---------- a | localhost | 9701 b | localhost | 9702 with_data | localhost | 9702
You can also query citus_shards
directly filtering down to schema table type to have a detailed listing
for all tables.
SELECT * FROM citus_shards WHERE citus_table_type = 'schema';
table_name | shardid | shard_name | citus_table_type | colocation_id | nodename | nodeport | shard_size | schema_name | colocation_id | schema_size | schema_owner ----------------+---------+-----------------------+------------------+---------------+-----------+----------+------------+-------------+---------------+-------------+-------------- a.cities | 102080 | a.cities_102080 | schema | 4 | localhost | 9701 | 8192 | a | 4 | 128 kB | citus a.map_tags | 102145 | a.map_tags_102145 | schema | 4 | localhost | 9701 | 32768 | a | 4 | 128 kB | citus a.measurement | 102047 | a.measurement_102047 | schema | 4 | localhost | 9701 | 0 | a | 4 | 128 kB | citus a.my_table | 102179 | a.my_table_102179 | schema | 4 | localhost | 9701 | 16384 | a | 4 | 128 kB | citus a.people | 102013 | a.people_102013 | schema | 4 | localhost | 9701 | 32768 | a | 4 | 128 kB | citus a.test | 102008 | a.test_102008 | schema | 4 | localhost | 9701 | 8192 | a | 4 | 128 kB | citus a.widgets | 102146 | a.widgets_102146 | schema | 4 | localhost | 9701 | 32768 | a | 4 | 128 kB | citus b.test | 102009 | b.test_102009 | schema | 5 | localhost | 9702 | 8192 | b | 5 | 32 kB | citus b.test_col | 102012 | b.test_col_102012 | schema | 5 | localhost | 9702 | 24576 | b | 5 | 32 kB | citus with_data.test | 102180 | with_data.test_102180 | schema | 11 | localhost | 9702 | 647168 | with_data | 11 | 632 kB | citus
Each distributed table in citus has a
“distribution column”. For more information about what this
is and how it works, see the
Choosing Distribution Column
section. There are many situations where it is important to know which
column it is. Some operations require joining or filtering on the
distribution column, and you may encounter error messages with
hints like add a filter to the distribution column.
The pg_dist_* tables on the coordinator node
contain diverse metadata about the distributed database. In particular
the pg_dist_partition
table holds information about the distribution column (formerly called
partition column) for each table. You can use
a convenient utility function to look up the distribution column
name from the low-level details in the metadata. Here is an example
and its output:
-- Create example table
CREATE TABLE products (
store_id bigint,
product_id bigint,
name text,
price money,
CONSTRAINT products_pkey PRIMARY KEY (store_id, product_id)
);
-- Pick store_id as distribution column
SELECT create_distributed_table('products', 'store_id');
-- Get distribution column name for products table
SELECT column_to_column_name(logicalrelid, partkey) AS dist_col_name
FROM pg_dist_partition
WHERE logicalrelid='products'::regclass;
Example output:
┌───────────────┐ │ dist_col_name │ ├───────────────┤ │ store_id │ └───────────────┘
This query will run across all worker nodes and identify locks, how long they have been open, and the offending queries:
SELECT * FROM citus_lock_waits;
For more information, see the Distributed Query Activity section.
This query will provide you with the size of every shard of a given
distributed table, designated here with the placeholder
my_table:
SELECT shardid, table_name, shard_size
FROM citus_shards
WHERE table_name = 'my_table';
Example output:
. shardid | table_name | shard_size ---------+------------+------------ 102170 | my_table | 90177536 102171 | my_table | 90177536 102172 | my_table | 91226112 102173 | my_table | 90177536
This query uses the citus_shards view.
This query gets a list of the sizes for each distributed table plus the size of their indices.
SELECT table_name, table_size FROM citus_tables;
Example output:
┌───────────────┬────────────┐ │ table_name │ table_size │ ├───────────────┼────────────┤ │ github_users │ 39 MB │ │ github_events │ 98 MB │ └───────────────┴────────────┘
There are other ways to measure distributed table size as well. To learn more, see the Determining Table and Relation Size section.
This query will run across all worker nodes and identify any unused
indexes for a given distributed table, designated here with the
placeholder my_distributed_table:
SELECT *
FROM run_command_on_shards('my_distributed_table', $cmd$
SELECT array_agg(a) as infos
FROM (
SELECT (
schemaname || '.' || relname || '##' || indexrelname || '##'
|| pg_size_pretty(pg_relation_size(i.indexrelid))::text
|| '##' || idx_scan::text
) AS a
FROM pg_stat_user_indexes ui
JOIN pg_index i
ON ui.indexrelid = i.indexrelid
WHERE NOT indisunique
AND idx_scan < 50
AND pg_relation_size(relid) > 5 * 8192
AND (schemaname || '.' || relname)::regclass = '%s'::regclass
ORDER BY
pg_relation_size(i.indexrelid) / NULLIF(idx_scan, 0) DESC nulls first,
pg_relation_size(i.indexrelid) DESC
) sub
$cmd$);
Example output:
┌─────────┬─────────┬─────────────────────────────────────────────────────────────────────────────────┐
│ shardid │ success │ result │
├─────────┼─────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ 102008 │ t │ │
│ 102009 │ t │ {"public.my_distributed_table_102009##stupid_index_102009##28 MB##0"} │
│ 102010 │ t │ │
│ 102011 │ t │ │
└─────────┴─────────┴─────────────────────────────────────────────────────────────────────────────────┘
This query will give you the connection count by each type that are open on the coordinator:
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;
Example output:
┌────────┬───────┐ │ state │ count │ ├────────┼───────┤ │ active │ 3 │ │ ∅ │ 1 │ └────────┴───────┘
The citus_stat_activity shows which queries are currently executing. You can filter to find the actively executing ones, along with the process ID of their backend:
SELECT global_pid, query, state FROM citus_stat_activity WHERE state != 'idle';
We can also query to see the most common reasons that non-idle queries that are waiting. For an explanation of the reasons, see the Wait Event Types table.
SELECT wait_event || ':' || wait_event_type AS type, count(*) AS number_of_occurences FROM pg_stat_activity WHERE state != 'idle' GROUP BY wait_event, wait_event_type ORDER BY number_of_occurences DESC;
Example output when executing the pg_sleep function in a separate query concurrently:
┌─────────────────┬──────────────────────┐ │ type │ number_of_occurences │ ├─────────────────┼──────────────────────┤ │ ∅ │ 1 │ │ PgSleep:Timeout │ 1 │ └─────────────────┴──────────────────────┘
This query will provide you with your index hit rate across all nodes. Index hit rate is useful in determining how often indices are used when querying:
-- On coordinator
SELECT 100 * (sum(idx_blks_hit) - sum(idx_blks_read)) / sum(idx_blks_hit) AS index_hit_rate
FROM pg_statio_user_indexes;
-- On workers
SELECT nodename, result as index_hit_rate
FROM run_command_on_workers($cmd$
SELECT 100 * (sum(idx_blks_hit) - sum(idx_blks_read)) / sum(idx_blks_hit) AS index_hit_rate
FROM pg_statio_user_indexes;
$cmd$);
Example output:
┌───────────┬────────────────┐ │ nodename │ index_hit_rate │ ├───────────┼────────────────┤ │ 10.0.0.16 │ 96.0 │ │ 10.0.0.20 │ 98.0 │ └───────────┴────────────────┘
Most applications typically access a small fraction of their total data at once. Postgres Pro keeps frequently accessed data in memory to avoid slow reads from disk. You can see statistics about it in the pg_statio_user_tables view.
An important measurement is what percentage of data comes from the memory cache vs the disk in your workload:
-- On coordinator
SELECT
sum(heap_blks_read) AS heap_read,
sum(heap_blks_hit) AS heap_hit,
100 * sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS cache_hit_rate
FROM
pg_statio_user_tables;
-- On workers
SELECT nodename, result as cache_hit_rate
FROM run_command_on_workers($cmd$
SELECT
100 * sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS cache_hit_rate
FROM
pg_statio_user_tables;
$cmd$);
Example output:
┌───────────┬──────────┬─────────────────────┐ │ heap_read │ heap_hit │ cache_hit_rate │ ├───────────┼──────────┼─────────────────────┤ │ 1 │ 132 │ 99.2481203007518796 │ └───────────┴──────────┴─────────────────────┘
If you find yourself with a ratio significantly lower than 99%, then you likely want to consider increasing the cache available to your database.
could not connect to server: Connection refused #Caused when the coordinator node is unable to connect to a worker.
SELECT 1 FROM companies WHERE id = 2928;
ERROR: connection to the remote node localhost:5432 failed with the following error: could not connect to server: Connection refused
Is the server running on host "localhost" (127.0.0.1) and accepting
TCP/IP connections on port 5432?
To fix, check that the worker is accepting connections, and that DNS is correctly resolving.
canceling the transaction since it was involved in a distributed deadlock #Deadlocks can happen not only in a single-node database, but in a distributed database, caused by queries executing across multiple nodes. citus has the intelligence to recognize distributed deadlocks and defuse them by aborting one of the queries involved.
We can see this in action by distributing rows across worker nodes and then running two concurrent transactions with conflicting updates:
CREATE TABLE lockme (id int, x int);
SELECT create_distributed_table('lockme', 'id');
-- id=1 goes to one worker, and id=2 another
INSERT INTO lockme VALUES (1,1), (2,2);
--------------- TX 1 ---------------- --------------- TX 2 ----------------
BEGIN;
BEGIN;
UPDATE lockme SET x = 3 WHERE id = 1;
UPDATE lockme SET x = 4 WHERE id = 2;
UPDATE lockme SET x = 3 WHERE id = 2;
UPDATE lockme SET x = 4 WHERE id = 1;
ERROR: canceling the transaction since it was involved in a distributed deadlock
Detecting deadlocks and stopping them is part of normal distributed transaction handling. It allows an application to retry queries or take another course of action.
could not connect to server: Cannot assign requested address #WARNING: connection error: localhost:9703 DETAIL: could not connect to server: Cannot assign requested address
This occurs when there are no more sockets available by which the coordinator can respond to worker requests.
Configure the operating system to re-use TCP sockets. Execute this on the shell in the coordinator node:
sysctl -w net.ipv4.tcp_tw_reuse=1
This allows reusing sockets in TIME_WAIT state for
new connections when it is safe from a protocol viewpoint. Default
value is 0 (disabled).
SSL error: certificate verify failed #In citus, nodes are required talk to one another using SSL by default. If SSL is not enabled on a Postgres Pro server when citus is first installed, the install process will enable it, which includes creating and self-signing an SSL certificate.
However, if a root certificate authority file exists (typically in
~/.postgresql/root.crt), then the certificate will
be checked unsuccessfully against that Certificate Authority at
connection time.
Possible solutions are to sign the certificate, turn off SSL, or remove the root certificate. Also a node may have trouble connecting to itself without the help of the citus.local_hostname configuration parameter.
could not connect to any active placements #When all available worker connection slots are in use, further connections will fail.
WARNING: connection error: hostname:5432 ERROR: could not connect to any active placements
This error happens most often when copying data into
citus in parallel. The COPY
command opens up one connection per shard. If you run M concurrent copies
into a destination with N shards, that will result in M*N connections.
To solve the error, reduce the shard count of target distributed tables,
or run fewer \copy commands in parallel.
remaining connection slots are reserved for non-replication superuser connections #This occurs when Postgres Pro runs out of available connections to serve concurrent client requests.
The max_connections
configuration parameter adjusts the limit, with a typical default of 100
connections. Note that each connection consumes resources, so adjust
sensibly. When increasing max_connections it is
usually a good idea to increase
memory limits too.
Using pgbouncer can also help by queueing connection requests, which exceed the connection limit.
pgbouncer cannot connect to server #In a self-hosted citus cluster, this error indicates that the coordinator node is not responding to pgbouncer.
Try connecting directly to the server with psql to ensure it is running and accepting connections.
creating unique indexes on non-partition columns is currently unsupported #As a distributed system, citus can guarantee uniqueness only if a unique index or primary key constraint includes a table distribution column. That is because the shards are split so that each shard contains non-overlapping partition column values. The index on each worker node can locally enforce its part of the constraint.
Trying to make a unique index on a non-distribution column will generate an error:
ERROR: creating unique indexes on non-partition columns is currently unsupported
Enforcing uniqueness on a non-distribution column would require
citus to check every shard on every
INSERT to validate, which defeats the goal of
scalability.
There are two ways to enforce uniqueness on a non-distribution column:
Create a composite unique index or primary key that includes the
desired column (C), but also includes the
distribution column (D). This is not quite as
strong a condition as uniqueness on C alone,
but will ensure that the values of C are unique
for each value of D. For instance if
distributing by company_id in a multi-tenant
system, this approach would make C unique
within each company.
Use a reference table rather than a hash-distributed table. This is only suitable for small tables, since the contents of the reference table will be duplicated on all nodes.
function create_distributed_table does not exist #
SELECT create_distributed_table('foo', 'id');
/*
ERROR: function create_distributed_table(unknown, unknown) does not exist
LINE 1: SELECT create_distributed_table('foo', 'id');
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
*/
When basic
utility functions
are not available, check whether the citus
extension is properly installed. Running \dx in
psql will list installed extensions.
One way to end up without extensions is by creating a new database in a Postgres Pro server, which requires extensions to be re-installed. See the Creating a New Database section to learn how to do it right.
Each Postgres Pro function has a
volatility classification,
which indicates whether the function can update the database and whether
the function's return value can vary over time given the same inputs. A
STABLE function is guaranteed to return the same
results given the same arguments for all rows within a single statement,
while an IMMUTABLE function is guaranteed to return
the same results given the same arguments forever.
Non-immutable functions can be inconvenient in distributed systems because they can introduce subtle changes when run at slightly different times across shards. Differences in database configuration across nodes can also interact harmfully with non-immutable functions.
One of the most common ways this can happen is using the
timestamp in Postgres Pro,
which unlike timestamptz does not keep a record of time
zone. Interpreting a timestamp column makes reference to the database
timezone, which can be changed between queries, hence functions
operating on timestamps are not immutable.
citus forbids running distributed queries that filter results using stable functions on columns. For instance:
-- foo_timestamp is timestamp, not timestamptz UPDATE foo SET ... WHERE foo_timestamp < now();
ERROR: STABLE functions used in UPDATE queries cannot be called with column references
In this case the comparison operator < between
timestamp and timestamptz is not immutable.
Avoid stable functions on columns in a distributed UPDATE
statement. In particular, whenever working with times use
timestamptz rather than timestamp. Having a
time zone in timestamptz makes calculations immutable.
Currently citus imposes primary key constraint only if the distribution column is a part of the primary key. This assures that the constraint needs to be checked only on one shard to ensure uniqueness.
With citus, you can add nodes manually by calling the citus_add_node function with the hostname (or IP address) and port number of the new node.
After adding a node to an existing cluster, the new node will not contain any data (shards). citus will start assigning any newly created shards to this node. To rebalance existing shards from the older nodes to the new node, citus provides an open source shard rebalancer utility. You can find more information in the Rebalancing Shards Without Downtime section.
citus uses Postgres Pro streaming replication to replicate the entire worker-node as-is. It replicates worker nodes by continuously streaming their WAL records to a standby. You can configure streaming replication on-premise yourself by consulting the Streaming Replication section.
As the citus coordinator node is similar to a standard Postgres Pro server, regular Postgres Pro synchronous replication and failover can be used to provide higher availability of the coordinator node. To learn more about handling coordinator node failures, see the Coordinator Node Failures section.
Since citus provides distributed functionality by extending Postgres Pro, it uses the standard Postgres Pro SQL constructs. The vast majority of queries are supported, even when they combine data across the network from multiple database nodes. This includes transactional semantics across nodes. For an up-to-date list of SQL coverage, see the Limitations section.
What's more, citus has 100% SQL support for queries that access a single node in the database cluster. These queries are common, for instance, in multi-tenant applications where different nodes store different tenants. To learn more, see the When to Use citus section.
Remember that even with this extensive SQL coverage data modeling can have a significant impact on query performance. See the Query Processing section for details on how citus executes queries.
One of the choices when first distributing a table is its shard count. This setting can be set differently for each co-location group, and the optimal value depends on the use case. It is possible, but difficult, to change the count after cluster creation, so use these guidelines to choose the right size.
In the multi-tenant SaaS database use case we recommend choosing between 32 and 128 shards. For smaller workloads, say <100GB, you could start with 32 shards and for larger workloads you could choose 64 or 128. This means that you have the leeway to scale from 32 to 128 worker machines.
In the real-time analytics use case, shard count should be related to the total number of cores on the workers. To ensure maximum parallelism, you should create enough shards on each node such that there is at least one shard per CPU core. We typically recommend creating a high number of initial shards, e.g. 2x or 4x the number of current CPU cores. This allows for future scaling if you add more workers and CPU cores.
To choose a shard count for a table you wish to distribute, update the citus.shard_count configuration parameter. This affects subsequent calls to the create_distributed_table function. For example:
SET citus.shard_count = 64; -- any tables distributed at this point will have -- sixty-four shards
For more guidance on this topic, see the Choosing Cluster Size section.
citus has a function called alter_distributed_table that can change the shard count of a distributed table.
count(distinct) queries? #
citus can evaluate
count(distinct) aggregates both in and across worker
nodes. When aggregating on a table's distribution column,
citus can push the counting down inside worker
nodes and total the results. Otherwise it can pull distinct rows to the
coordinator and calculate there. If transferring data to the coordinator
is too expensive, fast approximate counts are also available. More details
in
The count(distinct) Aggregates
section.
citus is able to enforce a primary key or uniqueness constraint only when the constrained columns contain the distribution column. In particular this means that if a single column constitutes the primary key then it has to be the distribution column as well.
This restriction allows citus to localize a uniqueness check to a single shard and let Postgres Pro on the worker node do the check efficiently.
Certain commands, when run on the coordinator node, do not get propagated to the workers:
CREATE ROLE/USER
CREATE DATABASE
ALTER … SET SCHEMA
ALTER TABLE ALL IN TABLESPACE
CREATE TABLE (see the
Table Types section)
For the other types of objects above, create them explicitly on all nodes. citus provides a function to execute queries across all workers:
SELECT run_command_on_workers($cmd$ /* the command to run */ CREATE ROLE ... $cmd$);
Learn more in the
Manual Query Propagation
section. Also note that even after manually propagating
CREATE DATABASE, citus must
still be installed there. See the
Creating a New Database
section.
In the future citus will automatically propagate more kinds of objects. The advantage of automatic propagation is that citus will automatically create a copy on any newly added worker nodes (see the citus.pg_dist_object table to learn more).
If the hostname or IP address of a worker changes, you need to let the coordinator know using the citus_update_node function:
-- Update worker node metadata on the coordinator -- (remember to replace 'old-address' and 'new-address' -- with the actual values for your situation) SELECT citus_update_node(nodeid, 'new-address', nodeport) FROM pg_dist_node WHERE nodename = 'old-address';
Until you execute this update, the coordinator will not be able to communicate with that worker for queries.
citus provides utility functions and metadata tables to determine the mapping of a distribution column value to a particular shard, and the shard placement on a worker node. See the Finding Which Shard Contains Data For a Specific Tenant section for more details.
The citus coordinator node metadata tables contain this information. See the Finding the Distribution Column For a Table section.
No, you must choose a single column per table as the distribution column. A common scenario where people want to distribute by two columns is for timeseries data. However, for this case we recommend using a hash distribution on a non-time column, and combining this with Postgres Pro partitioning on the time column, as described in the Timeseries Data section.
pg_relation_size report zero bytes for a distributed table? #The data in distributed tables lives on the worker nodes (in shards), not on the coordinator. A true measure of distributed table size is obtained as a sum of shard sizes. citus provides helper functions to query this information. See the Determining Table and Relation Size section to learn more.
citus.max_intermediate_result_size? #citus has to use more than one step to run some queries having subqueries or CTEs. Using the subquery/CTE push-pull execution, it pushes subquery results to all worker nodes for use by the main query. If these results are too large, this might cause unacceptable network overhead, or even insufficient storage space on the coordinator node which accumulates and distributes the results.
citus has a configurable setting citus.max_intermediate_result_size to specify a subquery result size threshold at which the query will be canceled. If you run into the error, it looks like:
ERROR: the intermediate result size exceeds citus.max_intermediate_result_size (currently 1 GB) DETAIL: Citus restricts the size of intermediate results of complex subqueries and CTEs to avoid accidentally pulling large result sets into once place. HINT: To run the current query, set citus.max_intermediate_result_size to a higher value or -1 to disable.
As the error message suggests, you can (cautiously) increase this limit by altering the variable:
SET citus.max_intermediate_result_size = '3GB';
Yes, schema-based sharding is available.
The cstore_fdw extension is no longer needed on Postgres Pro 12 and above, because columnar storage is now implemented directly in citus. Unlike cstore_fdw, columnar tables of the citus support transactional semantics, replication, and pg_upgrade. citus query parallelization, seamless sharding, and high-availability benefits combine powerfully with the superior compression and I/O utilization of columnar storage for large dataset archival and reporting.