The data loading process is an important part of the Shardman migration process.
You can use external utilities to load data, but we recommend our shardmanctl load
command, which speeds up data loading by running in multiple threads and knowing where and how data is stored.
With it, there are three ways to load data:
loading data from file to table;
loading data from another database table (view or function call result) into a Shardman table;
migrating database schema along with data to Shardman.
Now let's look at these three ways.
To load data into Shardman cluster, run the following command:
$shardmanctl --store-endpoints http://etcdserver:2379 load --file=/var/load/data.tsv --table=mytable --source file --format text -j 8
In this example, data is loaded from the /var/load/data.tsv data file
(delimiter tab) into the table mytable in 8
parallel threads. You can use schema.table as the table name.
Data also can be provided via stdin (using --file=- option).
The shardmanctl load command supports text and csv data formats.
The system automatically detects gzip archives and unpacks before loading data (So you can use somefile.tsv.gz files as source data).
To optimize the process of loading from a file/table, should use the --distributed-keys
command line option, the value of which will be used to calculate the destination node.
Option value is comma-separated list of pairs. Each pair consists of a field number
(starting with zero) and a type, separated by a colon.
The following types are supported: bool, char,
float4, float8, int2, int4, int8,
name, text, varchar and uuid (example --distributed-keys=0:int8,1:varchar).
Destination table:
CREATE TABLE users (
id uuid NOT NULL PRIMARY KEY,
name VARCHAR(255)
) WITH(distributed_by='id',num_parts=16);
Source file:
f7007387-4a83-416f-a07f-4564736c53e3 Mike
32bac248-3376-4011-8bd1-1528cd67b9c3 Alexander
28601c80-4a9a-496c-8f4e-cde107aad55b Dmitry
928c4f9b-52d0-4bb4-bcc1-841ebcc4a59f Nikita
a9a09b3c-27c2-4ce7-951b-eefb55ff3e2b Pavel
To load data from table to table, source and destination tables must have an equal number of columns with compatible types. By compatible attribute type we mean that strings generated by the type output function on the source are acceptable to the type input function on the destination.
To load data into Shardman cluster from PostgreSQL table, run the following command:
$shardmanctl --store-endpoints http://etcdserver:2379 load -t desttable --source postgres --source-connstr "dbname=db host=srchost port=srcport user=login password=passwd" --source-table sourcetable -j 8
In this example, data is loaded from the table sourcetable (can be a view or a function call)
desttable table in 8 parallel threads. You can use schema.table
as table names.
The loader can create the destination database schema and load selected data. But in this case, you need to describe the structure of migrating tables in the specification file (load_schema.yaml). This file has folowing format:
version: "1.0"
migrate:
connstr: "dbname=workdb host=workhost port=workport user=workuser password=workpassword"
jobs: 8
batch: 1000
options:
- create_schema
- create_table
- create_index
- create_sequence
- truncate_table
- skip_no_pkey_tables
- skip_create_index_error
- skip_create_extension_error
- skip_load_errors
schemas:
- name: public
all: false
tables:
- name: table1
type: sharded
partitions: 6
distributedby: id
- name: table2
type: global
- name: table3
type: sharded
partitions: 6
distributedby: field_id
colocatewith: table1
The migrate.jobs parameter defines the number of parallel data
loader processes.
The migrate.batch parameter is the number of rows in one batch
(recommended value is 1000).
The migrate.schema section defines an array of source database
schemas that we are working with. All other schemas will be skipped.
If the all parameter is set to true, then all
tables from the current schema will be migrated (with global type
by default). If a table is listed in the migrate.schema.tables
array, then the target table type must be explicitly specified for it. Two types
of tables are currently supported: global and sharded.
For sharded tables, the following attributes must be set:
distributedby (specifies the name of the column to use for the table
partitioning) and partitions (number of partitions that will be
created for this table). Optionally, for sharded tables can be set attribute colocatewith
(name of the table to colocate with).
Shardman will try to place partitions of the created table with the same partition key on the
same nodes as colocation table name.
For colocated tables, in addition to these two parameter, you need to add
colocatewith (name of the table to colocate with). Shardman
will try to place partitions of the created table with the same partition key on the
same nodes as colocation table name.
The schema supports the following options:
create_schema — create database schemas if they do not exist;
create_table — create tables if they do not exist;
create_index — create indexes after creating tables;
create_sequence — create sequences if they do not exist;
truncate_table — truncate tables before data load;
skip_no_pkey_tables — skip tables without primary keys;
skip_create_index_error — skip index creation errors;
skip_create_extension_error — skip extension creation errors;
skip_load_errors — continue loading if errors occur.
Now, to load data with schema into Shardman cluster from PostgreSQL, run the following command:
$shardmanctl --store-endpoints http://etcdserver:2379 load --schema load_schema.yaml
During the loading process, the nodes on which the data should be located are automatically determined and data load directly to them.