6.4. Data loading

6.4.1. Loading data from file
6.4.2. Loading data from PostgreSQL table, view or function.
6.4.3. Migrating database schema.

The data loading process is an important part of the Shardman migration process. You can use external utilities to load data, but we recommend our shardmanctl load command, which speeds up data loading by running in multiple threads and knowing where and how data is stored. With it, there are three ways to load data:

Now let's look at these three ways.

6.4.1. Loading data from file

To load data into Shardman cluster, run the following command:

$ shardmanctl --store-endpoints http://etcdserver:2379 load --file=/var/load/data.tsv --table=mytable --source file --format text -j 8

In this example, data is loaded from the /var/load/data.tsv data file (delimiter tab) into the table mytable in 8 parallel threads. You can use schema.table as the table name. Data also can be provided via stdin (using --file=- option).

The shardmanctl load command supports text and csv data formats.

The system automatically detects gzip archives and unpacks before loading data (So you can use somefile.tsv.gz files as source data).

To optimize the process of loading from a file/table, should use the --distributed-keys command line option, the value of which will be used to calculate the destination node. Option value is comma-separated list of pairs. Each pair consists of a field number (starting with zero) and a type, separated by a colon. The following types are supported: bool, char, float4, float8, int2, int4, int8, name, text, varchar and uuid (example --distributed-keys=0:int8,1:varchar).

6.4.1.1. Input data example

Destination table:

CREATE TABLE users (
    id uuid NOT NULL PRIMARY KEY, 
    name VARCHAR(255)
) WITH(distributed_by='id',num_parts=16);
                    

Source file:

f7007387-4a83-416f-a07f-4564736c53e3	Mike
32bac248-3376-4011-8bd1-1528cd67b9c3	Alexander
28601c80-4a9a-496c-8f4e-cde107aad55b	Dmitry
928c4f9b-52d0-4bb4-bcc1-841ebcc4a59f	Nikita
a9a09b3c-27c2-4ce7-951b-eefb55ff3e2b	Pavel                    
                    

6.4.2. Loading data from PostgreSQL table, view or function.

To load data from table to table, source and destination tables must have an equal number of columns with compatible types. By compatible attribute type we mean that strings generated by the type output function on the source are acceptable to the type input function on the destination.

To load data into Shardman cluster from PostgreSQL table, run the following command:

$ shardmanctl --store-endpoints http://etcdserver:2379 load -t desttable --source postgres --source-connstr "dbname=db host=srchost port=srcport user=login password=passwd" --source-table sourcetable -j 8

In this example, data is loaded from the table sourcetable (can be a view or a function call) desttable table in 8 parallel threads. You can use schema.table as table names.

6.4.3. Migrating database schema.

The loader can create the destination database schema and load selected data. But in this case, you need to describe the structure of migrating tables in the specification file (load_schema.yaml). This file has folowing format:

    version: "1.0"
    migrate:
      connstr: "dbname=workdb host=workhost port=workport user=workuser password=workpassword"
      jobs: 8
      batch: 1000
      options:
        - create_schema
        - create_table
        - create_index
        - create_sequence
        - truncate_table
        - skip_no_pkey_tables
        - skip_create_index_error
        - skip_create_extension_error
        - skip_load_errors
      schemas:
        - name: public
          all: false
          tables:
            - name: table1
              type: sharded
              partitions: 6
              distributedby: id
            - name: table2
              type: global
            - name: table3
              type: sharded
              partitions: 6
              distributedby: field_id
              colocatewith: table1
                

The migrate.jobs parameter defines the number of parallel data loader processes.

The migrate.batch parameter is the number of rows in one batch (recommended value is 1000).

The migrate.schema section defines an array of source database schemas that we are working with. All other schemas will be skipped.

If the all parameter is set to true, then all tables from the current schema will be migrated (with global type by default). If a table is listed in the migrate.schema.tables array, then the target table type must be explicitly specified for it. Two types of tables are currently supported: global and sharded.

For sharded tables, the following attributes must be set: distributedby (specifies the name of the column to use for the table partitioning) and partitions (number of partitions that will be created for this table). Optionally, for sharded tables can be set attribute colocatewith (name of the table to colocate with). Shardman will try to place partitions of the created table with the same partition key on the same nodes as colocation table name.

For colocated tables, in addition to these two parameter, you need to add colocatewith (name of the table to colocate with). Shardman will try to place partitions of the created table with the same partition key on the same nodes as colocation table name.

The schema supports the following options:

  • create_schema — create database schemas if they do not exist;

  • create_table — create tables if they do not exist;

  • create_index — create indexes after creating tables;

  • create_sequence — create sequences if they do not exist;

  • truncate_table — truncate tables before data load;

  • skip_no_pkey_tables — skip tables without primary keys;

  • skip_create_index_error — skip index creation errors;

  • skip_create_extension_error — skip extension creation errors;

  • skip_load_errors — continue loading if errors occur.

Now, to load data with schema into Shardman cluster from PostgreSQL, run the following command:

$ shardmanctl --store-endpoints http://etcdserver:2379 load --schema load_schema.yaml

During the loading process, the nodes on which the data should be located are automatically determined and data load directly to them.