6.4. Data loading

6.4.1. Loading data from file
6.4.2. Loading data from PostgreSQL table, view or function.
6.4.3. Migrating database schema.

The data loading process is an important part of the Shardman migration process. You can use external utilities to load data, but we recommend our shardmanctl load command, which speeds up data loading by running in multiple threads and knowing where and how data is stored. With it, there are three ways to load data:

Now let's look at these three ways.

6.4.1. Loading data from file

To load data into Shardman cluster, run the following command:

$ shardmanctl --store-endpoints http://etcdserver:2379 load --file=/var/load/data.tsv --table=mytable --source file --format text -j 8

In this example, data is loaded from the /var/load/data.tsv data file (delimiter tab) into the table mytable in 8 parallel threads. You can use schema.table as the table name. Data also can be provided via stdin (using --file=- option).

The shardmanctl load command supports text and csv data formats.

The system automatically detects gzip archives and unpacks before loading data (So you can use somefile.tsv.gz files as source data).

To optimize the process of loading from a file/table, should use the --distributed-keys command line option, the value of which will be used to calculate the destination node. Option value is comma-separated list of pairs. Each pair consists of a field number (starting with zero) and a type, separated by a colon. The following types are supported: bool, char, float4, float8, int2, int4, int8, name, text, varchar and uuid (example --distributed-keys=0:int8,1:varchar).

6.4.1.1. Input data example

Destination table:

CREATE TABLE users (
    id uuid NOT NULL PRIMARY KEY, 
    name VARCHAR(255)
) WITH(distributed_by='id',num_parts=16);
                    

Source file:

f7007387-4a83-416f-a07f-4564736c53e3	Mike
32bac248-3376-4011-8bd1-1528cd67b9c3	Alexander
28601c80-4a9a-496c-8f4e-cde107aad55b	Dmitry
928c4f9b-52d0-4bb4-bcc1-841ebcc4a59f	Nikita
a9a09b3c-27c2-4ce7-951b-eefb55ff3e2b	Pavel                    
                    

6.4.2. Loading data from PostgreSQL table, view or function.

To load data from table to table, source and destination tables must have an equal number of columns with compatible types. By compatible attribute type we mean that strings generated by the type output function on the source are acceptable to the type input function on the destination.

To load data into Shardman cluster from PostgreSQL table, run the following command:

$ shardmanctl --store-endpoints http://etcdserver:2379 load -t desttable --source postgres --source-connstr "dbname=db host=srchost port=srcport user=login password=passwd" --source-table sourcetable -j 8

In this example, data is loaded from the table sourcetable (can be a view or a function call) desttable table in 8 parallel threads. You can use schema.table as table names.

6.4.3. Migrating database schema.

The loader can create the destination database schema and load selected data. But in this case, you need to describe the structure of migrating tables in the specification file (load_schema.yaml). This file has folowing format:

    version: "1.0"
    migrate:
      connstr: "dbname=workdb host=workhost port=workport user=workuser password=workpassword"
      jobs: 8
      batch: 1000
      options:
        - create_schema
        - create_table
        - create_index
        - create_sequence
        - create_foreign_key
        - truncate_table
        - skip_no_pkey_tables
        - skip_create_index_error
        - skip_create_extension_error
        - skip_load_errors
        - skip_create_foreign_key_error
      schemas:
        - name: public
          all: false
          tables:
            - name: table1
              type: sharded
              partitions: 6
              distributedby: id
            - name: table2
              type: global
            - name: table3
              type: sharded
              partitions: 6
              distributedby: field_id
              colocatewith: table1
            - name: table4
              type: global
              source: schema.view
              source_pk: field_id
            - name: table5
              type: global
              source: schema.func(arg)
              source_pk: field_id
        - name: schema2
          all: false
          default_type: sharded
          default_partitions: 6
          tables:
            - name: table1
              distributedby: field_id
            - name: table2
              type: global
            - name: table3
              source: schema.view
              distributedby: field_id
            - name: table4
              distributedby: field_id
              source: schema.func(arg)
            - name: table5
              source: schema."complex.""table.name"
              distributedby: field_id
        - name: schema3
          all: true
          skip_tables: [table1, table2, table3]
                

The migrate.jobs parameter defines the number of parallel data loader processes.

The migrate.batch parameter is the number of rows in one batch (recommended value is 1000).

The migrate.schemas section defines an array of source database schemas that we are working with. All other schemas will be skipped.

If the all parameter is set to true, then all tables from the current schema will be migrated (with global type by default). If a table is listed in the migrate.schemas.tables array, then the target table type must be explicitly specified for it. Two types of tables are currently supported: global and sharded.

The migrate.schemas.skip_tables section defines an array of table names that will be skipped when schema is loaded even if the all parameter is set to true.

For sharded tables, the following attributes must be set: distributedby (specifies the name of the column to use for the table partitioning) and partitions (number of partitions that will be created for this table). Optionally, for sharded tables can be set attribute colocatewith (name of the table to colocate with). Shardman will try to place partitions of the created table with the same partition key on the same nodes as colocation table name.

For colocated tables, in addition to these two parameter, you need to add colocatewith (name of the table to colocate with). Shardman will try to place partitions of the created table with the same partition key on the same nodes as colocation table name.

Could specify table default_type option for schema: global or sharded (default: global). For sharded type you could also specify default_partitions option (default: 20). In case you set default_type sharded you need to specify the distributedby option for each table.

The source option for table should include schema and table source: schema.source. Source could be table, view or function. For example: public.table, public.view, public.func(arg). In case you set source view or function for global table you should specify source_pk what set primary key for this table. If source not specified or contains name of table you can also specify source_pk to create a primary key or override an existing one.

The schema supports the following options:

  • create_schema — create database schemas if they do not exist;

  • create_table — create tables if they do not exist;

  • create_index — create indexes after creating tables;

  • create_sequence — create sequences if they do not exist;

  • create_foreign_key — create foreign keys after creating tables;

  • truncate_table — truncate tables before data load;

  • skip_no_pkey_tables — skip tables without primary keys;

  • skip_create_index_error — skip index creation errors;

  • skip_create_extension_error — skip extension creation errors;

  • skip_load_errors — continue loading if errors occur;

  • skip_create_foreign_key_error — skip foreign keys creation errors.

Now, to load data with schema into Shardman cluster from PostgreSQL, run the following command:

$ shardmanctl --store-endpoints http://etcdserver:2379 load --schema load_schema.yaml

During the loading process, the nodes on which the data should be located are automatically determined and data load directly to them.