procopy — a utility for fast data loading from a source database to a destination database
procopy config generate [option...]
procopy load [option...]
procopy is a utility for fast loading of data from a source database, such as Oracle, PostgreSQL, MySQL, etc., to a destination database, such as PostgreSQL, Postgres Pro Enterprise, etc. For the sources and destinations that are currently supported, see Table 4.1.
procopy is not responsible for:
Database schema transfer.
Replaying the changes that occurred during data loading.
Retaining the data consistency. Therefore, it is recommended to remove existing constraints where possible and bring them back when prosync finishes.
Data verification after loading.
procopy is developed in Go for efficient resource utilization when migrating databases of the 10 TB size or larger. At a high level, procopy consists of a pool of Readers, pool of Loaders, and a built-in bus used to exchange data between Readers and Loaders:

Because of the binary load format for COPY, addition
of new types requires the support of the corresponding encoders.
Loading of BLOB and BFILE objects
is slower than loading of other objects because at least two SQL commands
are needed for each row. This is much slower than just
using COPY.
procopy does not support views and materialized views as sources.
During ProGate operation, tasks of different types
are performed. A task can be of the
Query, Schema, or Table
type. The tasks section in the
configuration file enables you to configure procopy
tasks.
Table.
Tasks of this type are used to load a specific table with a possibility to filter and map columns.
Schema.
Tasks of this type are used to load the whole schemas with a possibility to filter and map tables.
Query.
In tasks of this type, you can specify a statement that defines the selection from the source database.
In procopy, tasks of the Query
type have some limitations. They cannot be continued after a restart.
To generate a configuration file, execute the config-generate command. You can update the configuration values there later. The configuration file is described in Section 5.1 and Section 5.2.
To launch the data load from the source to the destination DB, execute the load command with the configuration file specified.
On launching the data load, the configuration file is validated as follows:
In case of incorrect names of configuration parameters, procopy completes with an error.
Values of configuration parameters are checked (for example, correctness of IP addresses and host names).
If not all the required configuration parameters are specified, procopy completes with an error.
If incompatible configuration parameters are specified, procopy completes with an error.
procopy terminates automatically when all the tasks are completed. A task is considered complete if procopy reached the end of the table.
To stop procopy in operation, press Ctrl+C.
Pressing Ctrl+C once launches the graceful shutdown. It
takes several minutes for the application to shut down correctly and save
all the information required for a restart. If Ctrl+C
is pressed twice, the application terminates quickly. In this case, multiple
“unique constraint violation” errors are likely to occur
after a restart. Despite the errors, procopy must
continue its operation as usual, but this can take much time.
If for some reason procopy cannot apply a row,
it saves the key and an error in a CSV file
<taks_id>_index.csv in the
directory. If the problem_rows.local.save_dir/<timestamp>problem_rows.local.save_dir
configuration parameter is not specified, the
/tmp/<timestamp>/*.csv directory is used.
The default behavior is to skip problematic rows, but in practice, this is
not suitable in most cases. In such cases, it is posssible to set up
procopy_options
to store where to save batches of problematic rows using the following construct:
...
procopy_options:
problem_rows:
local:
save_dir: path_to_save_dir
...
If an error occurs that causes procopy to crash,
the application saves the diagnostic information to the
Procopy_panic_data_<current_time>.tar.gz file.
If issues arose during a previous load and you managed to resolve them, you can try to apply these rows without full data reload. To do this, execute the command:
procopy load --file filename --reload-from TIMESTAMP
Where TIMESTAMP is the name of the directory that
stores indexes with problematic rows.
Rows from *_index.csv files will be reloaded. New rows
that occurred in tables will not be applied.
If issues occur again, new *_index.csv files will occur
under the new TIMESTAMP.
procopy can resume data loading from the point
where it stopped last time for tasks of the Schema type
and of the Table type for tables with keys.
If a task of the Query type was not completed, it cannot
be resumed. Perform truncate before reloading.
If a task of the Table type was not completed for a table
without keys (a heap), perform truncate before reloading,
otherwise, rows will be duplicated.
If procopy_options.truncate
is true or truncate
is true for a task, the table will be cleared and data
loading will start from scratch.
Parallelism of the load process is achieved by:
Setting up the read and write processes
Configuring splitting tasks into subtasks
Processing partitioned tables
Set the global limitations on the number of threads for reading data from
the source and loading it to the destination in the procopy_options.readers
and procopy_options.loaders
configuration parameters, respectively. By default, both equal
NumCPU/2.
You can split large tables into parts by the ranges of unique keys, so each
part will be processed in an individual task. To do this, set the
global procopy_options.sub_task_rows
parameter or the task.sub_task_rows parameter
at the task level.
For each partition of a partitioned table, an individual task is created.
Table Type #
Configuration settings related to tasks of this type are specified in the
tasks.table section
of the configuration file. For example:
tasks:
- id: tab1
table:
source_table: source_table
destination_table: destination_table
exclude_columns:
- field4
include_columns:
- field1
- field2
- field3
- field4
column_mapping:
field1: field2
field2: field1
where: (COL1 IS NULL OR COL1 > 10) AND COL2 <> 'value'
batch_bytes: 976.6KiB
truncate: false
transform:
field1:
null_to_value: ""
field2:
values_to_null:
- "NULL"
- NULL_VAL
- INVALID
field3:
null_char_replace: "\n"
snapshot_id: null
To configure tasks of the Table type, choose the columns
that will be finally loaded:
Only choose specific columns to load using the include_columns
setting. All the columns are assumed by default.
Exclude columns from loading using the exclude_columns
setting.
Rename columns to be loaded using the column_mapping
setting.
Use the transform
setting to list rules to transform null values of the specified columns.
Use the where setting
to limit the number of rows to load.
Schema Type #
Configuration settings related to tasks of this type are specified in the
tasks.schema section
of the configuration file. For example:
tasks:
- id: schema_task
schema:
source_schema: schema1
destination_schema: schema2
exclude_tables:
- TABLE1
include_tables:
- TABLE1
- TABLE2
- TABLE2
table_mapping:
TABLE2: main_table
batch_bytes: 976.6KiB
truncate: false
snapshot_id: null
To configure tasks of the Schema type, choose tables
that will be finally loaded:
Only choose specific tables to load using the include_tables
setting. All the tables are assumed by default.
Exclude tables from loading using the exclude_tables
setting.
Rename tables to be loaded using the table_mapping
setting.
Query Type #
Tasks of the Query type load data to the destination DB from the
source DB that is obtained by running a query.
Configuration settings related to tasks of this type are specified in the
tasks.query section
of the configuration file. For example:
tasks:
- id: tab2
query:
sql: SELECT id, name FROM users ORDER BY id
destination_table: users_table
destination_columns:
- id
- login
batch_bytes: 976.6KiB
truncate: false
snapshot_id: null
When configuring tasks of the Query type,
list columns in the tasks.query.destination_columns
setting in the same order as in the query.
In tasks that load data from PostgreSQL/Postgres Pro, you can specify the snapshot to load data for. To do this, follow the steps below:
In the source DB, start the transaction with the REPEATABLE READ
isolation level.
Get the snapshot identifier with the query
SELECT pg_export_snapshot()
In the section of the configuration file related to this task, specify
the returned value for snapshot_id.
For example, the fragment of the configuration file for this task can look as follows:
tasks:
- id: query_task
query:
sql: SELECT id, name FROM users ORDER BY id
destination_table: users_table
destination_columns:
- id
- login
batch_bytes: 1000000
truncate: false
snapshot_id: 00000006-0000000000002007-1
As a result of this task, the data rows will be loaded as of the start of the transaction even if concurrent transactions change these rows.
For automatic creation of a snapshot on launching the data load, set
the procopy_options.enable_auto_snapshot
configuration parameter to true.
BFILE Objects #
When migrating BFILE objects, only aliases are transferred.
The pure data must be available from the source or transferred by users.
This section describes procopy commands. Optional parameters are enclosed in square brackets.
procopy config generate [-f|--formatjson|yaml] [-o|--outputfilename]
Generates a procopy configuration file.
-f json|yaml--format json|yamlSpecifies the format of the configuration file: JSON or YAML. The default is YAML.
-o filename--output filenameSpecifies the filename of the configuration file. By default, the configuration file will be output to stdout.
load -f|--filefilename[--log-levelerror|warn|info|debug] [-s|--queue-sizecount_of_batches] [-q|--quiet] [-r|--reload-fromTIMESTAMP] [--dry-run|--read-only]
Launches data load between the databases.
-f filename--file filenameName of the configuration file. Required.
--log-level error|warn|info|debug
Logging level. The default is info.
Detailed logging can affect the performance. So only choose
debug when you are researching errors.
-s count_of_batches--queue-size count_of_batchesSize of the queue between Readers and Loaders to be used by procopy. The default is 128.
This value can excessively increase the memory size. It is usually
enough to set it to NumLoaders * 2. If
procopy starts to slow down when it
loads data to the destination, the memory size can grow more than by
count_of_batches *
batch_bytes
bytes.
-q--quietTurns off output of the procopy progress.
-r TIMESTAMP--reload-from TIMESTAMP
Launches a repeated load of problematic rows.
TIMESTAMP is the timestamp of one of the previous
launches of loading. If this option is specified,
procopy tries to only load rows that had issues
during loading with the TIMESTAMP. You can learn
this timestamp from logs or find it in the directory specified in the problem_rows.local.save_dir
configuration parameter.
--dry-runTurns off writing to the destination. procopy will perform all the operations except writing. This flag is useful for checking the performance of writing.
--read-onlyTurns off all the procopy logic except reading from the source. This flag is useful for checking the performance of reading.