F.69. vops

F.69.1. Installation and Setup
F.69.2. Motivation
F.69.3. Architecture
F.69.4. Using Vector Operations
F.69.5. Using Standard SQL Queries for Vector Operations
F.69.6. Reference
F.69.7. Authors

The vops extension provides support for vector operations (VOPS). It allows to speed up OLAP queries with filtering and aggregation more than 10 times, without changing anything in query planner or executor. The vops extension does not set up any executor hooks or alter query execution plan. Performance improvements are achieved by storing data in tiles — groups of column values. It allows to reduce the overhead of Postgres Pro Enterprise executor and the *_deform_tuple functions.

VOPS provides a set of standard operators for tile types, allowing to write SQL queries in the a similar way to normal SQL queries. Vector operators can be used inside predicates and aggregate expressions. Joins are not currently supported. Details of VOPS architecture are described below.

VOPS can efficiently execute filter and aggregation queries. For other query types, you can:

F.69.1. Installation and Setup

The vops extension is included into Postgres Pro Enterprise. Once you have Postgres Pro Enterprise installed, complete the following steps to enable vops:

  1. Add the vops value to the shared_preload_libraries variable in the postgresql.conf file:

    shared_preload_libraries = 'vops'
    

    Примечание

    If the extension is not registered in shared_preload_libraries list, it will be loaded on demand after the first function invocation of this extension. Thus, the first time you execute a VOPS query, it will not be trasformed and can return a wrong result.

  2. Restart the Postgres Pro Enterprise instance for the changes to take effect.

  3. Create the vops extension using the following query:

    CREATE EXTENSION vops;
    

Once vops is enabled, you can start creating VOPS table projections and using vector operations in your database.

F.69.2. Motivation

On OLTP workloads (executing a large number of simple queries), Postgres Pro Enterprise looks very competitive as compared to other mainstream databases. But on OLAP queries, which require processing of larger volumes of data, database management systems oriented on processing analytic queries can provide a much better speed. There are several main factors that limit Postgres Pro Enterprise performance:

  • Unpacking tuple overhead (tuple_deform). To be able to access column values, Postgres Pro Enterprise needs to deform the tuple. Values can be compressed, stored at some other page (TOAST), etc. Besides, as the size of the column can be varying, to extract the N-th column we need to unpack the preceding N-1 columns. Thus, deforming a tuple is quite an expensive operation, especially for tables with a large number of attributes. In queries like TPC-H Q6, tuple deform takes about 40% of total query execution time.

  • Interpretation overhead. Postgres Pro Enterprise compiler and optimizer build a tree representing the query execution plan. So query executor performs recursive invocation of evaluate functions for nodes of this tree. Implementation of some nodes also contains switches used to select the requested action. So query plan is interpreted by Postgres Pro Enterprise query executor rather than directly executed. Usually the interpreter is about 10 times slower than the native code. This is why elimination of interpretation overhead allows to increase query speed several times, especially for queries with complex predicates where most time is spent in expression evaluation.

  • Abstraction penalty. Support for abstract (user-defined) types and operations is one of the key features of Postgres Pro Enterprise. Its executor can deal not only with the built-in set of scalar types (like integer or real), but also with any user-defined types, such as complex or point. But the price for such flexibility is that each operation requires a separate function call. Instead of adding two integers directly, Postgres Pro Enterprise executor invokes a function that performs addition of two integers. In this case, function call overhead is much higher than the cost of performing the operation itself. Function call overhead is also increased because of Postgres Pro Enterprise function call convention requiring passing parameter values through memory (not using register call convention).

  • Pull model overhead. Postgres Pro Enterprise executor is implementing the pull-based query execution model, in which the operand values are pulled by the operator. It simplifies the executor and operators implementation. But it has a negative impact on performance, because leaf nodes (fetching a tuple from the heap or index pages) have to do a lot of extra work saving and restoring their context.

  • MVCC overhead. Postgres Pro Enterprise provides multiversion concurrency control, which allows multiple transactions to concurrently work with the same record without blocking each other. It is good for frequently updated data (OLTP), but for read-only or append-only data in OLAP scenarios, it only adds extra overhead. Both space overhead (about 20 extra bytes per tuple) and CPU overhead (checking visibility of each tuple).

Using vector operations allows to address most of these issues without radical changes in the executor, as explained in next section.

F.69.3. Architecture

F.69.3.1. Vertical Storage

A traditional query executor, such as the Postgres Pro Enterprise executor, deals with a single row of data at each moment of time. If it has to evaluate expression (x+y), it first fetches the value of "x", then the value of "y", performs the "+" operation, and returns the result value to the upper node. In contrast, a vectorized executor can process multiple values in one operation. In this case, "x" and "y" represent not just a single scalar value, but a vector of values. The returned result is also a vector of values. In vector execution model, interpretation and function call overhead is divided by the vector size. The price of performing a function call is the same, but as the function processes N values instead of a single one, this overhead becomes less critical. The larger the vector, the smaller the per-row overhead.

Thus, we can form a vector from all values of the corresponding table attribute. It is called vertical data model, or columnar storage. Unlike the classical horizontal data model where the data storing unit is a row (tuple), here we have vertical columns.

Columnar storage has the following advantages:

  • Reduces the size of fetched data: only the columns used in the query need to be fetched.

  • Achieves better compression rates: storing all values of the same attribute together makes it possible to compress them much better and faster For example, you can use delta encoding.

  • Minimizes interpretation overhead: each operation is performed not for single value, but for a set of values.

  • Uses CPU vector instructions (SIMD) to process data.

However, performing an operation on the whole column is not a very good idea. If the table is very large, the vector can also be very big and may not fit in memory. But even if it fits in memory, working with such large vectors prevents efficient utilization of CPU cache levels (L1, L2,...). So it is more efficient to split the column into relatively small chunks, or tiles — units of processing by vectorized executor. The size of such chunks is chosen to keep all operands of vector operations in cache, even for complex expressions. A typical size of chunk is from 100 to 1000 elements.

F.69.4. Using Vector Operations

To use vector operations, you need to load data into a tile-based VOPS table, using one of the following options:

  • If the data to load is already stored in a database table, use the vops.populate function. This function creates a VOPS projection of the original table, using VOPS types instead of scalar types, and copies all data into this table.

  • If your data is not yet loaded into the database, you can import it into the VOPS table directly from a CSV file using vops.import function.

Once all data is in the VOPS format, you can run VOPS queries on this data. VOPS provides a set of overloaded operators that allow you to write queries similar to SQL. Operators which cannot be overloaded (AND, OR, NOT, BETWEEN) are handled by VOPS executor hook.

Let's examine possible advantages of using vector operations for data kept in standard Postgres Pro Enterprise tables. An efficient vector executor is impossible without underlying support at the storage layer. Advantages of vector processing will be annihilated if vectors are formed from attributes of rows extracted from an existing Postgres Pro Enterprise heap page.

The idea of VOPS extension is to implement vector operations for tiles represented as special Postgres Pro Enterprise types. Tiles should be used as table column types instead of scalar types. For example instead of "real" we should use "vops_float4" which is a tile representing up to 64 values of the corresponding column. There are several reasons for choosing 64:

  1. To provide efficient access to tiles, size_of_tile*size_of_attribute*number_of_attributes must be smaller than the page size. A typical record contains about 10 attributes, the default size of Postgres Pro Enterprise page is 8 kB.

  2. 64 is the number of bits in a large word. We need to maintain a bitmask to mark null values. Certainly it is possible to store the bitmask in an array with arbitrary size, but manipulation with single 64-bit integer is more efficient.

F.69.4.1. Types

VOPS supports all basic Postgres Pro Enterprise numeric types: 1-, 2-, 4- ,8-byte integers and 4,8 bytes floats. Also it supports date and timestamp types but they are using the same implementation as int4 and int8, respectively.

SQL type C type VOPS tile type
bool bool vops_bool
"char", char or char(1) char vops_char
int2 int16 vops_int2
int4 int32 vops_int4
int8 int64 vops_int8
float4 float4 vops_float4
float8 float8 vops_float8
date DateADT vops_date
timestamp Timestamp vops_timestamp

VOPS does not support strings (char or varchar types), except for a single character. If strings are used as identifiers, in most cases it is preferable to place them in some dictionary and use integer identifiers instead of original strings.

F.69.4.2. Vector Operators

VOPS provides implementation of all built-in SQL arithmetic operations for numeric types: + - / * Certainly it also implements all comparison operators: = <> > >= < <=. Operands of such operators can be either tiles or scalar constants: x=y or x=1.

Boolean operators AND, OR, NOT cannot be overloaded. Instead, VOPS provides operators & | !. Please notice that precedence of these operators is different from AND, OR, NOT operators. So you cannot write predicate as x=1 | x=2 - it will cause a syntax error. To solve this problem use parenthesis: (x=1) | (x=2).

Also VOPS provides analog of BETWEEN operator. In SQL expression (x BETWEEN a AND b) is equivalent to (x >= a AND x <= b). But as far as AND operator cannot be overloaded, such substitution will not work for VOPS tiles. This is why VOPS provides special function for range check. Unfortunately BETWEEN is reserved keyword, so no function with such name can be defined. This is why synonym BETWIXT is used.

Postgres Pro Enterprise requires predicate expression to have boolean type. But result of vector boolean operators is vops_bool, not bool. This is why compiler doesn't allow to use it in predicate. The problem can be solved by introducing special filter function. This function is given arbitrary vector boolean expression and returns normal boolean which ... is always true. So from Postgres Pro Enterprise executor point of view predicate value is always true. But filter function sets filter_mask which is actually used in subsequent operators to determine selected records. So query in VOPS looks something like this:

  select sum(price) from trades where filter(day >= '2017-01-01'::date);

Please notice one more difference from normal sequence: we have to use explicit cast of string constant to appreciate data type (date type in this example). For betwixt function it is not needed:

  select sum(price) from trades where filter(betwixt(day, '2017-01-01', '2017-02-01'));

For char, int2 and int4 types VOPS provides concatenation operator || which produces doubled integer type: (char || char) -> int2, (int2 || int2) -> int4, (int4 || int4) -> int8. Them can be used for grouping by several columns (see below).

Operator Description
+ Addition
- Binary subtraction or unary negation
* Multiplication
/ Division
= Equals
<> Not equals
< Less than
<= Less than or Equals
> Greater than
>= Greater than or equals
& Boolean AND
| Boolean OR
! Boolean NOT
bitwixt(x,low,high) Analog of BETWEEN
is_null(x) Analog of IS NULL
is_not_null(x) Analog of IS NOT NULL
ifnull(x,subst) Analog of COALESCE

OLAP queries usually perform some kind of aggregation of large volumes of data. These includes grand aggregates which are calculated for the whole table or aggregates with group by which are calculated for each group. VOPS implements all standard SQL aggregates: count, min, max, sum, avg, var_pop, var_sampl, variance, stddev_pop, stddev_samp, stddev. Them can be used exactly in the same way as in normal SQL queries:

select sum(l_extendedprice*l_discount) as revenue
from vops_lineitem
where filter(betwixt(l_shipdate, '1996-01-01', '1997-01-01')
        & betwixt(l_discount, 0.08, 0.1)
        & (l_quantity < 24));

Also VOPS provides weighted average aggregate VWAP which can be used to calculate volume-weighted average price:

select wavg(l_extendedprice,l_quantity) from vops_lineitem;

Using aggregation with group by is more complex. VOPS provides two functions for it: map and reduce. The work is actually done by map(group_by_expression, aggregate_list, expr {, expr }) VOPS implements aggregation using hash table, which entries collect aggregate states for all groups. And set returning function reduce just iterates through the hash table consrtucted by map. reduce function is needed because result of aggregate in Postgres Pro Enterprise can not be a set. So aggregate query with group by looks something like this:

select reduce(map(l_returnflag||l_linestatus, 'sum,sum,sum,sum,avg,avg,avg',
    l_quantity,
    l_extendedprice,
    l_extendedprice*(1-l_discount),
    l_extendedprice*(1-l_discount)*(1+l_tax),
    l_quantity,
    l_extendedprice,
    l_discount)) from vops_lineitem where filter(l_shipdate <= '1998-12-01'::date);

Here we use concatenation operator to perform grouping by two columns. Right now VOPS supports grouping only by integer type. Another serious restriction is that all aggregated expressions should have the same type, for example vops_float4. It is not possible to calculate aggregates for vops_float4 and vopd_int8 columns in one call of map function, because it accepts aggregation arguments as variadic array, so all elements of this array should have the same type.

Aggregate string in map function should contain list of requested aggregate functions, separated by colon. Standard lowercase names should be used: count, sum, agg, min, max. Count is executed for the particular column: count(x). There is no need to explicitly specify count(*) because number of records in each group is returned by reduce function in any case.

reduce function returns set of vops_aggregate type. It contains three components: value of group by expression, number of records in the group and array of floats with aggregate values. Please notice that values of all aggregates, including count and min/max, are returned as floats.

create type vops_aggregates as(group_by int8, count int8, aggs float8[]);
create function reduce(bigint) returns setof vops_aggregates;

But there is much simple and straightforward way of performing group aggregates using VOPS. We need to partition table by group by fields. In this case grouping keys will be stored in normal way and other fields - inside tiles. Now Postgres Pro Enterprise executor will execute VOPS aggregates for each group:

select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
    sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    countall(*) as count_order
from
    vops_lineitem_projection
where
    filter(l_shipdate <= '1998-12-01'::date)
group by
    l_returnflag,
    l_linestatus
order by
    l_returnflag,
    l_linestatus;

In this example l_returnflag and l_linestatus fields of table vops_lineitem_projection have "char" type while all other used fields - tile types (l_shipdate has type vops_date and other fields - vops_float4). The query above is executed even faster than query with reduce(map(...)). The main problem with this approach is that you have to create projection for each combination of group by keys you want to use in queries.

VOPS provides limited support of Postgres Pro Enterprise window functions. It implements count, sum, min, max, avg and lag functions. But unfortunately Postgres Pro Enterprise requires aggregates to have to similar final type for moving (window) and plain implementations. This is why VOPS has to choose define this aggregate under different names: mcount, msum, mmin, mmax, mavg.

There are also two important restrictions:

  1. Filtering, grouping and sorting can be done only by scalar (non-tile) attributes

  2. Only rows between unbounded preceding and current row frame is supported (but there is special version of msum which accepts extra window size parameter)

Example of using window functions with VOPS:

select unnest(t.*) from (select mcount(*) over w,mcount(x) over w,msum(x) over w,mavg(x) over w,mmin(x) over w,mmax(x) over w,x - lag(x) over w 
from v window w as (rows between unbounded preceding and current row)) t;

F.69.4.5. Using indexes

Analytic queries are usually performed on the data for which no indexes are defined. And columnar store vector operations are most efficient in this case. But it is still possible to use indexes with VOPS.

As far as each VOPS tile represents multiple values, index can be used only for some preliminary, non-precise filtering of data. It is something similar with BRIN indexes. VOPS provides four functions: first, last, high, low which can be used to obtain high/low boundary of values stored in the tile. First two functions first and last should be used for sorted data set. In this case first value is the smallest value in the tile and last value is the largest value in the tile. If data is not sorted, then lowhigh functions should be used, which are more expensive, because them need to inspect all tile values. Using this four function it is possible to construct functional indexes for VOPS table. BRIN index seems to be the best choice for VOPS table:

create index low_boundary on trades using brin(first(day)); -- trades table is ordered by day
create index high_boundary on trades using brin(last(day)); -- trades table is ordered by day

Now it is possible to use this indexes in query. Please notice that we have to recheck precise condition because index gives only approximate result:

select sum(price) from trades where first(day) >= '2015-01-01' and last(day) <= '2016-01-01'
                                               and filter(betwixt(day, '2015-01-01', '2016-01-01'));

F.69.4.6. Preparing Data for VOPS

The vops extension provides special functions to pre-process the data for VOPS queries, uniting attribute values of several rows inside one VOPS tile.

First of all you need to create table with columns having VOPS tile types. It can map all columns of the original table or just some most frequently used subset of them. This table can be treated as a projection of the original table. Projection should include columns which are most frequently used together in queries.

Original table from TPC-H benchmark:

create table lineitem(
   l_orderkey integer,
   l_partkey integer,
   l_suppkey integer,
   l_linenumber integer,
   l_quantity real,
   l_extendedprice real,
   l_discount real,
   l_tax real,
   l_returnflag char,
   l_linestatus char,
   l_shipdate date,
   l_commitdate date,
   l_receiptdate date,
   l_shipinstruct char(25),
   l_shipmode char(10),
   l_comment char(44));

VOPS projection of this table:

create table vops_lineitem(
   l_shipdate vops_date not null,
   l_quantity vops_float4 not null,
   l_extendedprice vops_float4 not null,
   l_discount vops_float4 not null,
   l_tax vops_float4 not null,
   l_returnflag vops_char not null,
   l_linestatus vops_char not null
);

Original table can be treated as the write-optimized storage (WOS). If it has not indexes, then Postgres Pro Enterprise is able to provide very fast insertion speed, comparable with raw disk write speed. Projection in VOPS format can be treated as the read-optimized storage (ROS), most efficient for execution of OLAP queries.

To transfer data from the original to the projected table, use the vops.populate function. For example:

select populate(destination := 'vops_lineitem'::regclass, source := 'lineitem'::regclass);

You can use populated table in queries performing sequential scan. VOPS operators can speed up filtering of records and calculation of aggregates. Aggregation with group by requires use of reduce + map functions. But as it was mentioned above in the section describing aggregates, it is possible to populate table in such way, that standard PostgreSQL grouping algorithm will be used.

We need to choose partitioning keys and sort original table by this keys. Combination of partitioning keys expected to be NOT unique - otherwise tiles can only increase used space and lead to performance degradation. But if there are a lot of duplicates, then "collapsing" them and storing other fields in tiles will help to reduce space and speed up queries. Let's create the following projection of lineitems table:

create table vops_lineitem_projection(
   l_shipdate vops_date not null,
   l_quantity vops_float4 not null,
   l_extendedprice vops_float4 not null,
   l_discount vops_float4 not null,
   l_tax vops_float4 not null,
   l_returnflag "char" not null,
   l_linestatus "char" not null
);

As you can see, in this table l_returnflag and l_linestatus fields are scalars, and other fields - tiles. This projection can be populated using the following command:

select populate(destination := 'vops_lineitem_projection'::regclass, source := 'lineitem_projection'::regclass, sort := 'l_returnflag,l_linestatus');

Now we can create normal index on partitioning keys, define standard predicates for them and use them in group by and order by clauses.

Sometimes it is not possible or not desirable to store two copies of the same dataset. VOPS allows to load data directly from CSV file into VOPS table with tiles, bypassing creation of normal (plain) table. It can be done using import function:

select import(destination := 'vops_lineitem'::regclass, csv_path := '/mnt/data/lineitem.csv', separator := '|');

import function is defined in this way:

create function import(destination regclass, 
                       csv_path cstring, 
                       separator cstring default ',', 
                       skip integer default 0) returns bigint;

It accepts name of target VOPS table, path to CSV file, optional separator (default is ',') and number of lines in CSV header (no header by default). The function returns number of imported rows.

A query from VOPS projection returns set of tiles. Output function of tile type is able to print content of the tile. But in some cases it is preferable to transfer result to normal (horizontal) format where each tuple represents one record. It can be done using unnest function:

postgres=# select unnest(l.*) from vops_lineitem l where filter(l_shipdate <= '1998-12-01'::date) limit 3;
                unnest                 
---------------------------------------
 (1996-03-13,17,33078.9,0.04,0.02,N,O)
 (1996-04-12,36,38306.2,0.09,0.06,N,O)
 (1996-01-29,8,15479.7,0.1,0.02,N,O)
(3 rows)

As it was mentioned in previous section, unnest function can scatter records with VOPS types into normal records with scalar types. So it is possible to use this records in arbitrary SQL queries. But there are two problems with unnest function:

  1. It is not convenient to use. This function has no static knowledge about the format of output record and this is why programmer has to specify it manually, if here wants to decompose this record.

  2. Postgres Pro Enterprise optimizer has completely no knowledge on result of transformation performed by unnest() function. This is why it is not able to choose optimal query execution plan for data retrieved from VOPS table.

Fortunately Postgres Pro Enterprise provides solution for both of this problem: foreign data wrappers (FDW). In our case data is not really "foreign": it is stored inside our own database. But in alternatives (VOPS) format. VOPS FDW allows to "hide" specific of VOPS format and run normal SQL queries on VOPS tables. FDW allows the following:

  1. Extract data from VOPS table in normal (horizontal) format so that it can be proceeded by upper nodes in query execution plan.

  2. Pushdown to VOPS operations that can be efficiently executed using vectorized operations on VOPS types: filtering and aggregation.

  3. Provide statistic for underlying table which can be used by query optimizer.

So, by placing VOPS projection under FDW, we can efficiently perform sequential scan and aggregation queries as if them will be explicitly written for VOPS table and at the same time be able to execute any other queries on this data, including joins, CTEs,... Query can be written in standard SQL without usage of any VOPS specific functions.

Below is an example of creating VOPS FDW and running some queries on it:

create foreign table lineitem_fdw  (
   l_suppkey int4 not null,
   l_orderkey int4 not null,
   l_partkey int4 not null,
   l_shipdate date not null,
   l_quantity float4 not null,
   l_extendedprice float4 not null,
   l_discount float4 not null,
   l_tax      float4 not null,
   l_returnflag "char" not null,
   l_linestatus "char" not null
) server vops_server options (table_name 'vops_lineitem');

explain select
   sum(l_extendedprice*l_discount) as revenue
from
   lineitem_fdw
where
   l_shipdate between '1996-01-01' and '1997-01-01'
   and l_discount between 0.08 and 0.1
   and l_quantity < 24;
                       QUERY PLAN                        
---------------------------------------------------------
 Foreign Scan  (cost=1903.26..1664020.23 rows=1 width=4)
(1 row)

-- Filter was pushed down to FDW

explain select
    n_name,
    count(*),
    sum(l_extendedprice * (1-l_discount)) as revenue
from
    customer_fdw join orders_fdw on c_custkey = o_custkey
    join lineitem_fdw on l_orderkey = o_orderkey
    join supplier_fdw on l_suppkey = s_suppkey
    join nation on c_nationkey = n_nationkey
    join region on n_regionkey = r_regionkey
where
    c_nationkey = s_nationkey
    and r_name = 'ASIA'
    and o_orderdate >= '1996-01-01'
    and o_orderdate < '1997-01-01'
group by
    n_name
order by
    revenue desc;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=2337312.28..2337312.78 rows=200 width=48)
   Sort Key: (sum((lineitem_fdw.l_extendedprice * ('1'::double precision - lineitem_fdw.l_discount)))) DESC
   ->  GroupAggregate  (cost=2336881.54..2337304.64 rows=200 width=48)
         Group Key: nation.n_name
         ->  Sort  (cost=2336881.54..2336951.73 rows=28073 width=40)
               Sort Key: nation.n_name
               ->  Hash Join  (cost=396050.65..2334807.39 rows=28073 width=40)
                     Hash Cond: ((orders_fdw.o_custkey = customer_fdw.c_custkey) AND (nation.n_nationkey = customer_fdw.c_nationkey))
                     ->  Hash Join  (cost=335084.53..2247223.46 rows=701672 width=52)
                           Hash Cond: (lineitem_fdw.l_orderkey = orders_fdw.o_orderkey)
                           ->  Hash Join  (cost=2887.07..1786058.18 rows=4607421 width=52)
                                 Hash Cond: (lineitem_fdw.l_suppkey = supplier_fdw.s_suppkey)
                                 ->  Foreign Scan on lineitem_fdw  (cost=0.00..1512151.52 rows=59986176 width=16)
                                 ->  Hash  (cost=2790.80..2790.80 rows=7702 width=44)
                                       ->  Hash Join  (cost=40.97..2790.80 rows=7702 width=44)
                                             Hash Cond: (supplier_fdw.s_nationkey = nation.n_nationkey)
                                             ->  Foreign Scan on supplier_fdw  (cost=0.00..2174.64 rows=100032 width=8)
                                             ->  Hash  (cost=40.79..40.79 rows=15 width=36)
                                                   ->  Hash Join  (cost=20.05..40.79 rows=15 width=36)
                                                         Hash Cond: (nation.n_regionkey = region.r_regionkey)
                                                         ->  Seq Scan on nation  (cost=0.00..17.70 rows=770 width=40)
                                                         ->  Hash  (cost=20.00..20.00 rows=4 width=4)
                                                               ->  Seq Scan on region  (cost=0.00..20.00 rows=4 width=4)
                                                                     Filter: ((r_name)::text = 'ASIA'::text)
                           ->  Hash  (cost=294718.76..294718.76 rows=2284376 width=8)
                                 ->  Foreign Scan on orders_fdw  (cost=0.00..294718.76 rows=2284376 width=8)
                     ->  Hash  (cost=32605.64..32605.64 rows=1500032 width=8)
                           ->  Foreign Scan on customer_fdw  (cost=0.00..32605.64 rows=1500032 width=8)

-- filter on orders range is pushed to FDW

F.69.5. Using Standard SQL Queries for Vector Operations

You can use regular SQL when working with vector operations. You only have to create proper projections of original table. This projections need to use tiles types for some attributes (vops_float4,...). Then you can query this table using standard SQL. And this query will be executed using vector operations!

This is implemented as follows:

  1. User defined types

  2. User defined operator

  3. User defined implicit type casts

  4. Post parse analyze hook which performs query transformation

So VOPS defines tile types and standard SQL operators for this types. Then it defines implicit type cast from vops_bool (result of boolean operation with tiles) to boolean type. Now programmers do not have to wrap vectorized boolean operations in filter() function call. And the final transformation is done by post parse analyze hook, defined by VOPS extension. It replaces scalar boolean operations with vector boolean operations:

Original Expression Transformed Expression
NOT filter(o1) filter(vops_bool_not(o1))
filter(o1) AND filter(o2) filter(vops_bool_and(o1, o2))
filter(o1) OR filter(o2) filter(vops_bool_or(o1, o2))

Now there is no need to use VOPS specific BETIXT operator: standard SQL BETWEEN operator will work (but still using BETIXT is slightly more efficient, because it performs both comparions in one function). Also there are no problems with operators precedence and extra parenthesis are not needed. If query includes vectorized aggregates, then count(*) is transformed to countall(*).

There is only one difference left between standard SQL and its vectorized extension. You still have to perform explicit type cast in case of using string literal, for example l_shipdate <= '1998-12-01' will not work for l_shipdate column with tile type. Postgres Pro Enterprise have two overloaded versions of <= operator which can be applied here:

  1. vops_date <= vops_date

  2. vops_date <= date

And it decides that it is better to convert string to the tile type vops_date. In principle, it is possible to provide such conversion operator. But it is not good idea, because we have to generate dummy tile with all components equal to the specified constant and perform (vector OP vector) operation instead of more efficient (vector OP scalar).

F.69.6. Reference

F.69.6.1. Functions

vops.populate(destination regclass, source regclass, predicate cstring default null, sort cstring default null)

Copies data from an existing table to its vectorized projection and returns the number of imported rows (bigint).

Arguments:

  • destination — target table to copy the data into.

    Type: regclass

  • source — source table to copy the data from.

    Type: regclass

  • predicate (optional) — restricts the amount of imported data. Using this argument, you can upload only the most recent records.

    Type: cstring

  • sort (optional) — sorts the source data to define the order in which the data needs to be loaded.

    Type: cstring

Examples:

select populate(destination := 'vops_lineitem'::regclass, source := 'lineitem'::regclass);
select populate(destination := 'vops_lineitem_projection'::regclass, source := 'lineitem_projection'::regclass, sort := 'l_returnflag,l_linestatus');
vops.import(destination regclass, csv_path cstring, separator cstring default ',' skip integer default 0)

Loads the data into VOPS table with tiles directly from a CSV file, bypassing creation of a regular table, and returns the number of imported rows (bigint). Use this function to avoid storing two copies of the same dataset.

Arguments:

  • destination — target table to copy the data into.

    Type: regclass

  • csv_path — defines the path to the CSV file to copy the data from.

    Type: cstring

  • separator (optional) — specifies the field separator used in the CSV file. Use this parameter if the source CSV file uses any field separator other than the comma.

    Type: cstring

  • skip (optional) — specifies the number of rows in the CSV header.

    Default: 0

    Type: cstring

Examples:

select import(destination := 'vops_lineitem'::regclass, csv_path := '/mnt/data/lineitem.csv', separator := '|');

F.69.7. Authors

Postgres Professional, Moscow, Russia