pgpro_scout — detect sensitive information in a database
pgpro_scout config [option...]
pgpro_scout inspect [option...]
pgpro_scout is a command-line utility to detect sensitive (confidential) information in a database. pgpro_scout looks for sensitive information in database column names and field values. The search requires a specialized dictionary that the utility allows you to create.
pgpro_scout does not aim to find all the sensitive information, but only performs data intelligence to determine whether such information exists and show sensitive data samples.
pgpro_scout is provided with Postgres Pro Enterprise
as a separate pre-built package pgpro_scout
(for the detailed installation instructions, see Chapter 17).
The dictionary is a text file whose lines have the key-value-comment format. Where:
key is a string with the conventional name of a search parameter.
value defines a condition that a database field must meet to be considered sensitive. Can be a regular expression or an array of string values.
comment is a text description of the search condition to be visible in search results.
pgpro_scout generates a report file with information on sample occurrences of the sensitive information found. Refer to the section called “pgpro_scout Report” for details.
To perform the database inspection for occurrences of sensitive information, you need to create the dictionary and launch the search passing a list of database objects to look through.
To create a base dictionary, run the config command. You can update this dictionary and extend it with new search conditions if needed. Here is an example of the dictionary:
scout:
column_names:
matchers:
- key: phones
match_values: ["phones", "phone"]
comment: "Phone number fields"
column_values:
disabled: true # optional
case_sensitive: false # optional
matchers:
- key: surnames
disabled: false # optional
case_sensitive: true # optional
match_values: ["Holmes", "Watson"]
- key: phones_data
pattern: ^((8|\+7)[\- ]?)?(\(?\d{3}\)?[\- ]?)?[\d\- ]{7,10}$
comment: "Phone numbers"
exclude:
schemas: ['pgpro_sfile_data', 'dbms_lob', 'profile', 'information_schema', 'pg_catalog', 'pg_toast']
column_types: ['timestamp with time zone']
The dictionary consists of the following sections:
scout — contains the description of search conditions.
exclude — lists database schemas and field types
to be excluded from the search.
These types of search conditions can be specified:
column_names — field names.
column_values — field values.
A matchers element contains a list of search conditions.
Each matchers element may contain the following parameters:
key — the search key.
match_values — the list of strings to look
for in the database fields. Do not specify together with pattern.
pattern — a regular expression for search in
the database fields. Do not specify together with match_values.
case_sensitive — a boolean value to define
whether the search is case-sensitive if match_values is specified.
disabled — a boolean value to define whether
to exclude the condition from the search.
comment — the comment to be shown in the
report file.
To specify the objects to look through, use the --search-path option
when launching the search. --search-path takes a string that
contains comma-separated substrings with database objects to restrict
the search to:
To restrict the search to a database schema, specify a substring with the schema name.
To restrict the search to an individual table or tables, specify a
substring like this: SCHEMA.TABLE or
SCHEMA.{TABLE1, TABLE2}.
To restrict the search to an individual column or columns of a certain
table, specify a substring like this: SCHEMA.TABLE.COLUMN
or SCHEMA.TABLE.{COLUMN1, COLUMN2}.
Table I.1. Examples of Specifying Search Objects
--search-path Value | Search Range |
|---|---|
public | All tables (with all columns) in the public
schema. |
public.{products, sales} | products and sales tables
(with all columns) in the public
schema. |
public.products.{id, email} | id and email columns of
the products table in the public
schema. |
public.products | Only the public.products table. |
public.products,events,holds.card.id | The public.products table, all tables in the
events schema, and the id column
of the holds.card table. |
To launch the search of sensitive information with the available dictionary,
run the inspect command specifying
--search-path as explained in
the section called “Specifying Database Objects to Look through”.
Take into account the following limitations when using pgpro_scout:
When looking for data of the json/jsonb type,
pattern conditions must be used rather than
match_values.
pgpro_scout does not currently support schemas, tables, and fields with case-sensitive names.
Search for values in large-volume fields can greatly increase the search
time, so that this can even look like the application hanging. If you
are aware of fields containing considerable volume of data, exclude
these fields from the search by editing --search-path.
When looking for data in a table with a composite primary key, avoid excluding single fields contained in the primary key from search.
pgpro_scout supports tables with primary keys of the following types:
Numeric: smallint, integer, bigint,
decimal, numeric, real,
double, smallserial, serial,
bigserial.
Monetary: money.
Character: text, bpchar, character,
character(n).
Date and time: timestamp, timestampz,
date, time, time with time zone,
interval.
Describing network addresses: inet, cidr,
macaddr, macaddr8.
Boolean.
UUID.
Range: int4range, int8range,
daterange.
Array: arrays of types are supported whose element types are supported.
Use of primary keys of types different from those listed above can result in errors.
This section describes pgpro_scout commands. Optional parameters are enclosed in square brackets.
pgpro_scout config [--out|-o string] [--override|-O] [--help|-h]
Generates a base pgpro_scout dictionary. The meaning of the options is as follows:
--h--helpOutput the command reference information.
-O--overrideOverride the dictionary file if it exists.
-o filepath--out filepathPath to the dictionary file.
pgpro_scout inspect [--batch-size|-buint] [--config-file|-fstring] [--dbname|-dstring] [--disable-comment] [--hidden-sample] [--host|-Hstring] [--order|-Ostring] [--output|-ostring] [--password|-Wstring] [--port|-puint16] [--report-bufferuint16] [--rows-number|-nuint16] [--sample-sizeuint16] [--search-path|-sstring] [--threads|-jint] [--username|-Ustring] [--help|-h]
Launches the search of sensitive information. The meaning of the options is as follows:
-b size--batch-size size #Number of database rows to be processed by one query for each task.
Default: 1000.
-f filepath--config-file filepath #Path to the dictionary file.
--d name--dbname name #Database name to connect to.
--disable-comment #Create the report file without comments.
--h string--help string #Output the command reference information.
--hidden-sample #
Mask found samples in the report. For example: Br**te
will be output instead of Bronte.
--H name--host name #Database host name or IP address.
Default: localhost.
-O search_order--order search_order #The order of looking through rows. Possible values:
tail — get last rows
head — get first rows
random — get random rows
Default: head.
-o filepath--output filepath #Path to the report file.
--W string--password string #Password to connect to the database.
--p name--port name #Database port.
Default: 5432.
--report-buffer size #Size of the report buffer.
Default: 64
-n number--rows-number number #Number of rows to look through in each table.
Default: 10000.
--sample-size number #Number of found samples to be included in the report for each table. If set to 0, samples of found rows will not be included in the report.
Default: 5.
-s filepath--search-path filepath #List of schemas, tables, and fields to look through. See the section called “Specifying Database Objects to Look through” for details.
Default: 5.
-j filepath--threads filepath #Number of threads used to run the utility.
Default: 14.
--U name--username name #Username to connect to the database.
A csv file is created as a result of pgpro_scout search. This file provides the following information:
Date and time of the search completion.
Report field headings.
Search results.
The search results contain the following fields:
Table I.2. Report Fields
| Name | Description |
|---|---|
location | Full name of a table, that is <schema name>.<table_name>, with the occurrence found. |
column | Column of the table. |
key | Key of the search condition from the dictionary. |
comment | Comment of the search condition from the dictionary. |
sample | Sample occurrence found. If the search is run with
--sample-size
= 0, this field is not included in the report. To mask the contents of
this field, enable --hidden-sample
when launching the search. |
The report will contain not more than --sample-size
lines for each table.
The following is an example of the report file:
#report time: 2025-06-16T11:10:58+03:00 location,column,key,comment, sample public.expected,phone,phones,Phone numbers, events.holds,user_surname,surname,Last names, Holmes events.holds,user_surname,surname,Last names, Watson