pgpro_scout — detect sensitive information in a database
pgpro_scout config [option...]
pgpro_scout inspect [option...]
pgpro_scout is a command-line utility to detect sensitive (confidential) information in a database. pgpro_scout looks for sensitive information in database column names and field values. The search requires a specialized dictionary that the utility allows you to create.
pgpro_scout does not aim to find all the sensitive information, but only performs data intelligence to determine whether such information exists and show sensitive data samples.
pgpro_scout is provided with Postgres Pro Standard
as a separate pre-built package pgpro_scout
(for the detailed installation instructions, see Chapter 16).
The dictionary is a text file whose lines have the key-value-comment format. Where:
key is a string with the conventional name of a search parameter.
value defines a condition that a database field must meet to be considered sensitive. Can be a regular expression or an array of string values.
comment is a text description of the search condition to be visible in search results.
pgpro_scout generates a report file with information on sample occurrences of the sensitive information found. Refer to the section called “pgpro_scout Report” for details.
To perform the database inspection for occurrences of sensitive information, you need to create the dictionary and launch the search passing a list of database objects to look through.
To create a base dictionary, run the config command. You can update this dictionary and extend it with new search conditions if needed. Here is an example of the dictionary:
scout:
column_names:
matchers:
- key: phones
match_values: ["phones", "phone"]
comment: "Phone number fields"
column_values:
disabled: true # optional
case_sensitive: false # optional
matchers:
- key: surnames
disabled: false # optional
case_sensitive: true # optional
match_values: ["Holmes", "Watson"]
- key: phones_data
pattern: ^((8|\+7)[\- ]?)?(\(?\d{3}\)?[\- ]?)?[\d\- ]{7,10}$
comment: "Phone numbers"
exclude:
schemas: ['pgpro_sfile_data', 'dbms_lob', 'profile', 'information_schema', 'pg_catalog', 'pg_toast']
column_types: ['timestamp with time zone']
The dictionary consists of the following sections:
scout — contains the description of search conditions.
exclude — lists database schemas and field types
to be excluded from the search.
These types of search conditions can be specified:
column_names — field names.
column_values — field values.
A matchers element contains a list of search conditions.
Each matchers element may contain the following parameters:
key — the search key.
match_values — the list of strings to look
for in the database fields. Do not specify together with pattern.
pattern — a regular expression for search in
the database fields. Do not specify together with match_values.
case_sensitive — a boolean value to define
whether the search is case-sensitive if match_values is specified.
disabled — a boolean value to define whether
to exclude the condition from the search.
comment — the comment to be shown in the
report file.
To specify the objects to look through, use the --search-path option
when launching the search. --search-path takes a string that
contains comma-separated substrings with database objects to restrict
the search to:
To restrict the search to a database schema, specify a substring with the schema name.
To restrict the search to an individual table or tables, specify a
substring like this: SCHEMA.TABLE or
SCHEMA.{TABLE1, TABLE2}.
To restrict the search to an individual column or columns of a certain
table, specify a substring like this: SCHEMA.TABLE.COLUMN
or SCHEMA.TABLE.{COLUMN1, COLUMN2}.
Table I.1. Examples of Specifying Search Objects
--search-path Value | Search Range |
|---|---|
public | All tables (with all columns) in the public
schema. |
public.{products, sales} | products and sales tables
(with all columns) in the public
schema. |
public.products.{id, email} | id and email columns of
the products table in the public
schema. |
public.products | Only the public.products table. |
public.products,events,holds.card.id | The public.products table, all tables in the
events schema, and the id column
of the holds.card table. |
To launch the search of sensitive information with the available dictionary,
run the inspect command specifying
--search path as explained in
the section called “Specifying Database Objects to Look through”.
This section describes pgpro_scout commands. Optional parameters are enclosed in square brackets.
pgpro_scout config [--out|-o string] [--override|-O] [--help|-h]
Generates a base pgpro_scout dictionary. The meaning of the options is as follows:
--h--helpOutput the command reference information.
-O--overrideOverride the dictionary file if it exists.
-o filepath--out filepathPath to the dictionary file.
pgpro_scout inspect [--batch-size|-buint] [--config-file|-fstring] [--dbname|-dstring] [--disable-comment] [--hidden-sample] [--host|-Hstring] [--order|-Ostring] [--output|-ostring] [--password|-Wstring] [--port|-puint16] [--report-bufferuint16] [--rows-number|-nuint16] [--sample-sizeuint16] [--search-path|-sstring] [--threads|-jint] [--username|-Ustring] [--help|-h]
Launches the search of sensitive information. The meaning of the options is as follows:
-b size--batch-size size #Number of database rows to be processed by one query for each task.
Default: 1000.
-f filepath--config-file filepath #Path to the dictionary file.
--d name--dbname name #Database name to connect to.
--disable-comment #Create the report file without comments.
--h string--help string #Output the command reference information.
--hidden-sample #
Mask found samples in the report. For example: Br**te
will be output instead of Bronte.
--H name--host name #Database host name or IP address.
Default: localhost.
-O search_order--order search_order #The order of looking through rows. Possible values:
tail — get last rows
head — get first rows
random — get random rows
Default: head.
-o filepath--output filepath #Path to the report file.
--W string--password string #Password to connect to the database.
--p name--port name #Database port.
Default: 5432.
--report-buffer size #Size of the report buffer.
Default: 64
-n number--rows-number number #Number of rows to look through in each table.
Default: 10000.
--sample-size number #Number of found samples to be included in the report for each table. If set to 0, samples of found rows will not be included in the report.
Default: 5.
-s filepath--search-path filepath #List of schemas, tables, and fields to look through. See the section called “Specifying Database Objects to Look through” for details.
Default: 5.
-j filepath--threads filepath #Number of threads used to run the utility.
Default: 14.
--U name--username name #Username to connect to the database.
A csv file is created as a result of pgpro_scout search. This file provides the following information:
Date and time of the search completion.
Report field headings.
Search results.
The search results contain the following fields:
Table I.2. Report Fields
| Name | Description |
|---|---|
location | Full name of a table, that is <schema name>.<table_name>, with the occurrence found. |
column | Column of the table. |
key | Key of the search condition from the dictionary. |
comment | Comment of the search condition from the dictionary. |
sample | Sample occurrence found. If the search is run with
--sample-size
= 0, this field is not included in the report. To mask the contents of
this field, enable --hidden-sample
when launching the search. |
The report will contain not more than --sample-size
lines for each table.
The following is an example of the report file:
#report time: 2025-06-16T11:10:58+03:00 location,column,key,comment, sample public.expected,phone,phones,Phone numbers, events.holds,user_surname,surname,Last names, Holmes events.holds,user_surname,surname,Last names, Watson