PPEM supports sending alerts to users and user groups via email to monitor agents, hosts, and instances. To send alerts, you must configure an SMTP server.
For an alert to be sent, an alert trigger must fire. An alert trigger fires when the value of a metric is below, equal to, or higher than the specified threshold. Metrics are stored in data sources, for example, in the repository database.
To determine the alert trigger threshold, alert trigger rules are used. An
alert trigger rule is a set of one or multiple
conditions containing logical operators and values. For example, an alert
trigger rule condition can include the >
logical operator and the 0 value. With such a rule, the
alert trigger fires if the value of the specified metric exceeds
0.
Relations between multiple alert trigger rule conditions are determined by the
AND and OR logical connectives.
For alerts to work properly, you must first install and configure logging and monitoring tools.
This section explains how to manage alerts. It includes the following instructions:
The alerts functionality is in the beta phase. Currently, only pgpro-otel-collector metrics can be used for alert triggers. Other existing limitations are specified in the corresponding instructions of this section.
Pre-Configuring Alerts
To work with alerts, you must first pre-configure them in the
ppem-manager.yml manager configuration file.
You can specify the following parameters:
alerts:
metrics:
request_chunk_size: number_of_instance_IDs
scheduler:
interval: interval_for_checking_new_alerts
initial_delay: delay_for_starting_aler_scheduler
timeout: timeout_for_updating_alert_trigger_rules
notifier:
num_workers: number_of_concurrent_workers
worker_batch_size: number_of_alerts_in_one_batch
worker_interval: interval_for_checking_new_alerts
backoff_base: exponential_backoff_calculation_duration
max_retries: maximum_number_of_alert_attempts
notification_timeout: alert_timeout
janitor_interval: janitor_worker_polling_interval
stale_processing_timeout: stale_alert_processing_timeout
email:
is_enabled: true or false
smtp:
host: SMTP_server_hostname_or_IP
port: SMTP_server_port
username: username_for_SMTP_server_authentication
password: password_for_SMTP_server_authentication
from: alert_sender_email
timeout: SMTP_server_connection_timeout
use_starttls: true or false
use_ssl: true or false
tls:
insecure_skip_verify: true or false
root_ca_path: path_to_root_CA
Where:
metrics: The parameters of sending requests to the
metrics plugin.
request_chunk_size: The maximum number of
instance IDs within one request.
Default value: 100.
scheduler: The parameters of the scheduler that
updates alerts in the manager memory.
interval: The time interval for
the scheduler to check for new alerts to process.
Default value: 50s.
initial_delay: The delay before
starting the scheduler for the first time after the start of
PPEM.
Default value: 10s.
timeout: The scheduler timeout for updating
alert trigger rules.
Default value: 10m.
notifier: The parameters of the notifier that sends
alerts.
num_workers: The number of concurrent
workers that will send alerts.
Default value: 5.
worker_batch_size: The number of alerts
processed by workers in one batch.
Default value: 20.
worker_interval: The polling interval for
workers to check for new alerts in the
repository database.
Default value: 30s.
backoff_base: The base duration for the
exponential backoff calculation when resending a failed alert.
The delay for resending the alert is calculated as:
backoff_base X (2^number_of_retry_attempts).
Default value: 10s.
max_retries: The maximum number of attempts
to resend a failed alert.
Default value: 3.
notification_timeout: The maximum amount of
time for the notifier to wait for an alert to be sent
before considering it failed.
Default value: 20s.
janitor_interval: The polling interval for the
janitor worker that cleans alerts stuck in the processing state.
Default value: 1m.
stale_processing_timeout: The amount of
time after which alerts stuck in the processing state are considered
stale and must be reset by the janitor worker.
Default value: 10m.
email: The parameters of sending alerts via email.
is_enabled: Specifies whether alerts are sent
via email.
Possible values:
true
false
If false is specified, alerts are logged instead of
being sent via email.
Default value: false.
smtp: The parameters of the SMTP server used
for sending alerts.
host: The hostname or IP address of the
SMTP server.
Default value: localhost.
port: The port number of the SMTP server.
Default value: 25.
username: The username for authenticating
in the SMTP server.
Default value: "".
password: The password for authenticating
in the SMTP server.
Default value: "".
from: The email address of the alert sender.
Default value: admin@localdomain.local.
timeout: The SMTP server connection timeout.
Default value: 10s.
use_starttls: Specifies whether the
STARTTLS extension is used for
securing the SMTP server connection.
Possible values:
true
false
Default value: false.
use_ssl: Specifies whether the SSL/TLS
protocol is used for the SMTP server connection.
Possible values:
true
false
Default value: false.
tls: The TLS protocol parameters.
insecure_skip_verify: Specifies whether
the client skips the verification of the certificate chain
and hostname of the SMTP server.
Possible values:
true
false
Default value: false.
Setting this parameter to true represents
a security risk. Do it only for testing purposes or with
trusted networks.
root_ca_path: The path to the CA certificate
used for verifying the certificate of the SMTP server.
Default value: "".
Creating an Alert
In the navigation panel, go to Monitoring → Alerts.
In the top-right corner of the page, click Create alert.
Enter parameters of the new alert (parameters marked with an asterisk are required):
Name.
Datasource type: The type of the metrics that will be used for the alert trigger.
Currently, you can only use pgpro-otel-collector metrics.
Datasource.
State: The state of the alert after creation.
Possible values:
Disabled
Enabled
Check interval, sec.: The time interval in seconds for verifying the data source of the alert trigger.
Minimum value: 60.
Flap check, count: The number of repeated triggers required for stopping the alert.
0 means that this limitation is disabled.
Notify after, sec.: The time in seconds during which the trigger must continually fire for the alert to be sent.
Cooldown period, sec.: The time in seconds during which the alert will not be sent after the last firing trigger.
0 means that this limitation is disabled.
Currently, you cannot change the values of Flap check, count, Notify after, sec., and Cooldown period, sec..
Click Next, and then specify additional parameters (parameters marked with an asterisk are required):
Metric name: The name of the metric without any additional characters that will be used for the alert trigger.
You can use the following pgpro-otel-collector
metrics from the monitoring.metrics table
of the repository database:
postgresql.archiver.archived_count
postgresql.archiver.failed_count
postgresql.bgwriter.buffers_checkpoint
postgresql.bgwriter.buffers_clean
postgresql.bgwriter.buffers_backend
postgresql.bgwriter.buffers_allocated
postgresql.bgwriter.maxwritten_clean
postgresql.bgwriter.buffers_backend_fsync
postgresql.bgwriter.checkpoints_requested
postgresql.bgwriter.checkpoints_timed
postgresql.bgwriter.checkpoint_sync_time_milliseconds
postgresql.bgwriter.checkpoint_write_time_milliseconds
postgresql.databases.blocks_hit
postgresql.databases.blocks_read
postgresql.databases.conflicts
postgresql.databases.deadlocks
postgresql.databases.checksum_failures
postgresql.databases.tuples_fetched
postgresql.databases.tuples_returned
postgresql.databases.tuples_inserted
postgresql.databases.tuples_updated
postgresql.databases.tuples_deleted
postgresql.databases.temp_bytes
postgresql.databases.temp_files
postgresql.wal.bytes
postgresql.databases.rollbacks
system.cpu.utilization
system.memory.usage
system.paging.usage
postgresql.wal.records
postgresql.databases.commits
Operator • Threshold value: The alert trigger rule condition containing a logical operator and value.
Possible logical operators:
= (eq)
> (gt)
>= (gte)
< (lt)
<= (lte)
!= (neq)
For example, if you select > and specify
0, the alert is sent when the value of the
specified metric exceeds 0.
You can add multiple alert trigger rule conditions by clicking Add.
Rule condition: The logical connectives for the specified alert trigger rule conditions.
Possible values:
AND
OR
This parameter is available only if you added multiple alert trigger rule conditions.
Instances to check.
Possible values:
Check all.
Select instances.
For this value, from Instances, select the instances.
Notify users: The users that will receive alerts.
Notify groups: The user groups that will receive alerts.
Alert template: The template of the alert text.
You can use the following variables in the alert text:
{{.Title}}: The name of the metric used for
the alert trigger.
{{.Timestamp}}: The date and time when the
alert trigger fired.
{{.Status}}: The status of the alert trigger.
Notify resolved: Specifies whether the alert is sent once the trigger is resolved.
Possible values:
Enabled.
For this value, in Resolved template, enter the template of the alert text.
You can use the same variables in this alert text as in Alert template.
Disabled.
Click Save.
Viewing Alerts
In the navigation panel, go to Monitoring → Alerts.
The table of alerts with the following columns will be displayed:
Name.
State.
Possible values:
Enabled
Disabled
Source name: The data source of the alert trigger.
This column includes additional information:
Type: The type of the metrics used for the alert trigger.
Possible values:
Repositories: System metrics.
Metrics: pgpro-otel-collector metrics.
Logs: pgpro-otel-collector logs.
This type of metrics is temporarily not supported.
Interval, sec.: The time interval in seconds for verifying the data source of the alert trigger.
Flap check, count: The number of repeated triggers required for stopping the alert.
0 means that this limitation is disabled.
Notify after, sec.: The time in seconds during which the trigger must continually fire for the alert to be sent.
Cooldown period, sec.: The time in seconds during which the alert is not sent after the last trigger.
0 means that this limitation is disabled.
Recipients: The users that receive alerts.
Group recipients: The user groups that receive alerts.
Rule: The alert trigger rule conditions.
For example, if an alert trigger rule condition is postgresql.up > 0,
the alert is sent when the value of the postgersql.up
metric exceeds 0.
Actions.
For more information about available actions, refer to other instructions in this section.
Disabling and Enabling Alerts
In the navigation panel, go to Monitoring → Alerts.
Click or
next to the alert.
Editing Alert Recipients
In the navigation panel, go to Monitoring → Alerts.
Click next to the alert.
Edit users and user groups that receive alerts.
Click Save.
Deleting an Alert
System alerts cannot be deleted.
Deleted alerts cannot be restored.
To delete an alert:
In the navigation panel, go to Monitoring → Alerts.
Click next to the alert.
Confirm the operation and click Delete.