Alarm core is driven by a collection of rules, which are defined in config/alarm-settings.yml
.
There are three parts in alarm rule definition.
Defines the relation between scope and entity name.
There are two types of rules: individual rules and composite rules. A composite rule is a combination of individual rules.
An alarm rule is made up of the following elements:
_rule
.core/default/searchableAlarmTags
, or through system environment variable SW_SEARCHABLE_ALARM_TAG_KEYS
. The key level
is supported by default.Label settings are required by the meter-system. They are used to store metrics from the label-system platform, such as Prometheus, Micrometer, etc.
The four label settings mentioned above must implement LabeledValueHolder
.
value1, value2, value3, value4, value5
.-
if you do not wish to trigger the alarm by one or more of the values.value1
is the threshold of P50, and -, -, value3, value4, value5
means that there is no threshold for P50 and P75 in the percentile alarm rule.>
, >=
, <
, <=
, ==
. We welcome contributions of all OPs.count
, then an alarm will be sent.NOTE: Composite rules are only applicable to alarm rules targeting the same entity level, such as service-level alarm rules (service_percent_rule && service_resp_time_percentile_rule
). Do not compose alarm rules of different entity levels, such as an alarm rule of the service metrics with another rule of the endpoint metrics.
A composite rule is made up of the following elements:
_rule
.&&
, ||
, and ()
.rules:
# Rule unique name, must be ended with `_rule`.
endpoint_percent_rule:
# Metrics value need to be long, double or int
metrics-name: endpoint_percent
threshold: 75
op: <
# The length of time to evaluate the metrics
period: 10
# How many times after the metrics match the condition, will trigger alarm
count: 3
# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
silence-period: 10
# Specify if the rule can send notification or just as an condition of composite rule
only-as-condition: false
tags:
level: WARNING
service_percent_rule:
metrics-name: service_percent
# [Optional] Default, match all services in this metrics
include-names:
- service_a
- service_b
exclude-names:
- service_c
# Single value metrics threshold.
threshold: 85
op: <
period: 10
count: 4
only-as-condition: false
service_resp_time_percentile_rule:
# Metrics value need to be long, double or int
metrics-name: service_percentile
op: ">"
# Multiple value metrics threshold. Thresholds for P50, P75, P90, P95, P99.
threshold: 1000,1000,1000,1000,1000
period: 10
count: 3
silence-period: 5
message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
only-as-condition: false
meter_service_status_code_rule:
metrics-name: meter_status_code
exclude-labels:
- "200"
op: ">"
threshold: 10
period: 10
count: 3
silence-period: 5
message: The request number of entity {name} non-200 status is more than expected.
only-as-condition: false
composite-rules:
comp_rule:
# Must satisfied percent rule and resp time rule
expression: service_percent_rule && service_resp_time_percentile_rule
message: Service {name} successful rate is less than 80% and P50 of response time is over 1000ms
tags:
level: CRITICAL
For convenience’s sake, we have provided a default alarm-setting.yml
in our release. It includes the following rules:
The metrics names are defined in the official OAL scripts and
MAL scripts, the Event names can also serve
as the metrics names, all possible event names can be also found in the Event doc.
Currently, metrics from the Service, Service Instance, Endpoint, Service Relation, Service Instance Relation, Endpoint Relation scopes could be used in Alarm, and the Database access scope is same as Service.
Submit an issue or a pull request if you want to support any other scopes in alarm.
The Webhook requires the peer to be a web container. The alarm message will be sent through HTTP post by application/json
content type. The JSON format is based on List
with the following key information:
org.apache.skywalking.oap.server.core.source.DefaultScopeDefine
.alarm-settings.yml
.alarm-settings.yml
.See the following example:
[{
"scopeId": 1,
"scope": "SERVICE",
"name": "serviceA",
"id0": "12",
"id1": "",
"ruleName": "service_resp_time_rule",
"alarmMessage": "alarmMessage xxxx",
"startTime": 1560524171000,
"tags": [{
"key": "level",
"value": "WARNING"
}]
}, {
"scopeId": 1,
"scope": "SERVICE",
"name": "serviceB",
"id0": "23",
"id1": "",
"ruleName": "service_resp_time_rule",
"alarmMessage": "alarmMessage yyy",
"startTime": 1560524171000,
"tags": [{
"key": "level",
"value": "CRITICAL"
}]
}]
The alarm message will be sent through remote gRPC method by Protobuf
content type.
The message contains key information which are defined in oap-server/server-alarm-plugin/src/main/proto/alarm-hook.proto
.
Part of the protocol looks like this:
message AlarmMessage {
int64 scopeId = 1;
string scope = 2;
string name = 3;
string id0 = 4;
string id1 = 5;
string ruleName = 6;
string alarmMessage = 7;
int64 startTime = 8;
AlarmTags tags = 9;
}
message AlarmTags {
// String key, String value pair.
repeated KeyStringValuePair data = 1;
}
message KeyStringValuePair {
string key = 1;
string value = 2;
}
Follow the Getting Started with Incoming Webhooks guide and create new Webhooks.
The alarm message will be sent through HTTP post by application/json
content type if you have configured Slack Incoming Webhooks as follows:
slackHooks:
textTemplate: |-
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": ":alarm_clock: *Apache Skywalking Alarm* \n **%s**."
}
}
webhooks:
- https://hooks.slack.com/services/x/y/z
Note that only the WeChat Company Edition (WeCom) supports WebHooks. To use the WeChat WebHook, follow the Wechat Webhooks guide.
The alarm message will be sent through HTTP post by application/json
content type after you have set up Wechat Webhooks as follows:
wechatHooks:
textTemplate: |-
{
"msgtype": "text",
"text": {
"content": "Apache SkyWalking Alarm: \n %s."
}
}
webhooks:
- https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=dummy_key
Follow the Dingtalk Webhooks guide and create new Webhooks.
For security purposes, you can config an optional secret for an individual webhook URL.
The alarm message will be sent through HTTP post by application/json
content type if you have configured Dingtalk Webhooks as follows:
dingtalkHooks:
textTemplate: |-
{
"msgtype": "text",
"text": {
"content": "Apache SkyWalking Alarm: \n %s."
}
}
webhooks:
- url: https://oapi.dingtalk.com/robot/send?access_token=dummy_token
secret: dummysecret
Follow the Feishu Webhooks guide and create new Webhooks.
For security purposes, you can config an optional secret for an individual webhook URL.
If you would like to direct a text to a user, you can config ats
which is the feishu’s user_id and separated by “,” .
The alarm message will be sent through HTTP post by application/json
content type if you have configured Feishu Webhooks as follows:
feishuHooks:
textTemplate: |-
{
"msg_type": "text",
"content": {
"text": "Apache SkyWalking Alarm: \n %s."
},
"ats":"feishu_user_id_1,feishu_user_id_2"
}
webhooks:
- url: https://open.feishu.cn/open-apis/bot/v2/hook/dummy_token
secret: dummysecret
Follow the WeLink Webhooks guide and create new Webhooks.
The alarm message will be sent through HTTP post by application/json
content type if you have configured WeLink Webhooks as follows:
welinkHooks:
textTemplate: "Apache SkyWalking Alarm: \n %s."
webhooks:
# you may find your own client_id and client_secret in your app, below are dummy, need to change.
- client_id: "dummy_client_id"
client_secret: dummy_secret_key
access_token_url: https://open.welink.huaweicloud.com/api/auth/v2/tickets
message_url: https://open.welink.huaweicloud.com/api/welinkim/v1/im-service/chat/group-chat
# if you send to multi group at a time, separate group_ids with commas, e.g. "123xx","456xx"
group_ids: "dummy_group_id"
# make a name you like for the robot, it will display in group
robot_name: robot
Since 6.5.0, the alarm settings can be updated dynamically at runtime by Dynamic Configuration,
which will override the settings in alarm-settings.yml
.
In order to determine whether an alarm rule is triggered or not, SkyWalking needs to cache the metrics of a time window for
each alarm rule. If any attribute (metrics-name
, op
, threshold
, period
, count
, etc.) of a rule is changed,
the sliding window will be destroyed and re-created, causing the alarm of this specific rule to restart again.