承接上章k8s(七)、Prometheus部署篇,在上章的基础上,本章介绍Prometheus告警相关配置。
在了解告警规则之前,首先得了解Prometheus的数据查询表达式,来获取metric数据是否到达告警阈(ps:这个字儿念yu,第四声,不念第二声的fa)值。
Overview
Prometheus提供了一种功能性表达式语言,能够让用户实时的选择和聚合时间序列的数据。表达式返回的结果可以被显示为曲线图,也可以在prometheus浏览器中显示为表格,或者通过HTTP API经由外部系统处理。
表达式语言类型
Prometheus表达式或子表达式可以评估为一下四种类型之一:
即时向量(Instant vector) - 包含每个时间序列单个样品的一组时间序列,共享相同的时间戳
范围向量(Range vector) - 包含一个范围内数据点的一组时间序列
标量(Scalar) - 一个简单的数字浮点值
字符串(String) - 一个简单的字符串值;当前未使用
根据使用情况(例如画图或者显示表达式的输出),只有某些类型是合法的,例如,即时向量表达式是可以画图的唯一类型。
时间序列选择器
即时向量选择
即时向量选择器允许选择一组时间序列,或者某个给定的时间戳的样本数据。下面这个例子选择了具有时间序列的http_requests_total metric对象:
http_requests_total
你可以通过附加一组标签,并用{}括起来,来进一步筛选这些时间序列。下面这个例子只选择有http_requests_total名称的、有prometheus工作标签的、有canary组标签的时间序列:
http_requests_total{job="prometheus",group="canary"}
另外,也可以也可以将标签值反向匹配,或者对正则表达式匹配标签值。下面列举匹配操作符:
=:选择正好相等的字符串标签
!=:选择不相等的字符串标签
=~:选择匹配正则表达式的标签(或子标签)
!=:选择不匹配正则表达式的标签(或子标签)
例如,选择staging、testing、development环境下的,GET之外的HTTP方法的http_requests_total的时间序列:
http_requests_total{environment=~"staging|testing|development",method!="GET"}
范围向量选择
范围向量表达式正如即时向量表达式一样运行,但是前者返回从当前时刻开始的一定时间范围的时间序列集合回来。语法是,在一个向量表达式之后添加[]来表示时间范围,持续时间用数字表示,后接下面单元之一:
s:seconds
m:minutes
h:hours
d:days
w:weeks
y:years
在下面这个例子中,我们选择最后5分钟的记录,metric名称为http_requests_total、作业标签为prometheus的时间序列的所有值:
http_requests_total{job="prometheus"}[5m]
操作符
Prometheus支持多种二元和聚合的操作符。
+ (addition)
- (subtraction)
* (multiplication)
/ (division)
% (modulo)
^ (power/exponentiation)
== (equal)
!= (not-equal)
> (greater-than)
< (less-than)
>= (greater-or-equal)
<= (less-or-equal)
and (intersection)
or (union)
unless (complement)
函数
Prometheus支持多种函数,来对数据进行操作,参考官网:https://prometheus.io/docs/prometheus/latest/querying/functions/
了解了metric查询表达式的基础上,下面开始部署告警器组件。
prometheus支持多种类型的告警媒介,国内可以用的例如mail/webchat,也可以使用webhook自定义接口,公司一般使用的钉钉,prometheus没有定制的钉钉接口,因此在这里使用自定义的接口来中转,将消息发送至钉钉接口,接收钉钉告警消息。
首先看下告警配置,各项参数已添加中文注释,直接看注释:
alertmanager-config.yaml:
kind: ConfigMap
apiVersion: v1
metadata:
name: alertmanager
namespace: kube-system
data:
config.yml: |-
global:
resolve_timeout: 5m
templates:
- '/etc/alertmanager-templates/*.tmpl'
route:
group_by: ['alertname', 'cluster', 'service']
#根据['alertname', 'cluster', 'service'],在初始告警发送前,等待30秒,将这段时间的多个告警进行分组
# When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait: 30s
# When the first notification was sent, wait 'group_interval' to send a batch
# of new alerts that started firing for that group.
#按上面分组的消息,同一组消息,间隔5m才发送下一个。这个是为了尽量避免由一个问题带来的批量告警重复发送。
group_interval: 5m
# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
#完全相同的一个告警消息,间隔repeat_interval时间才下一次发送,这里为了测试效果设置1m,一般这个值设置时间都较长
#repeat_interval: 1m
repeat_interval: 1m
# A default receiver
# If an alert isn't caught by a route, send it to default.
#默认接收器为webhook_alert
receiver: webhook_alert
# All the above attributes are inherited by all child routes and can
# overwritten on each.
# The child route trees.
routes:
#给webhook_alert接收器定义匹配标签规则
# Send severity=slack alerts to slack.
- match:
severity: info
receiver: webhook_alert
- match:
severity: warning
receiver: webhook_alert
# 接收器定义,这里是webhook类型,webhook_configs写上自定义api接口的url。
# 注意,这里的webhook钩子接口是我自己写的接收prometheus告警的接口,api接收到prometheus的消息后会提取信息,通过钉钉的接口发给相应的人员。
# 也可以使用wechat/mail的接口,配置方式参考官网:https://prometheus.io/docs/alerting/configuration/
receivers:
- name: webhook_alert
webhook_configs:
- url: 'http://192.168.88.26:8080/api/1.0/utils/alert/?token=ADW82115yn7YEWXCEW88WEW6'
send_resolved: true
prometheus通过webhook发送的告警消息样例:
{
"receiver":"webhook_alert",
"status":"firing",
"alerts":[{
"status":"firing",
"labels":{
"DEVDB":"","alertname":"NodeMemoryUsage-low",
"beta_kubernetes_io_arch":"amd64",
"beta_kubernetes_io_os":"linux",
"instance":"hv001238",
"job":"kubernetes-node-exporter","kubernetes_io_hostname":"hv001238","node_role_kubernetes_io_master":"","severity":"info","team":"node"
},
"annotations":{
"description":"hv001238: Memory usage is above 80% (current value is: 52.18385396227273",
"summary":"hv001238: High Memory usage detected"
},
"startsAt":"2018-06-23T03:47:10.408935848Z",
"endsAt":"0001-01-01T00:00:00Z",
"generatorURL":"http://prometheus-785fc5bbf4-ztlrd:9090/graph?g0.expr=%28node_memory_MemTotal_bytes+-+%28node_memory_MemFree_bytes+%2B+node_memory_Buffers_bytes+%2B+node_memory_Cached_bytes%29%29+%2F+node_memory_MemTotal_bytes+%2A+100+%3E+1\\u0026g0.tab=1"
},],
"groupLabels":{"alertname":"NodeMemoryUsage-low"},
"commonLabels":{
"DEVDB":"",
"alertname":"NodeMemoryUsage-low",
"beta_kubernetes_io_arch":"amd64",
"beta_kubernetes_io_os":"linux",
"job":"kubernetes-node-exporter",
"node_role_kubernetes_io_master":"",
"severity":"info","team":"node"},
"commonAnnotations":{},
"externalURL":"http://alertmanager-7cffc68878-xb2m6:9093",
"version":"4",
"groupKey":"{}/{severity=\\"info\\"}:{alertname=\\"NodeMemoryUsage-low\\"}"
}
消息太多,不够精简,我webhook端在接收之后将它精简提取为了如下格式:
{
'description': 'hv001238: Memory usage is above 80% (current value is: 39.42384016274741',
'title': 'hv001238: High Memory usage detected',
'notice_user': ['xxx', 'xxx', 'xxx'],
'start_time': '2018-06-23T03:47:10.408935848Z',
'end_time': '0001-01-01T00:00:00Z',
'status': 'firing',
'graph_link': 'http://http://prometheusv19.xxx.com/graph?g0.expr=%28node_memory_MemTotal_bytes+-+%28node_memory_MemFree_bytes+%2B+node_memory_Buffers_bytes+%2B+node_memory_Cached_bytes%29%29+%2F+node_memory_MemTotal_bytes+%2A+100+%3E+1&g0.tab=1'
}
中转api接口python(django + rest-framework框架)代码样例如下,钉钉消息的接口可以自己去官网查:钉钉官网
from rest_framework.views import APIView
from rest_framework.response import Response
import collections
from account.tasks.message import send_dingding_message
class AlertView(APIView):
def post(self,request):
token = request.GET.get('token')
message = collections.OrderedDict()
if token == 'ADW82115yn7YEWXCEW88WEW6xxwetrcX':
# ops = ['xxx', 'xxx', 'xxx']
ops = ['xxx']
content = request.data
message['title'] = content['alerts'][0]['annotations']['summary']
message['status'] = content['status']
message['description'] = content['alerts'][0]['annotations']['description']
message['start_time'] = content['alerts'][0]['startsAt']
message['end_time'] = content['alerts'][0]['endsAt']
#截图有效的url段,并将hostname从podname替换成自己真实环境的域名,这样接收到的消息点击链接可以直接查看告警数据及图形
valid_link = content['alerts'][0]['generatorURL'].split("\\")[0]
uri = valid_link.split(':9090')[1]
host = "http://prometheus.xxx.com"
message['graph_link'] = host + uri
messages = ""
for k,v in message.items():
messages = messages + '[' + k + ']:' + v + "\n"
print(messages)
res = send_dingding_message(user=ops,message=messages)
return Response(res)
def send_dingding_message(user,message)
#自己去获取钉钉的api接口,调用接口
return True
Alertmanager相关部署yaml文件:
alertmanager-templates.yaml #各类常见告警模板
configmap.yaml #配置文件
deployment.yaml #部署文件
service.yaml #服务文件
alertmanager-templates.yaml:
apiVersion: v1
data:
default.tmpl: |
{{ define "__alertmanager" }}AlertManager{{ end }}
{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }}
{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
{{ define "__description" }}{{ end }}
{{ define "__text_alert_list" }}{{ range . }}Labels:
{{ range .Labels.SortedPairs }} - {{ .Name }} = {{ .Value }}
{{ end }}Annotations:
{{ range .Annotations.SortedPairs }} - {{ .Name }} = {{ .Value }}
{{ end }}Source: {{ .GeneratorURL }}
{{ end }}{{ end }}
{{ define "slack.default.title" }}{{ template "__subject" . }}{{ end }}
{{ define "slack.default.username" }}{{ template "__alertmanager" . }}{{ end }}
{{ define "slack.default.fallback" }}{{ template "slack.default.title" . }} | {{ template "slack.default.titlelink" . }}{{ end }}
{{ define "slack.default.pretext" }}{{ end }}
{{ define "slack.default.titlelink" }}{{ template "__alertmanagerURL" . }}{{ end }}
{{ define "slack.default.iconemoji" }}{{ end }}
{{ define "slack.default.iconurl" }}{{ end }}
{{ define "slack.default.text" }}{{ end }}
{{ define "hipchat.default.from" }}{{ template "__alertmanager" . }}{{ end }}
{{ define "hipchat.default.message" }}{{ template "__subject" . }}{{ end }}
{{ define "pagerduty.default.description" }}{{ template "__subject" . }}{{ end }}
{{ define "pagerduty.default.client" }}{{ template "__alertmanager" . }}{{ end }}
{{ define "pagerduty.default.clientURL" }}{{ template "__alertmanagerURL" . }}{{ end }}
{{ define "pagerduty.default.instances" }}{{ template "__text_alert_list" . }}{{ end }}
{{ define "opsgenie.default.message" }}{{ template "__subject" . }}{{ end }}
{{ define "opsgenie.default.description" }}{{ .CommonAnnotations.SortedPairs.Values | join " " }}
{{ if gt (len .Alerts.Firing) 0 -}}
Alerts Firing:
{{ template "__text_alert_list" .Alerts.Firing }}
{{- end }}
{{ if gt (len .Alerts.Resolved) 0 -}}
Alerts Resolved:
{{ template "__text_alert_list" .Alerts.Resolved }}
{{- end }}
{{- end }}
{{ define "opsgenie.default.source" }}{{ template "__alertmanagerURL" . }}{{ end }}
{{ define "victorops.default.message" }}{{ template "__subject" . }} | {{ template "__alertmanagerURL" . }}{{ end }}
{{ define "victorops.default.from" }}{{ template "__alertmanager" . }}{{ end }}
{{ define "email.default.subject" }}{{ template "__subject" . }}{{ end }}
{{ define "email.default.html" }}
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--
Style and HTML derived from https://github.com/mailgun/transactional-email-templates
The MIT License (MIT)
Copyright (c) 2014 Mailgun
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
-->
{{ template "__subject" . }}
>
://schema.org/EmailMessage" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; -webkit-font-smoothing: antialiased; -webkit-text-size-adjust: none; height: 100%; line-height: 1.6em; width: 100% !important; background-color: #f6f6f6; margin: 0; padding: 0;" bgcolor="#f6f6f6">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; width: 100%; background-color: #f6f6f6; margin: 0;" bgcolor="#f6f6f6">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0;" valign="top"> >
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; display: block !important; max-width: 600px !important; clear: both !important; width: 100% !important; margin: 0 auto; padding: 0;" valign="top">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; max-width: 600px; display: block; margin: 0 auto; padding: 0;">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; border-radius: 3px; background-color: #fff; margin: 0; border: 1px solid #e9e9e9;" bgcolor="#fff">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 16px; vertical-align: top; color: #fff; font-weight: 500; text-align: center; border-radius: 3px 3px 0 0; background-color: #E6522C; margin: 0; padding: 20px;" align="center" bgcolor="#E6522C" valign="top">
{{ .Alerts | len }} alert{{ if gt (len .Alerts) 1 }}s{{ end }} for {{ range .GroupLabels.SortedPairs }}
{{ .Name }}={{ .Value }}
{{ end }}
>
>
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 10px;" valign="top">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
{{ template "__alertmanagerURL" . }}" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; color: #FFF; text-decoration: none; line-height: 2em; font-weight: bold; text-align: center; cursor: pointer; display: inline-block; border-radius: 5px; text-transform: capitalize; background-color: #348eda; margin: 0; border-color: #348eda; border-style: solid; border-width: 10px 20px;">View in {{ template "__alertmanager" . }}
>
>
{{ if gt (len .Alerts.Firing) 0 }}
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">[{{ .Alerts.Firing | len }}] Firing>
>
>
{{ end }}
{{ range .Alerts.Firing }}
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">Labels>
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
{{ range .Labels.SortedPairs }}{{ .Name }} = {{ .Value }}
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />{{ end }}
{{ if gt (len .Annotations) 0 }}-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">Annotations>
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />{{ end }}
{{ range .Annotations.SortedPairs }}{{ .Name }} = {{ .Value }}
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />{{ end }}
{{ .GeneratorURL }}" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; color: #348eda; text-decoration: underline; margin: 0;">Source
>
>
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
{{ if gt (len .Alerts.Firing) 0 }}
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
>
>
{{ end }}
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">[{{ .Alerts.Resolved | len }}] Resolved>
>
>
{{ end }}
{{ range .Alerts.Resolved }}
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">Labels>
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
{{ range .Labels.SortedPairs }}{{ .Name }} = {{ .Value }}
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />{{ end }}
{{ if gt (len .Annotations) 0 }}-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">Annotations>
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />{{ end }}
{{ range .Annotations.SortedPairs }}{{ .Name }} = {{ .Value }}
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />{{ end }}
{{ .GeneratorURL }}" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; color: #348eda; text-decoration: underline; margin: 0;">Source
>
>
{{ end }}
>
>
>
>
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; width: 100%; clear: both; color: #999; margin: 0; padding: 20px;">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 12px; vertical-align: top; text-align: center; color: #999; margin: 0; padding: 0 0 20px;" align="center" valign="top">Sent by {{ template "__alertmanager" . }}
>
>
>>
>
-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0;" valign="top"> >
>
>
>
>
{{ end }}
{{ define "pushover.default.title" }}{{ template "__subject" . }}{{ end }}
{{ define "pushover.default.message" }}{{ .CommonAnnotations.SortedPairs.Values | join " " }}
{{ if gt (len .Alerts.Firing) 0 }}
Alerts Firing:
{{ template "__text_alert_list" .Alerts.Firing }}
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
Alerts Resolved:
{{ template "__text_alert_list" .Alerts.Resolved }}
{{ end }}
{{ end }}
{{ define "pushover.default.url" }}{{ template "__alertmanagerURL" . }}{{ end }}
slack.tmpl: |
{{ define "slack.devops.text" }}
{{range .Alerts}}{{.Annotations.DESCRIPTION}}
{{end}}
{{ end }}
kind: ConfigMap
metadata:
creationTimestamp: null
name: alertmanager-templates
namespace: kube-system
configmap.yaml:
kind: ConfigMap
apiVersion: v1
metadata:
name: alertmanager
namespace: kube-system
data:
config.yml: |-
global:
# ResolveTimeout is the time after which an alert is declared resolved
# if it has not been updated.
resolve_timeout: 5m
# The smarthost and SMTP sender used for mail notifications.
# The API URL to use for Slack notifications.
# # The directory from which notification templates are read.
templates:
- '/etc/alertmanager-templates/*.tmpl'
# The root route on which each incoming alert enters.
route:
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
group_by: ['alertname', 'cluster', 'service']
# When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait: 30s
# When the first notification was sent, wait 'group_interval' to send a batch
# of new alerts that started firing for that group.
group_interval: 5m
# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
#repeat_interval: 1m
repeat_interval: 1m
# A default receiver
# If an alert isn't caught by a route, send it to default.
receiver: webhook_alert
# All the above attributes are inherited by all child routes and can
# overwritten on each.
# The child route trees.
routes:
# Send severity=slack alerts to slack.
- match:
severity: info
receiver: webhook_alert
- match:
severity: warning
receiver: webhook_alert
receivers:
- name: webhook_alert
webhook_configs:
- url: 'http://192.168.88.26:8080/api/1.0/utils/alert/?token=ADW82115yn7YEWXCEW88WEW6'
send_resolved: true
deployment.yaml:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: alertmanager
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
name: alertmanager
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: quay.io/prometheus/alertmanager:v0.15.0
args:
- '--config.file=/etc/alertmanager/config.yml'
- '--storage.path=/alertmanager'
ports:
- name: alertmanager
containerPort: 9093
volumeMounts:
- name: config-volume
mountPath: /etc/alertmanager
- name: templates-volume
mountPath: /etc/alertmanager-templates
- name: alertmanager
mountPath: /alertmanager
serviceAccountName: prometheus
volumes:
- name: config-volume
configMap:
name: alertmanager
- name: templates-volume
configMap:
name: alertmanager-templates
- name: alertmanager
emptyDir: {}
service.yaml:
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/scrape: 'true'
prometheus.io/path: '/metrics'
labels:
name: alertmanager
name: alertmanager
namespace: kube-system
spec:
selector:
app: alertmanager
type: NodePort
ports:
- name: alertmanager
protocol: TCP
port: 9093
targetPort: 9093
依次部署上方几个yaml文件。
在配置完alertmanager之后,需要给prometheus主程序配置文件添加alertmanager相关的配置,即修改prometheus的configmap,添加data.rules.yml,增加一些达到触发告警阈值的规则。
rules.yml: |
groups:
- name: test-rule
rules:
- alert: NodeFilesystemUsage-high
expr: (node_filesystem_size{device="rootfs"} - node_filesystem_free{device="rootfs"}) / node_filesystem_size{device="rootfs"} * 100 > 80
for: 2m
labels:
team: node
severity: warning
annotations:
summary: "{{$labels.instance}}: High Filesystem usage detected"
description: "{{$labels.instance}}: Filesystem usage is above 80% (current value is: {{ $value }}"
- alert: NodeMemoryUsage
expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 80
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High Memory usage detected"
description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}"
- alert: NodeMemoryUsage-low
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 1
for: 1m
labels:
team: node
severity: info
annotations:
summary: "{{$labels.instance}}: High Memory usage detected"
description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}"
- alert: NodeCPUUsage
expr: (100 - (avg by (instance) (irate(node_cpu{job="kubernetes-node-exporter",mode="idle"}[5m])) * 100)) > 80
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High CPU usage detected"
description: "{{$labels.instance}}: CPU usage is above 80% (current value is: {{ $value }}"
- alert: PodMemUsage
expr: container_memory_working_set_bytes{pod_name!="",namespace="default"}/container_spec_memory_limit_bytes{pod_name!="",namespace="default"} *100 != +Inf > 10
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: Pod High Mem usage detected"
description: "{{$labels.instance}}: Pod CPU Mem is above 80% (current value is: {{ $value }}"
更新之后的prometheus的configmap完整yaml文件如下:
kind: ConfigMap
metadata:
name: prometheus-config
namespace: kube-system
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- /etc/prometheus/rules.yml
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
- job_name: 'kubernetes-cadvisor'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- job_name: 'kubernetes-services'
kubernetes_sd_configs:
- role: service
metrics_path: /probe
params:
module: [http_2xx]
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter.example.com:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: kubernetes_name
- job_name: 'kubernetes-ingresses'
kubernetes_sd_configs:
- role: ingress
relabel_configs:
- source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_probe]
action: keep
regex: true
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: 'kubernetes-node-exporter'
scheme: http
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_role]
action: replace
target_label: kubernetes_role
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:31672'
target_label: __address__
rules.yml: |
groups:
- name: test-rule
rules:
- alert: NodeFilesystemUsage-high
expr: (node_filesystem_size{device="rootfs"} - node_filesystem_free{device="rootfs"}) / node_filesystem_size{device="rootfs"} * 100 > 80
for: 2m
labels:
team: node
severity: warning
annotations:
summary: "{{$labels.instance}}: High Filesystem usage detected"
description: "{{$labels.instance}}: Filesystem usage is above 80% (current value is: {{ $value }}"
- alert: NodeMemoryUsage
expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 80
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High Memory usage detected"
description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}"
- alert: NodeMemoryUsage-low
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 1
for: 1m
labels:
team: node
severity: info
annotations:
summary: "{{$labels.instance}}: High Memory usage detected"
description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}"
- alert: NodeCPUUsage
expr: (100 - (avg by (instance) (irate(node_cpu{job="kubernetes-node-exporter",mode="idle"}[5m])) * 100)) > 80
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High CPU usage detected"
description: "{{$labels.instance}}: CPU usage is above 80% (current value is: {{ $value }}"
- alert: PodMemUsage
expr: container_memory_working_set_bytes{pod_name!="",namespace="default"}/container_spec_memory_limit_bytes{pod_name!="",namespace="default"} *100 != +Inf > 80
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: Pod High Mem usage detected"
description: "{{$labels.instance}}: Pod CPU Mem is above 80% (current value is: {{ $value }}"
为了测试使用,增加了一项node内存使用率超过1%则触发告警,这个在使用时可以删掉。
kubectl apply -f config.map更新prometheus配置文件,更新配置文件后,prometheus不会自动重载配置,需重启pod,为了优雅地重载,可以采取更新deployment注解的方式来触发滚动升级,保证服务不中断。使用如下命令:
kubectl patch deployment prometheus --patch '{"spec": {"template": {"metadata": {"annotations": {"update-time": "2018-06-25 17:50" }}}}}' -n kube-system
一般用于容器计算内存使用量的,有两个常用指标:
其中container_memory_working_set_bytes
是容器真实使用的内存量,也是limit限制时的 oom 判断依据,kubectl top 命令采集的也是这个metric,因此,若你存在容器实际使用内存情况计算不准的情况,可以检查一下是不是混用了这两个metric,建议使用以container_memory_working_set_bytes
来计算容器的内存使用情况。
告警处于firing激活状态时,只有有效的start_time,endtime为无效值,状态为resolved时会有准确的end-time。
配置告警消息并不难,难点在于各项键控值的定义,metrics类型有很多,同时支持的各类应用也有很多,还有很多可摸索的空间。后期还要不断的扩充、调整监控类型,调整优化通知策略,prometheus非常地灵活,可以基于此不断发掘,打造最适合自己公司环境的容器云监控体系。