最初由华为开发者开源 目前已经是Apache的顶级开源项目
套用官网上的话来说,这是一个开放源代码的可观察性平台,用于收集,分析,聚合和可视化来自服务和云本机基础结构的数据,是现代化的应用性能管理工具,专为云原生,基于容器和分布式系统而设计。
我们的用途主要是在微服务架构下进行链路监控,SkyWalking对代码无侵入性,采用探针的形式对JAVA程序进行监测,并对监测数据进行收集存储和分析及预警。
这个开源中间件总的来说分为四大部分:探针,平台后端,存储和UI
探针:用来收集和发送数据到归集器,将其重新格式化为SkyWalking要求
平台后端:它用于汇总,分析和驱动从探针到UI的流程
存储:目前支持ES,Mysql,TiDB,H2等数据库
UI:支持自定义匹配后端需求进行面板的管理 非常强大
下载地址:http://skywalking.apache.org/downloads/
下载完成之后,cd到bin目录下,点击startup启动项目即可
有如下显示证明启动成功,访问localhost:8080就能够打开UI界面
第一行是加入探针
第四行是netty的配置
中间才是Skywalking的参数配置
下面是可参考的配置参数,根据需要选择使用
参数名称 | 配置含义 |
---|---|
agent.namespace | 跨进程链路中的header,不同的namespace会导致跨进程的链路中断 |
agent.service_name | 一个服务(项目)的唯一标识,这个字段决定了在sw的UI上的关于service的展示名称,尽量采用英文 |
agent.sample_n_per_3_secs | 每3秒采集Trace的数量,默认为负数,代表在保证不超过内存Buffer区的前提下,采集所有的Trace |
agent.authentication | 与collector进行通信的安全认证,需要同collector中配置相同 |
agent.span_limit_per_segment | Skywalking每个segment的大小 |
agent.ignore_suffix | 忽略特定请求后缀的trace |
agent.is_open_debugging_class | 探针调试开关,如果设置为true,探针会将所有操作字节码的类输出到/debugging目录下 |
collector.backend_service | 探针需要同collector进行数据传输的IP和端口 |
logging.max_file_size | 日志文件最大大小,默认为300M(单位:B),超过则生成新的文件 |
logging.level | 记录日志级别,默认为DEBUG |
待项目启动完成以后 可以尝试发起请求 去UI界面进行链路跟踪
我们可以从这个全局仪表上看到一些指标跟英文的名词 现在跟大家解读一下
指标名称 | 指标含义 |
---|---|
Global | 全局 |
Service | 服务 类指Tomcat容器 |
Endpoint | 端点 指请求的URL |
Instance | 实例 具体的请求 |
Global Heatmap | 系统耗时热力分布图 |
Global Brief | 系统概要 |
Global TopThroughPut | 全局吞吐量排行 |
Global Response Percentile | 全局响应时间占比图 |
Global Top Slow endpoint | 全局响应最慢的请求排行 |
Global Top Slow endpoint | 全局响应最慢的请求排行 |
Service AVG SLA | 服务平均等级协议变化 |
Service TopThroughPut | 服务吞吐量排行 |
Service Response Percentile | 服务响应时间占比图 |
JVM CPU% | cpu负载变化 |
JVM Non-Heap | jvm方法区的使用情况 |
JVM Heap | jvm堆的使用 |
JVM GC | 垃圾回收耗时 |
JVM GC Count | 垃圾回收次数 |
占比图百分数解读
skywalking中有P50,P90,P95这种统计口径,就是百分位数的概念。
释义:在一个样本数据集合中,通过某个样本值,可以得到小于这个样本值的数据占整体的百分之多少,这个样本值的值就是这个百分数对应的百分位数。
告警有一个配置文件 在config文件夹下的alarm-settings.yml
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Sample alarm rules.
rules:
# Rule unique name, must be ended with `_rule`.
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 1000
period: 10
count: 3
silence-period: 10
message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
service_sla_rule:
# Metrics value need to be long, double or int
metrics-name: service_sla
op: "<"
threshold: 8000
# The length of time to evaluate the metrics
period: 10
# How many times after the metrics match the condition, will trigger alarm
count: 2
# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
silence-period: 3
message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
service_p90_sla_rule:
# Metrics value need to be long, double or int
metrics-name: service_p90
op: ">"
threshold: 1000
period: 10
count: 3
silence-period: 5
message: 90% response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes
service_instance_resp_time_rule:
metrics-name: service_instance_resp_time
op: ">"
threshold: 1000
period: 10
count: 2
silence-period: 5
message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
# Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
# Because the number of endpoint is much more than service and instance.
#
# endpoint_avg_rule:
# metrics-name: endpoint_avg
# op: ">"
# threshold: 1000
# period: 10
# count: 2
# silence-period: 5
# message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes
webhooks:
- http://127.0.0.1:8098//alarm/pushData
# - http://127.0.0.1/go-wechat/
参数解读
参数 | 含义 |
---|---|
service_resp_time_rule | 告警规则的名词 具备唯一性 以rule结尾 |
metrics-name | 指标数据名称 |
threshold | 指标阈值 |
op | 与阈值配合使用的操作符 |
period | 告警检查周期 |
count | 累计到达多少次后发送消息 |
silence-period | 忽略相同告警周期 |
message | 报警内容 |
根据钉钉开发文档进行配置
钉钉开发文档
import lombok.Data;
/**
* @author lizhaojie
* @date 2020/1/2 17:48
*/
@Data
public class AlarmMessageDto {
/**
* 范围id。所有范围都在org.apache.skywalking.oap.server.core.source.DefaultScopeDefine中定义
*/
private int scopeId;
/**
* 范围
*/
private String scope;
/**
* 目标范围实体名称
*/
private String name;
/**
* 您在中配置的规则名称alarm-settings.yml
*/
private String ruleName;
/**
* 作用域实体的ID,与名称匹配
*/
private int id0;
/**
* 不使用
*/
private int id1;
/**
* 报警文本消息
*/
private String alarmMessage;
/**
* 时间,以毫秒为单位,介于当前时间和UTC 1970年1月1日午夜之间
*/
private long startTime;
}
@RequestMapping(value = "/pushData", method = RequestMethod.POST)
public void alarm(@RequestBody List alarmMessageList) {
log.info("alarmMessage:{}", alarmMessageList.toString());
alarmMessageList.forEach(info -> {
try {
Long timestamp = System.currentTimeMillis();
String stringToSign = timestamp + "\n" + secret;
Mac mac = Mac.getInstance("HmacSHA256");
mac.init(new SecretKeySpec(secret.getBytes("UTF-8"), "HmacSHA256"));
byte[] signData = mac.doFinal(stringToSign.getBytes("UTF-8"));
String sign = "×tamp=" + timestamp + "&sign=" + URLEncoder.encode(new String(Base64.encodeBase64(signData)), "UTF-8");
DingTalkClient client = new DefaultDingTalkClient(webhook + sign);
OapiRobotSendRequest request = new OapiRobotSendRequest();
request.setMsgtype("text");
OapiRobotSendRequest.Text text = new OapiRobotSendRequest.Text();
text.setContent("业务告警:\n" + info.getAlarmMessage());
request.setText(text);
OapiRobotSendRequest.At at = new OapiRobotSendRequest.At();
at.setAtMobiles(Arrays.asList("所有人"));
request.setAt(at);
OapiRobotSendResponse response = client.execute(request);
log.info("execute:{}" + response.toString());
} catch (Exception e) {
e.printStackTrace();
}
});
}