实际生产中,面对几十个、甚至成百上千个的微服务实例,如果一旦某个实例发生宕机,如果不能快速定位、提交预警,对实际生产造成的损失无疑是巨大的。所以,要对微服务进行监控、预警,对微服务的调用链路进行监控,迅速定位问题
brew update
brew search elasticsearch
brew install elasticsearch
brew service start elasticsearch-full
访问localhost:9200,是否有如下类似数据返回
{
"name" : "******",
"cluster_name" : "elasticsearch_name",
"cluster_uuid" : "rp73VaY8RRCgQrl4M5uR9A",
"version" : {
"number" : "7.7.1",
"build_flavor" : "default",
"build_type" : "tar",
"build_hash" : "ad56dce891c901a492bb1ee393f12dfff473a423",
"build_date" : "2020-05-28T16:30:01.040088Z",
"build_snapshot" : false,
"lucene_version" : "8.5.1",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
如果没有返回,则去存放日志位置,查看日志里的报错内容,如果不知道日志的存放位置,可以查看elasticsearch.yml中的path.log参数的值,对应的就是日志的存放位置
在SkyWalking官网下载对应的版本,这里下载的是Binary Distribution for ElasticSearch 7版本,因为上面安装的Elasticsearch版本是7
解压并修改配置
2.1 解压压缩包
tar -zxvf apache-skywalking-apm-es7-7.0.0.tar.gz
2.2 进入解压后的文件夹,修改/config/application.yml配置
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#集群 本地是没有集群,默认单机,把集群的配置删除了
cluster:
selector: ${SW_CLUSTER:standalone}
standalone:
core:
selector: ${SW_CORE:default}
default:
# Mixed: Receive agent data, Level 1 aggregate, Level 2 aggregate
# Receiver: Receive agent data, Level 1 aggregate
# Aggregator: Level 2 aggregate
role: ${SW_CORE_ROLE:Mixed} # Mixed/Receiver/Aggregator
restHost: ${SW_CORE_REST_HOST:0.0.0.0}
restPort: ${SW_CORE_REST_PORT:12800}
restContextPath: ${SW_CORE_REST_CONTEXT_PATH:/}
gRPCHost: ${SW_CORE_GRPC_HOST:0.0.0.0}
gRPCPort: ${SW_CORE_GRPC_PORT:11800}
gRPCSslEnabled: ${SW_CORE_GRPC_SSL_ENABLED:false}
gRPCSslKeyPath: ${SW_CORE_GRPC_SSL_KEY_PATH:""}
gRPCSslCertChainPath: ${SW_CORE_GRPC_SSL_CERT_CHAIN_PATH:""}
gRPCSslTrustedCAPath: ${SW_CORE_GRPC_SSL_TRUSTED_CA_PATH:""}
downsampling:
- Hour
- Day
- Month
# Set a timeout on metrics data. After the timeout has expired, the metrics data will automatically be deleted.
enableDataKeeperExecutor: ${SW_CORE_ENABLE_DATA_KEEPER_EXECUTOR:true} # Turn it off then automatically metrics data delete will be close.
dataKeeperExecutePeriod: ${SW_CORE_DATA_KEEPER_EXECUTE_PERIOD:5} # How often the data keeper executor runs periodically, unit is minute
recordDataTTL: ${SW_CORE_RECORD_DATA_TTL:90} # Unit is minute
minuteMetricsDataTTL: ${SW_CORE_MINUTE_METRIC_DATA_TTL:90} # Unit is minute
hourMetricsDataTTL: ${SW_CORE_HOUR_METRIC_DATA_TTL:36} # Unit is hour
dayMetricsDataTTL: ${SW_CORE_DAY_METRIC_DATA_TTL:45} # Unit is day
monthMetricsDataTTL: ${SW_CORE_MONTH_METRIC_DATA_TTL:18} # Unit is month
# Cache metric data for 1 minute to reduce database queries, and if the OAP cluster changes within that minute,
# the metrics may not be accurate within that minute.
enableDatabaseSession: ${SW_CORE_ENABLE_DATABASE_SESSION:true}
topNReportPeriod: ${SW_CORE_TOPN_REPORT_PERIOD:10} # top_n record worker report cycle, unit is minute
# Extra model column are the column defined by in the codes, These columns of model are not required logically in aggregation or further query,
# and it will cause more load for memory, network of OAP and storage.
# But, being activated, user could see the name in the storage entities, which make users easier to use 3rd party tool, such as Kibana->ES, to query the data by themselves.
activeExtraModelColumns: ${SW_CORE_ACTIVE_EXTRA_MODEL_COLUMNS:false}
storage:
selector: ${SW_STORAGE:elasticsearch7}
elasticsearch7:
nameSpace: ${SW_NAMESPACE:"elasticsearch7"}
clusterNodes: ${SW_STORAGE_ES_CLUSTER_NODES:localhost:9200}
protocol: ${SW_STORAGE_ES_HTTP_PROTOCOL:"http"}
# trustStorePath: ${SW_SW_STORAGE_ES_SSL_JKS_PATH:"../es_keystore.jks"}
# trustStorePass: ${SW_SW_STORAGE_ES_SSL_JKS_PASS:""}
enablePackedDownsampling: ${SW_STORAGE_ENABLE_PACKED_DOWNSAMPLING:true} # Hour and Day metrics will be merged into minute index.
dayStep: ${SW_STORAGE_DAY_STEP:1} # Represent the number of days in the one minute/hour/day index.
user: ${SW_ES_USER:""}
password: ${SW_ES_PASSWORD:""}
secretsManagementFile: ${SW_ES_SECRETS_MANAGEMENT_FILE:""} # Secrets management file in the properties format includes the username, password, which are managed by 3rd party tool.
indexShardsNumber: ${SW_STORAGE_ES_INDEX_SHARDS_NUMBER:2}
indexReplicasNumber: ${SW_STORAGE_ES_INDEX_REPLICAS_NUMBER:0}
# Those data TTL settings will override the same settings in core module.
recordDataTTL: ${SW_STORAGE_ES_RECORD_DATA_TTL:7} # Unit is day
otherMetricsDataTTL: ${SW_STORAGE_ES_OTHER_METRIC_DATA_TTL:45} # Unit is day
monthMetricsDataTTL: ${SW_STORAGE_ES_MONTH_METRIC_DATA_TTL:18} # Unit is month
# Batch process setting, refer to https://www.elastic.co/guide/en/elasticsearch/client/java-api/5.5/java-docs-bulk-processor.html
bulkActions: ${SW_STORAGE_ES_BULK_ACTIONS:1000} # Execute the bulk every 1000 requests
flushInterval: ${SW_STORAGE_ES_FLUSH_INTERVAL:10} # flush the bulk every 10 seconds whatever the number of requests
concurrentRequests: ${SW_STORAGE_ES_CONCURRENT_REQUESTS:2} # the number of concurrent requests
resultWindowMaxSize: ${SW_STORAGE_ES_QUERY_MAX_WINDOW_SIZE:10000}
metadataQueryMaxSize: ${SW_STORAGE_ES_QUERY_MAX_SIZE:5000}
segmentQueryMaxSize: ${SW_STORAGE_ES_QUERY_SEGMENT_SIZE:200}
profileTaskQueryMaxSize: ${SW_STORAGE_ES_QUERY_PROFILE_TASK_SIZE:200}
advanced: ${SW_STORAGE_ES_ADVANCED:""}
receiver-sharing-server:
selector: ${SW_RECEIVER_SHARING_SERVER:default}
default:
authentication: ${SW_AUTHENTICATION:""}
receiver-register:
selector: ${SW_RECEIVER_REGISTER:default}
default:
receiver-trace:
selector: ${SW_RECEIVER_TRACE:default}
default:
bufferPath: ${SW_RECEIVER_BUFFER_PATH:../trace-buffer/} # Path to trace buffer files, suggest to use absolute path
bufferOffsetMaxFileSize: ${SW_RECEIVER_BUFFER_OFFSET_MAX_FILE_SIZE:100} # Unit is MB
bufferDataMaxFileSize: ${SW_RECEIVER_BUFFER_DATA_MAX_FILE_SIZE:500} # Unit is MB
bufferFileCleanWhenRestart: ${SW_RECEIVER_BUFFER_FILE_CLEAN_WHEN_RESTART:false}
sampleRate: ${SW_TRACE_SAMPLE_RATE:10000} # The sample rate precision is 1/10000. 10000 means 100% sample in default.
slowDBAccessThreshold: ${SW_SLOW_DB_THRESHOLD:default:200,mongodb:100} # The slow database access thresholds. Unit ms.
receiver-jvm:
selector: ${SW_RECEIVER_JVM:default}
default:
receiver-clr:
selector: ${SW_RECEIVER_CLR:default}
default:
receiver-profile:
selector: ${SW_RECEIVER_PROFILE:default}
default:
service-mesh:
selector: ${SW_SERVICE_MESH:default}
default:
bufferPath: ${SW_SERVICE_MESH_BUFFER_PATH:../mesh-buffer/} # Path to trace buffer files, suggest to use absolute path
bufferOffsetMaxFileSize: ${SW_SERVICE_MESH_OFFSET_MAX_FILE_SIZE:100} # Unit is MB
bufferDataMaxFileSize: ${SW_SERVICE_MESH_BUFFER_DATA_MAX_FILE_SIZE:500} # Unit is MB
bufferFileCleanWhenRestart: ${SW_SERVICE_MESH_BUFFER_FILE_CLEAN_WHEN_RESTART:false}
istio-telemetry:
selector: ${SW_ISTIO_TELEMETRY:default}
default:
envoy-metric:
selector: ${SW_ENVOY_METRIC:default}
default:
alsHTTPAnalysis: ${SW_ENVOY_METRIC_ALS_HTTP_ANALYSIS:""}
receiver_zipkin:
selector: ${SW_RECEIVER_ZIPKIN:-}
default:
host: ${SW_RECEIVER_ZIPKIN_HOST:0.0.0.0}
port: ${SW_RECEIVER_ZIPKIN_PORT:9411}
contextPath: ${SW_RECEIVER_ZIPKIN_CONTEXT_PATH:/}
receiver_jaeger:
selector: ${SW_RECEIVER_JAEGER:-}
default:
gRPCHost: ${SW_RECEIVER_JAEGER_HOST:0.0.0.0}
gRPCPort: ${SW_RECEIVER_JAEGER_PORT:14250}
query:
selector: ${SW_QUERY:graphql}
graphql:
path: ${SW_QUERY_GRAPHQL_PATH:/graphql}
alarm:
selector: ${SW_ALARM:default}
default:
telemetry:
selector: ${SW_TELEMETRY:none}
none:
prometheus:
host: ${SW_TELEMETRY_PROMETHEUS_HOST:0.0.0.0}
port: ${SW_TELEMETRY_PROMETHEUS_PORT:1234}
so11y:
prometheusExporterEnabled: ${SW_TELEMETRY_SO11Y_PROMETHEUS_ENABLED:true}
prometheusExporterHost: ${SW_TELEMETRY_PROMETHEUS_HOST:0.0.0.0}
prometheusExporterPort: ${SW_TELEMETRY_PROMETHEUS_PORT:1234}
receiver-so11y:
selector: ${SW_RECEIVER_SO11Y:-}
default:
# 配置中心,这里没有,把多余的配置删掉了
configuration:
selector: ${SW_CONFIGURATION:none}
none:
exporter:
selector: ${SW_EXPORTER:-}
grpc:
targetHost: ${SW_EXPORTER_GRPC_HOST:127.0.0.1}
targetPort: ${SW_EXPORTER_GRPC_PORT:9870}
配置说明:
cluster: 有几种注册中心可以选择,默认standalone-单机
core: 默认配置
storage: 用来存储trace的数据存储,以便查询、展示,提供了H2、Mysql、ES、InfluxDB这几种存储方式,recordDataTTL是存储数据多长时间,默认是7天
这里选择es7,把其他几种存储方式注释掉。然后在storage的selector选项中选择elasticsearch7
其他多余的配置都被删掉了,如果想要配置集群、配置中心、或者存储选择H2、MySQL等,参考官方配置再加上即可
启动OAP、以及SkyWalkingUI
cd bin
# 启动OAP
oapService.sh
# 启动SkyWalking UI
webappService.sh
访问localhost:8080,**如果8080端口被占用了怎么办?**进入webapp文件夹下的webaap.yml,修改port即可。启动之后,如图所示
记得把自动按钮点开、以及下面的开始筛选时间配置好,Skywalking就会每隔6秒刷新一次采集到的数据
以Eureka注册中心为例,演示如何配置agent。
1.1 在idea中的Run/Debug Configurations的VM options中加入如下配置
# -javaagent后面是skywalking-agent.jar的路径
-javaagent:/Users/jacksparrow414/skywalking/apache-skywalking-apm-bin-es7/agent/skywalking-agent.jar
1.2 在Environment viriables中添加两行配置
SW_AGENT_NAME=eureka;SW_AGENT_COLLECTOR_BACKEND_SERVICES=127.0.0.1:11800
SW_AGENT_NAME 对应的是服务的名字,可以任意取,同一个服务的不同实例,名字是一样的
1.3 类似的配置,分别配置gateway(网关)、one-service(服务1,对应两个实例DemoClientone、DemoClientthree)、two-service(服务2)。服务1、服务2均连接本地3306数据库。一共5个应用,依次启动:
注册中心->网关->服务1(两个实例)->服务2
再启动nginx
为什么启动nginx呢?
实际访问链路为:
web浏览器/App/H5小程序->发起请求->nginx反向代理->将请求转发至网关->网关根据接口路径->路由至不同的微服务实例
1.4 判断agent是否生效,在每个应用启动的时候,控制台会首先打印
可以看到,有4个应用,1个数据库,是对的
稍等一会可以在控制台查看到如下信息
1.4 在上图的Mysql/JDBC/PrepareStatement/execute中点击看到信息如下如果想要看到执行SQL的参数怎么办呢?
在agent.config配置,设置下面的参数为true即可
plugin.mysql.trace_sql_parameters=${SW_MYSQL_TRACE_SQL_PARAMETERS:true}
配置之后,重启。再次请求接口,查看调用链路中的SQL调用如下
稍等一会,可以在控制台看到如下信息
可以看到,整个的调用链路为gateway(网关)->two-service->one-service
1.4 在上图的Mysql/JDBC/PrepareStatement/execute中点击看到信息如下
由于官方没有提供关闭的脚本,详情见issue4698,只能手动去kill了
# 我们配置的是11800,查找占用端口11800的进程
lsof -i:11800
# 使用pkill 直接杀掉父进程
pkill -9 34956
2. 关闭SkyWlking UI的服务端口
# 默认是8080端口
kill -9 8080
brew service stop elasticsearch-full