简介
nagios插件check_logfiles可以监控日志,但是实时性及监控效果都不尽如人意。因此介绍naigos的nsca被动监控结合logstash进行日志的实时监控。此种方式适合日质量比较比较小的情况下,如果日志量比较大,logstash还需要配合redis/kafka等工具进行。
需求
nagios 实时监控java日志,当日志中出现ERROR字段时,进行报警通知。
IP
hostname
组件
备注
192.168.1.1
nagios server
nsca+nagios
nagios服务器
192.168.1.2
nagios client
send_nsca+logstash
java日志
实现
一、nagios server端配置
由于之前nagios server已经配置好,我们继续引用以下监控服务项:
define host{
use linux-server
host_name nagios-client
alias passive-2
address 192.168.1.2
}
define service{
use passive_service
host_name nagios-client
service_description java service
check_command check_dummy!0
notifications_enabled 1
}
二、nagios client端配置
1.配置logstash
input {
log4j {
type => "log4j-java"
port => 4560
}
}
output {
#为方便调试我们可以将logstash设置成console输出到界面或输出到文件
stdout {
# codec => "json"
codec => "rubydebug"
}
# file {
# path => "/logs/out.log"
# }
}
#启动
/usr/local/logstash/bin/logstash agent -f /usr/local/logstash/etc/logstash.conf -l /usr/local/logstash/logs/stdout.log
2.配置java的log4j输出
由于java由多种日志框架,而logstash可以支持log4j,因此我们需要更改我们java框架的日志打印使用log4j
vim log4j.properties
#加上logstash配置
log4j.rootLogger=INFO, stdout, logstash
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%-4r [%t] %-5p %c %x - %m%n
#logstash
log4j.appender.logstash=org.apache.log4j.net.SocketAppender
log4j.appender.logstash.Port=4560
log4j.appender.logstash.RemoteHost=192.168.1.2
log4j.appender.logstash.ReconnectionDelay=60000
log4j.appender.logstash.LocationInfo=true
log4j.appender.logstash.Threshold = INFO
#也可自定义日志输出格式
log4j.appender.logstash.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss}[%p] [%t] [%c] [%F:%L]-%m%n
重新启动java程序后,log4j会持续尝试链接你配置的logstash ip:port,建立链接后,即开始发送日志数据。
注意:打印INFO级别的日志会很多,因此log4j传输到logstash速度慢可能会引起java程序所在的服务器io压力大或是java程序处理慢,进而导致java程序异常。建议过滤使用ERROR级别日志,设置如下:
log4j.appender.logstash.Threshold = ERROR
这样stdout打印INFO级别的日志,输出到logstash的是ERROR级别日志。
3.测试
输入访问java程序的命令后,logstash控制台会在屏幕打印日志
{
"message" => "jdbc:mysql://192.168.1.1::3306;characterEncoding=utf8",
"@version" => "1",
"@timestamp" => "2017-03-20T01:24:31.477Z",
"timestamp" => 1489973070924,
"path" => "com.atomikos.jdbc.AtomikosXAConnectionFactory",
"priority" => "WARN",
"logger_name" => "com.atomikos.jdbc.AtomikosXAConnectionFactory",
"thread" => "Atomikos:3",
"class" => "com.atomikos.logging.Slf4jLogger",
"file" => "Slf4jLogger.java:12",
"method" => "logWarning",
"host" => "192.168.1.2:28337",
"type" => "log4j-java"
}
从logstash的output输出的json格式的数据来看,我们可以根据”priority”字段来进行nagios告警,当”priority”=INFO时,正常;当”priority”=ERROR时,报警通知;另外方便我们迅速定位问题,当报警时,我们需要知道”@timestamp”和”thread”来查找具体问题原因,也就是message_format => “%{@timestamp} %{thread}”。因此我们的logstash具体可以这样配置:
input {
log4j {
type => "log4j-java"
port => 4560
}
}
output {
# stdout {
# codec => "json"
# codec => "rubydebug"
# }
# file {
# path => "/logs/out.log"
# }
if [priority] == "ERROR" {
nagios_nsca {
host => "192.168.1.1"
port => "5667"
message_format => "%{@timestamp} %{thread}"
send_nsca_bin => "/usr/local/nagios/bin/send_nsca"
send_nsca_config => "/usr/local/nagios/etc/send_nsca.cfg"
nagios_host => "192.168.1.2"
nagios_service => "java service"
nagios_status => 2
}
}
if [type] == "log4j-jetty" {
nagios_nsca {
host => "192.168.1.1"
port => "5667"
message_format => "OK"
send_nsca_bin => "/usr/local/nagios/bin/send_nsca"
send_nsca_config => "/usr/local/nagios/etc/send_nsca.cfg"
nagios_host => "192.168.1.2"
nagios_service => "java service"
nagios_status => 0
}
}
}
其中,当”priority”=ERROR时,nagios_status => 2,因此checkdummy接受的参数为2,此时send_nsca会将此值传给nagios server的nsca,从而发出报警。
4.排错
以上过程虽然看似顺利,但是在配置过程中也出现了错误。如通过logstash的输出日志/usr/local/logstash/logs/stdout.log,我们可以看到以下报错:
{:timestamp=>"2017-03-17T09:21:00.097000+0800", :message=>"192.168.1.1~CheckDummy~2~ERROR", :error=>#>, :nagios_nsca_command=>"/usr/local/nagios/bin/send_nsca -H 192.168.1.1 -p 5667 -d ~ -c /usr/local/nagios/etc/send_nsca.cfg", :missed_event=>#<:event:0x1ad6acb6>, @cancelled=false......}
其中报错”error=NameError: undefined local variable or method `message’ for #>”,经排查根据https://github.com/logstash-plugins/logstash-output-nagios/issues/3,我对logstash插件进行了以下更改:
vim vendor/bundle/jruby/1.9/gems/logstash-output-nagios_nsca-2.0.2/lib/logstash/outputs/nagios_nsca.rb
114行 send_to_nagios(cmd)
改成 send_to_nagios(cmd, message)
131行 defsend_to_nagios(cmd)
改成 defsend_to_nagios(cmd, message)
然后logstash能够正常通过nagios进行报警。