原文地址:http://linux.52zhe.info/read.php/274.htm
原文看着更美观一些,建议看原文。
关于heartbeat的troubleshooting?
OS:SLES 11 SP2 64bit
HA:SLE-HAE 11 SP2
一:resource agent
二:stonith
三:log analysis
对于heartbeat使用配置不在此文描述,主要是对配置过程中,进行TroubleShooting的分析处理。对此我们分为两类。一种就是resource agent,另外就是stonith.
下面的是heartbeat的一个已成功的核心配置(可以通过crm configure show来获取heartbeat当前的配置情况)
node node1
node node2
primitive p_apache ocf:heartbeat:apache \
op monitor interval="10" \
op start interval="0" timeout="45" \
op stop interval="0" timeout="65"
primitive p_vip ocf:heartbeat:IPaddr2 \
params ip="10.10.10.250" \
op monitor interval="15" timeout="30" \
primitive p_ping ocf:pacemaker:ping \
params multiplier="100" dampen="5" host_list="10.10.10.1" \
op monitor interval="15" timemout="60" \
op start interval="0" timeout="60"
primitive st_ilo_1 stonith:external/ipmi \
params hostname="node1" ipaddr="192.168.203.1" userid="Administrator" passwd="123456" interface="lanplus" \
primitive st_ilo_2 stonith:external/ipmi \
params hostname="node2" ipaddr="192.168.203.2" userid="Administrator" passwd="123456" interface="lanplus" \
group g_apache p_vip p_apache
clone cl_ping p_ping \
location l_conn g_apache \
rule $id="l_conn-rule" -inf: not_defined pingd or pingd lte 0
location l_apache g_apache +inf: node1
location l_node1 st_ilo_1 -inf: node1
location l_node2 st_ilo_2 -inf: node2
property $id="cib-bootstrap-options" \
dc-version="1.1.6-b988976485d15cb702c9307df55512d323831a5e" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
no-quorum-policy="ignore"
一:在对Resource Agent进行处理时,主要用到以下方法:
1:获取可用资源类型:
crm(live)# ra help
This level contains commands which show various information about
the installed resource agents. It is available both at the top
level and at the `configure` level.
Available commands:
classes list classes and providers
list list RA for a class (and provider)
meta show meta data for a RA
providers show providers for a RA and a class
quit exit the program
help show help
end go back one level
crm(live)# ra classes
heartbeat
lsb
ocf / heartbeat linbit lvm2 ocfs2 pacemaker
stonith
crm(live)# ra list ocf
AoEtarget AudibleAlarm CTDB ClusterMon Delay
Dummy EvmsSCC Evmsd Filesystem HealthCPU
HealthSMART ICP IPaddr IPaddr2 IPsrcaddr
IPv6addr LVM LinuxSCSI MailTo ManageRAID
ManageVE NodeUtilization Pure-FTPd Raid1 Route
SAPDatabase SAPInstance SendArp ServeRAID SphinxSearchDaemon
Squid Stateful SysInfo SystemHealth VIPArip
VirtualDomain WAS WAS6 WinPopup Xen
Xinetd anything apache asterisk clvmd
cmirrord conntrackd controld db2 drbd
eDir88 ethmonitor exportfs fio iSCSILogicalUnit
iSCSITarget ids iscsi jboss ldirectord
lxc mysql mysql-proxy named nfsserver
nginx o2cb oracle oralsnr pgsql
ping pingd portblock postfix proftpd
rsyncd rsyslog scsi2reservation sfex slapd
symlink syslog-ng tomcat varnish vmware
crm(live)# ra list stonith
apcmaster apcmastersnmp apcsmart baytech
bladehpi cyclades drac3 external/drac5
external/dracmc-telnet external/hetzner external/hmchttp external/ibmrsa
external/ibmrsa-telnet external/ipmi external/ippower9258 external/kdumpcheck
external/libvirt external/nut external/rackpdu external/riloe
external/sbd external/vcenter external/vmware external/xen0
external/xen0-ha ibmhmc ipmilan meatware
nw_rpc100s rcd_serial rps10 suicide
wti_mpc wti_nps
2:查看Resource Agent的具体用法
A:通过crm获取
crm(live)# ra meta ocf:heartbeat:apache
Manages an Apache web server instance (ocf:heartbeat:apache)
This is the resource agent for the Apache web server.
This resource agent operates both version 1.x and version 2.x Apache
servers.
The start operation ends with a loop in which monitor is
repeatedly called to make sure that the server started and that
it is operational. Hence, if the monitor operation does not
succeed within the start operation timeout, the apache resource
will end with an error status.
The monitor operation by default loads the server status page
which depends on the mod_status module and the corresponding
configuration file (usually /etc/apache2/mod_status.conf).
Make sure that the server status page works and that the access
is allowed *only* from localhost (address 127.0.0.1).
See the statusurl and testregex attributes for more details.
See also http://httpd.apache.org/
Parameters (* denotes required, [] the default):
configfile (string, [/etc/apache2/httpd.conf]): configuration file path
The full pathname of the Apache configuration file.
This file is parsed to provide defaults for various other
resource agent parameters.
httpd (string, [/usr/sbin/httpd]): httpd binary path
The full pathname of the httpd binary (optional).
.............
.............
B:在SHELL里面直接获取
获取帮助,提示可用参数
sles11264:~ # OCF_ROOT=/usr/lib/ocf/ /usr/lib/ocf/resource.d/heartbeat/apache
usage: /usr/lib/ocf/resource.d/heartbeat/apache action
action:
start start the web server
stop stop the web server
status return the status of web server, run or down
monitor return TRUE if the web server appears to be working.
For this to be supported you must configure mod_status
and give it a server-status URL. You have to have
installed either curl or wget for this to work.
meta-data show meta data message
validate-all validate the instance parameters
C:直接打开文件,查看ocf资源的定义
sles11264:~ # less /usr/lib/ocf/resource.d/heartbeat/apache
#!/bin/sh
#
# High-Availability Apache/IBMhttp control script
#
# apache (aka IBMhttpd)
#
# Description: starts/stops apache web servers.
#
# Author: Alan Robertson
# Sun Jiang Dong
#
# Support:
[email protected]
#
# License: GNU General Public License (GPL)
#
# Copyright: (C) 2002-2005 International Business Machines
#
#
# An example usage in /etc/ha.d/haresources:
# node1 10.0.0.170 apache::/opt/IBMHTTPServer/conf/httpd.conf
# node1 10.0.0.170 IBMhttpd
#
# Our parsing of the Apache config files is very rudimentary.
# It'll work with lots of different configurations - but not every
# possible configuration.
#
# Patches are being accepted ;-)
#
# OCF parameters:
# OCF_RESKEY_configfile
# OCF_RESKEY_httpd
...................
3:如何trouble shooting,在heartbeat调用某种资源类型时,如果再系统中通过下面的方法来调试验证,就会得到在crm的configure show一样的输出写法。一致便是正确。
sles11264:~ # OCF_ROOT=/usr/lib/ocf OCF_RESKEY_configfile=/etc/apache2/httpd.conf OCF_RESKEY_httpd=/usr/sbin/httpd2 /usr/lib/ocf/resource.d/heartbeat/apache start | echo $?
0
sles11264:~ # OCF_ROOT=/usr/lib/ocf OCF_RESKEY_configfile=/etc/apache2/httpd.conf OCF_RESKEY_httpd=/usr/sbin/httpd2 /usr/lib/ocf/resource.d/heartbeat/apache status | echo $?
0
INFO: apache is running (pid 8778).
sles11264:~ # OCF_ROOT=/usr/lib/ocf OCF_RESKEY_configfile=/etc/apache2/httpd.conf OCF_RESKEY_httpd=/usr/sbin/httpd2 /usr/lib/ocf/resource.d/heartbeat/apache stop | echo $?
0
/usr/lib/ocf/resource.d/heartbeat/apache: line 210: kill: (8778) - No such process
INFO: Killing apache PID 8778
INFO: apache stopped.
如上,依次执行,分别获取资源关于start,status,stop的返回值,均为0,在实际使用场景中,才正确,在应对故障处理时,才可以准确的检测到故障,并停止服务,切换到对端运行等等。
关于语句含义说明:
OCF_ROOT=/usr/lib/ocf OCF_RESKEY_configfile=/etc/apache2/httpd.conf OCF_RESKEY_httpd=/usr/sbin/httpd2 /usr/lib/ocf/resource.d/heartbeat/apache
1:OCF_ROOT=/usr/lib/ocf
资源执行的root路径,默认就是这个。固定的。
2:OCF_RESKEY_configfile,OCF_RESKEY_httpd
关于OCF的参考关键字的简写,OCF_RESKEY_(参数选项),这里的参数选项都是小写的,configfile,httpd,实际上就是你在
crm ra info ocf:heartbeat:apache
获取到的可用参数,通常有说明哪些是必须的。哪些是可选的。
在实际应用场景中您需要自行调试这些配置的参数,以便最终写入配置。进行参数的选择和调整,就是HA调试的主要部分。
补充:由于服务调用过程就是start,stop,状态监控就是status或monitor,有的时候在场景下,需要准确计算服务启动关闭的时间,调整op参数设定。用time加上上面的语句来尝试调试。给出恰当的值。
二:对于stonith的trouble shooting.
1:查看可用stonith的设备类型
通过crm获取
crm(live)# ra list stonith
apcmaster apcmastersnmp apcsmart baytech
bladehpi cyclades drac3 external/drac5
external/dracmc-telnet external/hetzner external/hmchttp external/ibmrsa
external/ibmrsa-telnet external/ipmi external/ippower9258 external/kdumpcheck
external/libvirt external/nut external/rackpdu external/riloe
external/sbd external/vcenter external/vmware external/xen0
external/xen0-ha ibmhmc ipmilan meatware
nw_rpc100s rcd_serial rps10 suicide
wti_mpc wti_nps
通过系统命令stonith来获取
sles11264:~ # stonith -L
apcmaster
apcmastersnmp
apcsmart
baytech
bladehpi
cyclades
drac3
external/drac5
external/dracmc-telnet
external/hetzner
external/hmchttp
external/ibmrsa
external/ibmrsa-telnet
external/ipmi
external/ippower9258
external/kdumpcheck
external/libvirt
external/nut
external/rackpdu
external/riloe
external/sbd
external/vcenter
external/vmware
external/xen0
external/xen0-ha
ibmhmc
ipmilan
meatware
nw_rpc100s
rcd_serial
rps10
suicide
wti_mpc
wti_nps
2:查看stonith设备的支持参数
A:crm中查看
sles11264:~ # crm ra info stonith:ipmilan
IPMI Over LAN (stonith:ipmilan)
IPMI LAN STONITH device
Parameters (* denotes required, [] the default):
hostname* (string):
The hostname of the STONITH device
ipaddr* (string): IP Address
The IP address of the STONITH device
port* (string):
The port number to where the IPMI message is sent
auth* (string):
The authorization type of the IPMI session ("none", "straight", "md2", or "md5")
priv* (string):
The privilege level of the user ("operator" or "admin")
login* (string): Login
The username used for logging in to the STONITH device
password* (string): Password
The password used for logging in to the STONITH device
stonith-timeout (time, [60s]): How long to wait for the STONITH action to complete.
Overrides the stonith-timeout cluster property
...........
...........
B:在系统中查看参数使用
sles11264:~ # stonith -t ipmilan -h
STONITH Device: ipmilan - IPMI LAN STONITH device
For more information see http://www.intel.com/design/servers/ipmi/
List of valid parameter names for ipmilan STONITH device:
hostname
ipaddr
port
auth
priv
login
password
reset_method
For Config info [-p] syntax, give each of the above parameters in order as
the -p value.
Arguments are separated by white space.
Config file [-F] syntax is the same as -p, except # at the start of a line
denotes a comment
sles11264:~ # stonith -t external/ipmi -h
STONITH Device: external/ipmi - ipmitool based power management. Apparently, the power off
method of ipmitool is intercepted by ACPI which then makes
a regular shutdown. If case of a split brain on a two-node
it may happen that no node survives. For two-node clusters
use only the reset method.
For more information see http://ipmitool.sf.net/
List of valid parameter names for external/ipmi STONITH device:
hostname
ipaddr
userid
passwd
interface
For Config info [-p] syntax, give each of the above parameters in order as
the -p value.
Arguments are separated by white space.
Config file [-F] syntax is the same as -p, except # at the start of a line
denotes a comment
3:stonith的调试方法与普通资源一样,都是在系统上可以用一样的参数命令试验,以确定在heartbeat实际使用的参数。如下:
A:系统命令
ipmitool -H 192.168.203.1 -I lanplus -U Administrator chassis power cycle
回车后,提示输入密码,输入正确后会重启对端主机。
B:调试命令
stonith -t external/ipmi hostname="node1" ipaddr=192.168.203.1 userid="Administrator" passwd="123456" interface=lanplus -T reset node1
使用这个命令才是真正Heartbeat里面会调用的参数,对比配置文件,如下:
primitive st_ilo_1 stonith:external/ipmi \
params hostname="node1" ipaddr="192.168.203.1" userid="Administrator" passwd="123456" interface="lanplus"
所以如果想正确调用stonith,需要在stonith命令下正确使用相关参数来进行调试。而使用A命令无法等效。可以这样理解,heartbeat写好自己的agent,用以调用ipmitool这样的工具来使用。但是B是作用于A的基础上的,A可以用,并不代表B可以用。必须B可以用,才代表heartbeat可以正确调用。
补充说明:关于ipmi,在新一代的主机下,HP ilo 3,Dell DRAC 6,IBM RSA,都有基于ipmi的支持,用于控制管理主机,在stonith,主要体现于电源控制。
Almost all serves today have a BMC built-in as a default system component, so they all offer basic IPMI based monitoring and control. However it is the more advanced service processors which now enable system administrators to fully monitor and troubleshoot their servers. Most servers shipped in the last few years have service processors embedded within them, like IBM's RSA, Dell's DRAC or HP's iLO, and these provide extended power control, KVM and console access and monitoring and alerts.
sles11264:~ # rpm -qi ipmitool
Name : ipmitool Relocations: (not relocatable)
Version : 1.8.11 Vendor: SUSE LINUX Products GmbH, Nuernberg, Germany
Release : 0.14.1 Build Date: Fri Feb 3 23:09:52 2012
Install Date: Wed Jun 13 10:17:47 2012 Build Host: rinck
Group : System/Management Source RPM: ipmitool-1.8.11-0.14.1.src.rpm
Size : 1002790 License: BSD 3-Clause
Signature : RSA/8, Fri Feb 3 23:09:56 2012, Key ID e3a5c360307e3d54
Packager : http://bugs.opensuse.org
URL : http://ipmitool.sourceforge.net/
Summary : Utility for IPMI Control
Description :
This package contains a utility for interfacing with devices that
support the Intelligent Platform Management Interface specification.
IPMI is an open standard for machine health, inventory, and remote
power control.
This utility can communicate with IPMI-enabled devices through either a
kernel driver such as OpenIPMI or over the RMCP LAN protocol defined in
the IPMI specification. IPMIv2 adds support for encrypted LAN
communications and remote Serial-over-LAN functionality.
It provides commands for reading the Sensor Data Repository (SDR) and
displaying sensor values, displaying the contents of the System Event
Log (SEL), printing Field Replaceable Unit (FRU) information, reading
and setting LAN configuration, and chassis power control.
Distribution: SUSE Linux Enterprise 11
问题:在实际测试中,ipmilan和external/ipmi不能等效。在external/ipmi可以指定interface="lanplus",而在ipmilan中无法设定,导致使用ipmilan无法正确使用。而关于lan和lanplus的差别,可以查看ipmitool。
补充:后经调试ipmilan也成功语句如下:
stonith -t ipmilan hostname=node1 ipaddr=192.168.1.1 port=623 auth=md5 priv=admin login=Administrator password="123456" -T reset node1
sles11264:~ # man ipmitool
...........
LAN INTERFACE
The ipmitool lan interface communicates with the BMC over an Ethernet LAN connection using UDP under
IPv4. UDP datagrams are formatted to contain IPMI request/response messages with a IPMI session headers
and RMCP headers.
IPMI-over-LAN uses version 1 of the Remote Management Control Protocol (RMCP) to support pre-OS and
OS-absent management. RMCP is a request-response protocol delivered using UDP datagrams to port 623.
The LAN interface is an authenticatiod multi-session connection; messages delivered to the BMC can (and
should) be authenticated with a challenge/response protocol with either straight password/key or MD5
message-digest algorithm. ipmitool will attempt to connect with administrator privilege level as this
is required to perform chassis power functions.
You can tell ipmitool to use the lan interface with the -I lan option:
ipmitool -I lan -H <hostname> [-U <username>] [-P <password>] <command>
...........
LANPLUS INTERFACE
Like the lan interface, the lanplus interface communicates with the BMC over an Ethernet LAN connection
using UDP under IPv4. The difference is that the lanplus interface uses the RMCP+ protocol as described
in the IPMI v2.0 specification. RMCP+ allows for improved authentication and data integrity checks, as
well as encryption and the ability to carry multiple types of payloads. Generic Serial Over LAN support
requires RMCP+, so the ipmitool sol activate command requires the use of the lanplus interface.
RMCP+ session establishment uses a symmetric challenge-response protocol called RAKP (Remote Authenti-
cated Key-Exchange Protocol) which allows the negotiation of many options. ipmitool does not yet allow
the user to specify the value of every option, defaulting to the most obvious settings marked as
required in the v2.0 specification. Authentication and integrity HMACS are produced with SHA1, and
encryption is performed with AES-CBC-128. Role-level logins are not yet supported.
ipmitool must be linked with the OpenSSL library in order to perform the encryption functions and sup-
port the lanplus interface. If the required packages are not found it will not be compiled in and sup-
ported.
You can tell ipmitool to use the lanplus interface with the -I lanplus option:
ipmitool -I lanplus -H <hostname> [-U <username>] [-P <password>] <command>
......
三:log analysis和一些排查命令
默认的日志都在messages里面,建议独立出来,修改方法:
修改日志:
/etc/corosync/corosync.conf
to_logfile: no
to_syslog: yes
修改为:
to_logfile: yes
to_syslog: no
logfile: /var/log/heartbeat.log
# rcsyslog restart
这样确实可以独立出来,但是无法单独归档设定,因为归档的时候需要重启服务。但是实际上是不能重启服务的。可能带来日志越来越大的问题。请酌情处理。
资源关闭的情况下,查看历史状态情况,查看不同状态返回值:
crm_mon -o -r -1
============
Last updated: Wed Jun 20 21:26:31 2012
Stack: openais
Current DC: server1 - partition with quorum
Version: 1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5
2 Nodes configured, 2 expected votes
2 Resources configured.
============
Online: [ server2 server1 ]
Full list of resources:
Clone Set: SBD
Stopped: [ r_sbd:0 r_sbd:1 ]
Resource Group: Web
WebIP (ocf::heartbeat:IPaddr2): Stopped
WebSite (ocf::heartbeat:apache): Stopped
Operations:
* Node server1:
r_sbd:1: migration-threshold=1000000
+ (5) start: rc=0 (ok)
+ (6) monitor: interval=15000ms rc=0 (ok)
+ (7) stop: rc=0 (ok)
* Node server2:
r_sbd:0: migration-threshold=1000000
+ (6) start: rc=0 (ok)
+ (7) monitor: interval=15000ms rc=0 (ok)
+ (13) stop: rc=0 (ok)
WebIP: migration-threshold=1000000
+ (5) start: rc=0 (ok)
+ (8) monitor: interval=15000ms rc=0 (ok)
+ (12) stop: rc=0 (ok)
WebSite: migration-threshold=1000000
+ (9) start: rc=0 (ok)
+ (10) monitor: interval=10000ms rc=0 (ok)
+ (11) stop: rc=0 (ok)
调试命令pssh,在多台主机情况下,查看各个主机的情况。需要做互信验证。在集群组里面执行:
sles11264:~/bin # pssh -i -h list "ifconfig eth0"
[1] 16:05:26 [SUCCESS] 192.168.203.20
eth0 Link encap:Ethernet HWaddr 00:0C:29:0E:39:A9
inet addr:192.168.203.20 Bcast:192.168.203.255 Mask:255.255.255.0
inet6 addr: fe80::20c:29ff:fe0e:39a9/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:643 errors:0 dropped:0 overruns:0 frame:0
TX packets:93 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:179278 (175.0 Kb) TX bytes:10288 (10.0 Kb)
[2] 16:05:26 [SUCCESS] 192.168.203.10
eth0 Link encap:Ethernet HWaddr 00:0C:29:DF:B4:A6
inet addr:192.168.203.10 Bcast:192.168.203.255 Mask:255.255.255.0
inet6 addr: fe80::20c:29ff:fedf:b4a6/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:7296 errors:0 dropped:0 overruns:0 frame:0
TX packets:5638 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:802061 (783.2 Kb) TX bytes:1269019 (1.2 Mb)
[3] 16:05:26 [SUCCESS] 192.168.203.50
eth0 Link encap:Ethernet HWaddr 00:0C:29:43:1D:23
inet addr:192.168.203.50 Bcast:192.168.203.255 Mask:255.255.255.0
inet6 addr: fe80::20c:29ff:fe43:1d23/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:702 errors:0 dropped:0 overruns:0 frame:0
TX packets:55 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:188362 (183.9 Kb) TX bytes:8740 (8.5 Kb)
本文链接:http://linux.52zhe.info/read.php/274.htm
本文作者:kook(若就博客内所涉及的技术问题交流,请用下面的MSN或Gmail联系我)
联系方式:(MSN:kook#live.com) (Google talk:kookliu)
没有版权:GNU,转载时请注明“转载人”欠本人一顿饭,来日见面之时兑现!谢谢合作!