之前配置完所,断心跳网卡后,应用不会切,一度以为是自己的配置有问题。但发现将vnet3切换成与网卡直接桥接,问题就解决了。这极有可能是因为vnet3两节点间,发送包有些问题。
- 前提部署:
- 1、环境配置
- 2、主机名,yum,ssh
- 1、安装heartbeat.
- #yum install -y heartbeat* #要执行两次哦,不然会发现有的包居然没有装上。
- # rpm -qa | grep heartbeat*
- heartbeat-gui-2.1.3-3.el5.centos
- heartbeat-2.1.3-3.el5.centos
- heartbeat-stonith-2.1.3-3.el5.centos
- heartbeat-devel-2.1.3-3.el5.centos
- heartbeat-ldirectord-2.1.3-3.el5.centos
- heartbeat-pils-2.1.3-3.el5.centos
- 复制相关的配置文件:
- # cp /usr/share/doc/heartbeat-2.1.3/ha.cf /etc/ha.d/ #ha.cf HA的配置文件
- # cp /usr/share/doc/heartbeat-2.1.3/haresources /etc/ha.d/ #haresources 资源文件
- # cp /usr/share/doc/heartbeat-2.1.3/authkeys /etc/ha.d/ #HA节点间的验证文件
- # yum install -y httpd
- # vim /etc/ha.d/ha.cf
- debugfile /var/log/ha-debug
- logfile /var/log/ha-log
- logfacility local0
- keepalive 2
- deadtime 30
- warntime 10
- initdead 120
- udpport 694
- ucast eth1 1.1.1.2 #心跳
- auto_failback on
- node ha1
- node ha2
- ping 172.16.1.1 172.16.1.11 #网关与另一个节点IP
- respawn hacluster /usr/lib/heartbeat/ipfail
- deadping 30
- apiauth ipfail uid=hacluster
- use_logd yes
- conn_logd_time 60
- #cat authkeys #定义认证的keys
- auth 1
- 1 crc
- ================
- heartbeat[8404]: 2011/07/26_05:02:48 ERROR: Bad permissions on keyfile [/etc/ha.d/authkeys], 600 recommended.
- heartbeat[8404]: 2011/07/26_05:02:48 ERROR: Authentication configuration error.
- heartbeat[8404]: 2011/07/26_05:02:48 ERROR: Configuration error, heartbeat not started.
- # chmod 600 /etc/ha.d/authkeys
- =================
- # cat /etc/ha.d/haresources #配置HA资源
- ha1 IPaddr::172.16.1.100/24/eth0:0 httpd
- # /etc/init.d/heartbeat start
- logd is already running
- Starting High-Availability services:
- 2011/07/26_05:05:15 INFO: Resource is stopped
- [ OK ]
- #ha1与ha2之间的配置,不同的就是ucast 值与 被ping的IP。
- #++++++++++++++++++++++++++++++++++++++++++++++++++++++
- #
- #++++++++++++++++++++++++++++++++++++++++++++++++++++++
- 以下为断开心跳线,以及重新插入心跳线的过程日志:
- #断开一方的心跳
heartbeat[7043]: 2011/07/26_13:53:40 WARN: node ha2.example.com: is dead
heartbeat[7043]: 2011/07/26_13:53:40 info: Dead node ha2.example.com gave up resources.
heartbeat[7043]: 2011/07/26_13:53:40 info: Link ha2.example.com:eth1 dead.
ipfail[7069]: 2011/07/26_13:53:40 info: Status update: Node ha2.example.com now has status dead
ipfail[7069]: 2011/07/26_13:53:42 info: NS: We are still alive!
ipfail[7069]: 2011/07/26_13:53:42 info: Link Status update: Link ha2.example.com/eth1 now has status dead
ipfail[7069]: 2011/07/26_13:53:44 info: Asking other side for ping node count.
ipfail[7069]: 2011/07/26_13:53:44 info: Checking remote count of ping nodes.- 这个时候,请使用ip addr观察双方的IP地址,会发现VIP 地址出现在两台机器上。脑裂了!
#第二个节点又活了
heartbeat[7043]: 2011/07/26_13:56:09 CRIT: Cluster node ha2.example.com returning after partition.
heartbeat[7043]: 2011/07/26_13:56:09 info: For information on cluster partitions, See URL: http://linux-ha.org/SplitBrain
heartbeat[7043]: 2011/07/26_13:56:09 WARN: Deadtime value may be too small.
heartbeat[7043]: 2011/07/26_13:56:09 info: See FAQ for information on tuning deadtime.
heartbeat[7043]: 2011/07/26_13:56:09 info: URL: http://linux-ha.org/FAQ#heavy_load
heartbeat[7043]: 2011/07/26_13:56:09 info: Link ha2.example.com:eth1 up.- heartbeat[7043]: 2011/07/26_13:56:09 WARN: Late heartbeat: Node ha2.example.com: interval 104930 ms
ipfail[7069]: 2011/07/26_13:56:09 info: Link Status update: Link ha2.example.com/eth1 now has status up
heartbeat[7043]: 2011/07/26_13:56:09 info: Status update for node ha2.example.com: status active
ipfail[7069]: 2011/07/26_13:56:09 info: Status update: Node ha2.example.com now has status active
harc[7916]: 2011/07/26_13:56:09 info: Running /etc/ha.d/rc.d/status status
heartbeat[7043]: 2011/07/26_13:56:12 info: Heartbeat shutdown in progress. (7043)
#发现节点2的心跳网卡又活了,heartbeat重启了。- heartbeat[7932]: 2011/07/26_13:56:13 info: Giving up all HA resources.
ResourceManager[7945]: 2011/07/26_13:56:13 info: Releasing resource group: ha1.example.com IPaddr::172.16.1.100/24/eth0:0 httpd
ResourceManager[7945]: 2011/07/26_13:56:13 info: Running /etc/init.d/httpd stop
#资源管理器关闭了之前的应用- ResourceManager[7945]: 2011/07/26_13:56:13 info: Running /etc/ha.d/resource.d/IPaddr 172.16.1.100/24/eth0:0 stop
IPaddr[8037]: 2011/07/26_13:56:13 INFO: ifconfig eth0:0 down
IPaddr[8008]: 2011/07/26_13:56:13 INFO: Success
#相应的VIP也关了- ResourceManager[8067]: 2011/07/26_13:56:13 info: Releasing resource group: ha2.example.com IPaddr::172.16.1.101/24/eth0:1 vsftpd
#释放原属于ha2.example.com的ftp服务- ResourceManager[8067]: 2011/07/26_13:56:13 info: Running /etc/init.d/vsftpd stop
ResourceManager[8067]: 2011/07/26_13:56:14 info: Running /etc/ha.d/resource.d/IPaddr 172.16.1.101/24/eth0:1 stop
IPaddr[8161]: 2011/07/26_13:56:14 INFO: ifconfig eth0:1 down
#停服务,停网卡。- IPaddr[8132]: 2011/07/26_13:56:14 INFO: Success
heartbeat[7932]: 2011/07/26_13:56:14 info: All HA resources relinquished.
heartbeat[7043]: 2011/07/26_13:56:16 info: killing /usr/lib/heartbeat/ipfail process group 7069 with signal 15
heartbeat[7043]: 2011/07/26_13:56:17 info: Received shutdown notice from 'ha2.example.com'.
heartbeat[7043]: 2011/07/26_13:56:17 info: Resource takeover cancelled - shutdown in progress.
heartbeat[7043]: 2011/07/26_13:56:19 info: killing HBFIFO process 7045 with signal 15
heartbeat[7043]: 2011/07/26_13:56:19 info: killing HBWRITE process 7046 with signal 15
heartbeat[7043]: 2011/07/26_13:56:19 info: killing HBREAD process 7047 with signal 15
heartbeat[7043]: 2011/07/26_13:56:19 info: killing HBWRITE process 7048 with signal 15
heartbeat[7043]: 2011/07/26_13:56:19 info: killing HBREAD process 7049 with signal 15
heartbeat[7043]: 2011/07/26_13:56:19 info: Core process 7049 exited. 5 remaining
heartbeat[7043]: 2011/07/26_13:56:19 info: Core process 7047 exited. 4 remaining
heartbeat[7043]: 2011/07/26_13:56:19 info: Core process 7046 exited. 3 remaining
heartbeat[7043]: 2011/07/26_13:56:19 info: Core process 7048 exited. 2 remaining
heartbeat[7043]: 2011/07/26_13:56:19 info: Core process 7045 exited. 1 remaining
heartbeat[7043]: 2011/07/26_13:56:19 info: ha1.example.com Heartbeat shutdown complete.
#关了heartbeat服务- heartbeat[7043]: 2011/07/26_13:56:19 info: Heartbeat restart triggered.
heartbeat[7043]: 2011/07/26_13:56:19 info: Restarting heartbeat.
heartbeat[7043]: 2011/07/26_13:56:19 info: Performing heartbeat restart exec.
heartbeat[7043]: 2011/07/26_13:56:30 info: Version 2 support: false
heartbeat[7043]: 2011/07/26_13:56:30 WARN: Logging daemon is disabled --enabling logging daemon is recommended
heartbeat[7043]: 2011/07/26_13:56:30 info: **************************
heartbeat[7043]: 2011/07/26_13:56:30 info: Configuration validated. Starting heartbeat 2.1.3
heartbeat[8191]: 2011/07/26_13:56:30 info: heartbeat: version 2.1.3
heartbeat[8191]: 2011/07/26_13:56:30 info: Heartbeat generation: 1311635912
heartbeat[8191]: 2011/07/26_13:56:30 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1
heartbeat[8191]: 2011/07/26_13:56:30 info: glib: ucast: bound send socket to device: eth1
heartbeat[8191]: 2011/07/26_13:56:30 info: glib: ucast: bound receive socket to device: eth1
heartbeat[8191]: 2011/07/26_13:56:30 info: glib: ucast: started on port 694 interface eth1 to 10.1.1.2
heartbeat[8191]: 2011/07/26_13:56:30 info: glib: ping group heartbeat started.- heartbeat[8191]: 2011/07/26_13:56:30 info: G_main_add_TriggerHandler: Added signal manual handler
heartbeat[8191]: 2011/07/26_13:56:30 info: G_main_add_TriggerHandler: Added signal manual handler
heartbeat[8191]: 2011/07/26_13:56:30 info: G_main_add_SignalHandler: Added signal handler for signal 17
heartbeat[8191]: 2011/07/26_13:56:30 info: Local status now set to: 'up'
heartbeat[8191]: 2011/07/26_13:56:32 info: Link group1:group1 up.
heartbeat[8191]: 2011/07/26_13:56:32 info: Status update for node group1: status ping
heartbeat[8191]: 2011/07/26_13:56:33 info: Link ha2.example.com:eth1 up.
heartbeat[8191]: 2011/07/26_13:56:33 info: Status update for node ha2.example.com: status up- harc[8199]: 2011/07/26_13:56:33 info: Running /etc/ha.d/rc.d/status status
heartbeat[8191]: 2011/07/26_13:56:33 info: Comm_now_up(): updating status to active
heartbeat[8191]: 2011/07/26_13:56:33 info: Local status now set to: 'active'
heartbeat[8191]: 2011/07/26_13:56:33 info: Starting child client "/usr/lib/heartbeat/ipfail" (498,496)
heartbeat[8216]: 2011/07/26_13:56:33 info: Starting "/usr/lib/heartbeat/ipfail" as uid 498 gid 496 (pid 8216)
heartbeat[8191]: 2011/07/26_13:56:34 info: Status update for node ha2.example.com: status active
harc[8219]: 2011/07/26_13:56:34 info: Running /etc/ha.d/rc.d/status status
ipfail[8216]: 2011/07/26_13:56:40 info: Status update: Node ha2.example.com now has status active
#检查另一个节点的状态
ipfail[8216]: 2011/07/26_13:56:43 info: Asking other side for ping node count.
ipfail[8216]: 2011/07/26_13:56:46 info: No giveup timer to abort.
heartbeat[8191]: 2011/07/26_13:56:50 info: local resource transition completed.
heartbeat[8191]: 2011/07/26_13:56:50 info: Initial resource acquisition complete (T_RESOURCES(us))
heartbeat[8191]: 2011/07/26_13:56:50 info: remote resource transition completed.
IPaddr[8271]: 2011/07/26_13:56:51 INFO: Resource is stopped
heartbeat[8235]: 2011/07/26_13:56:51 info: Local Resource acquisition completed.
harc[8324]: 2011/07/26_13:56:51 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp
ip-request-resp[8324]: 2011/07/26_13:56:51 received ip-request-resp IPaddr::172.16.1.100/24/eth0:0 OK yes
ResourceManager[8345]: 2011/07/26_13:56:51 info: Acquiring resource group: ha1.example.com IPaddr::172.16.1.100/24/eth0:0 httpd
IPaddr[8372]: 2011/07/26_13:56:52 INFO: Resource is stopped
#获得资源信息
ResourceManager[8345]: 2011/07/26_13:56:53 info: Running /etc/ha.d/resource.d/IPaddr 172.16.1.100/24/eth0:0 start
IPaddr[8470]: 2011/07/26_13:56:54 INFO: Using calculated netmask for 172.16.1.100: 255.255.255.0
IPaddr[8470]: 2011/07/26_13:56:54 INFO: eval ifconfig eth0:0 172.16.1.100 netmask 255.255.255.0 broadcast 172.16.1.255
IPaddr[8441]: 2011/07/26_13:56:54 INFO: Success
#取得VIP及ip地址
ResourceManager[8345]: 2011/07/26_13:56:54 info: Running /etc/init.d/httpd start- 服务正常了! 该日志为完整日志!
双心跳及HA个人理解综合 http://myhat.blog.51cto.com/391263/623546