HA的一个测试_第1张图片 

之前配置完所,断心跳网卡后,应用不会切,一度以为是自己的配置有问题。但发现将vnet3切换成与网卡直接桥接,问题就解决了。这极有可能是因为vnet3两节点间,发送包有些问题。

   
   
   
   
  1. 前提部署:  
  2. 1、环境配置  
  3. 2、主机名,yum,ssh  
  4.  
  5. 1、安装heartbeat.  
  6. #yum install -y heartbeat*     #要执行两次哦,不然会发现有的包居然没有装上。  
  7.  
  8. # rpm -qa | grep heartbeat*  
  9. heartbeat-gui-2.1.3-3.el5.centos  
  10. heartbeat-2.1.3-3.el5.centos  
  11. heartbeat-stonith-2.1.3-3.el5.centos  
  12. heartbeat-devel-2.1.3-3.el5.centos  
  13. heartbeat-ldirectord-2.1.3-3.el5.centos          
  14. heartbeat-pils-2.1.3-3.el5.centos  
  15.  
  16. 复制相关的配置文件:  
  17. # cp /usr/share/doc/heartbeat-2.1.3/ha.cf /etc/ha.d/     #ha.cf HA的配置文件  
  18. # cp /usr/share/doc/heartbeat-2.1.3/haresources /etc/ha.d/  #haresources 资源文件  
  19. # cp /usr/share/doc/heartbeat-2.1.3/authkeys /etc/ha.d/   #HA节点间的验证文件  
  20.  
  21. # yum install -y httpd  
  22.  
  23. # vim /etc/ha.d/ha.cf  
  24. debugfile /var/log/ha-debug  
  25. logfile /var/log/ha-log  
  26. logfacility     local0  
  27. keepalive 2  
  28. deadtime 30  
  29. warntime 10  
  30. initdead 120  
  31. udpport 694  
  32. ucast eth1 1.1.1.2         #心跳  
  33. auto_failback on  
  34. node    ha1  
  35. node    ha2  
  36. ping 172.16.1.1  172.16.1.11 #网关与另一个节点IP  
  37. respawn hacluster /usr/lib/heartbeat/ipfail  
  38. deadping 30  
  39. apiauth ipfail uid=hacluster 
  40. use_logd yes  
  41. conn_logd_time 60  
  42.  
  43. #cat authkeys         #定义认证的keys  
  44. auth 1  
  45. crc 
  46. ================  
  47. heartbeat[8404]: 2011/07/26_05:02:48 ERROR: Bad permissions on keyfile [/etc/ha.d/authkeys], 600 recommended.  
  48. heartbeat[8404]: 2011/07/26_05:02:48 ERROR: Authentication configuration error.  
  49. heartbeat[8404]: 2011/07/26_05:02:48 ERROR: Configuration error, heartbeat not started.  
  50.  
  51. # chmod 600 /etc/ha.d/authkeys 
  52. =================  
  53. # cat /etc/ha.d/haresources       #配置HA资源  
  54. ha1     IPaddr::172.16.1.100/24/eth0:0 httpd  
  55.  
  56. # /etc/init.d/heartbeat start  
  57. logd is already running  
  58. Starting High-Availability services:   
  59. 2011/07/26_05:05:15 INFO:  Resource is stopped  
  60. [  OK  ]  
  61.  
  62. #ha1与ha2之间的配置,不同的就是ucast 值与 被ping的IP。
  63. #++++++++++++++++++++++++++++++++++++++++++++++++++++++
  64. #
  65. #++++++++++++++++++++++++++++++++++++++++++++++++++++++
  66. 以下为断开心跳线,以及重新插入心跳线的过程日志:
  67. #断开一方的心跳
    heartbeat[7043]: 2011/07/26_13:53:40 WARN: node ha2.example.com: is dead
    heartbeat[7043]: 2011/07/26_13:53:40 info: Dead node ha2.example.com gave up resources.
    heartbeat[7043]: 2011/07/26_13:53:40 info: Link ha2.example.com:eth1 dead.  
    ipfail[7069]: 2011/07/26_13:53:40 info: Status update: Node ha2.example.com now has status dead
    ipfail[7069]: 2011/07/26_13:53:42 info: NS: We are still alive!
    ipfail[7069]: 2011/07/26_13:53:42 info: Link Status update: Link ha2.example.com/eth1 now has status dead
    ipfail[7069]: 2011/07/26_13:53:44 info: Asking other side for ping node count.
    ipfail[7069]: 2011/07/26_13:53:44 info: Checking remote count of ping nodes.
  68. 这个时候,请使用ip addr观察双方的IP地址,会发现VIP 地址出现在两台机器上。脑裂了!

  69. #第二个节点又活了
    heartbeat[7043]: 2011/07/26_13:56:09 CRIT: Cluster node ha2.example.com returning after partition.
    heartbeat[7043]: 2011/07/26_13:56:09 info: For information on cluster partitions, See URL:
    http://linux-ha.org/SplitBrain
    heartbeat[7043]: 2011/07/26_13:56:09 WARN: Deadtime value may be too small.
    heartbeat[7043]: 2011/07/26_13:56:09 info: See FAQ for information on tuning deadtime.
    heartbeat[7043]: 2011/07/26_13:56:09 info: URL:
    http://linux-ha.org/FAQ#heavy_load

    heartbeat[7043]: 2011/07/26_13:56:09 info: Link ha2.example.com:eth1 up.
  70. heartbeat[7043]: 2011/07/26_13:56:09 WARN: Late heartbeat: Node ha2.example.com: interval 104930 ms
    ipfail[7069]: 2011/07/26_13:56:09 info: Link Status update: Link ha2.example.com/eth1 now has status up
    heartbeat[7043]: 2011/07/26_13:56:09 info: Status update for node ha2.example.com: status active
    ipfail[7069]: 2011/07/26_13:56:09 info: Status update: Node ha2.example.com now has status active
    harc[7916]:     2011/07/26_13:56:09 info: Running /etc/ha.d/rc.d/status status
    heartbeat[7043]: 2011/07/26_13:56:12 info: Heartbeat shutdown in progress. (7043)
    #发现节点2的心跳网卡又活了,heartbeat重启了。
  71. heartbeat[7932]: 2011/07/26_13:56:13 info: Giving up all HA resources.
    ResourceManager[7945]:  2011/07/26_13:56:13 info: Releasing resource group: ha1.example.com IPaddr::172.16.1.100/24/eth0:0 httpd
    ResourceManager[7945]:  2011/07/26_13:56:13 info: Running /etc/init.d/httpd  stop
    #资源管理器关闭了之前的应用
  72. ResourceManager[7945]:  2011/07/26_13:56:13 info: Running /etc/ha.d/resource.d/IPaddr 172.16.1.100/24/eth0:0 stop
    IPaddr[8037]:   2011/07/26_13:56:13 INFO: ifconfig eth0:0 down
    IPaddr[8008]:   2011/07/26_13:56:13 INFO:  Success
    #相应的VIP也关了
  73. ResourceManager[8067]:  2011/07/26_13:56:13 info: Releasing resource group: ha2.example.com IPaddr::172.16.1.101/24/eth0:1 vsftpd
    #释放原属于ha2.example.com的ftp服务
  74. ResourceManager[8067]:  2011/07/26_13:56:13 info: Running /etc/init.d/vsftpd  stop
    ResourceManager[8067]:  2011/07/26_13:56:14 info: Running /etc/ha.d/resource.d/IPaddr 172.16.1.101/24/eth0:1 stop
    IPaddr[8161]:   2011/07/26_13:56:14 INFO: ifconfig eth0:1 down
    #停服务,停网卡。
  75. IPaddr[8132]:   2011/07/26_13:56:14 INFO:  Success
    heartbeat[7932]: 2011/07/26_13:56:14 info: All HA resources relinquished.
    heartbeat[7043]: 2011/07/26_13:56:16 info: killing /usr/lib/heartbeat/ipfail process group 7069 with signal 15
    heartbeat[7043]: 2011/07/26_13:56:17 info: Received shutdown notice from 'ha2.example.com'.
    heartbeat[7043]: 2011/07/26_13:56:17 info: Resource takeover cancelled - shutdown in progress.
    heartbeat[7043]: 2011/07/26_13:56:19 info: killing HBFIFO process 7045 with signal 15
    heartbeat[7043]: 2011/07/26_13:56:19 info: killing HBWRITE process 7046 with signal 15
    heartbeat[7043]: 2011/07/26_13:56:19 info: killing HBREAD process 7047 with signal 15
    heartbeat[7043]: 2011/07/26_13:56:19 info: killing HBWRITE process 7048 with signal 15
    heartbeat[7043]: 2011/07/26_13:56:19 info: killing HBREAD process 7049 with signal 15
    heartbeat[7043]: 2011/07/26_13:56:19 info: Core process 7049 exited. 5 remaining
    heartbeat[7043]: 2011/07/26_13:56:19 info: Core process 7047 exited. 4 remaining
    heartbeat[7043]: 2011/07/26_13:56:19 info: Core process 7046 exited. 3 remaining
    heartbeat[7043]: 2011/07/26_13:56:19 info: Core process 7048 exited. 2 remaining
    heartbeat[7043]: 2011/07/26_13:56:19 info: Core process 7045 exited. 1 remaining
    heartbeat[7043]: 2011/07/26_13:56:19 info: ha1.example.com Heartbeat shutdown complete.
    #关了heartbeat服务
  76. heartbeat[7043]: 2011/07/26_13:56:19 info: Heartbeat restart triggered.
    heartbeat[7043]: 2011/07/26_13:56:19 info: Restarting heartbeat.
    heartbeat[7043]: 2011/07/26_13:56:19 info: Performing heartbeat restart exec.
    heartbeat[7043]: 2011/07/26_13:56:30 info: Version 2 support: false
    heartbeat[7043]: 2011/07/26_13:56:30 WARN: Logging daemon is disabled --enabling logging daemon is recommended
    heartbeat[7043]: 2011/07/26_13:56:30 info: **************************
    heartbeat[7043]: 2011/07/26_13:56:30 info: Configuration validated. Starting heartbeat 2.1.3
    heartbeat[8191]: 2011/07/26_13:56:30 info: heartbeat: version 2.1.3
    heartbeat[8191]: 2011/07/26_13:56:30 info: Heartbeat generation: 1311635912
    heartbeat[8191]: 2011/07/26_13:56:30 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1
    heartbeat[8191]: 2011/07/26_13:56:30 info: glib: ucast: bound send socket to device: eth1
    heartbeat[8191]: 2011/07/26_13:56:30 info: glib: ucast: bound receive socket to device: eth1
    heartbeat[8191]: 2011/07/26_13:56:30 info: glib: ucast: started on port 694 interface eth1 to 10.1.1.2
    heartbeat[8191]: 2011/07/26_13:56:30 info: glib: ping group heartbeat started.
  77. heartbeat[8191]: 2011/07/26_13:56:30 info: G_main_add_TriggerHandler: Added signal manual handler
    heartbeat[8191]: 2011/07/26_13:56:30 info: G_main_add_TriggerHandler: Added signal manual handler
    heartbeat[8191]: 2011/07/26_13:56:30 info: G_main_add_SignalHandler: Added signal handler for signal 17
    heartbeat[8191]: 2011/07/26_13:56:30 info: Local status now set to: 'up'
    heartbeat[8191]: 2011/07/26_13:56:32 info: Link group1:group1 up.
    heartbeat[8191]: 2011/07/26_13:56:32 info: Status update for node group1: status ping
    heartbeat[8191]: 2011/07/26_13:56:33 info: Link ha2.example.com:eth1 up.
    heartbeat[8191]: 2011/07/26_13:56:33 info: Status update for node ha2.example.com: status up
  78. harc[8199]:     2011/07/26_13:56:33 info: Running /etc/ha.d/rc.d/status status
    heartbeat[8191]: 2011/07/26_13:56:33 info: Comm_now_up(): updating status to active
    heartbeat[8191]: 2011/07/26_13:56:33 info: Local status now set to: 'active'
    heartbeat[8191]: 2011/07/26_13:56:33 info: Starting child client "/usr/lib/heartbeat/ipfail" (498,496)
    heartbeat[8216]: 2011/07/26_13:56:33 info: Starting "/usr/lib/heartbeat/ipfail" as uid 498  gid 496 (pid 8216)
    heartbeat[8191]: 2011/07/26_13:56:34 info: Status update for node ha2.example.com: status active
    harc[8219]:     2011/07/26_13:56:34 info: Running /etc/ha.d/rc.d/status status
    ipfail[8216]: 2011/07/26_13:56:40 info: Status update: Node ha2.example.com now has status active
    #检查另一个节点的状态
    ipfail[8216]: 2011/07/26_13:56:43 info: Asking other side for ping node count.
    ipfail[8216]: 2011/07/26_13:56:46 info: No giveup timer to abort.
    heartbeat[8191]: 2011/07/26_13:56:50 info: local resource transition completed.
    heartbeat[8191]: 2011/07/26_13:56:50 info: Initial resource acquisition complete (T_RESOURCES(us))
    heartbeat[8191]: 2011/07/26_13:56:50 info: remote resource transition completed.
    IPaddr[8271]:   2011/07/26_13:56:51 INFO:  Resource is stopped
    heartbeat[8235]: 2011/07/26_13:56:51 info: Local Resource acquisition completed.
    harc[8324]:     2011/07/26_13:56:51 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp
    ip-request-resp[8324]:  2011/07/26_13:56:51 received ip-request-resp IPaddr::172.16.1.100/24/eth0:0 OK yes
    ResourceManager[8345]:  2011/07/26_13:56:51 info: Acquiring resource group: ha1.example.com IPaddr::172.16.1.100/24/eth0:0 httpd
    IPaddr[8372]:   2011/07/26_13:56:52 INFO:  Resource is stopped
    #获得资源信息
    ResourceManager[8345]:  2011/07/26_13:56:53 info: Running /etc/ha.d/resource.d/IPaddr 172.16.1.100/24/eth0:0 start
    IPaddr[8470]:   2011/07/26_13:56:54 INFO: Using calculated netmask for 172.16.1.100: 255.255.255.0
    IPaddr[8470]:   2011/07/26_13:56:54 INFO: eval ifconfig eth0:0 172.16.1.100 netmask 255.255.255.0 broadcast 172.16.1.255
    IPaddr[8441]:   2011/07/26_13:56:54 INFO:  Success
    #取得VIP及ip地址
    ResourceManager[8345]:  2011/07/26_13:56:54 info: Running /etc/init.d/httpd  start
  79. 服务正常了! 该日志为完整日志!

双心跳及HA个人理解综合 http://myhat.blog.51cto.com/391263/623546