corosync+pacemaker的简单实现
前提:
1)本配置共有两个测试节点,分别lidefu1和lidefu2,相的IP地址分别为172.16.21.3和172.16.21.4;
2)集群服务为apache的httpd服务;
3)提供web服务的地址为172.16.21.21,即vip;
4)系统为CentOS 6.5 64bits
1、准备工作
为了配置一台Linux主机成为HA的节点,通常需要做出如下的准备工作:
1)所有节点的主机名称和对应的IP地址解析服务可以正常工作,且每个节点的主机名称需要跟"uname -n“命令的结果保持一致;因此,需要保证两个节点上的/etc/hosts文件均为下面的内容:
172.16.21.3 lidefu3
172.16.21.4 lidefu4
为了使得重新启动系统后仍能保持如上的主机名称,还分别需要在各节点执行类似如下的命令:
Node1:
# sed -i 's@\(HOSTNAME=\).*@\1lidefu3@g' /etc/sysconfig/network
# hostname lidefu3
Node2:
# sed -i 's@\(HOSTNAME=\).*@\1lidefu4 @g' /etc/sysconfig/network
# hostname lidefu4
2)设定两个节点可以基于密钥进行ssh通信,这可以通过类似如下的命令实现:
Node1:
# ssh-keygen -t rsa �CP ‘’
# ssh-copy-id -i ~/.ssh/id_rsa.pub root@lidefu4
Node2:
# ssh-keygen -t rsa �CP ‘’
# ssh-copy-id -i ~/.ssh/id_rsa.pub root@lidefu3
安装corosync ,两个节点都要安装
1: yum install �Cy corosync
查看安装生成文件
1: rpm -ql corosync
配置corosync,两个节点的配置需要一样
1: cd /etc/corosync/2: cp corosync.conf.example corosync.conf3: vim corosync.conf4: compatibility: whitetank #是否兼容0.8以前的版本5: totem { #corosync的协议6: version: 2 #协议的版本号7: secauth: on #是否开启安全认证功能,这个如果关闭,别人知道我们的多播地址,就很容易加入到集群中,开启会消耗更多的cpu,但是还是建议开启8: threads: 0 #实现认证时的并行线程数,0表示使用默认配置9: interface { #子模块10: ringnumber: 0 #环号,当主机上有多块网卡,为了避免自己主机上的其他网卡收到自己主机上发出的心跳信息,所使用的环号11: bindnetaddr: 172.16.21.0 #集群工作的网络地址,12: mcastaddr: 226.16.21.1 #多播地址13: mcastport: 5405 #多播端口14: ttl: 1 #至广播一次15: }16: }17:18: logging { #日志相关19: fileline: off20: to_stderr: no #日志信息是否发往标准错误输出21: to_logfile: yes #是否送给日志文件22: to_syslog: no #是否发往syslog23: logfile: /var/log/cluster/corosync.log #日志文件24: debug: off #是否开启debug25: timestamp: on #是否记录日志的时间戳,消耗I/O26: logger_subsys { #日志的子系统27: subsys: AMF #下面的AMF28: debug: off #debug29: }30: }31:32: amf { #与amf的编程接口相关33: mode: disabled34: }
安装pacemaker,两个节点都要安装
1: yum install pacemaker -y #注:好像不能与heartbeat共存2: rpm -q pacemaker3: rpm -ql pacemaker
配置corosync.conf,让pacemaker可以与corosync共同启动,添加早corosync.conf的末行
1: servcie { #定义的服务2: ver: 0 #版本号3: name: pacemaker #服务名,表示当corosync启动的时候,pacemaker同时启动4: }5: aisexec { #表示启动上面的service时以谁的身份,默认也是root,所以其实可以省略6: user: root #以root的身份运行7: group: root #以root组的身份运行8: }
9: 实际上pacemaker可以自己当成一个服务来启动,只是这里的corosync版本比较老,所以需要把他和corosync一起启动
生成密钥文件,为了通信的安全
1: corosync-keygen #生成认证文件,请使劲敲键盘以获取随机数
安装crmsh和pssh,两个节点都要安装,crmsh作为pacemaker的接口,而安装crmsh需要依赖pssh
1: yum install crmsh-1.2.6-4.el6.x86_64.rpm pssh-2.3.1-2.el6.x86_64.rpm
复制authkey和corosync文件到节点1
1: scp -p authkey corosync lidefu1:/etc/corosync
启动两个节点的corosync
1: service corosync start
查看日志,看是否启动成功
1: less /var/log/cluster/corosync.log
查看corosync引擎是否正常启动:
1: grep -e "Corosync Cluster Engine" -e "configuration file" /var/log/cluster/corosync.log2: Apr 20 17:39:36 corosync [MAIN ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service.
3: Apr 20 17:39:36 corosync [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'
查看初始化成员节点通知是否正常发出:
1: grep TOTEM /var/log/cluster/corosync.log2: Apr 20 17:39:36 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).3: Apr 20 17:39:36 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).4: Apr 20 17:39:36 corosync [TOTEM ] The network interface [172.16.21.3] is now up.5: Apr 20 17:39:37 corosync [TOTEM ] Type of received message is wrong... ignoring 25.6: Apr 20 17:39:37 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.7: ...检查启动过程中是否有错误产生。下面的错误信息表示packmaker不久之后将不再作为corosync的插件运行,因此,建议使用cman作为集群基础架构服务;此处可安全忽略。
1: grep ERROR: /var/log/cluster/corosync.log | grep -v unpack_resources
2: Apr 20 17:39:37 corosync [pcmk ] ERROR: process_ais_conf: You have configured a cluster using the Pacemaker plugin for Corosync. The plugin is not supported in this environment and will be removed very soon.3: Apr 20 17:39:37 corosync [pcmk ] ERROR: process_ais_conf: Please see Chapter 8 of 'Clusters from Scratch' (http://www.clusterlabs.org/doc) for details on using Pacemaker with CMAN
查看pacemaker是否正常启动:
1: grep pcmk_startup /var/log/cluster/corosync.log2: Apr 20 17:39:37 corosync [pcmk ] info: pcmk_startup: CRM: Initialized3: Apr 20 17:39:37 corosync [pcmk ] Logging: Initialized pcmk_startup4: Apr 20 17:39:37 corosync [pcmk ] info: pcmk_startup: Maximum core file size is: 18446744073709551615
5: Apr 20 17:39:37 corosync [pcmk ] info: pcmk_startup: Service: 96: Apr 20 17:39:37 corosync [pcmk ] info: pcmk_startup: Local hostname: lidefu3
如果安装了crmsh,可使用如下命令查看集群节点的启动状态
1: crm status #查看状态2: Last updated: Sun Apr 20 18:14:01 20143: Last change: Sun Apr 20 18:00:25 2014 via crmd on lidefu3
4: Stack: classic openais (with plugin)
5: Current DC: NONE6: 1 Nodes configured, 2 expected votes7: 0 Resources configured8:9:10: Node lidefu3: UNCLEAN (offline)
查看crm命令的帮助,crm的命令有两种模式,一种是批处理模式,一种是交互式
1: crm #直接执行crm命令2: crm(live)# help #live表示直接修改在执行的配置信息,即修改即生效3:4: This is crm shell, a Pacemaker command line interface.5:6: Available commands:7:8: cib manage shadow CIBs9: resource resources management10: configure CRM cluster configuration11: node nodes management12: options user preferences13: history CRM cluster history14: site Geo-cluster support15: ra resource agents information center16: status show cluster status17: help,? show help (help topics for list of topics)18: end,cd,up go back one level
19: quit,bye,exit exit the program
首先禁用stonith,因为没有stonith设备
1: property stonith-enabled=false #关闭stonith2: verify #查看命令是否错误3: commit #提交4: show #查看集群所有的配置信息
注:当一个集群没有法定票数,节点故障资源也不会转移
查看primitive的帮助
1: help primitive2: primitive <rsc> {[<class>:[<provider>:]]<type>|@<template>}
3: [params attr_list]4: [meta attr_list]5: [utilization attr_list]6: [operations id_spec]7: [op op_type [<attribute>=<value>...] ...]8:9: attr_list :: [$id=<id>] <attr>=<val> [<attr>=<val>...] | $id-ref=<id>10: id_spec :: $id=<id> | $id-ref=<id>11: op_type :: start | stop | monitor
查看各资源代理的可代理类型classes和list
1: classes #列出资源类型2: lsb3: o / heartbeat pacemaker4: service5: stonith6: crm(live)ra# list lsb #查看该资源代理可代理的资源7: NetworkManager abrt-ccpp abrt-oops abrtd acpid atd8: auditd autofs blk-availability bluetooth corosync corosync-notifyd9: ...查看资源可使用的参数meta
1: classes2: list o3: meta o:IPaddr24:5: Parameters (* denotes required, [] the default):
6:7: ip* (string): IPv4 or IPv6 address8: nic (string): Network interface9: broadcast (string): Broadcast address
10: iflabel (string): Interface label11: lvs_support (boolean, [false]): Enable support for LVS DR定义资源primitive
1: primitive webip o:IPaddr2 params ip=172.16.21.212: primitive webserver lsb:httpd3: verify4: commit5: show
定义组,对于corosync是后定义组的,不像heartbeat只能先定义组
1: group webservice webip webserver2: verify3: commit4: show
访问网页
1: curl http://172.16.21.21 #也可以通过windows的浏览器2: this is fours page33333333333333
让节点3停止页
1: node2: standby lidefu33: cd ..4: status再次访问网页
1: curl http://172.16.21.212: this is four page444444444444
让节点3上线,看资源是否转回来
1: crm node online lidefu32: crm status #结果是没转回来,因为前面的步骤并没有定义资源的倾向性
删除资源的一种方式
1: crm2: configure3: edit #可以直接编辑删掉,但是不建议这么做
停止资源stop,而后才做删除操作
1: crm2: resource3: stop webservice
4: status webservice
删除资源delete
1: crm2: configure3: delete webservice #注意:删除的仅仅是webservice这个组,并没有删除组内的资源
定义排列约束colocation,两个资源在一起,类似分组
1: Usage:2: ...............3: colocation <id> <score>: <rsc>[:<role>] <rsc>[:<role>] ...4: [node-attribute=<node_attr>]5: ...............6: Example:7: ...............8: colocation dummy_and_apache -inf: apache dummy9: colocation c1 inf: A ( B C )10: ...............
1: configure2: colocation webserver_and_webip inf: webserver webip #前面的是随后面的,webserver是随着webip的3: verify4: commit
查看所有资源的详细信息
1: configure2: show xml
定义顺序约束,假定启动是线性的
1: Usage:2: ...............3: order <id> {kind|<score>}: <rsc>[:<action>] <rsc>[:<action>] ...4: [symmetrical=<bool>]5:6: kind :: Mandatory(强制) | Optional(可选) | Serialize(强制)
7: ...............8: Example:9: ...............10: order c_apache_1 Mandatory: apache:start ip_111: order o1 Serialize: A ( B C )12: order order_2 Mandatory: [ A B ] C13: ...............
1: order webip_before_webserver mandatory: webip webserver
位置约束,location,让资源更倾向于运行在哪个节点上,资源的倾向性是各个资源的和1: location <id> <rsc> {node_pref|rules}2:3: node_pref :: <score>: <node>4:5: rules ::6: rule [id_spec] [$role=<role>] <score>: <expression>7: [rule [id_spec] [$role=<role>] <score>: <expression> ...]8:9: id_spec :: $id=<id> | $id-ref=<id>10: score :: <number> | <attribute> | [-]inf11: expression :: <simple_exp> [bool_op <simple_exp> ...]12: bool_op :: or | and13: simple_exp :: <attribute> [type:]<binary_op> <value>14: | <unary_op> <attribute>15: | date <date_expr>
16: type :: string | version | number
17: binary_op :: lt | gt | lte | gte | eq | ne18: unary_op :: defined | not_defined19:20: date_expr :: lt <end>
21: | gt <start>22: | in_range start=<start> end=<end>23: | in_range start=<start> <duration>24: | date_spec <date_spec>25: duration|date_spec ::26: hours=<value>27: | monthdays=<value>28: | weekdays=<value>29: | yearsdays=<value>30: | months=<value>31: | weeks=<value>32: | years=<value>33: | weekyears=<value>34: | moon=<value>35: ...............36: Examples:37: ...............38: location conn_1 internal_www 100: node139:40: location conn_1 internal_www \41: rule 50: #uname eq node1 \42: rule pingd: defined pingd43:44: location conn_2 dummy_float \45: rule -inf: not_defined pingd or pingd number:lte 0
1: location webip_on_lidefu4 webip 200: lidefu4配置全局属性,当一台主机挂了的时候,不具有法定票数时,剩下的一台主机同样可以运行资源1: configure2: property no-quorum-policy=ignore
3: verify4: commit定义资源的默认属性1: crm(live)configure# rsc_defaults2: allow-migrate= interval-origin= multiple-active= restart-type=3: description= is-managed= priority= target-role=定义资源就启动?
4: failure-timeout= migration-threshold= resource-stickiness=资源粘性,只对当前节点生效,还是不太熟
以上的定义只是节点出故障才会转移资源,并没有定义资源出故障怎么办.所以下面定义资源故障
1: killall httpd #不正常终止运行的服务2: crm3: status #查看节点状态,发现依然显示正常运行4: stop webserver #停止服务
5: stop webip #停止服务
6: resource #进去咯7: cleanup webserver#对不正常关闭的资源做清理8: cleanup webip #对不正常关闭的资源做清理
定义监控,当资源出故障的时候,会尝试重启服务,如果重启不了就会转换节点
1: Usage:2: ...............3: monitor <rsc>[:<role>] <interval>[:<timeout>]4: ...............5: Example:6: ...............7: monitor apence 60m:60s8: ...............
1: monitor webserver 20s:15s
2: verify3: commit
定义好了,可以尝试在当前运行httpd节点的服务停止,然后等待15秒以上,查看他是否启动成功,笔者这里是启动成功的.1: crm(live)configure# monitor webserver 20s:15s