linux下高可用集群之corosync详解
1.corosync相当于heartbeat功能,提供Messaging Layer,收集节点之间心跳等信息
pacemaker相当于haresources,提供crm管理资源信息
2.实验:双集群节点为node1.willow.com,IP为1.1.1.18 node2.willow.com,IP为1.1.1.19
在node1.willow.com主机配置如下:(与node2.willow.com集群配置全部相同)
2.1.安装corosync和pacemaker等需要安装的包
cluster-glue-1.0.6-1.6.el5.i386.rpm
cluster-glue-libs-1.0.6-1.6.el5.i386.rpm
corosync-1.2.7-1.1.el5.i386.rpm
corosynclib-1.2.7-1.1.el5.i386.rpm
heartbeat-3.0.3-2.3.el5.i386.rpm
heartbeat-libs-3.0.3-2.3.el5.i386.rpm
libesmtp-1.0.4-5.el5.i386.rpm
pacemaker-1.1.5-1.1.el5.i386.rpm
pacemaker-cts-1.1.5-1.1.el5.i386.rpm
pacemaker-libs-1.1.5-1.1.el5.i386.rpm
resource-agents-1.0.4-1.1.el5.i386.rpm
#yum --nogpgcheck localinstall *.rpm
2.2.配置corosync配置文件
#cp /etc/corosync/corosync.conf.example /etc/corosync/corosync.conf
#vim /etc/corosync/corosync.conf
totem {
version: 2
secauth: on
threads: 2
interface {
ringnumber: 0
bindnetaddr: 1.1.1.0
mcastaddr: 226.98.1.21
mcastport: 5405
}
}
logging {
fileline: off
to_stderr: no
to_logfile: yes
to_syslog: no
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
amf {
mode: disabled
}
service {
ver: 0
name: pacemaker
}
aisexec {
user: root
group: root
}
2.3.生成authkey认证文件
#corosync-keygen
2.4.从node1节点复制authkey和corosync.conf文件至node2节点上,内容保持一致并创建日志目录
mkdir /var/log/cluster
cd /etc/corosync/
scp -p authkey corosync.conf node2:/etc/corosync/
ssh node2 'mkdir /var/log/cluster'
2.5.启动corosync服务
service corosync start
ssh node2 'service corosync start'
2.6.查看日志信息
#grep -e "Corosync Cluster Engine" -e "configuration file" /var/log/cluster/corosync.log
Aug 05 09:36:14 corosync [MAIN ] Corosync Cluster Engine ('1.2.7'): started and ready to provide service.
Aug 05 09:36:14 corosync [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'.
#grep TOTEM /var/log/cluster/corosync.log
Aug 05 09:36:14 corosync [TOTEM ] Initializing transport (UDP/IP).
Aug 05 09:36:14 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Aug 05 09:36:15 corosync [TOTEM ] The network interface [1.1.1.18] is now up.
Aug 05 09:36:15 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug 05 09:36:42 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
#grep ERROR: /var/log/cluster/corosync.log
Aug 05 09:37:17 node1.willow.com pengine: [9917]: ERROR: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
Aug 05 09:37:17 node1.willow.com pengine: [9917]: ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
Aug 05 09:37:17 node1.willow.com pengine: [9917]: ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
Aug 05 09:52:17 node1.willow.com pengine: [9917]: ERROR: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
Aug 05 09:52:17 node1.willow.com pengine: [9917]: ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
Aug 05 09:52:17 node1.willow.com pengine: [9917]: ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
#grep pcmk_startup /var/log/cluster/corosync.log
Aug 05 09:36:15 corosync [pcmk ] info: pcmk_startup: CRM: Initialized
Aug 05 09:36:15 corosync [pcmk ] Logging: Initialized pcmk_startup
Aug 05 09:36:15 corosync [pcmk ] info: pcmk_startup: Maximum core file size is: 4294967295
Aug 05 09:36:15 corosync [pcmk ] info: pcmk_startup: Service: 9
Aug 05 09:36:15 corosync [pcmk ] info: pcmk_startup: Local hostname: node1.willow.com
2.7.crm_mon 监控
3.配置集群的工作属性,禁用stonith
# crm configure property stonith-enabled=false
# crm configure verify
# crm configure commit
4.使用如下命令查看当前的配置信息:
# crm configure show
5.为集群添加集群资源
5.1:添加IP资源
# crm configure primitive WebIP ocf:heartbeat:IPaddr params ip=1.1.1.100
5.2.添加资源httpd:
# crm configure primitive httpd lsb:httpd
5.3.查看资源状态
# crm status
5.4.模拟当前节点失效,查看发生的状态
# crm node standby
信息显示node1.willow.com已经离线,但资源WebIP却没能在node2.willow.com上启动。这是因为此时的集群状态为"WITHOUT quorum",即已经失去了quorum,此时集群服务本身已经不满足正常运行的条件,这对于只有两节点的集群来讲是不合理的。因此,我们可以通过如下的命令来修改忽略quorum不能满足的集群状态检查:
# crm configure property no-quorum-policy=ignore
注意:以上通过crm status命令显示Webip和httpd运行在在不同节点上,显然不合理
小结:
# crm node online #让当前节点重新上线
# crm resource cleanup WebIP #清理资源状态
# crm resource cleanup httpd #清理资源状态
# crm configure edit #可以编辑生成的配置
6.基于组实现两资源WebIP与httpd都运行在同一节点上
# crm configure group webservice WebIP httpd
# crm configure verify
# crm configure commit
# crm status 查看资源绑定在同一节点上
7.基于约束对资源进行管理
# crm resource stop webservice
# crm resource cleanup webservice
# crm resource cleanup WebIP
# crm resource cleanup httpd
# crm configure delete webservice
# crm configure verify
# crm configure commit
7.1.排列约束
# crm configure colocation httpd_with__WebIP inf: httpd WebIP
# crm configure show xml
# crm configure verify
# crm configure commit
7.2.顺序约束
# crm configure order WebIP_before_httpd mandatory: WebIP httpd
# crm configure verify
# crm configure commit
7.3.位置约束:更倾向于运行在哪个指定节点上
# crm configure location WebIP_on_node1 WebIP rule 100: #uname eq node1.willow.com
# crm configure verify
# crm configure commit
8.资源粘性:更倾向于运行在当前节点上即DC所在的节点
# crm configure rsc_defaults resource-stickiness=200
# crm configure verify
# crm configure commit
9.NFS共享web页面命令写法
# crm configureprimitive filesystem ocf:heartbeat:Filesystem params device=1.1.1.20:/web/ha directory=/var/www/html/ fstype=nfs
10.测试命令
# crm node standby
# crm node online
11.总结:也可以进行crm模式,依次进行命令配置或获取命令帮助信息
# crm
crm(live)# configure
crm(live)configure# help group
rm(live)configure# verify
crm(live)configure# commit