此过程分为三个阶段:
一.drbd的编译安装
node1:
新建一个空闲分区:/dev/sda5
vi /etc/sysconfig/network-scripts/ifcfg-eth0
修改内容为下面几行:DEVICE=eth0
BOOTPROTO=static
HWADDR=00:0C:29:77:F8:D2
IPADDR=192.168.0.2
NETMASK=255.255.255.0
ONBOOT=yes
vi /etc/sysconfig/network
修改内容为下面几行
NETWORKING=yes
NETWORKING_IPV6=no
HOSTNAME=node1.a.org
hostname node1.a.org
vi /etc/hosts
添加:192.168.0.2 node1.a.org node1
192.168.0.3 node1.a.org node2
scp /etc/hosts 192.168.0.3:/etc
下载源码包:drbd-8.3.10.tar.gz
tar xf drbd-8.3.10.tar.gz
cd drbd-8.3.10
./configure
make rpm
make km-rpm
cd /usr/src/redhat/RPMS/i386/
rpm -ivh drbd*
modprobe drbd//重读模块
lsmod | grep drbd//查看模块是否存在
vi /etc/drbd.d/global_common.conf添加global {
usage-count no;
# minor-count dialog-refresh disable-ip-verification
}
common {
protocol C;
handlers {
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
# fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
# split-brain "/usr/lib/drbd/notify-split-brain.sh root";
# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
# before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
# after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
}
startup {
wfc-timeout 120;
degr-wfc-timeout 120;
}
disk {
on-io-error detach;
fencing resource-only;
}
net {
cram-hmac-alg "sha1";
shared-secret "mydrbdlab";
}
syncer {
rate 100M;
}
}
3、定义一个资源/etc/drbd.d/web.res,内容如下:
resource web {
on node1.a.org {
device /dev/drbd0;
disk /dev/sda5;
address 192.168.0.2:7789; meta-disk internal;
}
on node2.a.org {
device /dev/drbd0;
disk /dev/sda5;
address 192.168.0.3:7789;
meta-disk internal;
}
}
scp /etc/drbd.d/global_common.conf /etc/drbd.d/web.res node2:/etc/drbd.d/
初始化:
drbdadm create-md web
启动服务:/etc/init.d/drbd start
node2:新建一个空闲分区:/dev/sda5
vi /etc/sysconfig/network-scripts/ifcfg-eth0
修改内容为下面几行:DEVICE=eth0
BOOTPROTO=static
HWADDR=00:0C:29:77:F8:D2
IPADDR=192.168.0.3
NETMASK=255.255.255.0
ONBOOT=yes
vi /etc/sysconfig/network
修改内容为下面几行
NETWORKING=yes
NETWORKING_IPV6=no
HOSTNAME=node2.a.org
hostname node2.a.org
下载源码包:drbd-8.3.10.tar.gz
tar xf drbd-8.3.10.tar.gz
cd drbd-8.3.10
./configure
make rpm
make km-rpm
cd /usr/src/redhat/RPMS/i386/
rpm -ivh drbd*
modprobe drbd//重读模块
lsmod | grep drbd//查看模块是否存在
初始化:
drbdadm create-md web
启动服务:/etc/init.d/drbd start
把当前节点设置为主节点:drbdadm -- --overwrite-data-of-peer primary web
二.openais的安装
node1:设定两个节点可以基于密钥进行ssh通信
ssh-keygen -t rsa
ssh-copy-id -i .ssh/id_rsa.pub
[email protected]
scp /etc/hosts 192.168.0.12:/etc/
安装软件包:cluster-glue
cluster-glue-libs
heartbeat
openaislib
resource-agents
corosync
heartbeat-libs
pacemaker
corosynclib
libesmtp
pacemaker-libs
yum --nogpgcheck -y localinstall *.rpm
修改配置文件:cd /etc/corosync
cp corosync.conf.example corosync.conf
添加vi corosync.conf
service {
ver: 0
name: pacemaker
}
aisexec {
user: root
group: root
}//此文件中的 bindnetaddr应为节点对应的网络地址:192.168.0.0
mkdir /var/log/cluster
ssh node2 'mkdir /var/log/cluster'
生成节点通信时的认证密钥文件:
corosync-keygen
scp -p corosync.conf authkey node2:/etc/corosync/复制这两个文件到node2节点
启动服务:/etc/init.d/corosync start
验证是否有错:
查看corosync引擎是否正常启动:
# grep -e "Corosync Cluster Engine" -e "configuration file" /var/log/messages
Jun 14 19:02:08 node1 corosync[5103]: [MAIN ] Corosync Cluster Engine ('1.2.7'): started and ready to provide service.
Jun 14 19:02:08 node1 corosync[5103]: [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'.
Jun 14 19:02:08 node1 corosync[5103]: [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1397.
Jun 14 19:03:49 node1 corosync[5120]: [MAIN ] Corosync Cluster Engine ('1.2.7'): started and ready to provide service.
Jun 14 19:03:49 node1 corosync[5120]: [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'.
查看初始化成员节点通知是否正常发出:
# grep TOTEM /var/log/messages
Jun 14 19:03:49 node1 corosync[5120]: [TOTEM ] Initializing transport (UDP/IP).
Jun 14 19:03:49 node1 corosync[5120]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jun 14 19:03:50 node1 corosync[5120]: [TOTEM ] The network interface [192.168.0.5] is now up.
Jun 14 19:03:50 node1 corosync[5120]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
检查启动过程中是否有错误产生:
# grep ERROR: /var/log/messages | grep -v unpack_resources
查看pacemaker是否正常启动:
# grep pcmk_startup /var/log/messages
Jun 14 19:03:50 node1 corosync[5120]: [pcmk ] info: pcmk_startup: CRM: Initialized
Jun 14 19:03:50 node1 corosync[5120]: [pcmk ] Logging: Initialized pcmk_startup
Jun 14 19:03:50 node1 corosync[5120]: [pcmk ] info: pcmk_startup: Maximum core file size is: 4294967295
Jun 14 19:03:50 node1 corosync[5120]: [pcmk ] info: pcmk_startup: Service: 9
Jun 14 19:03:50 node1 corosync[5120]: [pcmk ] info: pcmk_startup: Local hostname: node1.a.org
如果以上都没错,可以启动node2了
ssh node2 -- /etc/init.d/corosync start
crm status //查看节点的启动状态
crm configure property stonith-enabled=false//禁用stnoith,因为没有stonith设备,可能会出错
crm configure show//查看当前的配置信息
添加资源:crm configure primitive WebIP ocf:heartbeat:IPaddr params ip=192.168.0.10
crm status//验证资源是否已经在node1.a.org上启动
crm configure property no-quorum-policy=ignore//此设置可以在node1.a.org离线时,将资源运行在node2.a.org上
将此集群配置为WEB的高可用集群:
yum -y install httpd
crm configure primitive WebSite lsb:httpd //将httpd服务添加为集群资源
node2:
vi /etc/sysconfig/network-scripts/ifcfg-eth0
修改内容为下面几行:DEVICE=eth0
BOOTPROTO=static
HWADDR=00:0C:29:77:F8:D2
IPADDR=192.168.0.12
NETMASK=255.255.255.0
ONBOOT=yes
vi /etc/sysconfig/network
修改内容为下面几行
NETWORKING=yes
NETWORKING_IPV6=no
HOSTNAME=node2.a.org
hostname node2.a.org
设定两个节点可以基于密钥进行ssh通信
ssh-keygen -t rsa
ssh-copy-id -i .ssh/id_rsa.pub
[email protected]
yum -y install httpd
三.drbd+pacemaker
1、查看当前集群的配置信息,确保已经配置全局属性参数为两节点集群所适用:
# crm configure show
node node1.a.org
node node2.a.org
property $id="cib-bootstrap-options" \
dc-version="1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
last-lrm-refresh="1308059765" \
no-quorum-policy="ignore"
在如上输出的信息中,请确保有stonith-enabled和no-quorum-policy出现且其值与如上输出信息中相同。否则,可以分别使用如下命令进行配置:
# crm configure property stonith-enabled=false
# crm configure property no-quorum-policy=ignore
2、将已经配置好的drbd设备/dev/drbd0定义为集群服务;
1)按照集群服务的要求,首先确保两个节点上的drbd服务已经停止,且不会随系统启动而自动启动:
# drbd-overview
0:web Unconfigured . . . .
# chkconfig drbd off
2)配置drbd为集群资源:
提供drbd的RA目前由OCF归类为linbit,其路径为/usr/lib/ocf/resource.d/linbit/drbd。我们可以使用如下命令来查看此RA及RA的meta信息:
# crm ra classes
heartbeat
lsb
ocf / heartbeat linbit pacemaker
stonith
# crm ra list ocf linbit
drbd
# crm ra info ocf:linbit:drbd
This resource agent manages a DRBD resource
as a master/slave resource. DRBD is a shared-nothing replicated storage
device. (ocf:linbit:drbd)
Master/Slave OCF Resource Agent for DRBD
Parameters (* denotes required, [] the default):
drbd_resource* (string): drbd resource name
The name of the drbd resource from the drbd.conf file.
drbdconf (string, [/etc/drbd.conf]): Path to drbd.conf
Full path to the drbd.conf file.
Operations' defaults (advisory minimum):
start timeout=240
promote timeout=90
demote timeout=90
notify timeout=90
stop timeout=100
monitor_Slave interval=20 timeout=20 start-delay=1m
monitor_Master interval=10 timeout=20 start-delay=1m
drbd需要同时运行在两个节点上,但只能有一个节点(primary/secondary模型)是Master,而另一个节点为Slave;因此,它是一种比较特殊的集群资源,其资源类型为多状态(Multi-state)clone类型,即主机节点有Master和Slave之分,且要求服务刚启动时两个节点都处于slave状态。
[root@node1 ~]# crm
crm(live)# configure
crm(live)configure# primitive webdrbd ocf:heartbeat:drbd params drbd_resource=web op monitor role=Master interval=50s timeout=30s op monitor role=Slave interval=60s timeout=30s
crm(live)configure# master MS_Webdrbd webdrbd meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
crm(live)configure# show webdrbd
primitive webdrbd ocf:linbit:drbd \
params drbd_resource="web" \
op monitor interval="15s"
crm(live)configure# show MS_Webdrbd
ms MS_Webdrbd webdrbd \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
crm(live)configure# verify
crm(live)configure# commit
查看当前集群运行状态:
# crm status
============
Last updated: Fri Jun 17 06:24:03 2011
Stack: openais
Current DC: node2.a.org - partition with quorum
Version: 1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87
2 Nodes configured, 2 expected votes
1 Resources configured.
============
Online: [ node2.a.org node1.a.org ]
Master/Slave Set: MS_Webdrbd
Masters: [ node2.a.org ]
Slaves: [ node1.a.org ]
由上面的信息可以看出此时的drbd服务的Primary节点为node2.a.org,Secondary节点为node1.a.org。当然,也可以在node2上使用如下命令验正当前主机是否已经成为web资源的Primary节点:
# drbdadm role web
Primary/Secondary