--------------闲 扯------------------
这里首先感谢酒哥的构建高可用的Linux服务器的这本书,看了这本书上并参考里面的配置让自己对DRBD+HeartBeat+NFS思路清晰了许多。
drbd简单来说就是一个网络raid-1,一般有2到多个node节点,各个节点创建的磁盘块会映射到本地drbd块,而后通过网络对各个节点drbd磁盘块进行互相同步更新。
heartbeat的作用就可以增加drbd的可用性,它能在某节点故障后,自动切换drbd块到备份节点,并自动进行虚IP从新绑定,DRBD块提权,磁盘挂载以及启动NFS等脚本操作,这一系列操作因为只在他后端节点间完成,前端用户访问的是heartbeat的虚IP,所以对用户来说无任何感知。
最后吐槽下,yum安装真心坑爹,以后如果非必须,尽量源码包安装。
---------------开 搞----------------
系统版本: centos6.3 x64(内核2.6.32)
DRBD: DRBD-8.4.3
HeartBeat:epel更新源(真坑)
NFS: 系统自带
HeartBeat VIP: 192.168.7.90
node1 DRBD+HeartBeat: 192.168.7.88(drbd1.example.com)
node2 DRBD+HeartBeat: 192.168.7.89 (drbd2.example.com)
(node1)为仅主节点端配置
(node2)为仅从节点端配置
(node1,node2)为主从节点都需配置
一.DRBD配置,传送门:http://showerlee.blog.51cto.com/2047005/1211963
二.Hearbeat配置;
这里接着DRBD系统环境及安装配置:
1.安装heartbeat(CentOS6.3中默认不带有Heartbeat包,因此需要从第三方下载)(node1,node2)
# wget ftp://mirror.switch.ch/pool/1/mirror/scientificlinux/6rolling/i386/os/Packages/epel-release-6-5.noarch.rpm
# rpm -ivUh epel-release-6-5.noarch.rpm
# yum --enablerepo=epel install heartbeat -y
2.配置heartbeat
(node1)
# vi /etc/ha.d/ha.cf
---------------
# 日志
logfile /var/log/ha-log
logfacility local0
# 心跳监测时间
keepalive 2
# 死亡时间
deadtime 5
# 指定对方IP:
ucast eth0 192.168.7.89
# 服务器正常后由主服务器接管资源,另一台服务器放弃该资源
auto_failback off
#定义节点
node drbd1.example.com drbd2.example.com
---------------
(node2)
# vi /etc/ha.d/ha.cf
---------------
# 日志
logfile /var/log/ha-log
logfacility local0
# 心跳监测时间
keepalive 2
# 死亡时间
deadtime 5
# 指定对方IP:
ucast eth0 192.168.7.88
# 服务器正常后由主服务器接管资源,另一台服务器放弃该资源
auto_failback off
#定义节点
node drbd1.example.com drbd2.example.com
---------------
编辑双机互联验证文件:(node1,node2)
# vi /etc/ha.d/authkeys
--------------
auth 1
1 crc
--------------
# chmod 600 /etc/ha.d/authkeys
编辑集群资源文件:(node1,node2)
# vi /etc/ha.d/haresources
--------------
drbd1.example.com IPaddr::192.168.7.90/24/eth0 drbddisk::r0 Filesystem::/dev/drbd0::/data::ext4 killnfsd
--------------
注:该文件内IPaddr,Filesystem等脚本存放路径在/etc/ha.d/resource.d/下,也可在该目录下存放服务启动脚本(例如:mysql,www),将相同脚本名称添加到/etc/ha.d/haresources内容中,从而跟随heartbeat启动而启动该脚本。
IPaddr::192.168.7.90/24/eth0:用IPaddr脚本配置浮动VIP
drbddisk::r0:用drbddisk脚本实现DRBD主从节点资源组的挂载和卸载
Filesystem::/dev/drbd0::/data::ext4:用Filesystem脚本实现磁盘挂载和卸载
编辑脚本文件killnfsd,用来重启NFS服务:
注:因为NFS服务切换后,必须重新mount NFS共享出来的目录,否则会报错(待验证)
# vi /etc/ha.d/resource.d/killnfsd
-----------------
killall -9 nfsd; /etc/init.d/nfs restart;exit 0
-----------------
赋予执行权限:
# chmod 755 /etc/ha.d/resource.d/killnfsd
创建DRBD脚本文件drbddisk:(node1,node2)
注:
此处又是一个大坑,如果不明白Heartbeat目录结构的朋友估计要在这里被卡到死,因为默认yum安装Heartbeat,不会在/etc/ha.d/resource.d/创建drbddisk脚本,而且也无法在安装后从本地其他路径找到该文件。
此处本人也是因为启动Heartbeat后无法PING通虚IP,最后通过查看/var/log/ha-log日志,找到一行
ERROR: Cannot locate resource script drbddisk
然后进而到/etc/ha.d/resource.d/路径下发现竟然没有drbddisk脚本,最后在google上找到该代码,创建该脚本,终于测试通过:
# vi /etc/ha.d/resource.d/drbddisk
-----------------------
#!/bin/bash
#
# This script is inteded to be used as resource script by heartbeat
#
# Copright 2003-2008 LINBIT Information Technologies
# Philipp Reisner, Lars Ellenberg
#
###
DEFAULTFILE="/etc/default/drbd"
DRBDADM="/sbin/drbdadm"
if [ -f $DEFAULTFILE ]; then
. $DEFAULTFILE
fi
if [ "$#" -eq 2 ]; then
RES="$1"
CMD="$2"
else
RES="all"
CMD="$1"
fi
## EXIT CODES
# since this is a "legacy heartbeat R1 resource agent" script,
# exit codes actually do not matter that much as long as we conform to
# http://wiki.linux-ha.org/HeartbeatResourceAgent
# but it does not hurt to conform to lsb init-script exit codes,
# where we can.
# http://refspecs.linux-foundation.org/LSB_3.1.0/
#LSB-Core-generic/LSB-Core-generic/iniscrptact.html
####
drbd_set_role_from_proc_drbd()
{
local out
if ! test -e /proc/drbd; then
ROLE="Unconfigured"
return
fi
dev=$( $DRBDADM sh-dev $RES )
minor=${dev#/dev/drbd}
if [[ $minor = *[!0-9]* ]] ; then
# sh-minor is only supported since drbd 8.3.1
minor=$( $DRBDADM sh-minor $RES )
fi
if [[ -z $minor ]] || [[ $minor = *[!0-9]* ]] ; then
ROLE=Unknown
return
fi
if out=$(sed -ne "/^ *$minor: cs:/ { s/:/ /g; p; q; }" /proc/drbd); then
set -- $out
ROLE=${5%/**}
: ${ROLE:=Unconfigured} # if it does not show up
else
ROLE=Unknown
fi
}
case "$CMD" in
start)
# try several times, in case heartbeat deadtime
# was smaller than drbd ping time
try=6
while true; do
$DRBDADM primary $RES && break
let "--try" || exit 1 # LSB generic error
sleep 1
done
;;
stop)
# heartbeat (haresources mode) will retry failed stop
# for a number of times in addition to this internal retry.
try=3
while true; do
$DRBDADM secondary $RES && break
# We used to lie here, and pretend success for anything != 11,
# to avoid the reboot on failed stop recovery for "simple
# config errors" and such. But that is incorrect.
# Don't lie to your cluster manager.
# And don't do config errors...
let --try || exit 1 # LSB generic error
sleep 1
done
;;
status)
if [ "$RES" = "all" ]; then
echo "A resource name is required for status inquiries."
exit 10
fi
ST=$( $DRBDADM role $RES )
ROLE=${ST%/**}
case $ROLE in
Primary|Secondary|Unconfigured)
# expected
;;
*)
# unexpected. whatever...
# If we are unsure about the state of a resource, we need to
# report it as possibly running, so heartbeat can, after failed
# stop, do a recovery by reboot.
# drbdsetup may fail for obscure reasons, e.g. if /var/lock/ is
# suddenly readonly. So we retry by parsing /proc/drbd.
drbd_set_role_from_proc_drbd
esac
case $ROLE in
Primary)
echo "running (Primary)"
exit 0 # LSB status "service is OK"
;;
Secondary|Unconfigured)
echo "stopped ($ROLE)"
exit 3 # LSB status "service is not running"
;;
*)
# NOTE the "running" in below message.
# this is a "heartbeat" resource script,
# the exit code is _ignored_.
echo "cannot determine status, may be running ($ROLE)"
exit 4 # LSB status "service status is unknown"
;;
esac
;;
*)
echo "Usage: drbddisk [resource] {start|stop|status}"
exit 1
;;
esac
exit 0
-----------------------
赋予执行权限:
# chmod 755 /etc/ha.d/resource.d/drbddisk
在两个节点上启动HeartBeat服务,先启动node1:(node1,node2)
# service heartbeat start
# chkconfig heartbeat on
这里能够PING通虚IP 192.168.7.90,表示配置成功
三.配置NFS:(node1,node2)
# vi /etc/exports
-----------------
/data *(rw,no_root_squash)
-----------------
重启NFS服务:
# service rpcbind restart
# service nfs restart
# chkconfig rpcbind on
# chkconfig nfs off
这里设置NFS开机不要自动运行,因为/etc/ha.d/resource.d/killnfsd 该脚本内容控制NFS的启动。
四.最终测试
在另外一台LINUX的客户端挂载虚IP:192.168.7.90,挂载成功表明NFS+DRBD+HeartBeat大功告成.
# mount -t nfs 192.168.7.90:/data /tmp
# df -h
---------------
......
192.168.7.90:/data 1020M 34M 934M 4% /tmp
---------------
测试DRBD+HeartBeat+NFS可用性:
1.向挂载的/tmp目录传送文件,忽然重新启动主端DRBD服务,查看变化
经本人测试能够实现断点续传
2.正常状态重启Primary主机后,观察主DRBD状态是否恢复Primary并能正常被客户端挂载并且之前写入的文件存在,可以正常再写入文件。
经本人测试可以正常恢复,且客户端无需重新挂载NFS共享目录,之前数据存在,且可直接写入文件。
3.当Primary主机因为硬件损坏或其他原因需要关机维修,需要将Secondary提升为Primary主机,如何手动操作?
如果设备能够正常启动则按照如下操作,无法启动则强行提升Secondary为Primary,待宕机设备能够正常启动,若“脑裂”,再做后续修复工作。
首先先卸载客户端挂载的NFS主机目录
# umount /tmp
(node1)
卸载DRBD设备:
# service nfs stop
# umount /data
降权:
# drbdadm secondary r0
查看状态,已降权
# service drbd status
-----------------
drbd driver loaded OK; device status:
version: 8.4.3 (api:1/proto:86-101)
GIT-hash: 89a294209144b68adb3ee85a73221f964d3ee515 build by [email protected], 2013-05-27 20:45:19
m:res cs ro ds p mounted fstype
0:r0 Connected Secondary/Secondary UpToDate/UpToDate C
-----------------
(node2)
提权:
# drbdadm primary r0
查看状态,已提权:
# service drbd status
----------------
drbd driver loaded OK; device status:
version: 8.4.3 (api:1/proto:86-101)
GIT-hash: 89a294209144b68adb3ee85a73221f964d3ee515 build by [email protected], 2013-05-27 20:49:06
m:res cs ro ds p mounted fstype
0:r0 Connected Primary/Secondary UpToDate/UpToDate C
----------------
这里还未挂载DRBD目录,让Heartbeat帮忙挂载:
注:若重启过程中发现Heartbeat日志报错:
ERROR: glib: ucast: error binding socket. Retrying: Permission denied
请检查selinux是否关闭
# service heartbeat restart
# service drbd status
-----------------------
drbd driver loaded OK; device status:
version: 8.4.3 (api:1/proto:86-101)
GIT-hash: 89a294209144b68adb3ee85a73221f964d3ee515 build by [email protected], 2013-05-27 20:49:06
m:res cs ro ds p mounted fstype
0:r0 Connected Primary/Secondary UpToDate/UpToDate C /data ext4
------------------------
成功让HeartBeat挂载DRBD目录
重新在客户端做NFS挂载测试:
# mount -t nfs 192.168.7.90:/data /tmp
# ll /tmp
------------------
1 10 2 2222 3 4 5 6 7 8 9 lost+found orbit-root
------------------
重启刚刚被提权的主机,待重启查看状态:
# service drbd status
------------------------
drbd driver loaded OK; device status:
version: 8.4.3 (api:1/proto:86-101)
GIT-hash: 89a294209144b68adb3ee85a73221f964d3ee515 build by [email protected], 2013-05-27 20:49:06
m:res cs ro ds p mounted fstype
0:r0 WFConnection Primary/Unknown UpToDate/DUnknown C /data ext4
------------------------
HeartBeat成功挂载DRBD目录,drbd无缝连接到备份节点,客户端使用NFS挂载点对故障无任何感知。
4.测试最后刚才那台宕机重新恢复正常后,他是否会从新夺取Primary资源?
重启后不会重新获取资源,需手动切换主从权限方可。
注:vi /etc/ha.d/ha.cf配置文件内该参数:
--------------------
auto_failback off
--------------------
表示服务器正常后由新的主服务器接管资源,另一台旧服务器放弃该资源
5.以上都未利用heartbeat实现故障自动转移,当线上DRBD主节点宕机,备份节点是否立即无缝接管,heartbeat+drbd高可用性是否能够实现?
首先先在客户端挂载NFS共享目录
# mount -t nfs 192.168.7.90:/data /tmp
a.模拟将主节点node1 的heartbeat服务停止,则备节点node2是否接管服务?
(node1)
# service drbd status
----------------------------
drbd driver loaded OK; device status:
version: 8.4.3 (api:1/proto:86-101)
GIT-hash: 89a294209144b68adb3ee85a73221f964d3ee515 build by [email protected], 2013-05-27 20:45:19
m:res cs ro ds p mounted fstype
0:r0 Connected Primary/Secondary UpToDate/UpToDate C /data ext4
----------------------------
# service heartbeat stop
(node2)
# service drbd status
----------------------------------------
drbd driver loaded OK; device status:
version: 8.4.3 (api:1/proto:86-101)
GIT-hash: 89a294209144b68adb3ee85a73221f964d3ee515 build by [email protected], 2013-05-27 20:49:06
m:res cs ro ds p mounted fstype
0:r0 Connected Primary/Secondary UpToDate/UpToDate C /data ext4
-----------------------------------------
从机无缝接管,测试客户端是否能够使用NFS共享目录
# cd /tmp
# touch test01
# ls test01
------------------
test01
------------------
测试通过。。。
b.模拟将主节点宕机(直接强行关机),则备节点node2是否接管服务?
(node1)
强制关机,直接关闭node1虚拟机电源
(node2)
# service drbd status
-------------------------------
drbd driver loaded OK; device status:
version: 8.4.3 (api:1/proto:86-101)
GIT-hash: 89a294209144b68adb3ee85a73221f964d3ee515 build by [email protected], 2013-05-27 20:49:06
m:res cs ro ds p mounted fstype
0:r0 WFConnection Primary/Unknown UpToDate/DUnknown C /data ext4
-------------------------------
从机无缝接管,测试客户端是否能够使用NFS共享目录
# cd /tmp
# touch test02
# ls test02
------------------
test02
------------------
待node1恢复启动,查看drbd状态信息:
# service drbd status
------------------------------
drbd driver loaded OK; device status:
version: 8.4.3 (api:1/proto:86-101)
GIT-hash: 89a294209144b68adb3ee85a73221f964d3ee515 build by [email protected], 2013-05-27 20:49:06
m:res cs ro ds p mounted fstype
0:r0 Connected Primary/Secondary UpToDate/UpToDate C /data ext4
-------------------------------
node1已连接上线,处于UpToDate状态,测试通过。。。
注:这里node1的heartbeat有几率在关闭服务时,node2无法接管,所以有一定维护成本,因为本人线上未跑该服务,建议实际使用在上线前多做模拟故障演练,再实际上线。
-------大功告成----------
参考:酒哥的“构建高可用LINUX服务器”一书