参考:Cluster Software Installation
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_administration/ch-startup-HAAA#s1-clusterinstall-HAAA
# yum install pcs pacemaker fence-agents-all
实际多安装了corosync
[root@server2 ~]# yum install -y fence-agents-all corosync pacemaker pcs
安装后检查
[root@server3 ~]# rpm -q pacemaker
pacemaker-1.1.20-5.el7_7.1.x86_64
[root@server3 ~]# grep hacluster /etc/passwd
hacluster:x:189:189:cluster user:/home/hacluster:/sbin/nologin
配置主机名
[root@server2 ~]# hostnamectl set-hostname server4.example.com
vi /etc/hosts
192.168.122.143 server3.example.com s3
192.168.122.58 server4.example.com s4
[root@server3 ~]# ssh-keygen 默认回车
[root@server3 ~]# ssh-copy-id s4 拷贝公钥到对端
[root@server3 ~]# ssh s4 验证无密码登录到对端
2台服务器启动pcsd
systemctl start pcsd
systemctl enable pcsd
[root@server3 ~]# pcs cluster auth server3.example.com server4.example.com
Username: root
Password:
Error: s3: Username and/or password is incorrect
Error: Unable to communicate with s4
[root@server3 ~]#
[root@server3 ~]# pcs cluster auth server3.example.com server4.example.com
Username: hacluster
Password:
Error: Unable to communicate with server4.example.com
server3.example.com: Authorized
[root@server3 ~]#
不能使用root用户无法
参考官网增加防火墙配置
[root@server3 .ssh]# firewall-cmd --permanent --add-service=high-availability
[root@server3 .ssh]# firewall-cmd --add-service=high-availability
另外重启了pcsd服务
需要修改hacluster密码,2台机器都需要修改
[root@server4 ~]# pcs cluster auth server3.example.com server4.example.com
Username: hacluster
Password:
server4.example.com: Authorized
server3.example.com: Authorized
添加成功后,根据官网,由于之前手工增加了hacluster用户的登录权限,手工删除
# usermod -s /sbin/nologin hacluster
在其中一台执行
pcs cluster setup --start --name mytest_cluster server3.example.com server4.example.com
在2台机器可以看到同样的输出内容
[root@server3 ~]# pcs cluster status
Cluster Status:
Stack: corosync
Current DC: server3.example.com (version 1.1.20-5.el7_7.1-3c4c782f70) - partition with quorum
Last updated: Wed Sep 25 23:30:26 2019
Last change: Wed Sep 25 23:28:19 2019 by hacluster via crmd on server3.example.com
2 nodes configured
0 resources configured
PCSD Status:
server3.example.com: Online
server4.example.com: Online
[root@server3 ~]#
关机休息。
重启时pcs cluster status显示未启动,在任意一台机器都增加自动启动
[root@server4 ~]# pcs cluster enable --all
server3.example.com: Cluster Enabled
server4.example.com: Cluster Enabled
还需要手工启动双机[root@server4 ~]# pcs cluster start
如果只在4机启动,可以看到2台都是online的。但在3机执行 pcs cluster status 时显示未启动。
reboot重启server3后显示pcs状态正常
PCSD Status:
server4.example.com: Online
server3.example.com: Online
跳过了fencing配置,参考
Chapter 5. Fencing: Configuring STONITH
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/ch-fencing-haar#s1-stonithlist-HAAR
[root@server3 ~]# pcs stonith list|grep -i virt
fence_virt - Fence agent for virtual machines
fence_xvm - Fence agent for virtual machines
参考下面资料,创建一个主备用的web服务
Chapter 2. An active/passive Apache HTTP Server in a Red Hat High Availability Cluster
准备资源:浮动IP地址一个,共享硬盘一个
在virt-manager的server3上点灯泡,Disk2的路径为/var/lib/libvirt/images/hd4ks-clone.raw,在server4上添加同一个文件。
[root@server4 ~]# fdisk -l
Disk /dev/vdb: 209 MB, 209715200 bytes, 409600 sectors
可以直接使用了。
在任意节点执行
实际测试server3执行fdisk -l看不到vdb盘,所以在4上执行,删除原有的ext分区。
[root@server4 ~]# pvcreate /dev/vdb
WARNING: ext3 signature detected on /dev/vdb at offset 1080. Wipe it? [y/n]: y
Wiping ext3 signature on /dev/vdb.
Physical volume "/dev/vdb" successfully created.
[root@server4 ~]#
重启s3还是看不到vdb,把2台机器的shareable勾选上,都关机后再重启。
[root@server3 ~]# vgcreate my_vg /dev/vdb
Volume group "my_vg" successfully created
[root@server3 ~]# lvcreate -L 200 my_vg -n my_lv
Volume group "my_vg" has insufficient free space (49 extents): 50 required.
[root@server3 ~]# lvs 不足200M创建不成功,lvs无输出
[root@server3 ~]# lvcreate -L 190 my_vg -n my_lv
Rounding up size to full physical extent 192.00 MiB
Logical volume "my_lv" created.
[root@server3 ~]# mkfs.ext4 /dev/my_vg/my_lv
此时在4机只能看到vdb和pv,使用vgscan之后还是看不到pv
2.2. Web Server Configuration
2台都安装 yum install -y httpd wget
为了让代理能够检测apache的状态,在/etc/httpd/conf/httpd.conf 还需要新增配置
SetHandler server-status
Require local
代理不支持systemd,需要如下修改以支持reload Apache
/etc/logrotate.d/httpd
删除行
/bin/systemctl reload httpd.service > /dev/null 2>/dev/null || true
替换为
/usr/sbin/httpd -f /etc/httpd/conf/httpd.conf -c “PidFile /var/run/httpd.pid” -k graceful > /dev/null 2>/dev/null || true
在其中一台执行
# mount /dev/my_vg/my_lv /var/www/
# mkdir /var/www/html
# mkdir /var/www/cgi-bin
# mkdir /var/www/error
# restorecon -R /var/www
# cat <<-END >/var/www/html/index.html
Hello
END
# umount /var/www
测试了去掉END前的减号效果一样的。
2.3. Exclusive Activation of a Volume Group in a Cluster
Cluster要求禁止在双机软件之外激活卷组
/etc/lvm/lvm.conf
volume_list 配置的vg会自动激活,不应该包含双机要使用的卷组
# lvmconf --enable-halvm --services --startstopservices
该命令修改下列参数,并停止lvmetad
locking_type is set to 1 默认配置
use_lvmetad is set to 0 默认配置的1
grep -E “locking_type|use_lvmetad” /etc/lvm/lvm.conf
查看vg名称 # vgs --noheadings -o vg_name 居然搞这么多参数
重建initram并重启(跳过了),以防止其访问卷组
# dracut -H -f /boot/initramfs-$(uname -r).img $(uname -r)
H只安装本机启动需要的驱动 f覆盖原有文件
如果更新了内核,请重启后再执行上面命令。
2台都手工去激活卷组
[root@server4 ~]# vgchange -a n my_vg
0 logical volume(s) in volume group “my_vg” now active
2.4. Creating the Resources and Resource Groups with the pcs Command
4个资源(卷组LVM,文件系统Filesystem,浮动地址IPaddr2,应用)组成资源组apachegroup,保证都运行在同一台机器
卷组LVM
[root@server3 ~]# pcs resource create my_lvm LVM volgrpname=my_vg \
exclusive=true --group apachegroup
Assumed agent name 'ocf:heartbeat:LVM' (deduced from 'LVM')
Resource Group: apachegroup
my_lvm (ocf::heartbeat:LVM): FAILED (Monitoring)[ server3.example.com server4.example.com ]
Failed Resource Actions:
* my_lvm_monitor_0 on server3.example.com 'unknown error' (1): call=5, status=complete, exitreason='The volume_list filter must be initialized in lvm.conf for exclusive activation without clvmd'
根据提示,如果停止了clvmd则需要在lvm.conf配置volume_list参数,这是指导书提及但被我忽略的。
/etc/lvm/lvm.conf
volume_list = []
由于机器没有其他任何vg,所以配置值为空,但该参数必须出现。
以下命令并未解决
pcs resource restart my_lvm 重启资源
pcs resource disable my_lvm 停止资源
pcs resource enable my_lvm 启动资源
pcs resource cleanup my_lvm 清除资源故障
pcs cluster stop --all 停止所有双机节点
pcs cluster start --all 启动所有双机节点
reboot
pcs resource show 显示资源为停止,可以省略show
Resource Group: apachegroup
my_lvm (ocf::heartbeat:LVM): Stopped
重新作initram
备份,发现2台机器内核还不一样
[root@server3 ~]# cp -p /boot/initramfs-3.10.0-1062.1.1.el7.x86_64.img /boot/initramfs-3.10.0-1062.1.1.el7.x86_64.img.bakok1
[root@server4 ~]# cp -p /boot/initramfs-3.10.0-957.el7.x86_64.img /boot/initramfs-3.10.0-957.el7.x86_64.img.bakok1
[root@server3 ~]# dracut -H -f /boot/initramfs-$(uname -r).img $(uname -r)
[root@server3 ~]# reboot
```text
还是无法启动
删除重新创建,甚至修改为不独占,都无法启动
```text
pcs resource delete my_lvm
pcs resource create my_lvm LVM volgrpname=my_vg \
> exclusive=false --group apachegroup
看日志
[root@server3 ~]# journalctl -xe
Sep 26 17:13:54 server3.example.com pengine[3312]: error: Resource start-up disabled since no STONITH resources have been defined
Sep 26 17:13:54 server3.example.com pengine[3312]: error: Either configure some or disable STONITH with the stonith-enabled option
Sep 26 17:13:54 server3.example.com pengine[3312]: error: NOTE: Clusters with shared data need STONITH to ensure data integrity
Sep 26 17:13:54 server3.example.com pengine[3312]: notice: Removing my_lvm from server3.example.com
Sep 26 17:13:54 server3.example.com pengine[3312]: notice: Removing my_lvm from server4.example.com
[root@server3 ~]# pcs property show --all
stonith-enabled: true
[root@server3 ~]# pcs property set stonith-enabled=false 只在1台修改
[root@server3 ~]# pcs property show --all |grep stonith-enabled
stonith-enabled: false
重新添加资源,启动成功
[root@server3 ~]# pcs resource create my_lvm LVM volgrpname=my_vg \
> exclusive=true --group apachegroup
Assumed agent name 'ocf:heartbeat:LVM' (deduced from 'LVM')
[root@server3 ~]# pcs status
[root@server3 ~]# pcs resource
Resource Group: apachegroup
my_lvm (ocf::heartbeat:LVM): Started server3.example.com
使用 lvdisplay 显示 LV Status available
pcs resource create my_fs Filesystem
device="/dev/my_vg/my_lv" directory="/var/www" fstype=“ext4” --group
apachegroup
使用df查看挂载
pcs resource create VirtualIP IPaddr2 ip=192.168.122.30
cidr_netmask=24 --group apachegroup
使用ip ad查看浮动IP
pcs resource create Website apache
configfile="/etc/httpd/conf/httpd.conf"
statusurl=“http://127.0.0.1/server-status” --group apachegroup
状态检查
[root@server3 ~]# pcs status
Cluster name: mytest_cluster
Stack: corosync
Current DC: server3.example.com (version 1.1.20-5.el7_7.1-3c4c782f70) - partition with quorum
Last updated: Thu Sep 26 17:27:24 2019
Last change: Thu Sep 26 17:26:27 2019 by root via cibadmin on server3.example.com
2 nodes configured
4 resources configured
Online: [ server3.example.com server4.example.com ]
Full list of resources:
Resource Group: apachegroup
my_lvm (ocf::heartbeat:LVM): Started server3.example.com
my_fs (ocf::heartbeat:Filesystem): Started server3.example.com
VirtualIP (ocf::heartbeat:IPaddr2): Started server3.example.com
Website (ocf::heartbeat:apache): Started server3.example.com
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[root@server3 ~]#
访问应用
firefox访问 http://192.168.122.30 显示Hello
[xy@xycto ~]$ curl http://192.168.122.30
倒换测试
(1)重启
[root@server3 ~]# reboot
秒切换到server4
(2)停进程及听双机软件
[root@server4 ~]# ps -ef|grep httpd
root 7389 1 0 17:36 ? 00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache 7391 7389 0 17:36 ? 00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache 7392 7389 0 17:36 ? 00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache 7393 7389 0 17:36 ? 00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache 7394 7389 0 17:36 ? 00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache 7395 7389 0 17:36 ? 00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
root 7642 3468 0 17:36 pts/0 00:00:00 grep --color=auto httpd
[root@server4 ~]# kill -9 7389
程序会自动重启,停止了4次都没有切换,只是有一个告警
Failed Resource Actions:
* Website_monitor_10000 on server4.example.com 'not running' (7): call=42, status=complete, exitreason='',
*
last-rc-change='Thu Sep 26 17:37:11 2019', queued=0ms, exec=0ms
查看配置发现监控间隔为10秒,kill重启间隔估计只有3秒。
[root@server4 ~]# pcs config
Operations: monitor interval=10s timeout=20s (Website-monitor-interval-10s)
来一个狠招
# mv /sbin/httpd /sbin/httpdbak
之后再kill进程,立即就切换了,估计1秒。
看日志,程序比较智能,实际并未重启1000000次
Sep 26 17:54:08 server3.example.com apache(Website)[9380]: ERROR: apache httpd program not found
Sep 26 17:54:08 server3.example.com apache(Website)[9396]: ERROR: environment is invalid, resource considered stopped
Sep 26 17:54:08 server3.example.com lrmd[3325]: notice: Website_monitor_10000:9320:stderr [ ocf-exit-reason:apache httpd program not found ]
Sep 26 17:54:08 server3.example.com lrmd[3325]: notice: Website_monitor_10000:9320:stderr [ ocf-exit-reason:environment is invalid, resource considered stopped ]
Sep 26 17:54:08 server3.example.com crmd[3332]: notice: server3.example.com-Website_monitor_10000:41 [ ocf-exit-reason:apache httpd program not found\nocf-exit-reason:environment is invalid, resource considered stopped\n ]
...
Sep 26 17:54:09 server3.example.com pengine[3331]: warning: Processing failed start of Website on server3.example.com: not installed
Sep 26 17:54:09 server3.example.com pengine[3331]: notice: Preventing Website from re-starting on server3.example.com: operation start failed 'not installed' (5)
Sep 26 17:54:09 server3.example.com pengine[3331]: warning: Forcing Website away from server3.example.com after 1000000 failures (max=1000000)
可能无法恢复
在停止s4服务器的双机后pcs cluster stop
2 nodes configured
4 resources configured
Online: [ server3.example.com ]
OFFLINE: [ server4.example.com ]
Full list of resources:
Resource Group: apachegroup
my_lvm (ocf::heartbeat:LVM): Started server3.example.com
my_fs (ocf::heartbeat:Filesystem): Started server3.example.com
VirtualIP (ocf::heartbeat:IPaddr2): Started server3.example.com
Website (ocf::heartbeat:apache): Stopped
Failed Resource Actions:
* Website_start_0 on server3.example.com 'not installed' (5): call=44, status=complete, exitreason='environment is invalid, resource considered stopped',
手工清除状态
[root@server3 ~]# pcs resource cleanup Website
还是无法启动
Sep 26 18:13:53 server3.example.com LVM(my_lvm)[3460]: WARNING: LVM Volume my_vg is not available (stopped)
Sep 26 18:13:53 server3.example.com crmd[3321]: notice: Result of probe operation for my_lvm on server3.example.com: 7 (not running)
Sep 26 18:13:53 server3.example.com crmd[3321]: notice: Initiating monitor operation my_fs_monitor_0 locally on server3.example.com
Sep 26 18:13:53 server3.example.com Filesystem(my_fs)[3480]: WARNING: Couldn't find device [/dev/my_vg/my_lv]. Expected /dev/??? to exist
需要启动4机的双机[root@server4 ~]# pcs cluster start
自动恢复。
小结:
1、httpd文件损坏的问题,在恢复后仍然可能会出现异常,最好reboot一下。
2、1台机器停止双机软件后,不要重启另外一台机器。
(3)去掉浮动IP地址
[root@server4 ~]# ip addr del 192.168.122.30/24 dev eth0
也是在本机自动拉起
[root@server4 ~]# ip link set down dev eth0
只有通过kvm或者虚拟机界面操作了,s4还是不会切换
在3机上已经切换成功,网页访问也是正常的。
Online: [ server3.example.com ]
OFFLINE: [ server4.example.com ]
Full list of resources:
Resource Group: apachegroup
my_lvm (ocf::heartbeat:LVM): Started server3.example.com
my_fs (ocf::heartbeat:Filesystem): Started server3.example.com
VirtualIP (ocf::heartbeat:IPaddr2): Started server3.example.com
Website (ocf::heartbeat:apache): Started server3.example.com
[xy@xycto ~]$ curl http://192.168.122.30
Hello
把s4的网卡启动起来 ifconfig eth4 up 会自动去掉s4的浮动地址,并加入双机,2台机器显示pcs status相同。