CentOS安装验证PaceMaker

CentOS安装验证PaceMaker

    • 建立双机信任关系 不是必须的
    • 增加节点间的认证-在其中一台执行
    • 配置双机
    • 独占卷组激活
    • 创建资源组
      • 通过pcs status可以看到资源启动失败
      • 继续添加其他资源
    • 验证

参考:Cluster Software Installation
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_administration/ch-startup-HAAA#s1-clusterinstall-HAAA

# yum install pcs pacemaker fence-agents-all

实际多安装了corosync
[root@server2 ~]# yum install -y fence-agents-all corosync pacemaker pcs

安装后检查

[root@server3 ~]# rpm -q pacemaker
pacemaker-1.1.20-5.el7_7.1.x86_64
[root@server3 ~]# grep hacluster /etc/passwd
hacluster:x:189:189:cluster user:/home/hacluster:/sbin/nologin

配置主机名
[root@server2 ~]# hostnamectl set-hostname server4.example.com

vi /etc/hosts
192.168.122.143 server3.example.com s3
192.168.122.58 server4.example.com s4

建立双机信任关系 不是必须的

[root@server3 ~]# ssh-keygen 默认回车
[root@server3 ~]# ssh-copy-id s4 拷贝公钥到对端
[root@server3 ~]# ssh s4 验证无密码登录到对端

2台服务器启动pcsd
systemctl start pcsd
systemctl enable pcsd

增加节点间的认证-在其中一台执行

[root@server3 ~]# pcs cluster auth server3.example.com server4.example.com
Username: root
Password: 
Error: s3: Username and/or password is incorrect
Error: Unable to communicate with s4
[root@server3 ~]#
[root@server3 ~]# pcs cluster auth server3.example.com server4.example.com
Username: hacluster
Password: 
Error: Unable to communicate with server4.example.com
server3.example.com: Authorized
[root@server3 ~]#

不能使用root用户无法
参考官网增加防火墙配置
[root@server3 .ssh]# firewall-cmd --permanent --add-service=high-availability
[root@server3 .ssh]# firewall-cmd --add-service=high-availability

另外重启了pcsd服务
需要修改hacluster密码,2台机器都需要修改

[root@server4 ~]# pcs cluster auth server3.example.com server4.example.com
Username: hacluster
Password: 
server4.example.com: Authorized
server3.example.com: Authorized

添加成功后,根据官网,由于之前手工增加了hacluster用户的登录权限,手工删除

# usermod  -s /sbin/nologin hacluster

配置双机

在其中一台执行
pcs cluster setup --start --name mytest_cluster server3.example.com server4.example.com
在2台机器可以看到同样的输出内容

[root@server3 ~]# pcs cluster status
Cluster Status:
 Stack: corosync
 Current DC: server3.example.com (version 1.1.20-5.el7_7.1-3c4c782f70) - partition with quorum
 Last updated: Wed Sep 25 23:30:26 2019
 Last change: Wed Sep 25 23:28:19 2019 by hacluster via crmd on server3.example.com
 2 nodes configured
 0 resources configured

PCSD Status:
  server3.example.com: Online
  server4.example.com: Online
[root@server3 ~]# 

关机休息。
重启时pcs cluster status显示未启动,在任意一台机器都增加自动启动
[root@server4 ~]# pcs cluster enable --all
server3.example.com: Cluster Enabled
server4.example.com: Cluster Enabled
还需要手工启动双机[root@server4 ~]# pcs cluster start
如果只在4机启动,可以看到2台都是online的。但在3机执行 pcs cluster status 时显示未启动。
reboot重启server3后显示pcs状态正常
PCSD Status:
server4.example.com: Online
server3.example.com: Online

跳过了fencing配置,参考
Chapter 5. Fencing: Configuring STONITH
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/ch-fencing-haar#s1-stonithlist-HAAR

[root@server3 ~]# pcs stonith list|grep -i virt
fence_virt - Fence agent for virtual machines
fence_xvm - Fence agent for virtual machines

参考下面资料,创建一个主备用的web服务
Chapter 2. An active/passive Apache HTTP Server in a Red Hat High Availability Cluster

准备资源:浮动IP地址一个,共享硬盘一个
在virt-manager的server3上点灯泡,Disk2的路径为/var/lib/libvirt/images/hd4ks-clone.raw,在server4上添加同一个文件。
[root@server4 ~]# fdisk -l
Disk /dev/vdb: 209 MB, 209715200 bytes, 409600 sectors
可以直接使用了。

在任意节点执行
实际测试server3执行fdisk -l看不到vdb盘,所以在4上执行,删除原有的ext分区。

[root@server4 ~]# pvcreate /dev/vdb
WARNING: ext3 signature detected on /dev/vdb at offset 1080. Wipe it? [y/n]: y
  Wiping ext3 signature on /dev/vdb.
  Physical volume "/dev/vdb" successfully created.
[root@server4 ~]#

重启s3还是看不到vdb,把2台机器的shareable勾选上,都关机后再重启。

[root@server3 ~]# vgcreate my_vg /dev/vdb
  Volume group "my_vg" successfully created
[root@server3 ~]# lvcreate -L 200 my_vg -n my_lv
  Volume group "my_vg" has insufficient free space (49 extents): 50 required.
[root@server3 ~]# lvs   不足200M创建不成功,lvs无输出
[root@server3 ~]# lvcreate -L 190 my_vg -n my_lv
  Rounding up size to full physical extent 192.00 MiB
  Logical volume "my_lv" created.
[root@server3 ~]# mkfs.ext4 /dev/my_vg/my_lv

此时在4机只能看到vdb和pv,使用vgscan之后还是看不到pv

2.2. Web Server Configuration
2台都安装 yum install -y httpd wget
为了让代理能够检测apache的状态,在/etc/httpd/conf/httpd.conf 还需要新增配置

SetHandler server-status
Require local

代理不支持systemd,需要如下修改以支持reload Apache
/etc/logrotate.d/httpd
删除行
/bin/systemctl reload httpd.service > /dev/null 2>/dev/null || true
替换为
/usr/sbin/httpd -f /etc/httpd/conf/httpd.conf -c “PidFile /var/run/httpd.pid” -k graceful > /dev/null 2>/dev/null || true

在其中一台执行

# mount /dev/my_vg/my_lv /var/www/
# mkdir /var/www/html
# mkdir /var/www/cgi-bin
# mkdir /var/www/error
# restorecon -R /var/www
# cat <<-END >/var/www/html/index.html

Hello

END
# umount /var/www

测试了去掉END前的减号效果一样的。

独占卷组激活

2.3. Exclusive Activation of a Volume Group in a Cluster
Cluster要求禁止在双机软件之外激活卷组
/etc/lvm/lvm.conf
volume_list 配置的vg会自动激活,不应该包含双机要使用的卷组

# lvmconf --enable-halvm --services --startstopservices

该命令修改下列参数,并停止lvmetad
locking_type is set to 1 默认配置
use_lvmetad is set to 0 默认配置的1
grep -E “locking_type|use_lvmetad” /etc/lvm/lvm.conf
查看vg名称 # vgs --noheadings -o vg_name 居然搞这么多参数

重建initram并重启(跳过了),以防止其访问卷组

# dracut -H -f /boot/initramfs-$(uname -r).img $(uname -r)

H只安装本机启动需要的驱动 f覆盖原有文件
如果更新了内核,请重启后再执行上面命令。
2台都手工去激活卷组
[root@server4 ~]# vgchange -a n my_vg
0 logical volume(s) in volume group “my_vg” now active

创建资源组

2.4. Creating the Resources and Resource Groups with the pcs Command
4个资源(卷组LVM,文件系统Filesystem,浮动地址IPaddr2,应用)组成资源组apachegroup,保证都运行在同一台机器
卷组LVM

[root@server3 ~]# pcs resource create my_lvm LVM volgrpname=my_vg \
 exclusive=true --group apachegroup
Assumed agent name 'ocf:heartbeat:LVM' (deduced from 'LVM')

通过pcs status可以看到资源启动失败

 Resource Group: apachegroup
     my_lvm	(ocf::heartbeat:LVM):	FAILED (Monitoring)[ server3.example.com server4.example.com ]

Failed Resource Actions:
* my_lvm_monitor_0 on server3.example.com 'unknown error' (1): call=5, status=complete, exitreason='The volume_list filter must be initialized in lvm.conf for exclusive activation without clvmd'

根据提示,如果停止了clvmd则需要在lvm.conf配置volume_list参数,这是指导书提及但被我忽略的。
/etc/lvm/lvm.conf
volume_list = []
由于机器没有其他任何vg,所以配置值为空,但该参数必须出现。

以下命令并未解决
pcs resource restart my_lvm 重启资源
pcs resource disable my_lvm 停止资源
pcs resource enable my_lvm 启动资源
pcs resource cleanup my_lvm 清除资源故障
pcs cluster stop --all 停止所有双机节点
pcs cluster start --all 启动所有双机节点
reboot
pcs resource show 显示资源为停止,可以省略show

 Resource Group: apachegroup
     my_lvm	(ocf::heartbeat:LVM):	Stopped

重新作initram
备份,发现2台机器内核还不一样

[root@server3 ~]# cp -p /boot/initramfs-3.10.0-1062.1.1.el7.x86_64.img /boot/initramfs-3.10.0-1062.1.1.el7.x86_64.img.bakok1
[root@server4 ~]# cp -p /boot/initramfs-3.10.0-957.el7.x86_64.img /boot/initramfs-3.10.0-957.el7.x86_64.img.bakok1
[root@server3 ~]# dracut -H -f /boot/initramfs-$(uname -r).img $(uname -r)
[root@server3 ~]# reboot
```text
还是无法启动
删除重新创建,甚至修改为不独占,都无法启动
```text
pcs resource delete my_lvm
pcs resource create my_lvm LVM volgrpname=my_vg \
> exclusive=false --group apachegroup

看日志

[root@server3 ~]# journalctl -xe
Sep 26 17:13:54 server3.example.com pengine[3312]:    error: Resource start-up disabled since no STONITH resources have been defined
Sep 26 17:13:54 server3.example.com pengine[3312]:    error: Either configure some or disable STONITH with the stonith-enabled option
Sep 26 17:13:54 server3.example.com pengine[3312]:    error: NOTE: Clusters with shared data need STONITH to ensure data integrity
Sep 26 17:13:54 server3.example.com pengine[3312]:   notice: Removing my_lvm from server3.example.com
Sep 26 17:13:54 server3.example.com pengine[3312]:   notice: Removing my_lvm from server4.example.com
[root@server3 ~]# pcs property show --all
stonith-enabled: true
[root@server3 ~]# pcs property set stonith-enabled=false  只在1台修改
[root@server3 ~]# pcs property show --all |grep stonith-enabled
 stonith-enabled: false

重新添加资源,启动成功

[root@server3 ~]# pcs resource create my_lvm LVM volgrpname=my_vg \
>  exclusive=true --group apachegroup
Assumed agent name 'ocf:heartbeat:LVM' (deduced from 'LVM')
[root@server3 ~]# pcs status
[root@server3 ~]# pcs resource
 Resource Group: apachegroup
     my_lvm	(ocf::heartbeat:LVM):	Started server3.example.com
使用 lvdisplay 显示 LV Status              available

继续添加其他资源

pcs resource create my_fs Filesystem
device="/dev/my_vg/my_lv" directory="/var/www" fstype=“ext4” --group
apachegroup
使用df查看挂载

pcs resource create VirtualIP IPaddr2 ip=192.168.122.30
cidr_netmask=24 --group apachegroup
使用ip ad查看浮动IP

pcs resource create Website apache
configfile="/etc/httpd/conf/httpd.conf"
statusurl=“http://127.0.0.1/server-status” --group apachegroup

验证

状态检查

[root@server3 ~]# pcs status
Cluster name: mytest_cluster
Stack: corosync
Current DC: server3.example.com (version 1.1.20-5.el7_7.1-3c4c782f70) - partition with quorum
Last updated: Thu Sep 26 17:27:24 2019
Last change: Thu Sep 26 17:26:27 2019 by root via cibadmin on server3.example.com

2 nodes configured
4 resources configured

Online: [ server3.example.com server4.example.com ]

Full list of resources:

 Resource Group: apachegroup
     my_lvm	(ocf::heartbeat:LVM):	Started server3.example.com
     my_fs	(ocf::heartbeat:Filesystem):	Started server3.example.com
     VirtualIP	(ocf::heartbeat:IPaddr2):	Started server3.example.com
     Website	(ocf::heartbeat:apache):	Started server3.example.com

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@server3 ~]#

访问应用
firefox访问 http://192.168.122.30 显示Hello
[xy@xycto ~]$ curl http://192.168.122.30

Hello 此时2台机器systemctl status httpd的状态都未启动 Active: inactive (dead)

倒换测试
(1)重启
[root@server3 ~]# reboot
秒切换到server4
(2)停进程及听双机软件

[root@server4 ~]# ps -ef|grep httpd
root      7389     1  0 17:36 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache    7391  7389  0 17:36 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache    7392  7389  0 17:36 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache    7393  7389  0 17:36 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache    7394  7389  0 17:36 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache    7395  7389  0 17:36 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
root      7642  3468  0 17:36 pts/0    00:00:00 grep --color=auto httpd
[root@server4 ~]# kill -9 7389

程序会自动重启,停止了4次都没有切换,只是有一个告警

Failed Resource Actions:
* Website_monitor_10000 on server4.example.com 'not running' (7): call=42, status=complete, exitreason='',
* 
    last-rc-change='Thu Sep 26 17:37:11 2019', queued=0ms, exec=0ms

查看配置发现监控间隔为10秒,kill重启间隔估计只有3秒。
[root@server4 ~]# pcs config
Operations: monitor interval=10s timeout=20s (Website-monitor-interval-10s)
来一个狠招

# mv /sbin/httpd /sbin/httpdbak

之后再kill进程,立即就切换了,估计1秒。

看日志,程序比较智能,实际并未重启1000000次

Sep 26 17:54:08 server3.example.com apache(Website)[9380]: ERROR: apache httpd program not found
Sep 26 17:54:08 server3.example.com apache(Website)[9396]: ERROR: environment is invalid, resource considered stopped
Sep 26 17:54:08 server3.example.com lrmd[3325]:   notice: Website_monitor_10000:9320:stderr [ ocf-exit-reason:apache httpd program not found ]
Sep 26 17:54:08 server3.example.com lrmd[3325]:   notice: Website_monitor_10000:9320:stderr [ ocf-exit-reason:environment is invalid, resource considered stopped ]
Sep 26 17:54:08 server3.example.com crmd[3332]:   notice: server3.example.com-Website_monitor_10000:41 [ ocf-exit-reason:apache httpd program not found\nocf-exit-reason:environment is invalid, resource considered stopped\n ]
...
Sep 26 17:54:09 server3.example.com pengine[3331]:  warning: Processing failed start of Website on server3.example.com: not installed
Sep 26 17:54:09 server3.example.com pengine[3331]:   notice: Preventing Website from re-starting on server3.example.com: operation start failed 'not installed' (5)
Sep 26 17:54:09 server3.example.com pengine[3331]:  warning: Forcing Website away from server3.example.com after 1000000 failures (max=1000000)

可能无法恢复
在停止s4服务器的双机后pcs cluster stop

2 nodes configured
4 resources configured

Online: [ server3.example.com ]
OFFLINE: [ server4.example.com ]

Full list of resources:

 Resource Group: apachegroup
     my_lvm	(ocf::heartbeat:LVM):	Started server3.example.com
     my_fs	(ocf::heartbeat:Filesystem):	Started server3.example.com
     VirtualIP	(ocf::heartbeat:IPaddr2):	Started server3.example.com
     Website	(ocf::heartbeat:apache):	Stopped

Failed Resource Actions:
* Website_start_0 on server3.example.com 'not installed' (5): call=44, status=complete, exitreason='environment is invalid, resource considered stopped',

手工清除状态
[root@server3 ~]# pcs resource cleanup Website
还是无法启动

  • Website_start_0 on server3.example.com ‘unknown error’ (1): call=63, status=Timed Out, exitreason=’’,
    reboot重启后,所有资源都为stopped
    journalctl检查日志,不是很有用,没有仔细去找到关键日志。
Sep 26 18:13:53 server3.example.com LVM(my_lvm)[3460]: WARNING: LVM Volume my_vg is not available (stopped)
Sep 26 18:13:53 server3.example.com crmd[3321]:   notice: Result of probe operation for my_lvm on server3.example.com: 7 (not running)
Sep 26 18:13:53 server3.example.com crmd[3321]:   notice: Initiating monitor operation my_fs_monitor_0 locally on server3.example.com
Sep 26 18:13:53 server3.example.com Filesystem(my_fs)[3480]: WARNING: Couldn't find device [/dev/my_vg/my_lv]. Expected /dev/??? to exist

需要启动4机的双机[root@server4 ~]# pcs cluster start
自动恢复。

小结:
1、httpd文件损坏的问题,在恢复后仍然可能会出现异常,最好reboot一下。
2、1台机器停止双机软件后,不要重启另外一台机器。

(3)去掉浮动IP地址
[root@server4 ~]# ip addr del 192.168.122.30/24 dev eth0
也是在本机自动拉起
[root@server4 ~]# ip link set down dev eth0
只有通过kvm或者虚拟机界面操作了,s4还是不会切换
在3机上已经切换成功,网页访问也是正常的。

Online: [ server3.example.com ]
OFFLINE: [ server4.example.com ]

Full list of resources:

 Resource Group: apachegroup
     my_lvm	(ocf::heartbeat:LVM):	Started server3.example.com
     my_fs	(ocf::heartbeat:Filesystem):	Started server3.example.com
     VirtualIP	(ocf::heartbeat:IPaddr2):	Started server3.example.com
     Website	(ocf::heartbeat:apache):	Started server3.example.com
[xy@xycto ~]$ curl http://192.168.122.30

Hello

把s4的网卡启动起来 ifconfig eth4 up 会自动去掉s4的浮动地址,并加入双机,2台机器显示pcs status相同。

你可能感兴趣的:(Linux)