HA test 总结
目前HA master切换功能如下
环境:A 机器, B机器
切换逻辑:
1. 当A 机器上的master进程作为主节点时候, B机器在启动Master进场时候 动作如下
1. 发送心跳,主节点上的master收到后返回1001 确认链接
2. 同步fsimage 从主节点到备份节点,注意:只在master进程启动时候做一次以后不再同步fsimage
3. 定时同步 fslog 文件,在备份节点地方这个文件是一个递增的方式接受fslog
4. 每隔3秒发一起心跳 ,如果超过5秒没响应,抛 timeout 错误
2. 当 A 机器上的master进程被杀掉,A机器上的control进程会重新启master进程, 联系多次被杀掉后(超过重试次数)将在设置该机器上的
master 进程为Error (在crm_mon 可以查看该master 确认为error状态),
3. 主节点的HA control 进程认本地的master进程error后,启动备份机器上的master进场作为主节点(切换时间在20秒以内)进行切换。
4. 备份节点(B)切换成主节点后master进程,处理fslog文件 ,使得数据同步完成。
5. 如果启动A 机器上的 master进程, 但是HA没有更新他的状态,HA仍然认为他是Error状态,所以当B节点master进场down
A 机器上的master进场不会被切换成主节点 这时候就没有master出去对外提供工作状态。 除非重启B机器上的 HA 使得master状态获得更新。
6. 接5. 如果启动A机器上的Master 机场后,他可以继续被作为主节点工作。但是 crm_mon 查询状态还是 Error。
测试点:
1. A 主节点 和 B 备份节点 ,通过杀 master 进程切换 ok
2. A 主节点 和 B 备份节点 ,通过restart /etc/init.d/heartbeat 进程切换 ok (bug fix)
3. 大fsimage切换 ok (bug fix)
4. 大fslog切换 ok
5. 执行put过程时候的切换,错误处理 数据传输失败,错误数据块继续保留在系统中,需要手工删除
6. 执行rmfs过程时候的切换,错误处理 只记录删除动作很快,具体删除交给dataserver做,所以切换不影响。
7. 循环切换,需要手工干预 restart beartbeat . ok
8. 在put大文件(3G)时候, 出现fslog报错,master重启现象, 不知道是否和HA相关。
常见问题:
1. configure ha.cf file ,some key need be modified
ucast eth0 10.0.38.33 // it should be another machine ip address .
node dc_13 // you should add all node in this cluster
ping 10.0.38.156 // it only is test if ip fails.
2. configure haresources file.
there are three columes.
first clolume is machine name of the primary node.
the second is ip address which never be used in this network.
the third is the application which you want to call . it usually is a script which
be in /etc/init.d ( call it "any_server")
3. any_server configure .
it is a script in /etc/init.d , and will be call by heartbeast.
3. update crm.xml file
When you modify configure file , you should perfrom /usr/lib64/heartbeat/haresources2cib.py
it will generate cib.xml file again .
4. fix the problem about master thread switch between primary and backup matchine
the problem:
when primary heartbeat thread(A machine) restart ,
1. when A heartbeat stop , HA will reset B machine as primary server
2. when A heartbeat start ,HA will reset A machine as primary server
so , it will cause the problem that don't get data information and master thread don't start etc.
Solve:
We will limit the operation of reset A machine as primary server by some configure
modified the configure item in /var/lib/heartbeat/crm/cib.xml as following
5. HA在64位机器安装的一些问题
1. libnet的版本问题, 如果直接下载64位rpm 包安装,经常包send.c文件错误,我下载了源码包,编译安装可解决
2. 在默认安装的情况下,我们需要检查/etc/ha.d/shellfunc 文件,看看ha_bin 是否指导/usr/lib64/目录,如果你copy
以前安装的32位机器默认应该是/usr/lib/目录
3. 安装前一定要先加用户和用户组,安装后的补加是无效的,安装时候找不到用户无法给目录权限,在你启动时候就会导致
系统重启。
Notes:
1. The logic of fsimage and fslog synchronization
when slave master start , primary master will send fsimage to slave master server . and then primary master will don't send
fsimage to slave master again and primary master will send fslog to slave master . the fslog on slave master will increase .
when master switch , slave master will perform that with fslog update fsimage file.
2. The wait time of heartbeast is 5s
3. When perform /etc/init.d/delcae start , it fail , the reson maybe master thread have exist . it's notes is not very clear.
4. About master switch . there are A and B master , A master is configer as primary master , B is slave master.
4.1 When kill A master thread (sometime , the thread will be recall by HA. till it realy down ,we will kill it again ,) , HA will switch the primary master to B .
This time ,we can monitor HA (/usr/sbin/crm_mon -i 5) , the master thread on A is error status . so HA will don't
recall this thread on A , HA will be restart on A (/etc/init.d/heartbeat restart ) if you want to make it working.
Otherwise, Even if B master down , A master don't work.