IBM HACMP双机热备实施 -下

四、HACMP常见的故障解决方法


HACMP将诊测并响应于三种类型的故障:1网卡故障,2网络工作,3节点故障。下面就这三种故障分别进行介绍。


1、网卡故障


HACMP的群集结构中,除了TCP/IP网络以外,还有一个非TCP/IP网络,它实际上是一根“心跳”线,专门用来诊测是节点死机还是仅仅网络发生故障。如下图所示,一旦节点加入了 Cluster(即该节点上的HACMP已正常启动),该节点的各个网卡、非TCP/IP网络就会不断地接收并送Keep-Alive信号,K-A的参数是可调的,HA在连续发送一定数量个包都丢失后就可确认对方网卡,或网络,或节点发生故障。因此,有了K-A后,HACMP可以很轻易地发现网卡故障,因为一旦某块网卡发生故障发往该块网卡的K-A就会丢失。


此时node 1上的cluster manager( HACMP的“大脑”)会产生一个swap-adapter的事件,并执行该事件的script(HACMP中提供了大部分通用环境下的事件 scripts,它们是用标准AIX命令和HACMP工具来写的)。每个节点上都有至少两块网卡,一块是service adapter,提供对外服务,另一块是standby adapter,它的存在只有cluster manager知道,应用和client并不知道。


一旦发生swap-adapter事件后,cluster manager将原来service adapter的 IP地址转移到standby adapter上,而standby地址转移到故障网卡上,同时网络上其他节点进行ARP的刷新。网卡互换(swap-adapter)在几秒内就可完成,以太网为3秒,并且这种转换对应用和client来说是透明的,只发生延迟但连接并不中断。


2、网络故障


如果发往node1上的service和standby网卡上的K-A包全都丢失,而非TCP/IP网络上的K-A仍然存在,那么HACMP判断node1仍然正常而网络发生故障。此时HACMP执行一个。


3 、节点故障


如果不仅TCP/IP网络上的K-A全部丢失,而且非TCP/IP网络上的K-A也丢失,那么HACMP断定该节点发生故障,并产生node-down事件。此时将有资源接管,即放在共享磁盘陈列上的资源将由备份节点接管,接管包括一系列操作:Acquire disks,Varyon VG, Mount file systems,Export NFS file systems, Assume IP network Address, Restart highly available applications,其中IP地址接管和重新启动应用由HACMP来实现,而其他是由AIX来完成。


当整个节点发生故障时,HACMP将故障节点的service IP address转移到备份节点上,使网络上的client仍然使用这个IP地址,这个过程称为IP地址接管(IPAT),当一个节点down掉后,如果设置了IP地址接管,网络上的clients会自动连接到接管节点上;同样,如果设置了应用接管,该应用会在接管节点上自动重启,从而使系统能继续对外服务。对于要实现接管的应用,只需在HACMP中把它们设置成application server,并告诉HACMP启动这个应用的start script的全路径名和停止该应用的stop script的全路径名。由此可见,应用接管的配置在HACMP中十分简单,重要的是start script和stop script的写作,这需要用户对自己应用的了解。


4、其他故障


HACMP只去检测网卡、网络和节点是否发生故障,并作出相应的转移、接管行为。对于其他故障,那么HACMP缺省不作任何动作。


a、硬盘故障


一般我们都将硬盘设置成RAID-5方式或mirror方式,从而提供硬盘的高可用性。RAID-5将奇偶较验位分散在硬盘组中,因此当一组内的一个硬盘坏掉,组内的其他硬盘可以通过奇偶较验位将该硬盘上的数据恢复出来。RAID-5方式一般是由硬件实现的,如下7133的SSA适配器,而且如果同一组内的两个硬盘坏掉,该组硬盘的数据很可能就会全部丢失。mirror方式是将同一个数据写到至少两个物理外置上,因此它的效率没有RAID-5好,而且用盘量大,但安全性比RAID-5高,而且它易于实现,通过AIX中的(Logic Volume Management)可以很方便地设置。


b、硬盘控制卡


存储设备连接到主机上都必须通过一块控制卡,SCSI设备是SCSI Adapter, SSA设备是SSA Adapter,如果这块卡坏掉,与之连接的外设就无法利用。有几种办法可以解决这个问题。


一种办法是用多个adapter。每个主机上都有两块或两块以上adapter,分别连接mirror的数据,因此无论是硬盘坏掉,还是Adapter坏掉,所有好数据还是可以被主机利用,不会出现单点故障。这种方法实现起来并不难,但必须配置多块adapter,而且必须采用数据mirror方式。这种方法也不用通过HACMP来实现。


另一种方法仍只用一块adapter,利用HACMP中的Error Notification Facility( 错误通告机制)来解决。


Error Notification Facility是HACMP提供的对其他设备的监控工具,任何报告给AIX的错误(error)都能被捕获被采取相应措施。HACMP提供了smit界面,使配置简单化。


我们已知道,用LVM可实现硬盘镜像,当一个盘坏掉,仍有一份数据在镜像盘里,数据仍可进行读写,但此时数据不再有可用性,若镜像盘也坏掉则数据全部丢失。所以在此例中,PV丢失(LVM_PVMISS)的信息会大幅显示在控制台面上,从而提醒用户去仔细查看error log找出故障并修复它。同样,此例中HACMP提供了界面,结合AIX的功能,从而监控故障的发生。


c.、应用故障


如果用户的应用有kernel call调用,或以root身份来启动等,一旦应用发生故障,很容易导致操作系统down掉,发生死机,这时实际上等于节点故障,HACMP会采取相应接管措施。如果只是应用自身死掉,AIX仍正常运行,HACMP最多利用Error Notification Facility来提供监控功能,对应用本身不采取任何动作。但如果应用中调用了AIX的SRC (System Resource Controller)机制所提供的API接口,就可以使应用在down掉后自动重新启动。除了SRC提供API接口外,HACMP中的clinfo也提供这样的API。


clinfo是cluster Information daemon,它负责维护整个cluster的状态的信息,clinfo API允许应用程序利用这些状态信息来采取相应行动。


d.、HACMP故障


如果cluster中节点的HACMP进程down掉,HACMP将其升级为节点故障,从而发生资源接管。


如上所述,HACMP只全权负责诊断网卡故障、网络故障和节点故障这三类故障,并负责实现IP地址转换或接管,以及整个系统资源( 硬件、文件、系统、应用程序,等等)的接管。对于这三类故障外的其他故障,可以结合AIX基本功能和HACMP提供的一些机制,如Error Notification Facility, clinfo API 等,同样可以实现对故障的监控并采取相应措施。


-------------------------附加;Hacmp配置菜单-------------------------




配置网络拓扑结构


smit hacmp:
屏幕显示hacmp菜单:
HACMP
Cluster Configuration
Cluster Services
Cluster System Management
Cluster Recovery Aids
RAS Support
---------- end of screen ----------
选择Cluster Configuration,
Cluster Configuration
Cluster Topology
Cluster Security
Cluster Resources
Cluster Snapshots
Cluster Verification
Cluster Custom Modification
Restore System Default Configuration from Active Configuration
Advanced Performance Tuning Parameters
---------- end of screen ----------
选择Cluster Topology
Cluster Topology
Configure Cluster
Configure Nodes
Configure Networks
Configure Adapters
Configure Sites
Configure Global Networks
Configure Network Modules
Configure Topology Services and Group Services
Show Cluster Topology
Synchronize Cluster Topology
---------- end of screen ----------
选择Configure Cluster
Configure Cluster
Add a Cluster Definition
Change / Show Cluster Definition
Remove Cluster Definition
---------- end of screen ----------
选择Add a Cluster Definition,并进行配置:
Add a Cluster Definition
[Entry Fields]
**NOTE: Cluster Manager MUST BE RESTARTED
in order for changes to be acknowledged.**

* Cluster ID [188](输入) #
* Cluster Name [test](输入)
---------- end of screen ----------
添加成功后返回到Cluster Topology
Cluster Topology
Configure Cluster
Configure Nodes
Configure Networks
Configure Adapters
Configure Sites
Configure Global Networks
Configure Network Modules
Configure Topology Services and Group Services
Show Cluster Topology
Synchronize Cluster Topology
---------- end of screen ----------
选择Configure Nodes
Configure Nodes
Add Cluster Nodes
Change / Show Cluster Node Name
Remove a Cluster Node
---------- end of screen ----------
选择Add Cluster Nodes,并进行配置(添加两个Node,M851、M852):
Add Cluster Nodes
[Entry Fields]
Node Names [m851](输入 Node Name)
---------- end of screen ----------
Add Cluster Nodes
[Entry Fields]
Node Names [m852](输入 Node Name)
---------- end of screen ----------
添加成功后,返回到Cluster Topology
Cluster Topology
Configure Cluster
Configure Nodes
Configure Networks
Configure Adapters
Configure Sites
Configure Global Networks
Configure Network Modules
Configure Topology Services and Group Services
Show Cluster Topology
Synchronize Cluster Topology
---------- end of screen ----------
选择Configure Adapters
Configure Adapters
Adapters on IP-based network
Adapters on Non IP-based network
---------- end of screen ----------
选择Adapters on IP-based network
Adapters on IP-based network
Discover Current Network Configuration
Add an Adapter
Change / Show an Adapter
Remove an Adapter
---------- end of screen ----------
选择Add an Adapter,配置m851_boot
Add an IP-based Adapter

[Entry Fields]
Adapter IP Label m851_boot
New Adapter IP Label [] +
* Network Type [ether] +
* Network Name [test_eth] +
* Network Attribute [public] +
* Adapter Function [boot] +
Adapter IP address [202.168.0.11]
Adapter Hardware Address []
Node Name [m851] +
Netmask [255.255.255.0] +

---------- end of screen ----------
配置m851_stb
Add an IP-based Adapter

[Entry Fields]
Adapter IP Label m851_stb
New Adapter IP Label [] +
* Network Type [ether] +
* Network Name [test_eth] +
* Network Attribute [public] +
* Adapter Function [standby] +
Adapter IP address [172.17.0.1]
Adapter Hardware Address []
Node Name [m851] +
Netmask [255.255.255.0] +
---------- end of screen ----------
配置m851_svc
Add an IP-based Adapter

[Entry Fields]
Adapter IP Label m851_svc
New Adapter IP Label [] +
* Network Type [ether] +
* Network Name [test_eth] +
* Network Attribute [public] +
* Adapter Function [service] +
Adapter IP address [202.168.0.1]
Adapter Hardware Address [0x0002556affff]
Node Name [m851] +
Netmask [255.255.255.0] +
---------- end of screen ----------
配置m852_boot
Add an IP-based Adapter

[Entry Fields]
Adapter IP Label m852_boot
New Adapter IP Label [] +
* Network Type [ether] +
* Network Name [test_eth] +
* Network Attribute [public] +
* Adapter Function [boot] +
Adapter IP address [202.168.0.12]
Adapter Hardware Address []
Node Name [m852] +
Netmask [255.255.255.0] +
---------- end of screen ----------
配置m852_stb
Add an IP-based Adapter

[Entry Fields]
Adapter IP Label m852_stb
New Adapter IP Label [] +
* Network Type [ether] +
* Network Name [test_eth] +
* Network Attribute [public] +
* Adapter Function [standby] +
Adapter IP address [172.17.0.2]
Adapter Hardware Address []
Node Name [m852] +
Netmask [255.255.255.0] +
---------- end of screen ----------
配置m852_svc
Add an IP-based Adapter
[Entry Fields]
Adapter IP Label m852_svc
New Adapter IP Label [] +
* Network Type [ether] +
* Network Name [test_eth] +
* Network Attribute [public] +
* Adapter Function [service] +
Adapter IP address [202.168.0.2]
Adapter Hardware Address [0x0002556ad9ff]
Node Name [m852] +
Netmask [255.255.255.0] +
---------- end of screen ----------

(4)分别增加TTY在两台主机上:
smit tty
选择:Add a TTY
Add a TTY
Type or select values in entry fields.
Press Enter AFTER making all desired changes.

[TOP] [Entry Fields]
TTY tty2
TTY type tty
TTY interface rs232
Description Asynchronous Terminal
Status Available
Location 01-S4-00-00
Parent adapter sa3
PORT number [0] +
Enable LOGIN disable +
BAUD rate [9600] +
PARITY [none] +
BITS per character [8] +
Number of STOP BITS [1] +
[MORE...35]

在第二台主机上同样方法配置一个TTY。

(5)配置心跳线TTY
Cluster Topology
Configure Cluster
Configure Nodes
Configure Networks
Configure Adapters
Configure Sites
Configure Global Networks
Configure Network Modules
Configure Topology Services and Group Services
Show Cluster Topology
Synchronize Cluster Topology
---------- end of screen ----------
选择Configure Adapters
Configure Adapters
Adapters on IP-based network
Adapters on Non IP-based network
---------- end of screen ----------
选择Adapters on Non IP-based network
Adapters on Non IP-based network
Add an Adapter
Change / Show an Adapter
Remove an Adapter
---------- end of screen ----------
选择Add an Adapter,并进行配置
Add an Adapter

[Entry Fields]
Adapter Label m851_tty
New Adapter Label []
Network Type [rs232] +
* Network Name [test_tty] +
* Device Name [/dev/tty2]
* Node Name [m851] +
---------- end of screen ----------

Add an Adapter

[Entry Fields]
Adapter Label m852_tty
New Adapter Label []
Network Type [rs232] +
* Network Name [test_tty] +
* Device Name [/dev/tty3]
* Node Name [m852] +
---------- end of screen ----------



配置资源组:


smit hacmp:
屏幕显示hacmp菜单:
HACMP
Cluster Configuration
Cluster Services
Cluster System Management
Cluster Recovery Aids
RAS Support
---------- end of screen ----------
选择Cluster Configuration,
Cluster Configuration
Cluster Topology
Cluster Security
Cluster Resources
Cluster Snapshots
Cluster Verification
Cluster Custom Modification
Restore System Default Configuration from Active Configuration
Advanced Performance Tuning Parameters
---------- end of screen ----------
选择Cluster Resources
Cluster Resources

Define Resource Groups
Define Application Servers
Configure Application Monitoring
Define Tape Resources
Define Highly Available Communication Links
Discover Current Volume Group Configuration
Configure Dynamic Node Priority Policies
Change/Show Resources/Attributes for a Resource Group
Cluster Events
Change/Show Run Time Parameters
Change/Show Cluster Lock Manager Resource Allocation
Show Cluster Resources
Synchronize Cluster Resources
---------- end of screen ----------

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/15149581/viewspace-627154/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/15149581/viewspace-627154/

你可能感兴趣的:(IBM HACMP双机热备实施 -下)