DB2 HADR+TSA运维,TSA添加资源组的命令

Tivoli System Automation(TSA)是一个高可用性集群管理软件,DB2 TSA+HADR高可用方案可以实现DB2 hadr主备的自动检测切换。本文详细介绍了TSA的常用命令,如何把CDC或者DSG添加到TSA集群中,以及TSA的错误分析方法

常用命令:
lsrpdomain/lsrpnode - 查询domain和node信息:

[db2inst1@p0-pbd-pbd-db2 ~]$ lsrpdomain
Name        OpState RSCTActiveVersion MixedVersions TSPort GSPort
hadr_domain Online  3.2.4.4           No            12347  12348
[db2inst1@p0-pbd-pbd-db2 ~]$ lsrpnode
Name           OpState RSCTVersion
p0-pbd-pbd-db2 Online  3.2.4.4
p0-pbd-pbd-db1 Online  3.2.4.4

lssam - 查询resource状态:
[db2inst1@p0-pbd-pbd-db2 ~]$ lssam
Online IBM.ResourceGroup:cdc_I2KFK38-rg Nominal=Online
        '- Online IBM.Application:cdc-I2KFK38-rs
                |- Offline IBM.Application:cdc-I2KFK38-rs:p0-pbd-pbd-db1
                '- Online IBM.Application:cdc-I2KFK38-rs:p0-pbd-pbd-db2


lsrg -Ab -V -g - 查询resource group状态以及属性

[db2inst1@p0-pbd-pbd-db2 ~]$ lsrg -Ab -V -g cdc_I2KFK38-rg
Starting to list resource group information.
lsrg: Executed on Thu Aug 31 09:50:58 2023 at "p0-pbd-pbd-db2", master node "p0-pbd-pbd-db2".

Displaying Resource Group information:
All Attributes
For Resource Group "cdc_I2KFK38-rg".


Resource Group 1:
        Name                             = cdc_I2KFK38-rg
        MemberLocation                   = Collocated
        Priority                         = 0
        AllowedNode                      = ALL
        NominalState                     = Online
        ExcludedList                     = {}
        Subscription                     = {}
        Owner                            =
        Description                      =
        InfoLink                         =
        Requests                         = {}
        Force                            = 0
        ActivePeerDomain                 = hadr_domain
        OpState                          = Online
        TopGroup                         = cdc_I2KFK38-rg
        MoveStatus                       = [None]
        ConfigValidity                   =
        LockState                        = 0
        AutomationDetails[CompoundState] = Satisfactory
                          [DesiredState] = Online
                         [ObservedState] = Online
                          [BindingState] = Bound
                       [AutomationState] = Idle
                          [ControlState] = Startable
                           [HealthState] = Not Applicable
Completed listing resource group information.


chrg -o online(offline) - 启停resource group同时修改Nominal State
rgreq -o start(stop) - 启停resource group但是不修改Nominal State
rgreq -o lock(unlock) - 锁定或解锁resource group。

锁定资源组就可以让资源组不再自动根据依赖的资源组进行启停,可以等依赖的资源组发生切换后确定Online后再解锁资源组,确保资源组正常运行,比如一台DB2 HADR上建了很多CDC实例和DSG复制软件实例,并把这些实例进程加到了TSA资源组,并依赖HADR PRIMARY,PRIMARY在哪台机器上,这些CDC和DSG进程就跑在哪台机器上,已保证追到最新的日志。
再进行高可用切换演练的时候,在shutdown HADR standby机器的之前先把CDC和DSG资源组锁上,如果不锁上的话,而原备机和primary Log GAP比较大的话,切换到新的PRIMARY起来后
CDC和DSG会找不到最新的log而报错失败。
lsrg |egrep -i "dsg|cdc" | grep -v db2inst1|awk '{print "rgreq -o lock " $1}' | sh

lsrsrc IBM.Application - 列出所有resource属性,监控的CDC/Db2脚本及timeout时间。
resetrsrc -s 'Name =="db2_db2inst1_0-rs"' IBM.Application - 重置资源状态。

lsrsrc IBM.Application :
resource 57:
        Name                  = "db2_db2inst1_p0-pbd-pbd-db2_0-rs"
        ResourceType          = 0
        AggregateResource     = "0x2028 0xffff 0xe38eb1e1 0xa0a9fe1d 0x96244eb2 0x54fb9408"
        StartCommand          = "/usr/sbin/rsct/sapolicies/db2/db2V105_start.ksh db2inst1 0"
        StopCommand           = "/usr/sbin/rsct/sapolicies/db2/db2V105_stop.ksh db2inst1 0"
        MonitorCommand        = "/usr/sbin/rsct/sapolicies/db2/db2V105_monitor.ksh db2inst1 0"

resource 58:
        Name                  = "db2_db2inst1_p0-pbd-pbd-db2_0-rs"
        ResourceType          = 1
        AggregateResource     = "0x3fff 0xffff 0x00000000 0x00000000 0x00000000 0x00000000"
        StartCommand          = "/usr/sbin/rsct/sapolicies/db2/db2V105_start.ksh db2inst1 0"
        StopCommand           = "/usr/sbin/rsct/sapolicies/db2/db2V105_stop.ksh db2inst1 0"
        MonitorCommand        = "/usr/sbin/rsct/sapolicies/db2/db2V105_monitor.ksh db2inst1 0"

如下cdc_tsa.sh脚本可以将CDC实例添加到TSA集群资源组里:如cdc_I2KFK38-rg资源组,I2KFK38就是CDC的实例名
vi cdc_tsa.sh

OsUser=cdcuser
instName=test
ResourceName=cdc_${InstName}-rs
ResourceGroupName=cdc_${InstName}-rg
dependondb2ResourceName='IBM.ResourceGroup:db2_db2inst1_db2inst1_TKYLCDC-rg'

mkrsrc IBM.Application Name="${ResourceName}" ResourceType=1 StartCommand="/usr/sbin/rsct/sapolicies/cdc/${InstName}_start.sh" StopCommand="/usr/sbin/rsct/sapolicies/cdc/${InstName}_stop.sh" MonitorCommand="/usr/sbin/rsct/sapolicies/cdc/${InstName}_jiankong.sh" MonitorCommandPeriod=10 MonitorCommandTimeout=120 StartCommandTimeout=900 StopCommmandTimeout=900 UserName="${OsUser}" RunCommandsSync=1 ProtectionMode=0 NodeNameList='{"p0-pbd-pbd-db2","p0-pbd-pbd-db1"}'

mkrg ${ResourceGroupName}

#锁定资源
rgreq -o lock ${ResourceGroupName}

#node2 offline 资源
chrg -o Offline ${ResourceGroupName}

# 绑定 资源 -> 资源组 关系
addrgmbr -g ${ResourceGroupName} IBM.Application:${ResourceName}

# 绑定 资源组 和 DB2资源组的依赖关系
mkrel -p DependsOn -S IBM.Application:${ResourceName} -G ${dependondb2ResourceName} ${ResourceName}_DependsOn_db2-rel

# 切换资源组上线
chrg -o Online ${ResourceGroupName}

# 解锁资源
rgreq -o unlock ${ResourceGroupName}


TSA问题诊断:
问题诊断日志:
1)/var/log/messages
2)/var/ct/hadr_domain/log/mc 
drwxr-x--- 2 root root   6 Jul 23 14:42 IBM.ConfigRM
drwxr-xr-x 2 root root 4096 Jul 23 14:42 IBM.GblResRM 
drwxr-xr-x 2 root root 4096 Jul 23 14:42 IBM.RecoveryRM
drwxr-xr-x 2 root root 4096 Jul 23 14:42 IBM.StorageRM
drwxr-xr-x 2 root root  78 Jul 23 14:42 IBM.TestRM

如上所示:每个resource manager daemon对应一个文件夹。TSA重点关注GblResRM和RecoveryRM。
1) IBM.GblResRM – The “eyes and hands” of the cluster.
Responsible for start, stop, monitor and cleanup of IBM.Application resources. In the context of DB2, it is responsible for managing all DB2 defined entities.

Basically passive. It invokes monitor commands for resources based on defined intervals and services IBM.RecoveryRM requests.

2) IBM.RecoveryRM – The “brain” of the cluster.
Inputs are RMC events from other resource managers (IBM.GblResRM for IBM.Application resources, IBM.ConfigRM for hosts and network adapters, etc.), commands from users, and the resource model.

Output is commands issued to other resource managers to start/stop/cleanup resources.

Structured as a rule engine that determines how to respond to incoming events, and an optimizer component (called a “binder”) to determine resource placement if resources need to move between hosts.

使用rpttr -o dtic trace.29.sp format各个Resource Manager的trace文件。

你可能感兴趣的:(运维,windows)