Db2 HADR的replay-only window以及DB2_HADR_ROS_AVOID_REPLAY_ONLY_WINDOW

环境

  • Db2 v11.5.0.0
  • Red Hat Enterprise Linux release 9.0 (Plow)

概念

在Db2 HADR系统中,如果打开了ROS(Read On Standby),在standby上可以运行查询,以减轻primary的负担。

但是,在primary上的一些特定操作,比如create table,或者reorg,会导致standby进入“replay-only window”。具体来讲,standby上所有的连接会被terminate掉,新的连接尝试也会被阻塞。只有replay操作完成之后,才允许建立连接。

这种行为(或者说是一种限制)显然会造成非常不好的用户体验。从Db2 V11.5开始,引入了一个registry variable叫 DB2_HADR_ROS_AVOID_REPLAY_ONLY_WINDOW ,其缺省值是ON,以尽量避免replay-only window。具体说来,在primary上的create table和reorg等操作,不会再断开standby上的连接。但是,如果有锁冲突(比如standby上某个表加了锁,而primary上对这个表做alter table操作),则该连接仍然会被断开。

测试

准备好一套Db2 HADR环境,数据库名为 HADR

测试1

测试点:关闭 DB2_HADR_ROS_AVOID_REPLAY_ONLY_WINDOW ,则在primary上运行DDL会导致standby上的连接被force掉。

因为测试环境是Db2 V11.5,默认是打开 DB2_HADR_ROS_AVOID_REPLAY_ONLY_WINDOW 的,为了测试replay-only window,需要在standby上关闭 DB2_HADR_ROS_AVOID_REPLAY_ONLY_WINDOW

db2set DB2_HADR_ROS_AVOID_REPLAY_ONLY_WINDOW=OFF

然后验证一下:

[db2inst1@limb1 ~]$ db2set -all
[i] DB2_HADR_ROS_AVOID_REPLAY_ONLY_WINDOW=OFF
[i] DB2_STANDBY_ISO=UR
[i] DB2_HADR_ROS=ON
[i] DB2_ATS_ENABLE=YES
[i] DB2COMM=TCPIP
[g] DB2SYSTEM=limb1.fyre.ibm.com

注意:改变 DB2_HADR_ROS_AVOID_REPLAY_ONLY_WINDOW 设置后,需要重新activate一下standby:

[db2inst1@limb1 ~]$ db2 deactivate db hadr
DB20000I  The DEACTIVATE DATABASE command completed successfully.

注:如果DB上有连接,则deactivate操作会失败,需要先把连接都断开,再deactivate。

[db2inst1@limb1 ~]$ db2 activate db hadr
DB20000I  The ACTIVATE DATABASE command completed successfully.

接下来,在primary和standby上都建立DB连接,确认能访问 T1 表。以primary为例:

[db2inst1@myrmidon1 ~]$ db2 connect to hadr

   Database Connection Information

 Database server        = DB2/LINUXX8664 11.5.0.0
 SQL authorization ID   = DB2INST1
 Local database alias   = HADR

[db2inst1@myrmidon1 ~]$ db2 "select * from t1"

C1          C2
----------- -----------
          1         111
          2         222

  2 record(s) selected.

Standby上也一样。

现在,在primary上运行一个DDL:

[db2inst1@myrmidon1 ~]$ db2 "alter table t1 add column c3 int"
DB20000I  The SQL command completed successfully.

此时,standby上的连接就会被断开:

[db2inst1@limb1 ~]$ db2 "select * from t1"
SQL1224N  The database manager is not able to accept new requests, has
terminated all requests in progress, or has terminated the specified request
because of an error or a forced interrupt.  SQLSTATE=55032

DDL运行的非常快,所以replay-only window时间很短,很快就能再次连接standby了。

测试2

测试点:打开 DB2_HADR_ROS_AVOID_REPLAY_ONLY_WINDOW ,则在primary上运行DDL不会导致standby上的连接被force掉。

在standby上,打开 DB2_HADR_ROS_AVOID_REPLAY_ONLY_WINDOW

[db2inst1@limb1 ~]$ db2set DB2_HADR_ROS_AVOID_REPLAY_ONLY_WINDOW=ON

[db2inst1@limb1 ~]$ db2set -all
[i] DB2_HADR_ROS_AVOID_REPLAY_ONLY_WINDOW=ON
[i] DB2_STANDBY_ISO=UR
[i] DB2_HADR_ROS=ON
[i] DB2_ATS_ENABLE=YES
[i] DB2COMM=TCPIP
[g] DB2SYSTEM=limb1.fyre.ibm.com

[db2inst1@limb1 ~]$ db2 deactivate db hadr
DB20000I  The DEACTIVATE DATABASE command completed successfully.

[db2inst1@limb1 ~]$ db2 activate db hadr
DB20000I  The ACTIVATE DATABASE command completed successfully.

接下来,在primary和standby上都建立连接,确认能访问 T1 表。以primary为例:

[db2inst1@myrmidon1 ~]$ db2 connect to hadr

   Database Connection Information

 Database server        = DB2/LINUXX8664 11.5.0.0
 SQL authorization ID   = DB2INST1
 Local database alias   = HADR

[db2inst1@myrmidon1 ~]$ db2 "select * from t1"

C1          C2          C3
----------- ----------- -----------
          1         111           -
          2         222           -

  2 record(s) selected.

Standby上也一样。

现在,在primary上运行一个DDL:

[db2inst1@myrmidon1 ~]$ db2 "alter table t1 add column c4 int"
DB20000I  The SQL command completed successfully.

此时,standby上的连接仍然还在:

[db2inst1@limb1 ~]$ db2 "select * from t1"

C1          C2          C3          C4
----------- ----------- ----------- -----------
          1         111           -           -
          2         222           -           -

  2 record(s) selected.

可见, DB2_HADR_ROS_AVOID_REPLAY_ONLY_WINDOW 有效防止了standby进入replay-only window。

测试3

测试点:即使打开了 DB2_HADR_ROS_AVOID_REPLAY_ONLY_WINDOW ,有锁冲突的连接仍然会被force掉。

在standby上,查询 T1 表,并且保持锁不释放:

[db2inst1@limb1 ~]$ db2 +c "select * from t1 with rr"

C1          C2          C3          C4
----------- ----------- ----------- -----------
          1         111           -           -
          2         222           -           -

  2 record(s) selected.

然后,在primary上运行DDL:

[db2inst1@myrmidon1 ~]$ db2 "alter table t1 add column c5 int"
DB20000I  The SQL command completed successfully.

此时,standby上的连接就会被断开:

[db2inst1@limb1 ~]$ db2 "select * from t1"
SQL1224N  The database manager is not able to accept new requests, has
terminated all requests in progress, or has terminated the specified request
because of an error or a forced interrupt.  SQLSTATE=55032

这是因为standby在replay log时有锁冲突,所以导致连接断开。

导致replay-only window的DDL语句和操作

有很多DDL语句和操作会导致standby进入replay-only window,或者force掉有所冲突的连接(如果打开了 DB2_HADR_ROS_AVOID_REPLAY_ONLY_WINDOW ),比如:

  • CREATE, ALTER, or DROP TABLE
  • CREATE, ALTER, or DROP WORKLOAD
  • CREATE or DROP EVENT MONITOR
  • Reorg
  • Load
  • Runstats

注意:对于WORKLOAD的DDL操作,就算是打开了 DB2_HADR_ROS_AVOID_REPLAY_ONLY_WINDOW ,也仍然会导致replay-only window。

监控replay-only window

指标如下:

  • STANDBY_REPLAY_ONLY_WINDOW_ACTIVE
  • STANDBY_REPLAY_ONLY_WINDOW_START
  • STANDBY_REPLAY_ONLY_WINDOW_TRAN_COUNT

方法1:db2pd

在primary或者standby上运行都可以:

[db2inst1@myrmidon1 ~]$ db2pd -hadr -db hadr

Database Member 0 -- Database HADR -- Active -- Up 0 days 01:50:39 -- Date 2023-05-25-02.22.38.394267

                            HADR_ROLE = PRIMARY
                          REPLAY_TYPE = PHYSICAL
                        HADR_SYNCMODE = ASYNC
                           STANDBY_ID = 1
                        LOG_STREAM_ID = 0
                           HADR_STATE = PEER
                           HADR_FLAGS = TCP_PROTOCOL
                  PRIMARY_MEMBER_HOST = myrmidon1.fyre.ibm.com
                     PRIMARY_INSTANCE = db2inst1
                       PRIMARY_MEMBER = 0
                  STANDBY_MEMBER_HOST = limb1.fyre.ibm.com
                     STANDBY_INSTANCE = db2inst1
                       STANDBY_MEMBER = 0
                  HADR_CONNECT_STATUS = CONNECTED
             HADR_CONNECT_STATUS_TIME = 05/25/2023 02:00:30.703818 (1685005230)
          HEARTBEAT_INTERVAL(seconds) = 30
                     HEARTBEAT_MISSED = 0
                   HEARTBEAT_EXPECTED = 213
                HADR_TIMEOUT(seconds) = 120
        TIME_SINCE_LAST_RECV(seconds) = 8
             PEER_WAIT_LIMIT(seconds) = 0
           LOG_HADR_WAIT_CUR(seconds) = 0.000
    LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.000005
   LOG_HADR_WAIT_ACCUMULATED(seconds) = 0.001
                  LOG_HADR_WAIT_COUNT = 254
SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 46080
SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 131072
            PRIMARY_LOG_FILE,PAGE,POS = S0040380.LOG, 225, 168584769991
            STANDBY_LOG_FILE,PAGE,POS = S0040380.LOG, 225, 168584769991
                  HADR_LOG_GAP(bytes) = 0
     STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0040380.LOG, 225, 168584769991
       STANDBY_RECV_REPLAY_GAP(bytes) = 0
                     PRIMARY_LOG_TIME = 05/25/2023 02:11:02.000000 (1685005862)
                     STANDBY_LOG_TIME = 05/25/2023 02:11:02.000000 (1685005862)
              STANDBY_REPLAY_LOG_TIME = 05/25/2023 02:11:02.000000 (1685005862)
         STANDBY_RECV_BUF_SIZE(pages) = 4300
             STANDBY_RECV_BUF_PERCENT = 0
           STANDBY_SPOOL_LIMIT(pages) = 25600
                STANDBY_SPOOL_PERCENT = 0
                   STANDBY_ERROR_TIME = NULL
                 PEER_WINDOW(seconds) = 0
             READS_ON_STANDBY_ENABLED = Y
    STANDBY_REPLAY_ONLY_WINDOW_ACTIVE = N

当前不在replay-only window,所以只有 STANDBY_REPLAY_ONLY_WINDOW_ACTIVE

方法2:mon_get_hadr

只能在primary上运行:

[db2inst1@myrmidon1 ~]$ db2 "select STANDBY_ID, STANDBY_REPLAY_ONLY_WINDOW_ACTIVE, STANDBY_REPLAY_ONLY_WINDOW_START, STANDBY_REPLAY_ONLY_WINDOW_TRAN_COUNT from table (mon_get_hadr(NULL))"

STANDBY_ID STANDBY_REPLAY_ONLY_WINDOW_ACTIVE STANDBY_REPLAY_ONLY_WINDOW_START STANDBY_REPLAY_ONLY_WINDOW_TRAN_COUNT
---------- --------------------------------- -------------------------------- -------------------------------------
         1 N                                 -                                                                    -

  1 record(s) selected.

同理,因为当前不在replay-only window,所以只有 STANDBY_REPLAY_ONLY_WINDOW_ACTIVE 有值。

建议

为尽量减小DDL等操作造成replay-only window的影响,建议如下:

  • 在特定的维护窗口期运行
  • 多条操作集中运行
  • 只在有需要的表上运行REORG和RUNSTATS

参考

  • https://www.ibm.com/docs/en/db2/11.5?topic=standby-replay-only-window-hadr-database

你可能感兴趣的:(Db2,数据库,Db2,HADR)