达梦主备之备库失联后在线恢复加入集群

一、主库故障重启(备库接管前重启)

主库故障后立即重启,此时主库的守护进程变成 Startup 状态,重新进入守护进程的 启动流程,将数据一致的备库归档设置为有效状态,其余备库归档设置成无效状态,并重新 Open主库。Open成功后继续作为主库,当检测到归档状态无效的备库正常时会启动 Recovery 处理流程,重新同步主备库数据。

1、备库故障处理

备库产生故障(硬件故障或者内部网卡故障)时,主库的处理流程对手动切换、自动切 换模式处理上有些差异。

手动切换模式

对于手动切换模式,检测到备库故障,满足 Failover 条件时,主库的守护进程立即 切换到 Failover 状态,执行对应的故障处理,如果不满足切换 Failover 条件,则保持 当前状态不变。

手动切换模式下,主库守护进程切换 Failover 条件:

1. 备库实例故障,或者主备库之间出现网络故障,或者备库重演时校验 LSN 不匹配,

这三种场景下引发主库同步日志到备库失败挂起,主库实例处于 Suspend 状态

2. 主库到此备库的归档状态是 Valid(读写分离集群没有此限制)

3. 主库的守护进程处于 Startup、Open 或 Recovery 状态

4. 当前没有监视器命令正在执行

自动切换模式

对于自动切换模式,主库的守护进程会自动判断切换到 Failover 状态或者 Confirm 确认状态,如果两种状态切换条件都不满足,则保持当前状态不变。

自动切换模式下,主库守护进程不进入 Confirm 确认状态,直接切换到 Failover 条件:

1. 前四项条件,和上面列出的手动切换条件相同

2. 备库实例故障,备库守护进程正常

如果只满足条件 1,不满足条件 2,则主库守护进程会先进入 Confirm 确认状态,等 待确认监视器的确认消息。主库的守护进程进入 Confirm 确认状态后,会有下面几种不同

的处理:

1. 主库和确认监视器之间网络连接正常

主库的守护进程收到了确认监视器返回的确认消息,如果确认监视器认定可以执行 Failover,则主库的守护进程会切换为 Failover 状态并执行对应的处理;如果确认监 视器认定不满足执行 Failover 条件,则主库的守护进程会一直保持在 Confirm 状态。确 认监视器认定主库可以执行 Failover 条件:

1) 主库守护进程处于 Confirm 状态

2) 主库实例正常,处于 Suspend 状态

3) 主库没有被接管,不存在其他主库

4) 没有 takeover/switchover 命令在执行

5) 当前所有归档有效的备库均可以加入主库

2. 主库和确认监视器之间网络连接异常,或者没有启动确认监视器。满足下面条件后 主库允许切换至 Failover 状态执行故障处理:

1) 主库实例正常,处于 Suspend 状态

2) 备库守护进程正常

3) 主库没有被接管,不存在其他主库

4) 没有 takeover/switchover 命令正在执行

5) 备库故障前可以加入主库

3. 主库和确认监视器网络恢复正常后,主库已经被接管。老主库的守护进程切换为 Startup 状态,重新判断是否可加入新主库。 主库守护进程进入 Failover 状态后的执行流程(自动或手动切换模式下执行流程相 同):

1) 对实时主备或 MPP 主备,通知主库修改发送归档失败的备库归档状态无效

2) 通知主库重新 Open。

3) 将主库的守护进程切换为 Open 状态

二、恢复日志

 Clear all ep g_dw_status finished, Recovery finished!
 switch sub_state to sub_stat_start!
 设置GRP1守护进程为OPEN(SUB:STARTUP)状态
 dm_connect_async connection 6 is in progress
 非自动切换模式下20s没有收到远程守护进程消息
 Local instance: 守护进程状态(OPEN) 实例状态(OK) 实例名(DM01) 模式(PRIMARY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(8) FLSN(128083643) CLSN(1280836
 Instance: 守护进程状态(ERROR) 实例状态(OK) 实例名(DM02) 模式(STANDBY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(8) FLSN(128042324) CLSN(128042680) S
 dm_connect_async connection 6 is timeout
 dm_connect_async connection 6 is in progress
 dm_connect_async connection 6 is timeout
 dw2_send_port_set from dmmonitor vio(6) set, mid(1673602727), from name:dmmonitor, ip:::ffff:192.168.12.125, mon_confirm:FALSE
 dw2_send_port_set to dmwatcher vio(8) set, mid(-1), to name:DM02, ip:192.168.12.126
 ohis_inst_info_copy_low, inst(DM02) apply info changed, old info[p_db_magic:1486960128, n_apply_ep:1], new info to set[p_db_magic:0, n_apply_ep:0
 远程实例的模式、状态或者归档状态发生变化,原状态是:
 Instance: 守护进程状态(ERROR) 实例状态(OK) 实例名(DM02) 模式(STANDBY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(8) FLSN(128042324) CLSN(128042680) S
 远程实例的模式、状态或者归档状态发生变化,新状态是:
 dw2_send_port_set from dmmonitor vio(10) set, mid(1673602730), from name:dmmonitor, ip:::ffff:192.168.12.125, mon_confirm:FALSE
 远程实例的模式、状态或者归档状态发生变化,原状态是:
 远程实例的模式、状态或者归档状态发生变化,新状态是:
 Instance: 守护进程状态(STARTUP) 实例状态(OK) 实例名(DM02) 模式(UNKNOWN) 实例状态(SHUTDOWN) 归档状态(UNKNOWN) POCNT(0) FLSN(0) CLSN(0) SLSN(0) SSL
 ohis_inst_info_copy_low, inst(DM02) apply info changed, old info[p_db_magic:0, n_apply_ep:0], new info to set[p_db_magic:1486960128, n_apply_ep:1
 远程实例的模式、状态或者归档状态发生变化,原状态是:
 Instance: 守护进程状态(STARTUP) 实例状态(OK) 实例名(DM02) 模式(UNKNOWN) 实例状态(SHUTDOWN) 归档状态(UNKNOWN) POCNT(0) FLSN(0) CLSN(0) SLSN(0) SSL
 远程实例的模式、状态或者归档状态发生变化,新状态是:
 Instance: 守护进程状态(UNIFY EP) 实例状态(OK) 实例名(DM02) 模式(STANDBY) 实例状态(MOUNT) 归档状态(UNKNOWN) POCNT(8) FLSN(128045793) CLSN(12804579
 远程实例的模式、状态或者归档状态发生变化,原状态是:
 Instance: 守护进程状态(UNIFY EP) 实例状态(OK) 实例名(DM02) 模式(STANDBY) 实例状态(MOUNT) 归档状态(UNKNOWN) POCNT(8) FLSN(128045793) CLSN(12804579
 远程实例的模式、状态或者归档状态发生变化,新状态是:
 Instance: 守护进程状态(STARTUP) 实例状态(OK) 实例名(DM02) 模式(STANDBY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(8) FLSN(128045793) CLSN(128045793)
 远程实例的模式、状态或者归档状态发生变化,原状态是:
 Instance: 守护进程状态(STARTUP) 实例状态(OK) 实例名(DM02) 模式(STANDBY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(8) FLSN(128045793) CLSN(128045793)
 远程实例的模式、状态或者归档状态发生变化,新状态是:
 Instance: 守护进程状态(OPEN) 实例状态(OK) 实例名(DM02) 模式(STANDBY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(8) FLSN(128045793) CLSN(128045793) SL
 [ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 switch sub_state to pre_set_dw_stat!
 设置GRP1守护进程为RECOVERY(SUB:STARTUP)状态
 [ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 dw2_notify_set_dw_stat, dseq = 1671462826, from_dw_stat: NONE, to_dw_stat: DW_RECOVERY
 Send tcp msg to local ep DM01, hpc_seqno:0, code:0
 设置GRP1守护进程子状态为WAIT_SET_DW_STAT状态
 dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=217, dseq=1671462826, code=0
 dw2_clear_ep_cmd_info_low, clear ep(DM01) cmd info, and reset curr_ep to NULL.
 notify ep(DM01) set dw_stat to DW_RECOVERY success!
 [ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 检测到实例(DM02)可恢复,执行恢复流程
 开始向实例(DM02)发送归档日志
 [ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 dw2_rarch_send to DM02[seqno: 0], dseq = 1671462827
 [ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 Send tcp msg to local ep DM01, hpc_seqno:0, code:0
 设置GRP1守护进程子状态为WAIT_SEND_ARCH状态
 [ohis_check_can_recover, p_iname:DM01, n_p_apply=0, p_apply_db_magic=1486960128, p_apply_seqno_arr=[1162045, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 [ohis_check_can_recover, s_iname:DM02, n_s_apply=1, s_apply_db_magic=1486960128, s_apply_seqno_arr=[1141458, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 dw2_rarch_send to DM02[seqno: 0], dseq = 1671462828
 Send tcp msg to local ep DM01, hpc_seqno:0, code:0
 dw2_rarch_send to DM02[seqno: 0], dseq = 1671462829
 Send tcp msg to local ep DM01, hpc_seqno:0, code:0
 dw2_rarch_send to DM02[seqno: 0], dseq = 1671462830
 Send tcp msg to local ep DM01, hpc_seqno:0, code:0
 dw2_rarch_send to DM02[seqno: 0], dseq = 1671462831
 Send tcp msg to local ep DM01, hpc_seqno:0, code:0
 dw2_rarch_send to DM02[seqno: 0], dseq = 1671462832
 Send tcp msg to local ep DM01, hpc_seqno:0, code:0
 检测到实例(DM02)发送归档成功,设置为当前恢复实例
 dw2_notify_sql_exec, dseq = 1671462833, sql: ALTER DATABASE SUSPEND
 Send tcp msg to local ep DM01, hpc_seqno:0, code:0
 设置GRP1守护进程子状态为WAIT_TO_SUSPEND状态
 向实例(DM02)发送归档日志成功,实例(DM01)转入suspend状态
 dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=1, dseq=1671462833, code=100
 dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=1, dseq=1671462833, code=0
 dw2_clear_ep_cmd_info_with_recv_inst_low, clear ep(DM01) cmd info, and reset curr_ep to NULL.
 转入suspend状态后,再次发送归档日志
 dw2_rarch_send to DM02[seqno: 0], dseq = 1671462834
 Send tcp msg to local ep DM01, hpc_seqno:0, code:0
 设置GRP1守护进程子状态为WAIT_SEND_ALL_ARCH状态
 dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=210, dseq=1671462834, code=100
 dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=210, dseq=1671462834, code=100
 dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=210, dseq=1671462834, code=100
 dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=210, dseq=1671462834, code=100
 dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=210, dseq=1671462834, code=100
 dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=210, dseq=1671462834, code=0
 发送归档完毕,设置实例(DM02)归档有效
 dw2_notify_chg_arch_status, dseq = 1671462835, rstat = 0
 Send tcp msg to local ep DM01, hpc_seqno:0, code:0
 设置GRP1守护进程子状态为WAIT_SET_ARCH状态
 dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=100, dseq=1671462835, code=100
 实例(DM02)归档状态发生变化:INVALID --> VALID
 dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=100, dseq=1671462835, code=0
 dw2_clear_ep_cmd_info_with_recv_inst_low, clear ep(DM01) cmd info, and reset curr_ep to NULL.
 设置实例(DM02)归档有效成功,通知实例(DM01)OPEN
 dw2_notify_sql_exec, dseq = 1671462836, sql: ALTER DATABASE OPEN FORCE
 Send tcp msg to local ep DM01, hpc_seqno:0, code:0
 设置GRP1守护进程子状态为WAIT_TO_OPEN状态
 dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=1, dseq=1671462836, code=0
 dw2_clear_ep_cmd_info_with_recv_inst_low, clear ep(DM01) cmd info, and reset curr_ep to NULL.
 dw2_set_recover_info, instance:DM02, recover flag:TRUE, from monitor:FALSE, last_recv_time:1673602836, recover retry time:60
 本地守护进程为RECOVERY状态,本机实例为PRIMARY & OPEN,实例(DM02)故障恢复完成
 将实例(DM02)从恢复列表中删除
 不存在可恢复备库
 dw2_clear_ep_cmd_info_low, clear ep(DM01) cmd info.
 设置GRP1守护进程子状态为SUB_STATE_CLEAR状态
 Clear all ep dw_stat value!
 dw2_notify_set_dw_stat, dseq = 1671462837, from_dw_stat: DW_RECOVERY, to_dw_stat: NONE
 Send tcp msg to local ep DM01, hpc_seqno:0, code:0
 设置GRP1守护进程子状态为WAIT_CLEAR状态
 dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=217, dseq=1671462837, code=100
 dw2_group_get_curr_ep_retcode, ep(DM01) cmd_ret:cmd=217, dseq=1671462837, code=0
 dw2_clear_ep_cmd_info_low, clear ep(DM01) cmd info, and reset curr_ep to NULL.
 notify ep(DM01) set dw_stat to NONE success!
 dw2_clear_ep_cmd_info_low, clear ep(DM01) cmd info.
 Clear all ep g_dw_status finished, Recovery finished!
 switch sub_state to sub_stat_start!
 设置GRP1守护进程为OPEN(SUB:STARTUP)状态

三、恢复流程图

达梦主备之备库失联后在线恢复加入集群_第1张图片

你可能感兴趣的:(达梦数据库,数据库)