物理Data Guard 下Failover 时Redo 的处理问题

和老大讨论了一下Oracle Data Guard 下redo 的问题。 在Data Guard 环境下,归档文件是可以在备库应用的。 假如主库直接crash后,无法登陆,这时在将备库切换为主库的时候,如何处理主库的redo 就是关键。 因为这里的数据就是可能丢失的数据。

 

所以做了一个实验验证,验证redo 的处理。即将主库的redo 直接copy到备库,然后通过recover 来应用redo,等应用结束之后,在启动备库。这样就不会造成数据丢失。

 

当然,如果在Data Guard 中采用Maximum Protection 模式的化,也不会造成数据丢失,但是这种对主库的影响较大。 模式这块参考blog:

 

Oracle Data Guard 理论知识

http://blog.csdn.net/xujinyang/article/details/6833263

 

一. 先看archive gap 的情况:

 

当Primary Database的某些日志没有成功发送到Standby Database, 这时候发生饿了归档裂缝(Archive Gap)。缺失的这些日志就是裂缝(Gap)。 Data Guard能够自动检测,解决归档裂缝,不需要DBA的介入。这需要配置FAL_CLIENT, FAL_SERVER 这两个参数(FAL: Fetch Archive Log)。从FAL 这个名字可以看出,这个过程是Standby Database主动发起的“取”日志的过程,Standby Database 就是FAL_CLIENT. 它是从FAL_SERVER中取这些Gap, 10g中,这个FAL_SERVER可以是Primary Database, 也可以是其他的Standby Database。如:FAL_SERVER='PR1,ST1,ST2';

 

FAL_CLIENT和FAL_SERVER两个参数都是Oracle Net Name。 FAL_CLIENT 通过网络向FAL_SERVER发送请求,FAL_SERVER通过网络向FAL_CLIENT发送缺失的日志。 但是这两个连接不一定是一个连接。 因此FAL_CLIENT向FAL_SERVER发送请求时,会携带FAL_CLIENT参数值,用来告诉FAL_SERVER应该向哪里发送缺少的日志。 这个参数值也是一个Oracle Net Name,这个Name是在FAL_SERVER上定义的,用来指向FAL_CLIENT.

 

 

除了自动地日志缺失解决,DBA 也可以手工解决。 具体操作步骤如下:

 

1) 查看是否有日志GAP:

    SQL> SELECT UNIQUE THREAD#, MAX(SEQUENCE#) OVER(PARTITION BY THREAD#) LAST FROM V$ARCHIVED_LOG;

  SQL> SELECT THREAD#, LOW_SEQUENCE#, HIGH_SEQUENCE# FROM V$ARCHIVE_GAP;

2) 如果有,则拷贝过来

3) 手工的注册这些日志:

SQL> ALTER DATABASE REGISTER LOGFILE '路径';

 

二. Redo 文件

       一般情况下,都是redo 文件满了之后才会进行归档,在Data Guard 环境下,是通过这些归档文件来同步数据。 现在假如我们刚归档完一次。 这时进行了一些事务的提交操作。 主库恰好在这个时候crach掉了。 而且无法登陆。 如果我们仅靠归档来进行Failover。 肯定是会有数据丢失的。

 

       我们可以在Failover 之前将主库的redo copy过来,在apply一下。 下面的实验就是来验证这个问题。 关于Data Guard 环境下的switchover 和Failover 知识,参考我的Blog:

       Oracle Data Guard Linux 平台 Physical Standby 搭建实例

       http://blog.csdn.net/xujinyang/article/details/6829555

这篇blog 的最后部分有这两种切换的说明。

 

2.1 现在主库进行相关操作:

 

SQL> select max(sequence#) from v$archived_log;

MAX(SEQUENCE#)

--------------

            13

SQL> alter system switch logfile;

System altered.

SQL> select max(sequence#) from v$archived_log;

MAX(SEQUENCE#)

--------------

            14

SQL> create table dave (id number,name varchar(20));

Table created.

SQL> insert into dave values(1,'dave');

1 row created.

SQL> commit;

Commit complete.

SQL> shutdown immediate

Database closed.

Database dismounted.

ORACLE instance shut down.

SQL>

 

在主库,我们先将日志归档,然后创建了一张表Dave,并插入了一条数据。最后把实例shutdown。 来模拟主库crash的情况。

 

SQL> select sequence#,applied from v$archived_log;

 

 SEQUENCE# APP

---------- ---

         8 YES

         9 YES

        10 YES

        11 YES

        12 YES

        13 YES

        14 YES

        15 YES

        16 YES

 

从这个日志里可以看出,主库刚才16档已经应用过了。 如果没有应用,我们可以手动注册一下。 命令:SQL> ALTER DATABASE REGISTER LOGFILE '路径';

 

 

我们将主库的redo 文件copy到备库的对应目录:

 

[oracle@dg1 orcl]$ scp redo01.log 192.168.6.3://u01/app/oracle/oradata/orcl/

[email protected]'s password:

redo01.log                        100%   50MB 368.4KB/s   02:19   

[oracle@dg1 orcl]$ scp redo02.log 192.168.6.3://u01/app/oracle/oradata/orcl/

[email protected]'s password:

redo02.log                       100%   50MB 517.2KB/s   01:39   

[oracle@dg1 orcl]$ scp redo03.log 192.168.6.3://u01/app/oracle/oradata/orcl/

[email protected]'s password:

redo03.log                      100%   50MB 445.2KB/s   01:55   

[oracle@dg1 orcl]$

 

2.2 在备库应用这些redo:

      

SQL> select sequence#,applied from v$archived_log;

 

 SEQUENCE# APP

---------- ---

         8 YES

         9 YES

        10 YES

        11 YES

        12 YES

        13 YES

        14 YES

        15 YES

        16 YES

 

9 rows selected.

 

SQL> alter database recover managed standby database cancel;

 

Database altered.

 

SQL> recover standby database until cancel;

ORA-00279: change 509016 generated at 11/05/2010 11:40:27 needed for thread 1

ORA-00289: suggestion : /u01/archive/1_17_734225750.dbf

ORA-00280: change 509016 for thread 1 is in sequence #17

-- 默认情况下会提示需要归档17, 实际上这个序列为17的归档还没有生成,我们忽略它,使用我们刚才copy过来的redo 日志来恢复。

 

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}

/u01/app/oracle/oradata/orcl/redo01.log   -- 注意, 这个位置是我手动写的

Log applied.

Media recovery complete.

 

--这里是运气好,一次就搞定了。 实际上有三个redo,我们也是不确定使用哪个redo的,只能一个一个试了。

 

SQL> recover standby database until cancel;

ORA-00279: change 509209 generated at 11/05/2010 11:46:35 needed for thread 1

ORA-00289: suggestion : /u01/archive/1_17_734225750.dbf

ORA-00280: change 509209 for thread 1 is in sequence #17

 

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}

/u01/app/oracle/oradata/orcl/redo02.log

ORA-00310: archived log contains sequence 15; sequence 17 required

ORA-00334: archived log: '/u01/app/oracle/oradata/orcl/redo02.log'

 

2.3 现在我们来将备库切换到主库:

 

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE;

SQL> SELECT DATABASE_ROLE FROM V$DATABASE;

SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;

SQL> ALTER DATABASE OPEN; 或者 shutdown immediate+startup

 

以上的步骤是常规的方法,但是用这个方法的时候报错:

 

SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;

ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY

*

ERROR at line 1:

ORA-16139: media recovery required

 

其实,恢复工作在上面已经做过了。 我们把数据库用read only 方式打开看一下,数据都已经写进去过了:

 

SQL> alter database open read only;

Database altered.

SQL> select * from dave;

 

        ID NAME

---------- --------------------

         1 dave

 

2.4 强制Failover 一下看看:

SQL> shutdown immediate   

ORA-01109: database not open

Database dismounted.

ORACLE instance shut down.

 

SQL> startup mount;

ORACLE instance started.

 

Total System Global Area  184549376 bytes

Fixed Size                  1218412 bytes

Variable Size              62916756 bytes

Database Buffers          117440512 bytes

Redo Buffers                2973696 bytes

Database mounted.

 

SQL> alter database activate standby database;

Database altered.

 

SQL> alter database open;

Database altered.

 

SQL> select open_mode from v$database;

OPEN_MODE

----------

READ WRITE

 

Oracle 对这种切换的说明:

http://download.oracle.com/docs/cd/B19306_01/server.102/b14239/scenarios.htm#i1035282

12.8.2 Failing Over to a Physical Standby Database with a Time Lag

A standby database configured to delay application of archived redo log files can be used to recover from user errors or data corruptions on the primary database. In most cases, you can query the time-delayed standby database to retrieve the data needed to repair the primary database (for example, to recover the contents of a mistakenly dropped table). In cases where the damage to the primary database is unknown or when the time required to repair the primary database is prohibitive, you can also consider failing over to a time-delayed standby database.

Assume that a backup file was inadvertently applied twice to the primary database and that the time required to repair the primary database is prohibitive. You choose to fail over to a physical standby database for which the application of archived redo log files is delayed. By doing so, you transition the standby database to the primary role at a point before the problem occurred, but you will likely incur some data loss. The following steps illustrate the process:

1.     Initiate the failover by issuing the appropriate SQL statements on the time-delayed physical standby database:

2.         SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
3.         SQL> ALTER DATABASE ACTIVATE PHYSICAL STANDBY DATABASE;
4.         SQL> SHUTDOWN IMMEDIATE;
5.         SQL> STARTUP

The ACTIVATE statement immediately transitions the standby database to the primary role and makes no attempt to apply any additional redo data that might exist at the standby location. When using this statement, you must carefully balance the cost of data loss at the standby location against the potentially extended period of downtime required to fully repair the primary database.

6.     Re-create all other standby databases in the configuration from a copy of this new primary database.

 

小结:

1. 如果不是特殊情况,尽量采用正常的切换方式。

2. 在对redo进行recover 之前,应当确保所有的归档文件已经应用。 应用之后可以采用强制Failver的方式来激活备库。

3. 为了验证这个,特意搭建了一个DG 环境,实验做完了,DG环境也废了,折腾。

 

 

 

 

 

 

------------------------------------------------------------------------------

你可能感兴趣的:(物理Data Guard 下Failover 时Redo 的处理问题)