环境:AIX 5.3 9i 9.2.0.8
DG模式:最大性能,ARCH传输方式
物理连接方式:电信4M专线(8台数据库共享)
客户8套灾备standby数据库从上海本地机房搬迁至北京IDC数据中心后周一生产环境下,一套FMDB数据库primary端不能正常归档导致数据库被挂起。
Mon Dec 24 10:00:47 2012
ARC0: Complete FAL archive (thread 1 sequence 55730 destination fmpdb_new)
ARC0: Evaluating archive log 2 thread 1 sequence 57890
ARC0: Unable to archive log 2 thread 1 sequence 57890
Log actively being archived by another process
ARC0: Evaluating archive log 4 thread 1 sequence 57891
ARC0: Beginning to archive log 4 thread 1 sequence 57891
Creating archive destination LOG_ARCHIVE_DEST_2: 'fmpdb_new'
Creating archive destination LOG_ARCHIVE_DEST_1: '/oradata/fmpdb/arch/1_57891.dbf'
Mon Dec 24 10:03:19 2012
ORACLE Instance fmpdb - Can not allocate log, archival required
意思是数据库有归档的请求但是不能正常进行归档,这个时候就有可能产生数据库非常缓慢以及挂起的情况。
通过检查发现是由于专线带宽不足引起的问题,那为什么该套DG数据库会由于不能正常归档被hang住呢?
首先,讲一下我们生产的环境,我们生产端和灾备端是通过一根4M的专线去连接的,4M专线的速度理论即 4096/8=512KB,如果加上线路损耗的话实际应该是小于这个值的,这样一根专线需要供8台数据库做同步通信使用。
在最大性能模式,9i数据库下从online redo log中轮流读1M归档日志数据到primary db本地归档目录,然后读1M归档日志数据到standby归档目录,这样直到读取完一个redo log为止,在出现网络及慢的情况下tnsping不会返回有任何结果(又不正常但又没有error返回),这样就导致不能读写到standby,从而primary db不能正常归档,以至于数据库被挂起。 由于9i下这种轮流替换传输归档的特殊性导致在极端带宽条件下导致错误的发生。通过查询ORACLE 对于9.2.0.5以后的版本为了解决该问题开放了一个隐含参数供设置。
ORACLE 官方已有对该问题的解释,详见文档ID 260040.1:
Refining Remote Archival Over a Slow Network with the ARCH Process [ID 260040.1]
里面有提到通过设置隐含参数”_LOG_ARCHIVE_CALLOUT”为LOCAL_FIRST=TRUE来强制数据库先进行本地归档再传输到standby数据库
alter system set “_LOG_ARCHIVE_CALLOUT”=’LOCAL_FIRST=TRUE’ scope=both;
因为数据库默认log_archive_max_processes=2,为了增加archive的可处理进程建议增加至跟数据库online redo log的组数保持一致,即log_archive_max_processes=6 (生产环境online redo log为6组,最大不能超9)
通过设置_LOG_ARCHIVE_CALLOUT=’LOCAL_FIRST=TRUE’ 以及’log_archive_max_processes=6′一般即可解决该问题。
Purpose
When archiving locally and remotely using the ARCH process where the remote destination is across a saturated or slow network you can receive the following errors in the alert log:
ARC0: Evaluating archive log 2 thread 1 sequence 100
ARC0: Unable to archive log 2 thread 1 sequence 100
Log actively being archived by another process
If the ARCH process is unable to archive at the rate at which online logs are switched then it is possible for the primary database to suspend while waiting for archiving to complete. The following discussion describes how this can occur.
Default Behavior for 9iR2 and Below
The ARCH process sits in a very tight loop waiting for an update to the controlfile that states an online log needs to be archived. Once the update occurs the ARCH process builds a list of archive destinations that need to be serviced. Once this list is complete, the ARCH process will read a one megabyte chunk of data from the online log that is to be archived. This one megabyte chunk is then sent to the first destination in the list. When the write has completed, the same one megabyte chunk is written to the second destination. This continues until all of the data from the online log being archived has been written to all destinations. So it can be said that archiving is only as fast as the slowest destination.
A common misconception is that if the LOG_ARCHIVE_DEST_n parameter for a particular destination has the OPTIONAL attribute set, then that destination will not impede local archiving. This is true during error situations while archiving to that destination �C e.g. a network disconnect error, but not during an archival over a slow network, which is not an error situation. In error situations, whether the destination is marked OPTIONAL or MANDATORY, Data Guard will close that destination and continue transmitting to all other valid destinations. Transmitting to the closed destination will be attempted again only after the time specified in the REOPEN attribute has expired and a log switch has occurred. This process will continuefor the number of times specifiedby the MAX_FAILURE attribute. During this time, it is possible that the log writer process recycles through the available online redo log groups and tries to use the online redo log file which has not yet been transmitted successfully to the remote destination. If the destination is marked OPTIONAL, log writer will reuse the online redo log file for the next set of redo. If the destination is marked MANDATORY, log writer will not be able to reuse that online redo log file, and the primary database will delay processing until that online redo log file has been successfully transmitted to the remote destination.
However, the situation is very different if the transmission is being done over a slow network. In this case, no error is encountered and the destination is not closed. Transmission continues, but is very slow. Ultimately, with the unavailability of any more online redo log groups, Log writer may suspend because the archive process is taking a long time to complete its archival, including local archival.
Refining the Default Behavior
The following underscore parameter was introduced as of 9.2.0.5 to allow the DBA to change this default behavior:
_LOG_ARCHIVE_CALLOUT=’LOCAL_FIRST=TRUE’
This is a dynamic Parameter, so you can set it this Way:
SQL> alter system set “_LOG_ARCHIVE_CALLOUT”=’LOCAL_FIRST=TRUE’ scope=both;
If the above parameter is set then the ARCH process will begin archiving to the local destination first. Once the redo log has been completely and successfully archived to at least one local destination, it will then be transmitted to the remote destination. This is the default behavior beginning with Oracle Database 10g Release 1.
Starting in 9.2.0.7 patchsets, one ARCH process will begin acting as a ‘dedicated’ archiver, handling only local archival duties. It will not perform remote log shipping or service FAL requests. This is a backport of behavior from 10gR1 to 9iR2.