PURPOSE
-------
The purpose of this document is to explain the benefits and functionality of
fusion recovery in an Oracle Real Application Clusters Environment.
SCOPE & APPLICATION
-------------------
This document is intended for Oracle Real Application Clusters database
administrators that would like to understand how fusion recovery works
and how it can increase availability on their clustered database.
Crash/Instance Recovery for Cache Fusion
----------------------------------------
Because of the possible existance of past images in remote buffer caches, instance
or crash recovery is handled differently in a RAC environment than in previous
versions. The major differences are that thread recovery of failed instance(s) are
done by a surviving instance's SMON process instead of a foreground process. The
second major change is that during bounded instance and crash recovery (which
introduces a two-pass log read during thread recovery) SMON eliminates BWR's (block
written redos) from the recovery set. This enhancement should speed up recovery time
if there were existing past images. So, if an instance fails:
1. The instance, or instances, dies.
2. Failure is detected by cluster manager or CGS.
3. Reconfiguration occurs and all locks owned by the departing instance are
remastered (see Note 139435.1 for more info) and the first pass read of
threads of failed instances done by SMON.
4. SMON claims locks needed to recover blocks found by the first pass read.
5. Locks are obtained and second pass of redo threads of failed instances
is performed and blocks become available as they have been recovered.
After an instance dies and the failure is detected, the SMON process of a surviving
instance will start the first pass log read of the failed instance's redo thread.
SMON will merge the redo thread ordered by SCN to ensure that changes are written in
an orderly fashion. SMON will also find BWR (block written records) in the redo stream
and remove entries that are no longer needed for recovery because they were past
images of blocks already written to disk. The final product of the first pass log
read is a recovery set that only contains blocks modified by the failed instance
with no subsequent BWR to indicate that the blocks were later written. Each entry
in the recovery list is ordered by first-dirty SCN to specify the order to acquire
instance recovery locks. The recovering SMON process will then inform each lock
element's master node for each block in the recovery list that it will be taking
ownership of the block and lock for recovery. This is handled differently depending
on ownership of the lock element as described below:
Case 1: LE not open (or in NL0 mode) on recovering instance, no other instances own
lock element:
---------------- ----------------- -----------------
| Recovering | | Other Open | | Failed |
| Instance | | Instance | | Instance |
| | | | | |
| Lock Held | | Lock Held | | Lock Held |
| on LENUM 123: | | on LENUM 123: | | on LENUM 123: |
| | | | | |
---------------- ----------------- -----------------
Action: Acquire lock element in XL0 mode, read block from disk, and apply redo
changes then DBWR will write out recovery buffer when complete:
---------------- ----------------- -----------------
| Recovering | | Other Open | | Failed |
| Instance | | Instance | | Instance |
| | | | | |
| Lock Held | | Lock Held | | Lock Held |
| on LENUM 123: | | on LENUM 123: | | on LENUM 123: |
| XL0 | | | | |
---------------- ----------------- -----------------
|
keep block in recovery list
Case 2: LE not open (or in NL0 mode) on recovering instance, other instance has LE
in SL0 or XL0 mode:
---------------- ----------------- -----------------
| Recovering | | Other Open | | Failed |
| Instance | | Instance | | Instance |
| | | | | |
| Lock Held | | Lock Held | | Lock Held |
| on LENUM 123: | | on LENUM 123: | | on LENUM 123: |
| | | XL0 | | |
---------------- ----------------- -----------------
Action: No recovery needed because a current copy of the buffer already exists on
another instance, remove block entry from recovery set.
---------------- ----------------- -----------------
| Recovering | | Other Open | | Failed |
| Instance | | Instance | | Instance |
| | | | | |
| Lock Held | | Lock Held | | Lock Held |
| on LENUM 123: | | on LENUM 123: | | on LENUM 123: |
| | | XL0 | | |
---------------- ----------------- -----------------
|
remove block from recovery list
Case 3: LE not open (or in NL0 mode) on recovering instance, other instance has LE
in SG# or XG#:
---------------- ----------------- -----------------
| Recovering | | Other Open | | Failed |
| Instance | | Instance | | Instance |
| | | | | |
| Lock Held | | Lock Held | | Lock Held |
| on LENUM 123: | | on LENUM 123: | | on LENUM 123: |
| | | XG0 | | |
---------------- ----------------- -----------------
Action: Initiate write of current block, no recovery needed because a current copy of
the buffer already exists on another instance, remove block entry from recovery set.
Write completion will release recovery buffer and lock as usual:
---------------- ----------------- -----------------
| Recovering | | Other Open | | Failed |
| Instance | | Instance | | Instance |
| | | | | |
| Lock Held | | Lock Held | | Lock Held |
| on LENUM 123: | | on LENUM 123: | | on LENUM 123: |
| NG1 | | | | |
---------------- ----------------- -----------------
| |
| write block to disk
remove block from recovery list
Case 4: LE not open (or in NL0 mode) on recovering instance, other instance has LE
in NG1.
---------------- ----------------- -----------------
| Recovering | | Other Open | | Failed |
| Instance | | Instance | | Instance |
| | | | | |
| Lock Held | | Lock Held | | Lock Held |
| on LENUM 123: | | on LENUM 123: | | on LENUM 123: |
| | | NG1 | | |
---------------- ----------------- -----------------
Action: Get consistent read image of latest past image based on SCN, apply redo
changes and write out recovery buffer when complete.
---------------- ----------------- -----------------
| Recovering | | Other Open | | Failed |
| Instance | | Instance | | Instance |
| | | | | |
| Lock Held | | Lock Held | | Lock Held |
| on LENUM 123: | | on LENUM 123: | | on LENUM 123: |
| acquires XG0 | | NG1 | | |
---------------- ----------------- -----------------
| |
| send CR block to recovering instance
keep block in recovery list
Case 5: LE open in recovering instance in SL0 or XL0, other instance has no lock.
---------------- ----------------- -----------------
| Recovering | | Other Open | | Failed |
| Instance | | Instance | | Instance |
| | | | | |
| Lock Held | | Lock Held | | Lock Held |
| on LENUM 123: | | on LENUM 123: | | on LENUM 123: |
| XL0 | | | | |
---------------- ----------------- -----------------
Action: No recovery needed because a current copy of the buffer already exists on
another instance, remove block entry from recovery set.
---------------- ----------------- -----------------
| Recovering | | Other Open | | Failed |
| Instance | | Instance | | Instance |
| | | | | |
| Lock Held | | Lock Held | | Lock Held |
| on LENUM 123: | | on LENUM 123: | | on LENUM 123: |
| XL0 | | | | |
---------------- ----------------- -----------------
|
remove block from recovery list
Case 6: LE open in recovering instance in SG# or XG#, other instance doesn't matter:
---------------- ----------------- -----------------
| Recovering | | Other Open | | Failed |
| Instance | | Instance | | Instance |
| | | | | |
| Lock Held | | Lock Held | | Lock Held |
| on LENUM 123: | | on LENUM 123: | | on LENUM 123: |
| XG0 | | NG1 | | |
---------------- ----------------- -----------------
Action: Initiate write of current block, no recovery needed on recovering instance.
Release recovery buffer and decrement past image count when block write completes.
---------------- ----------------- -----------------
| Recovering | | Other Open | | Failed |
| Instance | | Instance | | Instance |
| | | | | |
| Lock Held | | Lock Held | | Lock Held |
| on LENUM 123: | | on LENUM 123: | | on LENUM 123: |
| XG0 | | NG1 | | |
---------------- ----------------- -----------------
|
write block to disk
remove block from recovery list
Case 7: LE open in recovering instance in NG1 mode, other instance has LE in SG# or
XG# mode.
---------------- ----------------- -----------------
| Recovering | | Other Open | | Failed |
| Instance | | Instance | | Instance |
| | | | | |
| Lock Held | | Lock Held | | Lock Held |
| on LENUM 123: | | on LENUM 123: | | on LENUM 123: |
| NG1 | | XG0 | | |
---------------- ----------------- -----------------
Action: Initiate write of current block on remote instance, no recovery needed on
recovering instance. Release recovery buffer and decrement past image count when
block write completes:
---------------- ----------------- -----------------
| Recovering | | Other Open | | Failed |
| Instance | | Instance | | Instance |
| | | | | |
| Lock Held | | Lock Held | | Lock Held |
| on LENUM 123: | | on LENUM 123: | | on LENUM 123: |
| NG1 | | XG0 | | |
---------------- ----------------- -----------------
| |
| write block to disk
remove block from recovery list
Case 8: LE open in recovering instance in NG1 mode, other instance has LE in NG#
mode:
---------------- ----------------- -----------------
| Recovering | | Other Open | | Failed |
| Instance | | Instance | | Instance |
| | | | | |
| Lock Held | | Lock Held | | Lock Held |
| on LENUM 123: | | on LENUM 123: | | on LENUM 123: |
| NG1 | | NG0 | | |
---------------- ----------------- -----------------
Action: Get consistent read copy of block from highest past image based on SCN.
Apply redo changes and write out recovery buffer when complete:
---------------- ----------------- -----------------
| Recovering | | Other Open | | Failed |
| Instance | | Instance | | Instance |
| | | | | |
| Lock Held | | Lock Held | | Lock Held |
| on LENUM 123: | | on LENUM 123: | | on LENUM 123: |
| acquires XG1 | | NG0 | | |
---------------- ----------------- -----------------
| |
| send CR block to recovering instance
keep block in recovery list
After the above operation the recovering instance should have locks on every block
in the recovery set. Other instances will not be able to acquire these locks until
the recovery operation is completed. When blocks are cached for recovery, instance
recovery buffers cannot be replaced or aged out except by another recovery buffer
request. At this point the second pass log read and redo application can begin.
When the second pass log read begins again redo threads for failed instances are
merged by SCN and the redo is applied to the datafiles.
Instance Recovery Failure Scenerios:
o If recovery fails without the death of the recovering instance instance
recovery will be restarted.
o If the recovering instance dies, a surviving instance (if one exists) will
acquire the instance recovery enqueue and start recovery. Crash recovery
will be necessary if all instances are down.
o If a non-recovering instance fails, SMON will abort recovery, release the
IR enqueue, and the next live instance will re-attempt instance recovery.
o If there are I/O errors the file is taken offline and instance recovery
is restarted. If the file is the system datafile the recovering instance
will crash; eventually all instances in the cluster will go down and
media recovery will be required.
o If block corruption is encountered during redo application online block
recovery will attemp to clean up the block in order for instance recovery
to proceed.
Online Block Recovery for Cache Fusion
--------------------------------------
When a data buffer becomes corrupt in an instance's cache, the instance will
initiate online block recovery. Block recovery will occur if either a foreground
process dies while applying changes or an error is generated during redo application.
In the first case, PMON initiates block recovery and in the second case the
foreground process initiates block recovery. Online block recovery consists of
finding the block's predecessor and applying redo changes from the online logs of the
thread in which corruption occurred. The predecessor of a fusion block is its most
recent past image. If there is no past image then the block on disk is the
predecessor. For non-fusion blocks, the disk copy is always the predecessor.
If the LE of the block needing recovery is held in XL0 status then the predecessor
will be located on disk.
If the LE of the block needing recovery is held in XG# status then the predecessor
will exist in another instance's buffer cache. The instance with the highest SCN PI
image of the block will send a consistent read copy of the block to the recovering
instance.
Media Recovery for Cache Fusion
-------------------------------
Cache fusion does not impact the existing mechanism for media recovery.
RELATED DOCUMENTS
-----------------
Note 139436.1 - Understanding 9i Real Application Clusters Cache Fusion
Note 139435.1 - Fast Reconfiguration in 9i Real Application Clusters