Best Practices for Corruption Detection, Prevention, and Automatic Repair - in a Data Guard Configuration (文档 ID 1302539.1) | 转到底部 |
In this Document
APPLIES TO:Oracle Database - Enterprise Edition - Version 11.1.0.7 to 12.1.0.1 [Release 11.1 to 12.1]Oracle Database - Enterprise Edition - Version 12.1.0.2 to 12.1.0.2 [Release 12.1] Information in this document applies to any platform. ***Checked for relevance on 3-Jul-2015*** PURPOSEOracle Active Data Guard is a data protection and availability solution for the Oracle Database. The purpose of this document is to provide Database Administrators best practice recommendations for configuring key parameters, database features, system features and operational practices to enable best corruption detection, prevention, and automatic repair, in a MAA or Data Guard configuration. This note also provides additional background information on each parameter, performance considerations, and relevant Oracle documentation references. SCOPEThis document is intended for Database Administrators wanting to learn how to prevent, detect and automatically repair from various data block corruptions. A corrupt block is a block that has been changed so that it differs from what Oracle Database expects to find. This note covers two data block corruption types:
Block corruptions can also be divided into interblock corruption and intrablock corruption:
DETAILSCauses of corrupted blocksBlock corruptions can be caused by various failures including, but not limited to the following:
Block corruptions can be also be caused by operator errors such as copying backups over existing data files or restoring inconsistent database backups. Since corruptions can happen anywhere in the system and software stack, MAA recommends a comprehensive list of architectural and configuration best practices. Active Data Guard is strategic and important component to achieve the most comprehensive Oracle data protection. Corruption SummaryThe table outlines block corruption checks for various manual operational checks and runtime and background corruption checks. Manual checks are something that the DBA and operations team can incorporate such as running RMAN backups, RMAN "check logical" validations or running ANALYZE VALIDATE STRUCTURE command on important objects. Manual checks are especially important to validate data that are rarely updated or queried. Runtime checks are far superior in that it will catch corruptions almost immediately or during runtime for actively queried and updated data. Runtime checks can prevent corruptions or automatically fix corruptions resulting in better data protection and higher application availability. A new background check has been introduced in Exadata to automatically scan and scrub disks intelligently with no application overhead and to automatically fix physically corrupted blocks.
Configuration DetailsConfigure at Primary Database:
Configure at Data Guard Standby Database:
Review the additional background on each of these settings provided in the sections below, especially if tests show that any of the above recommendations have a greater than acceptable impact on the performance of your application. Deploy Primary and Standby on Oracle Exadata Database Machine and Exadata Storage ServersIn addition to the settings above that provide optimal data protection for the Oracle Database on any platform supported by Oracle, the Exadata Database Machine and Sparc Supercluster also implements comprehensive Oracle Hardware Assisted Resilient Data (HARD) specifications, providing a unique level of validation for Oracle block data structures. The Exadata HARD checks include support for spfiles, controlfiles, log files, Oracle data files and Data Guard broker file when residing on Exadata Storage and works during an ASM rebalance or during ASM resync operations. Oracle Exadata Storage Server Software detects corruptions introduced into the I/O path between the database and storage. It stops corrupted data from being written to disk when a HARD check fails. This eliminates a large class of failures that the database industry had previously been unable to prevent. Examples of the Exadata HARD checks include: 1) redo and block checksum, 2) correct log sequence, 3) block type validation, 4) block number validation, 5) Oracle data structures such as block magic number, block size, sequence#, and block header and tail data structures. Exadata HARD checks are the most comprehensive list of Oracle data block checks initiating from the storage software (cellsrv) and works transparently after enabling database's DB_BLOCK_CHECKSUM parameter. Except for the case of Exadata storage, the Oracle HARD initiative has ended. Most past storage HARD implementations only provided checksums and very simple data block checks. Starting with Oracle Exadata Software 11.2.3.3.0 and Oracle Database 11.2.0.4, Oracle Exadata Storage Server Software provides Automatic Hard Disk Scrub and Repair. This feature automatically inspects and repairs hard disks periodically when hard disks are idle. If bad sectors are detected on a hard disk, then Oracle Exadata Storage Server Software automatically sends a request to Oracle ASM to repair the bad sectors by reading the data from another mirror copy. By default, the hard disk scrub runs every two weeks. It’s very lightweight and has enormous value add by fixing physical block corruptions even with infrequently access data. Deploy Oracle Data Integrity eXtensions (DIX) with T10 Data Integrity Field (DIF) when not on ExadataOracle Linux team has collaborated with hardware vendors and Oracle database development to extend Oracle data integrity extensions from Oracle’s operating system (Linux) to various vendor’s host adapter down to the storage device. With these extensions, DIX provides end to end data integrity for reads and writes through a checksum validation. The prerequisite is to leverage certified storage, HBA and disk firmware. An example of this partnership is DIX integration with Oracle Linux, Emulex or QLogic Host Bus Adapters and any T10 DIF capable storage arrays such as EMC VMAX. Refer to the following documentation for more information.
General Guidance on Performance Trade-offsPerformance implications are discussed in each of the sections below. In general, the processing that accompanies higher levels of corruption checking, automatic repair, or fast point in time recovery, will create additional overhead on primary and standby systems. While this overhead is reduced with every Oracle release as validation and repair algorithms are enhanced, the usual recommendation for conducting thorough performance testing still applies. DB_BLOCK_CHECKSUM- Background This parameter determines whether DBWn and the direct loader will calculate a checksum (a number calculated from all the bytes stored in the block) and store it in the cache header of every data block and redo log when writing to disk. The checksum is used to validate that a block is not physically corrupt, detecting corruptions caused by underlying disks, storage systems, or I/O systems. If checksum validation fails when it is set to FULL, Oracle will attempt to recover the block by reading it from disk (or from another instance) and applying the redo needed to fix the block. Corruptions are recorded as ORA-600 or ORA-01578 in the database or ASM alert logs.
DB_BLOCK_CHECKING - Background This parameter specifies whether or not Oracle performs logical intra-block checking for database blocks (memory semantic check). Block checking will check block contents, including header and user data, when changes are made to the block and prevents in-memory corruptions from being written to disk. It performs a logical validation of the integrity of a block by walking through the data on the block, making sure it is self consistent. When DB_BLOCK_CHECKING is set at MEDIUM or FULL, block corruptions that are detected in memory are automatically repaired by reading the good block from disk and applying required redo. If for any reason the corruption cannot be repaired an error will be reported and the data block write will be prevented. All corruptions are reported as ORA-600 or ORA-01578 errors in the database or ASM alert logs.
Oracle recommends setting DB_BLOCK_CHECKING to FULL at both primary and standby databases. Workload specific testing is required to assess whether the performance overhead of FULL is acceptable. If tests show unacceptable performance impact, then set DB_BLOCK_CHECKING to MEDIUM. Performance testing is particularly important given that overhead is incurred on every block change. Block checking typically causes 1% to 10% overhead, but for update and insert intensive applications (such as Redo Apply at a standby database) the overhead can be much higher. OLTP compressed tables also require additional checks that can result in higher overhead depending on the frequency of updates to those tables. If performance concerns prevent setting DB_BLOCK_CHECKING to either FULL or MEDIUM at a primary database, then it becomes even more important to enable this at the standby database. This protects the standby database from logical block corruption that would be undetected at the primary database. Note: When DB_BLOCK_CHECKING is set on the primary database, end-to-end checksums introduced in Oracle Database 11g make it unnecessary to use DB_BLOCK_CHECKING at the standby to detect primary database corruption. Oracle, however, still recommends enabling this parameter on the standby database for the following reasons:
DB_LOST_WRITE_PROTECT - Background This parameter enables lost write detection. A data block lost write occurs when an I/O subsystem acknowledges the completion of the block write, while in fact the write did not occur in the persistent storage or in some cases an older version of the block was written out instead.
Starting with Oracle 11.2.0.4, there's a new Data Guard broker configuration-level property, PrimaryLostWriteAction. This property allows the user to choose whether the primary continues operation or is shutdown. The default value is CONTINUE. Refer to Oracle? Data Guard Broker 11g Release 2 (11.2) documentation. Setting DB_LOST_WRITE_PROTECT on a non-Data Guard environment is still advantageous for better troubleshooting and debugging of lost writes problems. Starting in Oracle 12.1, Active Data Guard with lost write protect can clearly detect lost writes on the standby. When recovery process detects a lost write on the standby, the following error will be reported. ORA-00753: recovery detected a lost write of a data block Cause: A data block write to storage was lost during normal redo database operation on the standby database or during recovery on a primary database. An example of the benefit of using DB_LOST_WRITE_PROTECT, refer to Data Guard Protection From Lost-Write Corruption demo at http://www.oracle.com/technetwork/database/features/availability/demonstrations-092317.html. Oracle Automatic Storage Management (ASM) - BackgroundRead errors can be the result of a loss of access to the entire disk or media corruptions on an otherwise healthy disk. Oracle ASM tries to recover from read errors on corrupted sectors on a disk. When a read error by the database or Oracle ASM triggers the Oracle ASM instance to attempt bad block remapping, Oracle ASM reads a good copy of the extent and copies it to the disk that had the read error.
Another benefit with Oracle ASM based mirroring is that the database instance is aware of the mirroring. For many types of physical block corruptions such as a bad checksum, the database instance proceeds through the mirror side looking for valid content and proceeds without errors. If the process in the database that encountered the read can obtain the appropriate locks to ensure data consistency, it writes the correct data to all mirror sides. When encountering a write error, a database instance sends the Oracle ASM instance a disk offline message.
When the Oracle ASM instance receives a write error message from a database instance or when an Oracle ASM instance encounters a write error itself, the Oracle ASM instance attempts to take the disk offline. Oracle ASM consults the Partner Status Table (PST) to see whether any of the disk's partners are offline. If too many partners are offline, Oracle ASM forces the dismounting of the disk group. Otherwise, Oracle ASM takes the disk offline. The ASMCMD remap command was introduced to address situations where a range of bad sectors exists on a disk and must be corrected before Oracle ASM or database I/O. For information about the remap command, see "remap". When ASM detects any block corruptions, ASM logs the error to the ASM alert.log file. The same corruption error may not appear in the database alert.log or application if ASM can correct the corruption automatically. Starting Oracle 12c, Oracle ASM disk scrubbing checks logical data corruptions and repairs the corruptions automatically in normal and high redundancy disks groups. The feature is designed so that it does not have any impact to the regular input and output (I/O) operations in production systems. The scrubbing process repairs logical corruptions using the Oracle ASM mirror disks. Disk scrubbing uses Oracle ASM rebalancing to minimize I/O overhead. The scrubbing process is visible in fields of the V$ASM_OPERATION view. Refer to Oracle? Automatic Storage Management Administrator's Guide 12c Release 1 (12.1). These ASM benefits are available for all databases using ASM. Since every Exadata Database Machine uses ASM, all these benefits are always available for Exadata customers. Oracle Flashback Technologies - Background Flashback Database is used for fast point-in-time recovery to recover from human errors that cause widespread damage to a production database. It is also used for fast reinstatement of a failed primary database as a new standby database following a Data Guard failover. Flashback Database uses flashback logs to rewind an Oracle Database to a previous point in time. See My Oracle Support Document 565535.1 for Flashback Database best practices. See Section 13.2 of Data Guard Concepts and Administration for Fast Reinstatement using Flashback Database. Note that new optimizations in Oracle Database 11.2.0.2 reduce the impact on load operations when Flashback Database enabled.
Starting in Oracle 11g, manual RMAN block media recovery will automatically search flashback logs for good copies of blocks to help repair from physical data block corruptions quickly. Oracle Data GuardOracle Data Guard ensures high availability, data protection, and disaster recovery for enterprise data. One of the key benefits of Oracle Data Guard is its continuous Oracle-aware validation of all changes using multiple checks for physical and logical consistency of structures within an Oracle data block and redo before updates are applied to a standby database. This isolates the standby database from being impacted by data corruptions that can occur on the primary system. Active Data Guard Automatic Block Repair - BackgroundIf Oracle detects a physical block corruption on either the primary or a standby database in a configuration that uses Active Data Guard, Oracle will automatically repair the corrupt block using a valid copy from the other database. This repair is transparent to primary database applications and users. No application modifications are necessary. If the nature of the corruption makes it impossible to be repaired automatically (e.g. file header corruption, max block repair timeout of 60 seconds for one block repair or number of outstanding block corruptions reaching 100 block corruption incidents), an ORA-1578 error is returned to the application. Manual block media recovery can be executed at a later time. Automatic Block Repair requires an Active Data Guard standby and Data Guard real-time apply. The following database initialization parameters must be configured on the standby database: the following database initialization parameters are configured on the standby database:
Active Data Guard auto block repair can fix physical block corruptions which is the most common type of block corruptions. It does not address logical block corruptions which are normally prevented by setting DB_BLOCK_CHECKING on the primary and standby. If enabling DB_BLOCK_CHECKING incurs an unacceptable performance impact on the primary, we recommend enabling on your standby database.
Additional Operational Practices to detect block corruptions
To verify the integrity of the structure of a table, index, cluster, or materialized view, use the ANALYZE statement with the VALIDATE STRUCTURE option. If the structure is valid, no error is returned. However, if the structure is corrupt, you receive an error message. For example, in rare cases such as hardware or other system failures, an index can become corrupted and not perform correctly. When validating the index, you can confirm that every entry in the index points to the correct row of the associated table. If the index is corrupt, you can drop and re-create it. If a table, index, or cluster is corrupt, you should drop it and re-create it. If a materialized view is corrupt, perform a complete refresh and ensure that you have remedied the problem. If the problem is not corrected, drop and re-create the materialized view.
REFERENCESNOTE:1265884.1 - Resolving ORA-752 or ORA-600 [3020] During Standby RecoveryNOTE:565535.1 - Flashback Database Best Practices & Performance http://www.oracle.com/technetwork/database/availability/maa-datacorruption-bestpractices-396464.pdf |
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/29135257/viewspace-2155561/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/29135257/viewspace-2155561/