Corruption detection, prevention, and automatic repair with Oracle Data Guard

Corruption in an Oracle database block means that a block doesn't contain the data that the database expects to find. This can be caused by various failures in the hardware environment, including disks, disk controllers, memory or network components or software errors in the operating system, firmware, the volume manager, and the Oracle database software itself.

Oracle offers some initialization parameters to control the level of corruption prevention and detection. Of course, a higher level brings performance issues with it. In a Data Guard configuration, using the standby database for corruption detection and prevention will bring higher data protection and availability with less performance effect on the primary database.

Let's first start with learning the three types of block corruption in Oracle databases.

  • Physical block corruption: In a physically corrupted database block, the block header may be corrupted, the block may be misplaced or fractured, or the block checksum may be invalid. These types of corruptions are reported by the Oracle database as a ORA-01578 error in the alert log.
  • Logical block corruption: In a logical block corruption, the block contains a valid checksum; however, the block content is corrupt. This corruption is not reported in the alert log but if db_block_checking is enabled, ORA-600 internal errors may show up.
  • The third type of Oracle database corruptions are caused by stray writes, lost writes, or misdirected writes. In this case, the block may not be corrupted as described in the first two types; however, the content of the block is older, stale,or in the wrong location.

Now we'll learn about preventing and detecting these corruptions, especially in a Data Guard configuration, by studying the related initialization parameters. There are three important parameters in this study: DB_BLOCK_CHECKSUM, DB_BLOCK_ CHECKING, and DB_LOST_WRITE_PROTECT.

DB_BLOCK_CHECKSUM

This is the initialization parameter used to detect physical corruptions in a database. As we know, a checksum is the data calculated from the arbitrary data with a specific function. It can be recalculated anytime and compared with the stored result of the previous instance to ensure integrity of data. When we use DB_BLOCK_CHECKSUM, the Oracle database calculates a checksum and stores it in the header of each data block when writing to the disk. The following are the possible values of this initialization parameter:

  • OFF (FALSE): Checksums are calculated only for the SYSTEM tablespace data blocks. The user's tablespace and log checksum are not performed. The FALSE value is preserved for backward compatibility, and has the same effect as OFF.
  • TYPICAL (TRUE): When a block of any tablespace is read, checksum is calculated and compared. Also, at the last write of the block, the new checksum is stored. The TRUE value is preserved for backward compatibility and has the same effect as TYPICAL.
  • FULL: In addition to checksum calculations in the TYPICAL mode, Oracle also verifies checksum before the update/delete statements. Also, Oracle gives every log block a checksum before writing it to the current log. Before 11g, log block checksum was performed by LGWR; however, in 11g, the database creates foreground processes for this purpose for better performance. Note that, when checksum validation fails in the FULL mode, Oracle will try to recover the block using the data version on disk and redo data.

In a Data Guard environment, Oracle recommends setting this parameter to FULL on both primary and standby databases. Oracle also indicates that setting it to FULL causes 4 percent to 5 percent overhead in a primary database, whereas the TYPICAL mode causes 1 percent to 2 percent overhead. If setting FULL in the primary database has unacceptable performance degradation, consider setting it as TYPICAL on the primary database and FULL on the standby database.

DB_BLOCK_CHECKING

This parameter specifies whether the database will perform block checking for database blocks and detect logical corruptions. Oracle controls the header and the data in the block if it's logically consistent.

The following are the possible values of this initialization parameter:

  • OFF (FALSE): Only semantic block checking is performed for the blocks of the SYSTEM tablespace. No block checking is performed for the other tablespaces.
  • LOW: Only block header checks are performed when the block content changes. This setting has very limited benefit for corruption detection and prevention, because there's no block checking on the data blocks itself.
  • MEDIUM: Block checking is performed for all objects except indexes.
  • FULL (TRUE): All the LOW and MEDIUM checks are performed for all objects. When MEDIUM or FULL is being used, block corruptions detected in memory will be automatically repaired using the data version on disk and redo data.

In a Data Guard environment, Oracle recommends setting this parameter to FULL at both the primary and standby databases for the highest level of detection and prevention against logical corruptions. However, the performance effect of using this checking can be very high. Oracle states that block checking typically causes 1 percent to 10 percent overhead on the primary database; for update- and insert-intensive applications, the performance effect may me even higher. We should test its effect on the primary database and if the FULL value is unacceptable in terms of performance effect, we should consider setting it to MEDIUM or LOW.

When we cannot set it to FULL or MEDIUM on the primary database because of performance issues, it becomes more important to enable it on the standby database. The performance effect of block checking on Redo Apply may also be high; in some cases it may halve the Redo Apply rate. So we must test and evaluate the effect. We can sum up by saying that it's good practice to set the highest degree of logical corruption detection and prevention on a standby database using the DB_BLOCK_ CHECKING parameter.

DB_LOST_WRITE_PROTECT

Lost-write corruption is a serious type of corruption that occurs on the storage layer. The I/O subsystem acknowledges to the database that the write operation is completed, but it is actually not. The DB_LOST_WRITE_PROTECT initialization parameter can be used to detect the lost write. Lost-write detection on the standby database is an 11g feature and it's most effective when used with Data Guard.

The following are the possible values of this initialization parameter:

  • NONE: Lost-write detection is disabled.
  • TYPICAL: Lost-write detection is enabled for read-write tablespaces. Buffer cache reads are recorded in the redo log and this information is used to detect lost writes. When set in the physical standby database, the MRP process will check for lost writes in read-write tablespaces and stop recovery if detected. Thus, corruption will not be applied on the standby database.
  • FULL: Lost-write detection for read-only tablespaces is included besides read-write tablespaces.

The recommended setting is FULL for both primary and standby databases, and for most cases its performance effect on the primary database and Redo Apply is negligible.

Automatic block media repair

In Oracle 11gR2, when Active Data Guard is being used with Real-Time Apply, if a physical corruption is detected on the primary database, Oracle will automatically try to repair the corruption using the non-corrupted block on the standby database. This operation is also valid in the opposite direction, which means standby database corruption will be repaired using the data block on the primary database. A notification will be printed in the alert log about the automatic block media repair operation in the meantime; this repair operation is completely transparent to database users.

In order to run ABMR successfully, the following initialization parameters must be configured:

  • The LOG_ARCHIVE_CONFIG parameter with a DG_CONFIG list on both the primary and standby databases
  • The LOG_ARCHIVE_DEST_n parameter for the primary database
  • The FAL_SERVER parameter for the standby database with the Oracle Net service name pointing to the primary database

Tip

We can also manually repair a corrupted data block with the RMAN command's RECOVER BLOCK command. By default, this command will try to use an Active Data Guard physical standby database if it exists. In order to exclude the standby database as a source to repair corruption, we must use the EXCLUDE STANDBY option of this command.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset