CHAPTER 9

image

Recovering Exadata

You may have heard the saying “disk drives spin, and then they die.” It’s not something we like to think about, but from the moment you power up a new system, your disk drives begin aging. Disk drives have come a long way in the past 30 years, and typical life expectancy has improved dramatically. At the end of the day, though, it’s a matter of “when” a disk will fail, not “if.” And we all know that many disk drives fail long before they should. Knowing how to diagnose disk failures and what to do when they occur has generally been the responsibility of the system administrator or storage administrator. For many DBAs, Exadata is going to change that. Many Exadata systems out there are being managed entirely by the DBA staff. Whether or not this is the case in your data center, the procedure for recovering from a disk failure on Exadata is going to be a little different from what you are used to.

Oracle database servers have traditionally required two types of backups: operating system backups and database backups. Exadata compute nodes rely on industry standard hardware RAID and Linux logical volumes to ensure that they are resilient to hardware failures and easy to manage. Exadata adds storage cells to the mix and, with that, comes a whole new subsystem that must be protected and, on occasion, restored. The storage cell is a fairly resilient piece of hardware that employs Linux software RAID to protect the operating system filesystems. As such, it is unlikely that a single disk failure would necessitate an operating system restore. The more likely causes would be human error, a failed patch install, or a bug. Remember that these physical disk devices also contain grid disks (database volumes), so a loss of one of these disks would most likely mean a loss of database storage as well. Oracle has engineered several features into Exadata to protect your data and reduce the impact of such failures. This chapter will discuss some of the more common storage failure scenarios, how to diagnose them, and how to recover with minimal downtime.

Image Note  One of the most challenging aspects of writing this chapter is the rapidly changing nature of the commands and scripts we will be discussing. In many cases, recovery tasks will have you working very closely with the hardware layer of Exadata. So, as you read this chapter, keep in mind that with each new version of Exadata hardware and software, the commands and scripts discussed in this chapter may change. Be sure to check the Oracle documentation for the latest updates to the commands and scripts discussed here.

Exadata Diagnostic Tools

Exadata is a highly complex blend of hardware and software that work together to produce an incredibly resilient delivery platform. The complexity of the platform can be a bit daunting at first. There are simply a lot of moving parts that one must understand in order to maintain the platform effectively. Oracle provides a wealth of diagnostic tools that can be used to verify, analyze, and report important information about the configuration and health of the system. In this section, we’ll discuss some of those tools and how to use them.

Sun Diagnostics: sundiag.sh

Installed on every Exadata database server and storage cell is the sundiag.sh script, located in the /opt/oracle.SupportTools directory. On newer releases of the Exadata Storage Server software, the script is installed via the exadata-sun-computenode or exadata-sun-cellnode RPM package. If for some reason you don’t find it installed on your system, you can download it from My Oracle Support. Refer to MOS Doc ID 761868.1. This script is run from the root account and collects diagnostic information needed for troubleshooting hardware failures. The files it collects are bundled in the familiar tar format and then compressed using bzip2.

sundiag.sh Output

The sundiag.sh script creates an archive with the hostname, serial number, and timestamp of the run. For example, running the script on our lab system produced an output file named as follows:

/tmp/sundiag_enkcel05_XXXXXXXX_2014_11_17_12_23.tar.bz2

Now, let’s take a look at the diagnostic files collected by sundiag.sh. Files are compressed in an archive, with folders for the following components:

  • asr: This directory contains files associated with the configuration of Automatic Service Request.
  • cell: This directory contains output of various CellCLI commands as well as log files related to the Exadata Storage Server software stack. Output includes configuration information on the cell disks, grid disks, and Flash Cache, along with log files for the cellsrv and management service processes. This directory is not present when sundiag.sh is run on an Exadata compute node.
  • disk: This directory contains binary files related to the hard disks, generated by the LSI disk controller.
  • ilom: If the ilom or snapshot options are used, this directory will contain ILOM data collection output.
  • messages: This directory contains copies of the dmesg and messages system logs from the syslog utility.
  • net: This directory contains diagnostic information related to the assorted networks on the node. Files include a InfiniBand diagnostics, lists of any firewall rules, and network device configuration files. Additionally, the output of the ethtool command is also included.
  • raid: The raid directory contains disk controller configuration information from the parted, fdisk, and mdstat commands, along with RAID controller output from the MegaCli64 command.
  • sysconfig: Files that do not fall into the other categories are left here. Files are named after the commands that generated them. Examples include df-hl.out, lspci-vvv.out, and CheckHWnFWProfile.log.

    Some of the more important files created by the sundiag.sh script are described below. These files are found on both compute nodes and storage servers.

  • messages: This is a copy of the /var/log/messages file from your system. The messages file is rotated and aged out automatically by the operating system. If your system has been running for a while, you will have several of these files enumerated in ascending order from current (messages) to oldest (messages.4). This file is maintained by the syslog daemon and contains important information about the health and operation of the operating system.
  • dmesg: This file is created by the dmesg command and contains diagnostic kernel-level information from the kernel ring buffer. The kernel ring buffer contains messages sent to or received from external devices connected to the system such as disk drives, keyboard, video, and so on.
  • lspci: This file contains a list of all the PCI devices on the system.
  • lsscsi: The lsscsi file contains a list of all the SCSI devices on the system.
  • fdisk-l and parted: The fdisk-l and parted files contain a listing of all disk device partitions in your system.
  • megacli64: The sundiag.sh script runs the MegaCli64 command with various options that interrogate the MegaRAID controller for information on the configuration and status of your disk controller and attached disk drives. There is a wealth of information collected by the MegaRAID controller that can be easily tapped into using the MegaCli64 command. For example, the megacli64-PdList_short.out file shows a summary of the RAID configuration of the disk drives on a compute node:
    Slot 00 Device 11 (HITACHI H106030SDSUN300GA3D01247NLV9ZD  ) status is: Online,
    Slot 01 Device 10 (HITACHI H106030SDSUN300GA3D01247NGXRZF  ) status is: Online,
    Slot 02 Device 09 (HITACHI H106030SDSUN300GA3D01246NLV1JD  ) status is: Online,
    Slot 03 Device 08 (HITACHI H106030SDSUN300GA3D01247NH06DD  ) status is: Online,
  • Information in these files includes an event log and a status summary of your controller and disk drives. For example, the following listing shows a summary of the state of the physical disk drives attached to one of our database servers (from the megacli64-status.out file):
    Checking RAID status on enkx3db01.enkitec.com
    Controller a0:  LSI MegaRAID SAS 9261-8i
    No of Physical disks online : 4
    Degraded : 0
    Failed Disks : 0
  • It is hard to say whether Exadata uses the MegaCli64 command to monitor predictive failure for disk drives or if the developers have tapped into SMART metrics through an API, but this information is available to you at the command line. There isn’t a lot of information about MegaCli64 out there, but the sundiag.sh script is a good place to start if you are interested in peeking under the hood and getting a closer look at some of the metrics Exadata collects to determine the health of your disk subsystem.

If you run the sundiag.sh script on your storage cells, additional data is collected about the cell configuration, alerts, and special log files that do not exist on the database server. The following list describes these additional log files collected by sundiag.sh.

  • cell-detail: The cell-detail file contains detailed site-specific information about your storage cell. This is output from the CellCLI command LIST CELL DETAIL.
  • celldisk-detail: This file contains a detailed report of your cell disks. The report is created using the CellCLI command LIST CELLDISK DETAIL. Among other things, it shows the status, logical unit number (LUN), and physical device partition for your cell disks.
  • lun-detail: This report is generated using the CellCLI command LIST LUN DETAIL. It contains detailed information about the underlying LUNs on which your cell disks are configured. Included in this report are the names, device types, and physical device names (such as /dev/sdw) of your LUNs.
  • physicaldisk-detail: The physicaldisk-detail file contains a detailed report of all physical disks and FMODs used by the storage cell for database type storage and Flash Cache. It is generated using the CellCLI command LIST PHYSICALDISK DETAIL, and it includes important information about these devices such as the device type (hard disk or flash disk), make and model, slot address, and device status.
  • physicaldisk-fail: This file contains a listing of all physical disks (including flash disks) that do not have a status of Normal. This would include disks with a status of Not Present, which is a failed disk that has been replaced but not yet removed from the configuration. When a physical disk is replaced, its old configuration remains in the system for seven days, after which it is automatically purged.
  • griddisk-detail: This file contains a detailed report of all grid disks configured on the storage cell. It is created using the CellCLI command LIST GRIDDISK DETAIL and includes, among other things, the grid disk name, cell disk name, size, and status of all grid disks you have configured on the storage cell.
  • griddisk-status: This file contains the name and status of each grid disk configured on the storage cell. It is created using the CellCLI command LIST GRIDDISK ATTRIBUTES NAME, STATUS, ASMMODESTATUS, ASMDEACTIVATIONOUTCOME and includes details on the status of the grid disk from the perspective of both the storage server and ASM.
  • flashcache-detail: This report contains the list of all FMODs that make up the Cell Flash Cache. It is the output of the CellCLI command LIST FLASHCACHE DETAIL and includes the size and status of the Flash Cache. Also found in this report is a list of all flash cell disks that are operating in a degraded mode.
  • flashlog-detail: This report contains the list of all FMODs that make up the cell flash log area. It is the output of the CellCLI command LIST FLASHLOG DETAIL and includes the size and status of the flash log area. Also found in this report is a list of all flash cell disks that are operating in a degraded mode.
  • alerthistory: The alerthistory file contains a detailed report of all alerts that have occurred on the storage cell. It is created using the CellCLI command LIST ALERTHISTORY.
  • alert.log: The alert.log file is written to by the cellsrv process. Similar to a database or ASM alert log file, the storage cell alert.log contains important runtime information about the storage cell and the status of its disk drives. This file is very useful in diagnosing problems with cell storage. On Exadata storage cells running version 12c, there are multiple alert logs, one for each of the offload servers.
  • ms-odl.trc: The ms-odl.trc contains detailed runtime, trace-level information from the cell’s management server process.
  • ms-odl.log: This file is written to by the cell’s management server process. It is not included in the collection created by the sundiag.sh script, but we have found it very useful in diagnosing problems that occur in the storage cell. It also contains normal, day-to-day operational messages. Storage cells maintain their log files by rotating them, similar to the way the operating system rotates the system log (/var/log/messages). The ms-odl.log file records these tasks as well as more critical tasks such as disk failures.

Cell Alerts

As part of the monitoring features, Exadata tracks over 70 alert types of metrics in the storage cell. Additional alerts may be defined using Grid Control’s monitoring and alerting features. Alert severities fall into four categories: Information, Warning, Critical, and Clear. These categories are used to manage alert notifications. For example, you may choose to get an e-mail alert notification for critical alerts only. The Clear severity is used to notify you when a component has returned to Normal status. The LIST ALERTHISTORY DETAIL command can be used to generate a detailed report of the alerts generated by the system. The following listing is an example of an alert generated by the storage cell:

name:                   209_1
alertMessage:           "All Logical drives are in WriteThrough caching mode.
                        Either battery is in a learn cycle or it needs to be
                        replaced. Please contact Oracle Support"
alertSequenceID:        209
alertShortName:         Hardware
alertType:              Stateful
beginTime:              2011-01-17T04:42:10-06:00
endTime:                2011-01-17T05:50:29-06:00
examinedBy:
metricObjectName:       LUN_CACHE_WT_ALL
notificationState:      1
sequenceBeginTime:      2011-01-17T04:42:10-06:00
severity:               critical
alertAction:            "Battery is either in a learn cycle or it needs
                        replacement. Please contact Oracle Support"

When the battery subsequently returns to Normal status, a follow-up alert is generated with a severity of Clear, indicating that the component has returned to normal operating status:

name:                   209_2
alertMessage:           "Battery is back to a good state"
...
severity:               clear
alertAction:            "Battery is back to a good state. No Action Required"

When you review alerts, you should get in the habit of setting the examinedBy attribute of the alert so you can keep track of which alerts are already being investigated. If you set the examinedBy attribute, you can use it as a filter on the LIST ALERTHISTORY command to report all alerts that are not currently being attended to. By adding the severity filter, you can further reduce the output to just critical alerts. For example:

LIST ALERTHISTORY WHERE severity = 'critical' AND examinedBy = ' ' DETAIL

To set the examinedBy attribute of the alert, use the ALTER ALERTHISTORY command and specify the name of the alert you wish to alter. For example, we can set the examinedBy attribute for the Battery alert as follows:

CellCLI> alter alerthistory 209_1 examinedBy="acolvin"
Alert 209_1 successfully altered

CellCLI> list alerthistory attributes name, alertMessage, examinedby where name=209_1 detail
         name:                   209_1
         alertMessage:           "All Logical drives are in WriteThrough caching mode.
                                 Either battery is in a learn cycle or it needs to be
                                 replaced. Please contact Oracle Support"
         examinedBy:             acolvin

There is quite a bit more to say about managing, reporting, and customizing Exadata alerts. An entire chapter would be needed to cover the subject in detail. In this section, we’ve only touched on the basics. Fortunately, once you get e-mail configured for alert notification, very little must be done to manage these alerts. In many environments, e-mail notification is all that is used to catch and report critical alerts.

Backing Up Exadata

When we took delivery of our first Exadata system, one of our primary questions was, “How can we back up everything so we can restore it to working order if something goes horribly wrong?” When our Exadata arrived in May 2010, the latest version of the Cell software was 11.2.1.2.1. At the time, the only way to back up a database server was to use third-party backup software or standard Linux commands like tar. Oracle is constantly developing new features for Exadata and, less than a year later, Exadata X-2 database servers were released with the native Linux Logical Volume Manager (LVM). This was a big step forward because the LVM has built-in snapshot capabilities that provide an easy method of taking backups of the operating system. Storage cells use a built-in method for backup and recovery. In this section, we’ll take a look at the various methods Oracle recommends for backing up Exadata database servers and storage cells. We’ll also take a brief look at Recovery Manager (RMAN) and some of the features Exadata provides that improve the performance of database backup and recovery. After that, we’ll take a look at what it takes to recover from some of the more common types of system failure. It may surprise you, but the focus of this chapter is not database recovery. There are very few Exadata-specific considerations for database backup and recovery. A majority of the product-specific backup and recovery methods pertain to backup and recovery of the system volumes containing the operating system and Exadata software. Hence, we’ll spend quite a bit of time discussing recovery, from the loss of a cell disk to the loss of a system volume on the database servers or storage cells.

Backing Up the Database Servers

Exadata compute nodes have a default configuration that utilizes Linux Logical Volume Management (LVM). Logical volume managers provide an abstraction layer for physical disk partitions similar to the way ASM does for its underlying physical storage devices. LVMs have volume groups comparable to ASM disk groups. These volume groups are made up of one or more physical disks (or disk partitions), as ASM disk groups are made up of one or more physical disks (or disk partitions). LVM volume groups are carved up into logical volumes in which file systems can be created. In a similar way, databases utilize ASM disk groups for creating tablespaces that are used for storing tables, indexes, and other database objects. Abstracting physical storage from the file systems allows the system administrator to grow and shrink the logical volumes (and file systems) as needed. There are a number of other advantages to using the LVM to manage storage for the Exadata database servers, but our focus will be the new backup and restore capabilities the Linux LVM provides, namely LVM snapshots. In addition to their convenience and ease of use, LVM snapshots eliminate many of the typical challenges we face with simple backups using the tar command or third-party backup products. For example, depending on the amount of data in the backup set, file system backups can take quite a while to complete. These backups are not consistent to a point in time, meaning that if you must restore a file system from backup, the data in your files will represent various points in time from the beginning of the backup process to its end. Applications that continue to run during the backup cycle can hold locks on files, causing them to be skipped (not backed up). And once again, open applications will inevitably make changes to data during the backup cycle. Even if you are able to back up these open files, you have no way of knowing if they are in any usable state unless the application is shut down before the backup is taken. LVM snapshots are instantaneous because no data is actually copied. You can think of a snapshot as an index of pointers to the physical data blocks that make up the contents of your file system. When a file is changed or deleted, the original blocks of the file are written to the snapshot volume. So, even if it takes hours to complete a backup, it will still be consistent with the moment the snapshot was created. Now, let’s take a look at how LVM snapshots can be used to create a consistent file system backup of the database server.

System Backup Using LVM Snapshots

Creating file system backups using LVM snapshots is a pretty simple process. First, you need to create a destination for the final copy of the backups. This can be SAN or NAS storage or simply an NFS file system shared from another server. If you have enough free space in the volume group to store your backup files, you can create a temporary logical volume to stage your backups before sending them off to tape. This can be done using the lvcreate command. Before creating a new logical volume, make sure you have enough free space in your volume group using the vgdisplay command:

[root@enkx4db01 ~]# vgdisplay
--- Volume group ---
  VG Name               VGExaDb
...
  VG Size               1.63 TB
  PE Size               4.00 MB
  Total PE              428308
  Alloc PE / Size       47104 / 184.00 GB
  Free  PE / Size       381204 / 1.45 TB
...

The vgdisplay command shows the size of our volume group, physical extents (PE) currently in use, and the amount of free space available in the volume group. The Free PE/Size attribute indicates that we have 1.45TB of free space remaining in the volume group.

First, we’ll mount an NFS share from another system as the destination for our backups. We will call this /mnt/nfs:

[root@enkx4db01 ~]# mount -t nfs -o rw,intr,soft,proto=tcp,nolock <ip>/share /mnt/nfs

Next, we’ll create and label LVM snapshots for / and /u01 using the lvcreate and e2label commands. Notice the –L1G and -L5G options we used to create these snapshots. The –L parameter determines the size of the snapshot volume. When data blocks are modified or deleted after the snapshot is created, the original copy of the block is written to the snapshot. It is important to size the snapshot sufficiently to store an original copy of all changed blocks. The snapshot will not be utilized for a long time, so typically 5GB-10GB is enough space. If the snapshot runs out of space, it will be deactivated.

[root@enkx4db01 ~]# lvcreate –L1G -s -n root_snap /dev/VGExaDb/LVDbSys1
  Logical volume "root_snap" created
[root@enkx4db01 ~]# e2label /dev/VGExaDb/root_snap DBSYS_SNAP
[root@enkx4db01 ~]# lvcreate –L5G -s -n u01_snap /dev/VGExaDb/LVDbOra1
  Logical volume "u01_snap" created
[root@enkx4db01 ~]# e2label /dev/VGExaDb/u01_snap DBORA_SNAP

Next, mount the snapshot volumes. We use the file system labels (DBSYS_SNAP and DBORA_SNAP) to ensure that the correct volumes are mounted. After they are mounted, they can be copied to the NFS mountpoint. The df command displays our new file system and the logical volumes we want to include in our system backup, VGExaDb-LVDbSys1 (logical volume of the root file system) and VGExaDb-LVDbOra1 (logical volume of the /u01 file system). Notice that the /boot file system does not use the LVM for storage. This file system must be backed up using the tar command. This isn’t a problem because the /boot file system is fairly small and static so we aren’t concerned with these files being modified, locked, or open during the backup cycle.

[root@enkx4db01 ~]# mkdir –p /mnt/snaps/u01
[root@enkx4db01 ~]# mount –L DBSYS_SNAP /mnt/snaps
[root@enkx4db01 ~]# mount –L DBORA_SNAP /mnt/snaps/u01
[root@enkx4db01 mnt]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VGExaDb-LVDbSys1
                       30G   27G  1.6G  95% /
/dev/sda1             496M   40M  431M   9% /boot
/dev/mapper/VGExaDb-LVDbOra1
                       99G   53G   41G  57% /u01
tmpfs                 252G     0  252G   0% /dev/shm
192.168.10.9:/nfs/backup
                      5.4T  182G  5.2T   4% /mnt/nfs
/dev/mapper/VGExaDb-root_snap
                       30G   27G  1.4G  96% /mnt/snaps
/dev/mapper/VGExaDb-u01_snap
                       99G   53G   41G  57% /mnt/snaps/u01

Now that we have snapshots ensuring consistent / and /u01 file systems, we are ready to take a backup. To prove that these snapshots are consistent, we’ll copy the /etc/hosts file to a test file in the /root directory. If snapshots work as they are supposed to, this file will not be included in our backup because it was created after the snapshot was created. The command looks like this:

[root@enkx4db01 ~]# cp /etc/hosts /root/test_file.txt

Because the snapshots are mounted, we can browse them just like any other file system. Snapshot file systems look and feel just like the original file systems, with one exception. If we look in the mounted snapshot for the test file we created (or any other change after the snapshot was taken), we don’t see it. It’s not there because the file was created after the snapshots were created:

[root@enkx4db01 ~]# ls -l /root/testfile
-rw-r--r-- 1 root root 1724 Nov 24 14:23 /root/test_file.txt   <- the test file we created

[root@enkx4db01 ~]# ls -l /mnt/snaps/root/test_file.txt
ls: /mnt/snaps/root/testfile: No such file or directory    <- no test file in the snapshot

Once the snapshots are mounted, they can be backed up using any standard Linux backup software. For this test, we’ll use the tar command to create a tarball backup of the / and /u01 file systems to the NFS share. Since we are backing up a snapshot, we don’t have to worry about files that are open, locked, or changed during the backup. Notice that we’ve also included the /boot directory in this backup.

[root@enkx4db01 ~]# cd /mnt/snaps
[root@enkx4db01 snap]# tar -pjcvf /mnt/nfs/backup.tar.bz2 * /boot --exclude
          nfs/backup.tar.bz2 --exclude /mnt/nfs >        
          /tmp/backup_tar.stdout 2> /tmp/backup_tar.stderr

When the backup is finished, you should check the error file /tmp/backup_tar.stderr for any issues logged during the backup. If you are satisfied with the backup, you can unmount and drop the snapshots. You will create a new set of snapshots each time you run a backup. After the backup is copied, you can optionally unmount and drop the temporary logical volume you created:

[root@enkx4db01 snap]# cd /

[root@enkx4db01 /]# umount /mnt/snaps/u01
[root@enkx4db01 /]# rm -Rf /mnt/snaps/u01

[root@enkx4db01 /]# umount /mnt/snaps
[root@enkx4db01 /]# rm -Rf /mnt/snaps

[root@enkx4db01 /]# lvremove /dev/VGExaDb/root_snap
Do you really want to remove active logical volume root_snap? [y/n]: y
  Logical volume "root_snap" successfully removed

[root@enkx4db01 /]# lvremove /dev/VGExaDb/u01_snap
Do you really want to remove active logical volume u01_snap? [y/n]: y
  Logical volume "u01_snap" successfully removed

Early models of Exadata V2 did not implement LVM for managing file system storage. Without LVM snapshots, getting a clean system backup would require shutting down the applications on the server (including the databases), or purchasing third-party backup software. Even then, there would be no way to create a backup in which all files are consistent with the same point in time. LVM snapshots fill an important gap in the Exadata backup and recovery architecture and offer a simple, manageable strategy for backing up the database servers. Later in this chapter, we’ll discuss how these backups are used for restoring the database server when file systems are lost or damaged.

Backing Up the Storage Cell

The first two disks in a storage cell contain the Linux operating system. These Linux partitions are commonly referred to as the system volumes. Backing up the system volumes using industry standard Linux backup software is not recommended. So, how do you back up the system volumes? Well, the answer is that you don’t. Exadata automatically does this for you through the use of an internal USB drive called the CELLBOOT USB flash drive. If you are the cautious sort, you can also create your own cell recovery image using an external USB flash drive. In addition to the CELLBOOT USB flash drive, Exadata also maintains, on a separate set of disk partitions, a full copy of the system volumes as they were before the last patch was installed. These backup partitions are used for rolling back a patch. Now, let’s take a look at how these backup methods work.

CELLBOOT USB Flash Drive

You can think of the internal CELLBOOT USB flash drive as you would any external USB drive you would plug into your laptop. The device can be seen using the parted command as follows:

[root@enkx4cel01 ~]# parted /dev/sdac print
Model: ORACLE UNIGEN-UFD (scsi)
Disk /dev/sdac: 4010MB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End     Size    Type     File system  Flags
 1      11.3kB  4008MB  4008MB  primary  ext3

Just for fun, we mounted the internal USB flash drive to take a peek at what Oracle included in this backup. The following listing shows the contents of this device:

[root@enkx4cel01 ~]# mount /dev/sdm1 /mnt/usb

[root@enkx4cel01 ~]# ls -al /mnt/usb
total 95816
drwxr-xr-x 7 root root     4096 Oct  3 21:26 .
drwxr-xr-x 9 root root     4096 Nov 23 15:49 ..
-r-xr-x--- 1 root root     2048 Aug 17  2011 boot.cat
-r-xr-x--- 1 root root       16 Oct  9  2013 boot.msg
drwxr----- 2 root root     4096 Oct  3 20:14 cellbits
drwxrwxr-x 2 root root     4096 Oct  3 20:15 grub
-rw-r----- 1 root root       16 Oct  3 20:14 I_am_CELLBOOT_usb
-rw-r----- 1 root root      805 Oct  3 19:53 image.id
-rw-r----- 1 root root      441 Oct  3 19:55 imgboot.lst
-rw-rw-r-- 1 root root  8280755 Jul 14 04:12 initrd-2.6.32-300.19.1.el5uek.img
-rw-r----- 1 root root  7381429 Oct  3 20:14 initrd-2.6.39-400.128.17.el5uek.img
-rw-r----- 1 root root 70198394 Oct  3 19:55 initrd.img
-r-xr-x--- 1 root root    10648 Aug 17  2011 isolinux.bin
-r-xr-x--- 1 root root      155 Apr 14  2014 isolinux.cfg
-rw-r----- 1 root root       25 Oct  3 20:14 kernel.ver
drwxr----- 4 root root     4096 Nov  7 16:28 lastGoodConfig
drwxr-xr-x 3 root root     4096 Oct  3 21:38 log
drwx------ 2 root root    16384 Oct  3 20:11 lost+found
-r-xr-x--- 1 root root    94600 Aug 17  2011 memtest
-r-xr-x--- 1 root root     7326 Aug 17  2011 splash.lss
-r-xr-x--- 1 root root     1770 Oct  9  2013 trans.tbl
-rwxr-x--- 1 root root  4121488 Jul 14 04:12 vmlinuz
-rwxr-xr-x 1 root root  3688864 Jul 14 04:12 vmlinuz-2.6.32-300.19.1.el5uek
-rwxr----- 1 root root  4121488 Oct  3 20:08 vmlinuz-2.6.39-400.128.17.el5uek

In this backup, we see the Linux boot images and all the files required to boot Linux and restore the operating system. Notice that you also see a directory called lastGoodConfig. This directory is a backup of the /opt/oracle.cellos/iso/lastGoodConfig directory on our storage cell. There is also a directory called cellbits containing the Cell Server software. Not only do we have a complete copy of everything needed to recover our storage cell to a bootable state on the internal USB drive, but we also have an online backup of all of our important cell configuration files and Cell Server binaries.

External USB Drive

In addition to the built-in CELLBOOT USB flash drive, Exadata also provides a way to create your own external bootable recovery image using a common 1–8GB USB flash drive you can buy at a local electronics store. Exadata will create the rescue image on the first external USB drive it finds, so before you create this recovery image, you must remove all other external USB drives from the system or the script will throw a warning and exit.

Recall that Exadata storage cells maintain two versions of the operating system and cell software: active and inactive. These are managed as two separate sets of disk partitions for the / and /opt/oracle file systems as can be confirmed using the imageinfo command, as follows:

[root@enkx4cel01 ~]# imageinfo | grep device
Active system partition on device: /dev/md6
Active software partition on device: /dev/md8
Inactive system partition on device: /dev/md5
Inactive software partition on device: /dev/md7

The imageinfo command shows the current (Active) and previous (Inactive) system volumes on the storage cell. Using the df command, we can see that we are indeed currently using the Active partitions (/dev/md6 and /dev/md8) identified in the output from the imageinfo command:

[root@enkx4cel01 ~]# df | egrep 'Filesystem|md6|md8'

Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/md6              10317752   6632836   3160804  68% /
/dev/md8               2063440    654956   1303668  34% /opt/oracle

By default, the make_cellboot_usb command will create a rescue image of your active configuration (the one you are currently running). The –inactive option allows you to create a rescue image from the previous configuration. The inactive partitions are the system volumes that were active when the last patch was installed.

The make_cellboot_usb command is used to create a bootable rescue image. To create an external rescue image, all you have to do is plug a USB flash drive into one of the USB ports on the front panel of the storage cell and run the make_cellboot_usb command.

Image Caution  The rescue image will be created on the first external USB drive found on the system. Before creating an external rescue image, remove all other external USB drives from the system.

For example, the following listing shows the process of creating an external USB rescue image. The output from the make_cellboot_usb script is fairly lengthy, a little over 100 lines, so we won’t show all of it here. Some of the output excluded from the following listing includes output from the fdisk command that is used to create partitions on the USB drive, formatting of the file systems, and the many files that are copied to create the bootable rescue disk.

[root@enkx4cel01 oracle.SupportTools]# ./make_cellboot_usb
[WARNING] More than one USB devices suitable for use as Oracle Exadata Cell start up boot device.
Candidate for the Oracle Exadata Cell start up boot device     : /dev/sdad
Partition on candidate device                                  : /dev/sdad1
The current product version                                    : 12.1.1.1.1.140712
Label of the current Oracle Exadata Cell start up boot device  :
2014-11-25 10:12:27 -0600  [DEBUG] set_cell_boot_usb: cell usb        : /dev/sdad
2014-11-25 10:12:27 -0600  [DEBUG] set_cell_boot_usb: mnt sys         : /
2014-11-25 10:12:27 -0600  [DEBUG] set_cell_boot_usb: preserve        : preserve
2014-11-25 10:12:27 -0600  [DEBUG] set_cell_boot_usb: mnt usb         : /mnt/usb.make.cellboot
2014-11-25 10:12:27 -0600  [DEBUG] set_cell_boot_usb: lock            : /tmp/usb.make.cellboot.lock
2014-11-25 10:12:27 -0600  [DEBUG] set_cell_boot_usb: serial console  :
2014-11-25 10:12:27 -0600  [DEBUG] set_cell_boot_usb: kernel mode     : kernel
2014-11-25 10:12:27 -0600  [DEBUG] set_cell_boot_usb: mnt iso save    :
2014-11-25 10:12:27 -0600  Create CELLBOOT USB on device /dev/sdad
...
2014-11-25 10:15:11 -0600  Copying ./isolinux.cfg to /mnt/usb.make.cellboot/. ...
2014-11-25 10:15:44 -0600  Copying ./trans.tbl to /mnt/usb.make.cellboot/. ...
2014-11-25 10:15:48 -0600  Copying ./isolinux.bin to /mnt/usb.make.cellboot/. ...
2014-11-25 10:15:48 -0600  Copying ./boot.cat to /mnt/usb.make.cellboot/. ...
2014-11-25 10:15:48 -0600  Copying ./initrd.img to /mnt/usb.make.cellboot/. ...
2014-11-25 10:16:26 -0600  Copying ./memtest to /mnt/usb.make.cellboot/. ...
2014-11-25 10:16:29 -0600  Copying ./boot.msg to /mnt/usb.make.cellboot/. ...
2014-11-25 10:16:30 -0600  Copying ./vmlinuz-2.6.39-400.128.17.el5uek to /mnt/usb.make.cellboot/. ...
2014-11-25 10:16:31 -0600  Copying ./cellbits/ofed.tbz to /mnt/usb.make.cellboot/./cellbits ...
2014-11-25 10:16:38 -0600  Copying ./cellbits/commonos.tbz to /mnt/usb.make.cellboot/./cellbits ...
2014-11-25 10:17:51 -0600  Copying ./cellbits/sunutils.tbz to /mnt/usb.make.cellboot/./cellbits ...
2014-11-25 10:18:11 -0600  Copying ./cellbits/cellfw.tbz to /mnt/usb.make.cellboot/./cellbits ...
2014-11-25 10:19:30 -0600  Copying ./cellbits/doclib.zip to /mnt/usb.make.cellboot/./cellbits ...
2014-11-25 10:20:34 -0600  Copying ./cellbits/debugos.tbz to /mnt/usb.make.cellboot/./cellbits ...
2014-11-25 10:26:07 -0600  Copying ./cellbits/exaos.tbz to /mnt/usb.make.cellboot/./cellbits ...
2014-11-25 10:27:19 -0600  Copying ./cellbits/cellboot.tbz to /mnt/usb.make.cellboot/./cellbits ...
2014-11-25 10:27:26 -0600  Copying ./cellbits/cell.bin to /mnt/usb.make.cellboot/./cellbits ...
2014-11-25 10:30:52 -0600  Copying ./cellbits/kernel.tbz to /mnt/usb.make.cellboot/./cellbits ...
2014-11-25 10:31:37 -0600  Copying ./cellbits/cellrpms.tbz to /mnt/usb.make.cellboot/./cellbits ...
2014-11-25 10:33:59 -0600  Copying ./initrd-2.6.39-400.128.17.el5uek.img to /mnt/usb.make.cellboot/. ...
2014-11-25 10:34:20 -0600  Copying ./splash.lss to /mnt/usb.make.cellboot/. ...
2014-11-25 10:34:26 -0600  Copying ./image.id to /mnt/usb.make.cellboot/. ...
2014-11-25 10:34:32 -0600  Copying ./imgboot.lst to /mnt/usb.make.cellboot/. ...
2014-11-25 10:34:37 -0600  Copying ./vmlinuz to /mnt/usb.make.cellboot/. ...
2014-11-25 10:34:44 -0600  Copying lastGoodConfig/* to /mnt/usb.make.cellboot/lastGoodConfig ...
/opt/oracle.cellos
...
2014-11-25 10:37:01 -0600  [DEBUG] set_grub_conf_n_initrd: mnt sys        : /
2014-11-25 10:37:01 -0600  [DEBUG] set_grub_conf_n_initrd: grub template  : USB_grub.in
2014-11-25 10:37:01 -0600  [DEBUG] set_grub_conf_n_initrd: boot dir       : /mnt/usb.make.cellboot
2014-11-25 10:37:01 -0600  [DEBUG] set_grub_conf_n_initrd: kernel param   : 2.6.39-400.128.17.el5uek
2014-11-25 10:37:01 -0600  [DEBUG] set_grub_conf_n_initrd: marker         : I_am_CELLBOOT_usb
2014-11-25 10:37:01 -0600  [DEBUG] set_grub_conf_n_initrd: mode           :
2014-11-25 10:37:01 -0600  [DEBUG] set_grub_conf_n_initrd: sys dev        :
2014-11-25 10:37:01 -0600  [DEBUG] set_grub_conf_n_initrd: Image id file: //opt/oracle.cellos/image.id
2014-11-25 10:37:01 -0600  [DEBUG] set_grub_conf_n_initrd: System device where image id exists: /dev/md5
2014-11-25 10:37:01 -0600  [DEBUG] set_grub_conf_n_initrd: Kernel version: 2.6.39-400.128.17.el5uek
2014-11-25 10:37:01 -0600  [DEBUG] set_grub_conf_n_initrd: System device with image_id (/dev/md5) and kernel version (2.6.39-400.128.17.el5uek) are in sync
2014-11-25 10:37:01 -0600  [DEBUG] set_grub_conf_n_initrd: Full kernel version: 2.6.39-400.128.17.el5uek
2014-11-25 10:37:01 -0600  [DEBUG] set_grub_conf_n_initrd: System device for the next boot: /dev/md5
2014-11-25 10:37:01 -0600  [DEBUG] set_grub_conf_n_initrd: initrd for the next boot: /mnt/usb.make.cellboot/initrd-2.6.39-400.128.17.el5uek.img
2014-11-25 10:37:01 -0600  [INFO] set_grub_conf_n_initrd: Set /dev/md5 in /mnt/usb.make.cellboot/I_am_CELLBOOT_usb
2014-11-25 10:37:01 -0600  [INFO] Set kernel 2.6.39-400.128.17.el5uek and system device /dev/md5 in generated /mnt/usb.make.cellboot/grub/grub.conf from //opt/oracle.cellos/tmpl/USB_grub.in
2014-11-25 10:37:01 -0600  [INFO] Set /dev/md5 in /mnt/usb.make.cellboot/initrd-2.6.39-400.128.17.el5uek.img
33007 blocks
2014-11-25 10:37:12 -0600  [WARNING] restore_preserved_cell_boot_usb: Unable to restore logs and configs. Archive undefined


    GNU GRUB  version 0.97  (640K lower / 3072K upper memory)

 [ Minimal BASH-like line editing is supported.  For the first word, TAB
   lists possible command completions.  Anywhere else TAB lists the possible
   completions of a device/filename.]
grub> root (hd0,0)
 Filesystem type is ext2fs, partition type 0x83
grub> setup (hd0)
 Checking if "/boot/grub/stage1" exists... no
 Checking if "/grub/stage1" exists... yes
 Checking if "/grub/stage2" exists... yes
 Checking if "/grub/e2fs_stage1_5" exists... yes
 Running "embed /grub/e2fs_stage1_5 (hd0)"...  16 sectors are embedded.
succeeded
 Running "install /grub/stage1 (hd0) (hd0)1+16 p (hd0,0)/grub/stage2 /grub/grub.conf"... succeeded
Done.

Here you can see that the make_cellboot_usb script copies over all of the storage cell software (cellbits) and configuration files (lastGoodConfig) it needs to recover the storage cell. Finally, you see that the Grub boot loader is installed on the USB drive so you can boot the system from it. When the script completes, you can remove the external USB disk from the system. This rescue disk can later be used for restoring your storage cell to working condition should the need arise.

Backing Up the Database

Exadata represents a leap forward in capacity and performance. Just a few years ago, large databases were described in terms of gigabytes. Today, it’s not uncommon to find databases measured in terabytes. It wasn’t long ago when a table was considered huge if it contained tens of millions of rows. Today, we commonly see tables that contain tens of billions of rows. This trend makes it clear that we will soon see databases measured in exabytes. As you might imagine, this creates some unique challenges for backup and recovery. The tools for backing up Exadata databases have not fundamentally changed, and the need to complete backups in a reasonable period of time is becoming increasingly difficult to achieve. Some of the strategies we’ll discuss here will not be new; however, we will be looking at ways to leverage the speed of the platform so backup performance can keep pace with the increasing volume of your databases.

Disk-Based Backups

Oracle 10g introduced us to a new feature called the Flash Recovery Area, which extended Recovery Manager’s structured approach to managing backups. Recently, this feature has been renamed to the Fast Recovery Area (FRA). The FRA is a storage area much like any other database storage. It can be created on raw devices, block devices, file systems, and, of course, ASM. Since the FRA utilizes disk-based storage, it provides a very fast storage medium for database recovery. This is especially true when using Exadata’s high-performance storage architecture. Eliminating the need to retrieve backups from tape can shave hours and sometimes days off the time it takes to recover your databases. And, since the FRA is an extension of the database, Oracle automatically manages that space for you. When files in the FRA are backed up to tape, they are not immediately deleted. They are, instead, kept online as long as there is enough free space to do so. When more space is needed, the database deletes (in a FIFO manner) enough of these files to provide the needed space.

Tape-Based Backups

Using the FRA for disk-based backups can greatly improve the time it takes to recover your databases, but it does not eliminate the need for tape backups. As a matter of fact, tape-based backups are required for backing up the FRA. Moving large quantities of backup data to tape can be a challenge, and, with the volume of data that can be stored on Exadata, the need for high-performance tape backups is critical. Exadata V2 comes equipped with Gigabit Ethernet (GigE) ports that are each capable of delivering throughput up to 1000 megabits per second. Exadata X2-2 and later come with 10 Gigabit Ethernet ports, capable of delivering up to 10 times the throughput of the GigE ports of the V2. The problem is that even the 10 GigE ports between Exadata and the tape library’s media server may not be fast enough to keep up.

A common solution to this problem is to install a 40 Gbps QDR InfiniBand card (or two) into the media server, allowing it to be linked directly into the spare ports on the Exadata InfiniBand network switch. Figure 9-1 illustrates a common backup configuration that leverages the high-speed InfiniBand network inside the Exadata rack to provide high-speed backups to tape.

9781430262411_Fig09-01.jpg

Figure 9-1. Exadata backup architecture

For very large databases, one InfiniBand card may not provide the throughput needed to complete backups in a reasonable time. For Oracle RAC databases, backups can be parallelized and distributed across any or all nodes in the RAC cluster. For Exadata full-rack configurations, this means you can have up to eight nodes (in a single rack) participating in the backup workload. Installing additional InfiniBand cards into the media server allows you to increase the throughput in 40 Gbps increments (3.2GB effective) up to the limits of the media server. An additional media server can be added to the configuration and load-balanced to extend performance even further. Oracle’s MAA group published a very good white paper entitled “Backup and Recovery Performance and Best Practices for Exadata Cell and the Sun Oracle Database Machine,” in which they reported backup rates of up to 2,509 MB/sec or 8.6 TB/hr for tape backups.

Backup from Standby Database

If you are planning to set up a disaster recovery site using Data Guard, you have the option of offloading your database backups to the standby database. This is not in any way an Exadata feature, so we will only touch briefly on the subject. The main purpose of the standby database is to take over the production load in the event that the primary database experiences a total failure. However, using a physical standby database also provides an additional backup for your primary database. If a datafile from the primary database is lost, a replacement datafile from the standby database can be used to replace it. Once the file has been restored to the primary database, archived redo logs are used to recover the datafile up to the current SCN of the database. The standby database is typically mounted (but not open) during normal operations. Cold backups can be made from the standby to the Fast Recovery Area (FRA) and then to tape. Backups from the standby database can be restored directly to the primary database. This provides three levels of recovery to choose from before deciding whether a failover to the standby is necessary.

It is best to use an Exadata platform for your standby database. This is because although tables that use Hybrid Columnar Compression (HCC) will replicate to non-Exadata databases just fine, you will not be able to read from them. Typically, the database kernel on non-Exadata databases cannot read HCC compressed data (there are a few exceptions, such as when data resides on a ZFS Storage Appliance). For example, the following error is returned when you select from an HCC table on a standard 11.2.x database:

SQL> select distinct segment_name from bigtab_arch_high;
 select distinct segment_name from small_table_arch_high
                       *
ERROR at line 1:
ORA-64307: hybrid columnar compression is only supported in tablespaces residing on Exadata storage

Your compressed data is still intact. You just cannot read it unless you first uncompress it. HCC compressed tables can be uncompressed on non-Exadata databases using the ALTER TABLE MOVE command as follows:

SQL> alter table BIGTAB_ARCHIVE_HIGH move nocompress;

Partitioned tables can be uncompressed in a similar manner, and the operation can be parallelized using the parallel option, as you can see in the following command:

SQL> alter table BIGTAB_ARCHIVE_HIGH move partition JAN_2011 nocompress parallel;

Once the table is uncompressed, it can be read from a non-Exadata database. Keep in mind that with the high degree of compression HCC provides, you must take into consideration the additional disk storage that will be required by the uncompressed table or partition, which can be quite substantial.

Exadata Optimizations for RMAN

When RMAN performs an incremental backup on the Exadata platform, cellsrv filters out unwanted blocks and sends back only those that have changed since the last level 0 or level 1 backup. This improves the performance of incremental backups and reduces the workload on the database server. But even when only a relatively small number of blocks have changed, discovering them is a very I/O-intensive process because every block in the database must be examined to determine which ones have changed since the last incremental backup. This is true for both Exadata and non-Exadata databases. The only difference is where the work is done—on the database server or on the storage cells. A few years ago, Oracle 10g introduced block change tracking (BCT) to address this problem. Of course, this was long before Exadata came onto the scene. This feature maintains a bitmap structure in a file called the block change tracking file. Each bit in the BCT file (1 bit per 32K of data) represents a group of blocks in the database. When a data block is modified, Oracle flips a bit in the BCT file representing the group of blocks in which the changed block resides. When an incremental backup is taken, RMAN retrieves the whole group of blocks (represented by a flipped bit in the BCT file) and examines them to determine which one changed. Block change tracking introduces minimal overhead on the database server and is a very efficient way to track changed blocks. And, since it greatly reduces the number of blocks that must be examined during a backup, it improves backup performance while reducing the workload on the database server and storage grid.

For the Exadata platform, you may choose to allow cellsrv to do all of the block filtering for incremental backups, or you may use it in tandem with block change tracking. Block change tracking seems to provide the most benefit when fewer than 20 percent of the blocks in the database have changed since the last level 0 or level 1 backup. If your database is close to that threshold, you should do some testing to determine whether or not BCT improves incremental backup performance. The BLOCKS_SKIPPED_IN_CELL column of the V$BACKUP_DATAFILE view shows the number of blocks that were read and filtered out at the storage cell. This offloading is transparent and requires no user intervention or special parameters to be set ahead of time.

Wait Events

There are two Exadata-specific wait events that are triggered by database backup and recovery operations on the Exadata platform; cell smart incremental backup and cell smart restore from backup. These wait events are covered in more detail in Chapter 10.

  • cell smart incremental backup: This wait event occurs when Exadata offloads incremental backup processing to the storage cells. The P1 column of the V$SESSION_WAIT view contains the cell hash number. This hash value can be used to compare the relative backup performance of each storage cell and determine if there is a performance problem on any of the cells.
  • cell smart restore from backup: This wait event occurs during restore operations when Exadata offloads the task of initializing a file to the storage cells. The P1 column of V$SESSION_WAIT contains the cell hash number. This hash value can be used to compare the relative restore performance of each storage cell and determine if there is a performance problem on any of the cells.

Recovering Exadata

A better title for this section might be “When Things Go Wrong.” After all, that’s usually about the time we realize how little practical experience we have recovering our systems. As corporate America continues to squeeze every drop of productive time out of our workweek, DBAs and system administrators spend most if not all of their waking hours (and sometimes sleeping hours) just “keeping the wheels on.” So, actually, practicing system recovery is more often than not treated like the proverbial “redheaded stepchild”—seldom thought about and rarely attended to. And even if we find ourselves in the enviable position of having the time to practice system recovery, it’s rare to have the spare equipment to practice on. So kudos to you if you are reading this and nothing is actually broken. In this section, we’ll be discussing Exadata system recovery using the backup methods we covered in the “Backing Up Exadata” section of this chapter.

Restoring the Database Server

Backing up and restoring the database servers can be done using third-party backup software or homegrown scripts using familiar commands such as tar and zip. The Linux Logical Volume Manager (LVM) provides the capability of backing up database servers via snapshots for creating point-in-time, tar-based backup sets. The procedure for recovering Exadata database servers is a very structured process that is specific to Exadata. In this section, we’ll be stepping through this procedure, presuming the backup was taken using the backup procedure discussed earlier in this chapter. So, if you haven’t read through that section of this chapter, you might want to take a look at it before continuing.

Image Caution  Before performing any of the recovery steps listed in this section, it is a good idea to open a service request with Oracle support. Many of the tools described here will require passwords or assistance that can typically only be provided by Oracle’s support organization. These steps should be performed as a last resort only.

Recovery Using LVM Snapshot-Based Backup Images

Restoring the database server using the LVM snapshot backup procedure we discussed earlier in this chapter is a fairly straightforward process. The backup image we will use in this procedure, backup.tar.bz2, is the one we created earlier in this chapter and includes the /, /boot, and /u01 file systems. The first thing you need to do is stage the backup image on an NFS file system that can be mounted by the failed database server. The server is then booted from a special diagnostics ISO boot image included on all Exadata servers. When the system boots from the diagnostics ISO, you will be prompted step-by-step through the recovery process. Let’s take a look at the basic steps for recovering a failed database server from the LVM snapshot-based backup we took earlier in this chapter:

  1. Place the LVM snapshot backup image on an NFS shared file system that is accessible to the failed server by IP address. The file we’ll be working with is named backup.tar.bz2.
  2. Attach the /opt/oracle.SupportTools/diagnostics.iso boot image (obtained from a surviving server) to the failed server through the ILOM remote console.
  3. Reboot the failed server and select the CD-ROM as the boot device. When the system boots from the diagnostics ISO, it will enter a special server recovery process.
  4. From this point on, the recovery process will include step-by-step directions. For example, the following process recovers the database server from the backup image, backup.tar.bz2. Answers to the prompts are shown in bold_italics:
    Choose from following by typing letter in '()':
    (e)nter interactive diagnostics shell. Must use credentials from Oracle support to login (reboot or power cycle to exit the shell),
    (r)estore system from NFS backup archive,
    Select: r

    Are you sure (y/n) [n]: y

    The backup file could be created either from LVM or non-LVM based compute node. Versions below 11.2.1.3.1 and 11.2.2.1.0 or higher do not support LVM based partitioning. Use LVM based scheme(y/n): y

    Enter path to the backup file on the NFS server in format:
    <ip_address_of_the_NFS_share>:/<path>/:<archive_file>

    For example, 10.10.10.10:/export/:operating_system.tar.bz2

    NFS line: 10.160.242.200:/export/:backup.tar.bz2
    IP Address of this host: 10.160.242.170
    Netmask of this host: 255.255.255.0
    Default gateway: 10.160.242.1
  5. When all the above information is entered, Exadata will proceed to mount the backup image across the network and recover the system. When the recovery is finished, you will be prompted to log in. Log in as root using the password provided in the Oracle documentation.
  6. Detach the diagnostics ISO from the ILOM.
  7. Reboot the system using the reboot command. The failed server should be completely restored at this point.

When the system finishes booting, you can verify the recovery using the imagehistory command. The following listing shows that the image was created as a restore from nfs backup and was completed successfully:

[enkdb01:oracle:EXDB1] /home/oracle
> su -
Password:

[enkdb01:root] /root
> imagehistory
Version                              : 11.2.1.2.3
Image activation date                : 2010-05-15 05:58:56 -0700
Imaging mode                         : fresh
Imaging status                       : success
...

Version                              : 11.2.2.2.0.101206.2
Image activation date                : 2010-12-17 11:51:53 -0600
Imaging mode                         : patch
Imaging status                       : success

Version                              : 11.2.2.2.0.101206.2
Image activation date                : 2010-01-23 15:23:05 -0600
Imaging mode                         : restore from nfs backup
Imaging status                       : success

Generally speaking, it’s a good idea not to get too creative when it comes to customizing your Exadata database server. Oracle permits you to create new LVM partitions and add file systems to your database servers, but if you do so, your recovery will require some additional steps. They aren’t terribly difficult, but if you choose to customize your LVM partitions, be prepared to document the changes somewhere and familiarize yourself with the recovery procedures for customized systems in the Oracle documentation. Also, scripts that come from Oracle will not be aware of custom changes to the file system layout. This could lead to unexpected results when running those scripts—in particular, the LVM backup scripts provided in Exadata documentation.

Reimaging a Database Server

If a database server must be replaced or rebuilt from scratch and there is no backup image to recover from, an image can be created from an install image provided by Oracle Support. It is a lengthy and highly complicated process, but we’ll hit the highlights here so you get a general idea of what it involves.

Before the server can be reimaged, it must be removed from the RAC cluster. This is the standard procedure for deleting a node from any 11gR2 or 12cR1 RAC cluster. First, the listener on the failed server must be shut down and disabled. Then the ORACLE_HOME for the database binaries is removed from the Oracle inventory. The VIP is then stopped and removed from the cluster configuration and the node deleted from the cluster. Finally, the ORACLE_HOME for the Grid Infrastructure is removed from the Oracle inventory.

The Oracle Software Delivery Cloud (formerly e-Delivery) hosts a computeImageMaker file that is used for creating an install image from one of the surviving database servers. This imagemaker file is specific to the version and platform of your Exadata system and will be named as follows:

computeImageMaker_{exadata_release}_LINUX.X64_{release_date}.{platform}.tar

An external USB flash drive is used to boot the recovery image on the failed server. The USB drive doesn’t need to be very big, a 2–4GB thumb drive can be used. The next step is to unzip the imagemaker file you downloaded from Oracle Support on one of the other Exadata database servers in your rack. A similar recovery processes for storage cells uses the first USB drive found on the system so, before proceeding, you should remove all other external USB devices from the system. To create a bootable system image for recovering the failed database server, you will run the makeImageMedia.sh script. When the makeImageMedia.sh script completes, you are ready to install the image on your failed server. Remove the USB drive from the good server and plug it into the failed server. Log in to the ILOM on the failed server and reboot it. When the server boots up, it will automatically find the bootable recovery image on the external USB drive and begin the reimaging process. From this point, the process is automated. First, it will check the firmware and BIOS versions on the server and update them as needed to match them with your other database servers. Don’t expect this to do anything if you are reimaging a server that was already part of your Exadata system, but it is necessary if the damaged server has been replaced with new equipment. Once the hardware components are up-to-date, a new image will be installed. When the reimaging process is complete, you can unplug the external USB drive and power cycle the server to boot up the new system image.

When the reimaging process is complete and the database server is back online, it will be set to factory defaults. For all intents and purposes, you should think of the reimaged server as a brand-new server. The server will enter the firstboot process, where it will ask you for any relevant network information required to complete the installation. This includes hostnames, IP addresses, DNS, and NTP servers. Once the operating system is configured, you will need to reinstall the Grid Infrastructure and database software and add the node back into the cluster. This is a well-documented process that many RAC DBAs refer to as the “add node” procedure. If you’re not familiar with the process, let us reassure you—it’s not nearly as daunting or time-consuming as you might think. Once you have the operating system prepared for the install, much of the heavy lifting is done for you by the Oracle Installer. The Exadata Owner’s Guide does an excellent job of walking you through each step of the process.

Recovering the Storage Cell

Storage cell recovery is a very broad subject. It can be as simple as replacing an underperforming or failed data disk and as complex as responding to a total system failure such as a malfunctioning chip on the motherboard. In this section, we’ll be discussing various types of cell recovery including removing and replacing physical disks, and failed Flash Cache modules. We will also discuss what to do if an entire storage cell dies and must be replaced.

System Volume Failure

Recall that the first two disks in the storage cell contain the Linux operating system and are commonly referred to as the “system volumes.” Exadata protects these volumes using software mirroring through the Linux operating system. Even so, certain situations may require you to recover these disks from backup. Following are some reasons for performing cell recovery:

  • System volumes (disks 1 and 2) fail simultaneously.
  • The boot partition is damaged beyond repair.
  • File systems become corrupted.
  • A patch installation or upgrade fails.

If you find yourself in any of these situations, it may be necessary, or at least more expedient, to recover the system volumes from backup. As discussed earlier, Exadata automatically maintains a backup of the last good boot configuration using a 4GB internal USB flash drive called the CELLBOOT USB flash drive. Recovering the system volumes using this internal USB flash disk is commonly referred to as the storage cell rescue procedure. The steps for performing the cell rescue procedure basically involve booting from the internal USB drive and following the prompts for the type of rescue you want to perform. By the way, since Exadata comes equipped with an Integrated Lights Out Management module (ILOM), you can perform all cell recovery operations remotely, across the network. There is no need to stand in front of the rack to perform a full cell recovery from the internal USB flash disk.

Image Note  In order to perform a sanity check of the USB recovery media, every Exadata storage cell is configured to use the USB device as its primary boot media. The cell will utilize a bootloader installed on the USB recovery media, which then points back to the system volumes. In the event that the USB recovery media is unavailable, the cell will revert back to boot from the system volumes and generate an alert that the USB recovery media is either not present or is damaged.

This section is not intended to be a step-by-step guide to cell recovery, so we’re not going to go into all the details of cell recovery from the CELLBOOT USB flash disk. The Oracle documentation should be used for that, but we will take a look at what to consider before starting such a recovery.

  • Cell Disks and Grid Disks: The rescue procedure restores the Linux system volumes only. Cell disks and their contents are not restored by the rescue procedure. If these partitions are damaged, they must be dropped and re-created. Once the grid disks are online, they can be added back to the ASM disk group and a subsequent rebalance will restore the data.
  • ASM Redundancy: Recovering a storage cell from USB backup can potentially cause the loss of all data on the system volumes. This includes your database data in the grid disks on these disk drives. If your ASM disk groups use Normal redundancy, we strongly recommend making a database backup before performing cell recovery from USB disk. With ASM High redundancy, you have a total of three copies of all your data, so it is safe to perform cell recovery without taking database backups. Even so, we’d still take a backup if at all possible. The recovery process does not destroy data volumes (cell/grid disks) unless you explicitly choose to do so when prompted by the rescue procedure.
  • Software and Patches: The rescue procedure will restore the cell to its former state, patches included, when the backup was taken. Also included in the restore are the network settings and SSH keys for the root, celladmin, and cellmonitor accounts.

Cell Rescue Options

In order to access the cell rescue options, connect to a virtual console from the lights-out management card and reboot the cell. When the grub menu shows on the screen, press a key to bring up the boot options. You will see several different options for the pair of system disks on the storage cell. The CELLBOOT USB device can be accessed through the final option, CELL_USB_BOOT_CELLBOOT_usb:in_rescue_mode. Upon booting from the CELLBOOT USB device, you have two initial options for recovery—enter a rescue shell or reimage the storage server. The rescue shell can be helpful if only a few files need to be recovered, or if the failure can be resolved via the command line. Typically, you will need to reimage the cell. While this may sound like a drastic option, Oracle has made sure that everything critical to rebuilding the operating system has been backed up. This means that you will not need to enter IP addresses, hostnames, or reset the root password. Upon choosing to reimage the storage cell, you are asked whether you would like to erase all of the data partitions and disks. This decision is based on the type of recovery that you need to perform.

  • Erase data partitions and data disks: If this option is chosen, the reimage procedure will drop the partitions on the system disks and remove the cell disk metadata from the nonsystem disks. This leaves the storage cell in a state with a reimaged operating system and no cell disks within the storage server configuration.
  • Do not erase data partitions and data disks: The storage cell will still be reimaged, but only the system volume partitions will be impacted. This option leaves all of the cell disk metadata intact, meaning that the individual cell disks will still be available and contain all of the data that resided on them before the reimage. The cell disks will still have to be imported, but that can be done with a simple import celldisk all force command in CellCLI. This option is valuable if the ASM disks for that cell have not been dropped from the ASM disk groups. Importing the cell disks will avoid the need for an I/O-intensive rebalance.

So, what happens if, for some reason, the internal CELLBOOT USB flash disk cannot be used for the rescue procedure? If this happens, you can follow the compute node reimage procedure and download the storage cell imagemaker software from the Oracle Software Delivery Cloud. Create a bootable USB device and boot the storage cell using that. Keep in mind that you will need to enter all of the network information specific to that cell because the image you are using does not have this data. All cell disks and grid disks will need to be recreated as well. Hopefully, this should never happen due to the constant validation checks performed on the storage cells. Newer versions of the Exadata Storage Server software include periodic checks of the CELLBOOT USB device and will automatically rebuild the device if it becomes corrupted.

Cell Disk Failure

ASM handles the temporary or permanent loss of a cell disk through its redundant failure group technology. As a result, the loss of a cell disk should not cause any interruption to the databases as long as the disk group is defined with Normal redundancy. If High redundancy is used, the disk group can suffer the simultaneous loss of two cell disks within the same failure group. Recall that on Exadata, each storage cell constitutes a separate failure group. This means that with Normal redundancy, you can lose an entire storage cell (12 cell disks) without impact to your databases. With High redundancy, you can lose two storage cells simultaneously and your databases will continue to service your clients without interruption. That’s pretty impressive. Redundancy isn’t cheap, though. For example, consider a disk group with 30 terabytes of raw space (configured for External redundancy). With Normal redundancy, that 30 terabytes becomes 15 terabytes of usable space. With High redundancy, it becomes 10 terabytes of usable storage. Also keep in mind that the database will typically read the primary copy of your data unless it is unavailable. On Oracle 12c, a disk failure will enable the even read feature, which will read from the disk with the lightest load, regardless of whether it contains a primary or mirrored copy of the data. Normal and High redundancy provide no performance benefits. They are used strictly for fault tolerance. The key is to choose a redundancy level that strikes a balance between resiliency and budget.

Simulated Disk Failure

In this section, we’re going to test what happens when a cell disk fails. The system used for these tests was a quarter rack, Exadata V2. We’ve created a disk group called SCRATCH_DG, defined as follows:

SYS:+ASM2> CREATE DISKGROUP SCRATCH_DG NORMAL REDUNDANCY
  FAILGROUP CELL01 DISK 'o/192.168.12.3/SCRATCH_DG_CD_05_cell01'
  FAILGROUP CELL02 DISK 'o/192.168.12.4/SCRATCH_DG_CD_05_cell02'
  FAILGROUP CELL03 DISK 'o/192.168.12.5/SCRATCH_DG_CD_05_cell03'
  attribute 'compatible.rdbms'='12.1.0.2.0',
            'compatible.asm'  ='12.1.0.2.0',
            'au_size'='4M',
            'cell.smart_scan_capable'='true';

Notice that this disk group is created using three grid disks. Following Exadata best practices, we’ve used one grid disk from each storage cell. It’s interesting to note that even if we hadn’t specified three failure groups with one disk in each, ASM would have done so automatically. We then created a small, single-instance database called SCRATCH using this disk group. The disk group is configured with normal redundancy (two mirror copies for each block of data), which means our database should be able to suffer the loss of one grid disk without losing access to data or causing a crash. Since each grid disk resides on a separate storage cell, we could even suffer the loss of an entire storage cell without losing data. We’ll discuss what happens when a storage cell fails later in the chapter.

In a moment, we will take a look at what happens when a grid disk is removed from the storage cell (a simulated disk failure). But before we do, there are a few things we need to do:

  • Verify that no rebalance or other volume management operations are running
  • Ensure that all grid disks for the SCRATCH_DG disk group are online
  • Verify that taking a disk offline will not impact database operations
  • Check the disk repair timer to ensure the disk is not automatically dropped before we can bring it back online again

There are a couple of ways to verify that volume management activity is not going on. First, let’s check the current state of the disk groups using asmcmd. The ls –l command shows the disk groups, the type of redundancy, and whether or not a rebalance operation is currently underway. By the way, you could also get this information using the lsdg command, which also includes other interesting information such as space utilization, online/offline status, and more. The Rebal column in the following listing indicates that no rebalance operations are executing at the moment.

> asmcmd -p
ASMCMD [+] > ls -l
State    Type    Rebal  Name
MOUNTED  NORMAL  N      DATA_DG/
MOUNTED  NORMAL  N      RECO_DG/
MOUNTED  NORMAL  N      SCRATCH_DG/
MOUNTED  NORMAL  N      STAGE_DG/
MOUNTED  NORMAL  N      SYSTEM_DG/

Notice that not all volume management operations are shown in the asmcmd commands. If a grid disk has been offline for a period of time, there may be a considerable amount of backlogged data that must be copied to it in order to bring it up-to-date. Depending on the volume of data, it may take several minutes to finish resynchronizing a disk. Although this operation is directly related to maintaining balance across all disks, it is not technically a “rebalance” operation. As such, it will not appear in the listing. For example, even though the ls –l command in the previous listing showed a status of N for rebalance operations, you can clearly see that a disk is currently being brought online by running the next query:

SYS:+ASM2> select dg.name "Diskgroup", disk.name, disk.failgroup, disk.mode_status
       from v$asm_disk disk,
            v$asm_diskgroup dg
      where dg.group_number = disk.group_number
        and disk.mode_status <> 'ONLINE';

Diskgroup         NAME                           FAILGROUP  MODE_ST
----------------- ------------------------------ ---------- -------
SCRATCH_DG        SCRATCH_CD_05_CELL01           CELL01     SYNCING

Checking for the online/offline state of a disk is a simple matter of running the following query from SQL*Plus. In the following listing, you can see that the SCRATCH_CD_05_CELL01 disk is offline by its MOUNT_STATE of MISSING and HEADER_STATUS of UNKNOWN:

SYS:+ASM2> select d.name, d.MOUNT_STATUS, d.HEADER_STATUS, d.STATE
 from v$asm_disk d
 where d.name like 'SCRATCH%'
 order by 1;

NAME                                               MOUNT_S HEADER_STATU STATE
-------------------------------------------------- ------- ------------ ----------
SCRATCH_CD_05_CELL01                               MISSING UNKNOWN      NORMAL
SCRATCH_CD_05_CELL02                               CACHED  MEMBER       NORMAL
SCRATCH_CD_05_CELL03                               CACHED  MEMBER       NORMAL

Still, perhaps a better way of checking the status of all disks in the SCRATCH_DG disk group would be to check the mode_status in V$ASM_DISK_STAT. The following listing shows that all grid disks in the SCRATCH_DG disk group are online:

SYS:+ASM2> select name, mode_status from v$asm_disk_stat where name like 'SCRATCH%';

NAME                                               MODE_ST
-------------------------------------------------- -------
SCRATCH_CD_05_CELL03                               ONLINE
SCRATCH_CD_05_CELL01                               ONLINE
SCRATCH_CD_05_CELL02                               ONLINE

The next thing we’ll look at is the disk repair timer. Recall that the disk group attribute disk_repair_time determines the amount of time ASM will wait before it permanently removes a disk from the disk group and rebalances the data to the surviving grid disks when read/write errors occur. Before taking a disk offline, we should check to see that this timer is going to give us enough time to bring the disk back online before ASM automatically drops it. This attribute can be displayed using SQL*Plus and running the following query. (By the way, the V$ASM views are visible whether you are connected to an ASM instance or a database instance.)

SYS:+ASM2> select dg.name "DiskGroup",
            attr.name,
            attr.value
       from v$asm_diskgroup dg,
            v$asm_attribute attr
      where dg.group_number = attr.group_number
        and attr.name like '%repair_time';

DiskGroup         NAME                      VALUE
----------------- ------------------------- ----------
DATA_DG           disk_repair_time          3.6h
DATA_DG           failgroup_repair_time     24.0h
DBFS_DG           disk_repair_time           3.6h
DBFS_DG           failgroup_repair_time     24.0h
RECO_DG           disk_repair_time           3.6h
RECO_DG           failgroup_repair_time     24.0h
SCRATCH_DG        disk_repair_time           8.5h
SCRATCH_DG        failgroup_repair_time       24h
STAGE_DG          disk_repair_time            72h
STAGE_DG          failgroup_repair_time       24h

The default value for the disk repair timer is 3.6 hours. Since this query was run on a cluster running Oracle 12c, there is also a failgroup_repair_time attribute. This is the amount of time that will be taken before the disks are dropped in the event that an entire fail group goes missing. This is useful when there is a hardware failure across the entire storage cell. These attributes are engaged when a storage cell is rebooted or when a disk is temporarily taken offline, but, on rare occasion, they can also occur spontaneously when there is an actual hardware failure. Sometimes simply pulling a disk out of the chassis and reinserting it will clear unexpected transient errors. Any data that would normally be written to the failed disk will queue up until the disk is brought back online or the disk repair time expires. If ASM drops a disk, it can be manually added back into the disk group, but it will require a full rebalance, which can be a lengthy process. The following command was used to set the disk repair timer to 8.5 hours for the SCRATCH_DG disk group:

SYS:+ASM2> alter diskgroup SCRATCH_DG set attribute 'disk_repair_time'='8.5h';

Now, let’s verify whether taking a cell disk offline will affect the availability of the disk group. We can do that by checking the asmdeactivationoutcome and asmmodestatus attributes of our grid disks. For example, the following listing shows the output from the LIST GRIDDISK command when a grid disk in a normal redundancy disk group is taken offline. In this example, we have a SCRATCH_DG disk group consisting of one grid disk from three failure groups (enkcel01, enkcel02, and enkcel03). First, we’ll check the status of the grid disks when all disks are active:

[enkdb02:root] /root
> dcli -g cell_group -l root " cellcli -e list griddisk
     attributes name, asmdeactivationoutcome, asmmodestatus " | grep SCRATCH
enkcel01: SCRATCH_DG_CD_05_cell01   Yes     ONLINE
enkcel02: SCRATCH_DG_CD_05_cell02   Yes     ONLINE
enkcel03: SCRATCH_DG_CD_05_cell03   Yes     ONLINE

Now, we’ll deactivate one of these the grid disks at the storage cell and run the command again:

CellCLI> alter griddisk SCRATCH_DG_CD_05_cell01 inactive
GridDisk SCRATCH_DG_CD_05_cell01 successfully altered

 [enkdb02:root] /root
> dcli -g cell_group -l root " cellcli -e list griddisk
     attributes name, asmdeactivationoutcome, asmmodestatus " | grep SCRATCH
enkcel01: SCRATCH_DG_CD_05_cell01   Yes     OFFLINE
enkcel02: SCRATCH_DG_CD_05_cell02   "Cannot de-activate due to other offline disks in the diskgroup"        ONLINE
enkcel03: SCRATCH_DG_CD_05_cell03   "Cannot de-activate due to other offline disks in the diskgroup"        ONLINE

As you can see, the asmmodestatus attribute of the offlined grid disk is now set to OFFLINE, and the asmdeactivationoutcome attribute of the other two disks in the disk group warns us that these grid disks cannot be taken offline. Doing so would cause ASM to dismount the SCRATCH_DG disk group.

Image Note  Notice that we use the dcli command to run the CellCLI command LIST GRIDDISK ATTRIBUTES on each cell in the storage grid. Basically, dcli allows us to run a command concurrently on multiple nodes. The cell_group parameter is a file containing a list of all of our storage cells.

If the output from the LIST GRIDDISK command indicates it is safe to do so, we can test what happens when we take one of the grid disks for our SCRATCH_DG disk group offline. For this test, we will physically remove the disk drive from the storage cell chassis. The test configuration will be as follows:

  • For this test, we will create a new tablespace with one datafile. The datafile is set to autoextend so it will grow into the disk group as data is loaded.
  • Next, we’ll generate a considerable amount of data in the tablespace by creating a large table; a couple of billion rows from DBA_SEGMENTS should do it.
  • While data is being loaded into the large table, we will physically remove the disk from the cell chassis.
  • Once the data is finished loading, we will reinstall the disk and observe Exadata’s automated disk recovery in action.

The first order of business is to identify the location of the disk drive within the storage cell. To do this, we will use the grid disk name to find the cell disk it resides on. Then we’ll use the cell disk name to find the slot address of the disk drive within the storage cell. Once we have the slot address, we will turn on the service LED on the front panel so we know which disk to remove.

From storage cell 3, we can use the LIST GRIDDISK command to find the name of the cell disk we are looking for:

CellCLI> list griddisk attributes name, celldisk where name like 'SCRATCH.*' detail
         name:                   SCRATCH_DG_CD_05_cell03
         cellDisk:               CD_05_cell03

Now that we have the cell disk name, we can use the LIST LUN command to find the slot address of the physical disk we want to remove. In the following listing, we see the slot address we’re looking for, 16:5.

CellCLI> list LUN attributes celldisk, physicaldrives where celldisk=CD_05_cell03 detail
         cellDisk:               CD_05_cell03
         physicalDrives:         16:5

With the slot address, we can use the MegaCli64 command to activate the drive’s service LED on the front panel of the storage cell. Note that the characters in the MegaCli64 command below are used to prevent the Bash shell from interpreting the brackets ([]) around the physical drive address. (Single quotes work as well, by the way.)

/opt/MegaRAID/MegaCli/MegaCli64 -pdlocate -physdrv [16:5] -a0

The amber LED on the front of the disk drive should be flashing, as can be seen in Figure 9-2.

9781430262411_Fig09-02.jpg

Figure 9-2. Disk drive front panel

And, in case you were wondering, the service LED can be turned off again using the stop option of the MegaCli64 command, like this:

/opt/MegaRAID/MegaCli/MegaCli64 -pdlocate –stop -physdrv [16:5] -a0

Now that we’ve located the right disk, we can remove it from the storage cell by pressing the release button and gently pulling the lever on the front of the disk, as you can see in Figure 9-3.

9781430262411_Fig09-03.jpg

Figure 9-3. Ejected disk drive

Image Note  All disk drives in the storage cell are hot-pluggable and may be replaced without powering down the storage cell.

Checking the grid disk status in CellCLI, we see that it has been changed from Active to Inactive. This makes the grid disk unavailable to the ASM storage cluster.:

CellCLI> list griddisk where name = 'SCRATCH_CD_05_cell03';
         SCRATCH_CD_05_cell03    inactive

ASM immediately notices the loss of the disk, takes it offline, and starts the disk repair timer. The ASM alert log (alert_+ASM2.log) shows that we have about 8.5 hours (30596/60/60) to bring the disk back online before ASM permanently drops it from the disk group:

alert_+ASM1.log
--------------------
Tue Dec 28 08:40:54 2010
GMON checking disk modes for group 5 at 121 for pid 52, osid 29292
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_gmon_5912.trc:
ORA-27603: Cell storage I/O error, I/O failed on disk o/192.168.12.5/SCRATCH_CD_05_cell03 at offset 4198400 for data length 4096
ORA-27626: Exadata error: 201 (Generic I/O error)
WARNING: Write Failed. group:5 disk:3 AU:1 offset:4096 size:4096
...
WARNING: Disk SCRATCH_DG_CD_05_CELL03 in mode 0x7f is now being offlined
WARNING: Disk SCRATCH_DG_CD_05_CELL03 in mode 0x7f is now being taken offline
...
Tue Dec 28 08:43:21 2010
WARNING: Disk (SCRATCH_DG_CD_05_CELL03) will be dropped in: (30596) secs on ASM inst: (2)
Tue Dec 28 08:43:23 2010

The status of the disk in ASM can be seen using the following query from one of the ASM instances. Notice that the SCRATCH disk group is still mounted (online):

SYS:+ASM2> select dg.name, d.name, dg.state, d.mount_status, d.header_status, d.state
             from v$asm_disk d,
                  v$asm_diskgroup dg
            where dg.name = 'SCRATCH_DG'
              and dg.group_number = d.group_number
            order by 1,2;

NAME          NAME                         STATE      MOUNT_S HEADER_STATU STATE
------------- ---------------------------- ---------- ------- ------------ ----------
SCRATCH       SCRATCH_DG_CD_05_CELL01      MOUNTED    CACHED  MEMBER       NORMAL
SCRATCH       SCRATCH_DG_CD_05_CELL02      MOUNTED    CACHED  MEMBER       NORMAL
SCRATCH       SCRATCH_DG_CD_05_CELL03      MOUNTED    MISSING UNKNOWN      NORMAL

While the disk is offline, ASM continues to poll its status to see if the disk is available. We see the following query repeating in the ASM alert log:

alert_+ASM1.log
--------------------
WARNING: Exadata Auto Management: OS PID: 5918 Operation ID: 3015:   in diskgroup  Failed
  SQL    : /* Exadata Auto Mgmt: Select disks in DG that are not ONLINE. */
select name from v$asm_disk_stat
  where
    mode_status='OFFLINE'
      and
    group_number in
      (
       select group_number from v$asm_diskgroup_stat
         where
           name='SCRATCH_DG'
             and
           state='MOUNTED'
      )

Our test database also detected the loss of the grid disk, as can be seen in the database alert log:

alert_SCRATCH.log
-----------------------
Tue Dec 28 08:40:54 2010
Errors in file /u01/app/oracle/diag/rdbms/scratch/SCRATCH/trace/SCRATCH_ckpt_22529.trc:
ORA-27603: Cell storage I/O error, I/O failed on disk o/192.168.12.5/SCRATCH_CD_05_cell03 at offset 26361217024 for data length 16384
ORA-27626: Exadata error: 201 (Generic I/O error)
WARNING: Read Failed. group:5 disk:3 AU:6285 offset:16384 size:16384
WARNING: failed to read mirror side 1 of virtual extent 0 logical extent 0 of file 260 in group [5.1611847437] from disk SCRATCH_CD_05_CELL03  allocation unit 6285 reason error; if possible, will try another mirror side
NOTE: successfully read mirror side 2 of virtual extent 0 logical extent 1 of file 260 in group [5.1611847437] from disk SCRATCH_CD_05_CELL02 allocation unit 224
...
Tue Dec 28 08:40:54 2010
NOTE: disk 3 (SCRATCH_CD_05_CELL03) in group 5 (SCRATCH) is offline for reads
NOTE: disk 3 (SCRATCH_CD_05_CELL03) in group 5 (SCRATCH) is offline for writes

Notice that the database automatically switches to the mirror copy for data it can no longer read from the failed grid disk. This is ASM normal redundancy in action.

When we reinsert the disk drive, the storage cell returns the grid disk to a state of Active, and ASM brings the disk back online again. We can see that the grid disk has returned to a state of CACHED and a HEADER_STATUS of NORMAL in the following query:

SYS:+ASM2> select dg.name, d.name, dg.state, d.mount_status, d.header_status, d.state
             from v$asm_disk d,
                  v$asm_diskgroup dg
            where dg.name = 'SCRATCH'
              and dg.group_number = d.group_number
            order by 1,2;

NAME          NAME                      STATE      MOUNT_S HEADER_STATU STATE
------------- ------------------------- ---------- ------- ------------ ----------
SCRATCH       SCRATCH_CD_05_CELL01      MOUNTED    CACHED  MEMBER       NORMAL
SCRATCH       SCRATCH_CD_05_CELL02      MOUNTED    CACHED  MEMBER       NORMAL
SCRATCH       SCRATCH_CD_05_CELL03      MOUNTED    CACHED  MEMBER       NORMAL  image

It is likely that the disk group will need to catch up on writing data that queued up while the disk was offline. If the disk was reinserted before the disk_repair_time counter hit zero, the disk will simply catch up on the writes that were missed. If not, then the entire disk group will need to be rebalanced, which can take a significant amount of time. Generally speaking, the delay is not a problem because it all happens in the background. During the resilvering process, ASM redundancy allows our databases to continue with no interruption to service. You can see the status of a resync or rebalance operation through the gv$asm_operation view in ASM. Keep in mind that resync operations are only visible in Oracle 12c and on.

If this had been an actual disk failure and we actually replaced the disk drive, we would need to wait for the RAID controller to acknowledge the new disk before it could be used. This doesn’t take long, but you should check the status of the disk to ensure that its status is Normal before using it. The disk status may be verified using the CellCLI command LIST PHYSICALDISK, as shown here:

CellCLI> list physicaldisk where diskType=HardDisk AND status=critical detail

When a disk is replaced, the storage cell performs the following tasks automatically:

  • The disk firmware is updated to match the other disk drives in the storage cell.
  • The cell disk is re-created to match that of the disk it replaced.
  • The replacement cell disk is brought online (status set to Normal).
  • The grid disk (or grid disks) on the failed disk will be re-created.
  • The grid disk status is set to Active.

Once the replacement grid disks are set to Active, ASM automatically opens the disk and begins the resilvering process. The Exadata Storage Server software handles all of these tasks automatically, making disk replacement a fairly painless process.

When to Replace a Cell Disk

Disk failure can occur abruptly, causing the disk to go offline immediately, or it can occur gradually, manifesting poor I/O performance. Storage cells are constantly monitoring the disk drives. This monitoring includes drive performance, in terms of both I/O and throughput, and SMART metrics such as temperature, speed, and read/write errors. The goal is to provide early warning for disks that are likely to fail before they actually do. When the storage cell detects a problem, an alert is generated with specific instructions on how to replace the disk. If the system has been configured for e-mail notification, these alerts will be e-mailed to you automatically. Alerts will also be sent using the other available notification methods, including Oracle Enterprise Manager and Automatic Service Request. Figure 9-4 shows an example of an e-mail alert from an Exadata storage cell. Note that the e-mail includes the name of the host, the disk that has failed, and even a picture of the front of a storage cell with a red ring around the disk that has failed. When the disk has been replaced and the alert has cleared, a follow up e-mail will be sent with a green ring around the new disk.

9781430262411_Fig09-04.jpg

Figure 9-4. Example of an e-mail alert from a failed disk drive

In the previous section, we walked you through a simulated drive failure. Had this been an actual disk failure, the procedure for replacing the disk would follow the same steps we used for the simulation. But what happens when Exadata’s early warning system determines that a drive is likely to fail soon? When Exadata detects drive problems, it sets the physical disk status attribute accordingly. The following CellCLI command displays the status of all disks in the storage cell:

CellCLI> list physicaldisk attributes name, status where disktype = 'HardDisk'
         35:0    normal
         35:1    normal
         ...
         35:11   normal

Table 9-1 shows the various disk status values and what they mean.

Table 9-1. Disk Status Definitions

Status

Description

Normal

The drive is healthy.

Predictive Failure

The disk is still working but likely to fail soon and should be replaced as soon as possible.

Poor Performance

The disk is exhibiting extremely poor performance and should be replaced.

Predictive Failure

If a disk status shows Predictive Failure, ASM will automatically drop the grid disks from the drive and rebalance data to other disks in the disk group according to the redundancy policy of the affected disk groups that use the drive. Once ASM has finished rebalancing and completed the drop operation, you can replace the disk drive. The following listing can be used to track the status of the ASM disk. A status of Offline indicates that ASM has not yet finished rebalancing the disk group. Once the rebalance is complete, the disk will no longer appear in the listing. By the way, tailing the ASM alert log is also an excellent way of checking the progress of the drop.

SYS:+ASM2>select name, mode_status
            from v$asm_disk_stat
           where name like 'SCRATCH%'
           order by 1;

NAME                                               MODE_ST
-------------------------------------------------- -------
SCRATCH_CD_05_CELL01                               ONLINE
SCRATCH_CD_05_CELL02                               ONLINE
SCRATCH_CD_05_CELL03                               OFFLINE

Image Caution  The first two physical disks in the storage cell also contain the Linux operating system. The O/S partitions on these two disks are configured as mirrors of one another. If one of these disks fails, the data must be in sync with the mirror disk before you remove it. Use the CellCLI command alter cell validate configuration to verify that no mdadm errors exist before replacing the disk.

The CellCLI command VALIDATE CONFIGURATION performs this verification for you:

CellCLI> ALTER CELL VALIDATE CONFIGURATION
Cell enkcel01 successfully altered

Poor Performance

If a disk exhibits poor performance, it should be replaced. A single poorly performing cell disk can impact the performance of other healthy disks. When a disk begins performing extremely badly, its status will be set to Poor Performance. As is the case with Predictive Failure status, ASM will automatically drop all grid disks (on this cell disk) from the disk groups and begin a rebalance operation. Once the rebalance is complete, you can remove and replace the failing disk drive. You can use the CellCLI command CALIBRATE to manually check the performance of all disks in the storage cell. This command runs Oracle’s Orion calibration tool to look at both the performance and throughput of each of the disks. Ordinarily, cellsrv should be shut down before running CALIBRATE because it can significantly impact I/O performance for databases using the storage cell. If you cannot shut down cellsrv for the test, you can run CALIBRATE using the FORCE option. As daunting as that sounds, FORCE simply overrides the safety switch and allows you to run CALIBRATE while cellsrv is up and applications are using the cell disks. The following listing shows the output from the CALIBRATE command run on a healthy set of cell disks from an Exadata X4-2 high capacity cell. The test takes about ten minutes to run.

CellCLI> calibrate
Calibration will take a few minutes...
Aggregate random read throughput across all hard disk LUNs: 1123 MBPS
Aggregate random read throughput across all flash disk LUNs: 8633 MBPS
Aggregate random read IOs per second (IOPS) across all hard disk LUNs: 2396
Aggregate random read IOs per second (IOPS) across all flash disk LUNs: 260102
Calibrating hard disks (read only) ...
LUN 0_0  on drive [20:0     ] random read throughput: 141.27 MBPS, and 195 IOPS
LUN 0_1  on drive [20:1     ] random read throughput: 139.66 MBPS, and 203 IOPS
LUN 0_10 on drive [20:10    ] random read throughput: 141.02 MBPS, and 201 IOPS
LUN 0_11 on drive [20:11    ] random read throughput: 140.82 MBPS, and 200 IOPS
LUN 0_2  on drive [20:2     ] random read throughput: 139.89 MBPS, and 199 IOPS
LUN 0_3  on drive [20:3     ] random read throughput: 142.46 MBPS, and 201 IOPS
LUN 0_4  on drive [20:4     ] random read throughput: 140.99 MBPS, and 203 IOPS
LUN 0_5  on drive [20:5     ] random read throughput: 141.92 MBPS, and 198 IOPS
LUN 0_6  on drive [20:6     ] random read throughput: 141.23 MBPS, and 199 IOPS
LUN 0_7  on drive [20:7     ] random read throughput: 143.44 MBPS, and 202 IOPS
LUN 0_8  on drive [20:8     ] random read throughput: 141.54 MBPS, and 204 IOPS
LUN 0_9  on drive [20:9     ] random read throughput: 142.63 MBPS, and 202 IOPS
Calibrating flash disks (read only, note that writes will be significantly slower) ...
LUN 1_0  on drive [FLASH_1_0] random read throughput: 540.90 MBPS, and 39921 IOPS
LUN 1_1  on drive [FLASH_1_1] random read throughput: 540.39 MBPS, and 40044 IOPS
LUN 1_2  on drive [FLASH_1_2] random read throughput: 541.03 MBPS, and 39222 IOPS
LUN 1_3  on drive [FLASH_1_3] random read throughput: 540.45 MBPS, and 39040 IOPS
LUN 2_0  on drive [FLASH_2_0] random read throughput: 540.56 MBPS, and 43739 IOPS
LUN 2_1  on drive [FLASH_2_1] random read throughput: 540.64 MBPS, and 43662 IOPS
LUN 2_2  on drive [FLASH_2_2] random read throughput: 542.54 MBPS, and 36758 IOPS
LUN 2_3  on drive [FLASH_2_3] random read throughput: 542.63 MBPS, and 37341 IOPS
LUN 4_0  on drive [FLASH_4_0] random read throughput: 542.35 MBPS, and 39658 IOPS
LUN 4_1  on drive [FLASH_4_1] random read throughput: 542.62 MBPS, and 39374 IOPS
LUN 4_2  on drive [FLASH_4_2] random read throughput: 542.80 MBPS, and 39699 IOPS
LUN 4_3  on drive [FLASH_4_3] random read throughput: 543.14 MBPS, and 38951 IOPS
LUN 5_0  on drive [FLASH_5_0] random read throughput: 542.42 MBPS, and 38388 IOPS
LUN 5_1  on drive [FLASH_5_1] random read throughput: 542.69 MBPS, and 39360 IOPS
LUN 5_2  on drive [FLASH_5_2] random read throughput: 542.59 MBPS, and 39350 IOPS
LUN 5_3  on drive [FLASH_5_3] random read throughput: 542.72 MBPS, and 39615 IOPS
CALIBRATE results are within an acceptable range.
Calibration has finished.

Cell Flash Cache Failure

Exadata X4-2 storage cells come equipped with four F80 PCIe Flash Cache cards. Each card has four Flash Cache disks (FDOMs) for a total of 16 flash disks. Exadata X5-2 high-capacity cells include four F160 PCIe Flash Cache cards with a total of four flash disks. These Flash Cache cards occupy slots 1, 2, 4, and 5 inside the storage cell. If a Flash Cache module fails, performance of the storage cell will be degraded and should be replaced at your earliest opportunity. If you are using some of your Flash Cache for flash disk-based grid disks, your disk group redundancy will be affected as well. These Flash Cache cards are not hot-pluggable, so replacing them will require you to power off the affected cell.

If a flash disk fails, Exadata will send you an e-mail notifying you of the failure. The e-mail will include the slot address of the card. If a specific FDOM has failed, it will include the address of the FDOM on the card (1, 2, 3, or 4). The failed Flash Cache card can be seen using the CellCLI command LIST PHYSICALDISK as follows:

CellCLI> list physicaldisk where disktype=flashdisk and status!=normal detail

         name:                 FLASH_5_3
         diskType:             FlashDisk
         flashLifeLeft:        100
         luns:                 5_3
         makeModel:            "Sun Flash Accelerator F80 PCIe Card"
         physicalFirmware:     UIO1
         physicalInsertTime:   2014-10-03T20:08:05-05:00
         physicalSize:         372.52903032302856G
         slotNumber:           "PCI Slot: 5; FDOM: 3"
         status:               critical

The slotNumber attribute here shows you where the card and FDOM are installed. In our case, the card is installed in PCIe slot 5. Once you have this information, you can shut down and power off the storage cell and replace the defective part. Keep in mind that when the cell is offline, ASM will no longer have access to the grid disks. So, before you shut down the cell, make sure that shutting it down will not impact the availability of the disk groups it supports. This is the same procedure we described in the “Cell Disk Failure” section of this chapter. Once the part is replaced and the cell reboots, the storage cell will automatically configure the cell disk on the replacement card and, if it was used for Flash Cache, you will see your Flash Cache return to its former size.

Cell Failure

There are two main types of cell failure—temporary and permanent. Temporary cell failures can be as harmless as a cell reboot or a power failure. Extended cell failures can also be temporary in nature. For example, if a patch installation fails or a component must be replaced, it could take the cell offline for hours or even days. Permanent cell failures are more severe in nature and require the entire cell chassis to be replaced. In either case, if your system is configured properly, there will be no interruption to ASM or your databases. In this section, we’ll take a look at what happens when a cell is temporarily offline and what to do if you ever have to replace one.

Temporary Cell Failure

As discussed in Chapter 14, Exadata storage cells are Sun servers with internal disk drives running Oracle Enterprise Linux 5 or 6. If a storage cell goes offline, all the disks on that cell become unavailable to the database servers. This means that all disk groups containing database data (as well as OCR and Voting files) on that storage cell are offline for the duration of the outage. ASM failure groups provide redundancy that allows your cluster and databases to continue to run during the outage, albeit with reduced I/O performance. When grid disks are created in a storage cell, they are assigned to a failure group. Each cell constitutes a failure group, as can be seen in the following listing:

SYS:+ASM2> select dg.name diskgroup, d.name disk, d.failgroup
            from v$asm_diskgroup dg,
                 v$asm_disk d
           where dg.group_number = d.group_number
             and dg.name like 'SCRATCH%'
           order by 1,2,3;

DISKGROUP                      DISK                           FAILGROUP
------------------------------ ------------------------------ ------------------------------
SCRATCH_DG                     SCRATCH_DG_CD_05_CELL01        CELL01
SCRATCH_DG                     SCRATCH_DG_CD_05_CELL02        CELL02
SCRATCH_DG                     SCRATCH_DG_CD_05_CELL03        CELL03

Because SCRATCH_DG was created using Normal redundancy, our SCRATCH database should be able to continue even if an entire storage cell dies. In this section, we’ll be testing what happens when a storage cell goes dark. We’ll use the same disk group configuration we used for the disk failure simulation earlier in this chapter. To cause a cell failure, we’ll log in to the ILOM on storage cell 3 and power it off. Because each storage cell constitutes an ASM failure group, this scenario is very similar to losing a single cell disk, I/O performance notwithstanding. The difference, of course, is that we are losing an entire failure group. Just as we did in our cell disk failure tests, we’ll generate data in the SCRATCH database during the failure to verify that the database continues to service client requests during the cell outage.

To generate I/O for the tests, we’ll be repeatedly inserting 23205888 rows from the BIGTAB table into the bigtab2 table:

RJOHNSON:SCRATCH> insert /*+ append */ into bigtab2 nologging (select * from bigtab);
RJOHNSON:SCRATCH> commit;

While the above inserts are running, let’s power off Cell03 and take a look at the database alert log. As you can see, the database throws an error when reading from a disk on Cell03, “failed to read mirror side 1.” A couple of lines further down in the log, you see the database successfully reading the mirror copy of the extent, “successfully read mirror side 2.”

alert_SCRATCH.log
-----------------------
Fri Jan 16 21:09:45 2015
Errors in file /u01/app/oracle/diag/rdbms/scratch/SCRATCH/trace/SCRATCH_mmon_31673.trc:
ORA-27603: Cell storage I/O error, I/O failed on disk o/192.168.12.5/SCRATCH_CD_05_cell03 at offset 2483044352 for data length 16384
ORA-27626: Exadata error: 12 (Network error)
...
WARNING: Read Failed. group:3 disk:2 AU:592 offset:16384 size:16384
WARNING: failed to read mirror side 1 of virtual extent 2 logical extent 0 of file 260 in group [3.689477631] from disk SCRATCH_CD_05_CELL03  allocation unit 592 reason error; if possible,will try another mirror side
NOTE: successfully read mirror side 2 of virtual extent 2 logical extent 1 of file 260 in group [3.689477631] from disk SCRATCH_CD_05_CELL01 allocation unit 589

Turning to the ASM alert log, we see that ASM also noticed the issue with Cell03 and responds by taking grid disk SCRATCH_CD_05_CELL03 offline. Notice further on that ASM is in the process of taking other grid disks offline as well. This continues until all grid disks on Cell03 are offline:

alert_+ASM2.log
-----------------------
--- Test Cell03 Failure --
Fri Jan 16 21:09:45 2015
NOTE: process 23445 initiating offline of disk 2.3915933784 (SCRATCH_CD_05_CELL03) with mask 0x7e in group 3
...
WARNING: Disk SCRATCH_CD_05_CELL03 in mode 0x7f is now being offlined
Fri Jan 16 21:09:47 2015
NOTE: process 19753 initiating offline of disk 10.3915933630 (RECO_CD_10_CELL03) with mask 0x7e in group 2

Checking the V$SESSION and V$SQL views, we can see that the insert is still running:

  SID PROG       SQL_ID         SQL_TEXT
----- ---------- -------------  -----------------------------------------
    3 sqlplus@en 9ncczt9qcg0m8  insert /*+ append */ into bigtab2 nologgi

So our databases continue to service client requests even when one-third of all storage is lost. That’s pretty amazing. Let’s power up Cell03 again and observe what happens when this storage is available again.

Looking at Cell03’s alert log we see cellsrv bring our grid disks back online again. The last thing we see in Cell03’s alert log is it rejoining the storage grid by establishing a heartbeat with the diskmon (disk monitor) process on the database servers:

Cell03 Alert log
-----------------
Storage Index Allocation for GridDisk SCRATCH_DG_CD_05_cell03 successful [code: 1]
CellDisk v0.5 name=CD_05_cell03 status=NORMAL guid=edc5f61e-6a60-48c9-a4a6-58c403a86a7c found on dev=/dev/sdf
Griddisk SCRATCH_DG_CD_05_cell03  - number is (96)
Storage Index Allocation for GridDisk RECO_CD_06_cell03 successful [code: 1]
Storage Index Allocation for GridDisk SYSTEM_CD_06_cell03 successful [code: 1]
Storage Index Allocation for GridDisk STAGE_CD_06_cell03 successful [code: 1]
Storage Index Allocation for GridDisk DATA_CD_06_cell03 successful [code: 1]
CellDisk v0.5 name=CD_06_cell03 status=NORMAL guid=00000128-e01b-6d36-0000-000000000000 found on dev=/dev/sdg
Griddisk RECO_CD_06_cell03  - number is (100)
Griddisk SYSTEM_CD_06_cell03  - number is (104)
Griddisk STAGE_CD_06_cell03  - number is (108)
Griddisk DATA_CD_06_cell03  - number is (112)
...
Fri Jan 16 22:51:30 2015
Heartbeat with diskmon started on enkdb02.enkitec.com
Heartbeat with diskmon started on enkdb01.enkitec.com
Fri Jan 16 22:51:40 2015
...

Summary

Exadata is a highly redundant platform with a lot of moving parts. Businesses don’t typically invest in such a platform without expectations of minimal downtime. As such, Exadata is commonly used for hosting mission-critical business applications with very stringent uptime requirements. Knowing what to do when things go wrong is critical to meeting these uptime requirements. In this chapter, we discussed the proper procedures for protecting your applications and customers from component and system failures. Before your system is rolled into production, make it a priority to practice backing up and restoring system volumes, removing and replacing disk drives, and rebooting storage cells. In addition, become familiar with what happens to your databases. Run the diagnostic tools we’ve discussed in this chapter and make sure you understand how to interpret the output. If you are going to be responsible for maintaining Exadata for your company, now is the time to get comfortable with the topics discussed in this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset