CHAPTER 4

image

Supporting Software

After introducing new trends in hardware in Chapter 3, you can read more on how to exploit these with appropriate software. Choosing the operating environment for the next generation hosting platform is equal in importance to choosing the hardware. Hardware and software should be certified both from the vendors and in-house. Strong support contracts with defined Service Level Agreements must be put into place to allow the quick turnaround of troubleshooting problems.

You should not underestimate the amount of testing it takes to certify an Oracle stack, even if it is primarily for the database! This certification requires joint efforts by the storage, operating system, and database engineering teams. Effective management of this joint team effort is important to the overall success. This sounds too logical to be added to this introductory section of the chapter, but experience shows that a strict separation of duties, as seen in many large organizations, requires re-learning how to work together.

Enabling Software Solutions

After spending time on evaluating the hardware for the consolidation project, the next step is to think about deployment of the Oracle 12c database. In the author’s opinion, there are two different approaches, depending on the importance of the environment. These approaches are virtualization and clustering. They both are centered on the magic terms “high availability” and “disaster recovery.” Clustering can be further broken down, resulting in:

  • Virtualization-based solutions
  • Using active/passive clusters
  • Using active/active clusters

What shouldn’t be hidden from the discussion is the fact that complexity and cost increase in almost the same way as the protection from failure!

High Availability Considerations

The “high availability” baseline is a single instance Oracle database. The following discussion assumes there are no unnecessary single points of failure in the chosen hardware, which can potentially render the whole solution very vulnerable. If you want high availability, then you should know it will come at a cost.

Using virtualization-based solutions usually involves a single instance Oracle database installation in a virtual machine. The physical location of a virtual machine is often irrelevant, and, rather than mapping a virtual machine to a host, the mapping is on a logical grouping of hosts. Oracle VM Manager calls such a logical group of servers a server pool. Other vendors use different terminology but the same principle. In case a host has an unrecoverable failure and must reboot, it is possible to migrate virtual machines from the failed host to another host in the cluster, capacity permitting. However, there does not have to be a failure: (live) migration of virtual machines is applicable for planned hardware maintenance or to deal with capacity problems on a given host. Almost all visualization solutions try to automate dealing with failure transparently, which plays nicely toward the requirement to not require a lot of hands-on management.

Another popular approach is to use an active-passive cluster. Most active/passive clusters are still easy to manage since the cluster management software takes care of resource management and failover to the passive node. You achieve higher availability than with a single instance database by means of the cluster framework. A lot of cluster frameworks exist for Oracle from all the major vendors, operating off the following simple principles:

  • A dedicated active/passive cluster consists of two nodes. More nodes often do not make sense in active/passive configurations. Quite often, such a setup is dedicated to a specific product and does not normally host more than a couple of databases.
  • Storage is provided to the active node. There are exceptions when a cluster logical volume manager, such as Oracle’s Automatic Storage Management, or alternative is in place.
  • Important entities are grouped into cluster resources. From an Oracle point-of-view, these include the OFA and database file systems, the listener process(es) and the database(s), among others. Resources can be logically grouped.
  • All resources have metadata associated with them. Among the information stored is a check interval, the command to check for the resource’s status, start retries, and a failover process.
  • An agent process monitors resources according to the metadata stored.
  • In case a resource has been detected as “failed,” and it cannot be restarted on the node it was running on, it will be started on the passive node.

Again, most of this processing is automated internally by the aforementioned agents and the cluster framework. There will be an outage to the clients, at least for the duration of the instance recovery. If you are unlucky, you might have to wait for a file system consistency check after the Oracle-related file systems are mounted to the passive node. Cluster file systems can prevent this problem from happening, but, for most, you need to pay an extra license.

The final option you have at your disposal is to cluster your database in an active/active way. The only way to do so is to use the Real Application Clusters option to the Oracle database. RAC takes away the strict 1:1 mapping of the Oracle instance to the database. Remember that the Oracle instance is the transient part of Oracle in memory, whereas the files on hard disk are referred to as the database. In RAC systems, you have multiple instances running on multiple hosts accessing the same database, so RAC is a high-availability solution and does not protect you from disasters or human error. RAC has been designed so the cluster appears as one database to the end user, and sophisticated programming allows for a single SGA across all the cluster nodes. RAC can be hugely beneficial to applications, if written with the special demands of RAC in mind. RAC has many advanced workload management features available to application developers, most of which aim at making instance failure more or less transparent to the application. In case of a RAC instance failure and supporting programming in the application, there is usually very little delay in processing. RAC can also be instructed to recreate failed sessions on another cluster node.

Unfortunately, many RAC features are not fully exploited in application code. This becomes even more apparent when considering an advanced RAC deployment: extended distance clusters. Extended distance clusters split the RAC setup across two data centers that can be a few miles apart. Each data center uses its own storage array to store the database, while Oracle Automatic Storage Management (ASM) keeps the arrays in sync.

In the field, it turns out that deploying a RAC system can be time consuming. Any RAC deployment requires networking, storage, and operating system support teams to work together closer than with the classic single instance database. Storage needs shared between RAC cluster nodes, and a RAC system needs a few more IP addresses. Finally, RAC requires a little more preparation on the operating system to work.

To streamline the RAC rollout process, Oracle developed what it calls Engineered Systems. Following the idea of a pre-packaged and pre-configured deployment, these systems allow for quicker installation and setup.

The Oracle offerings for the relational database include the Oracle Database Appliance (ODA), as well as Exadata. The Database Appliance has recently been upgraded to the X3-2 model and comprises two dual-socket servers with 10GBit/s Ethernet and 256 GB of DRAM each. Storage in the form of spinning hard disk complemented with SSD is directly attached in a bay and connected via the Serial Attached SCSI interface.

Exadata extends the idea of the database appliance and adds intelligent storage. Many workloads benefit from Exadata’s unique features, out of which the best-known is the offloading capability to the storage layer and the I/O Resource Manager.

Exadata is a wonderful piece of engineering and, from an application’s point of view, is another case of RAC. Exadata does not require any changes to the applications compared to traditional Real Application Cluster deployments. Depending on the configuration, an Exadata rack can have between two and eight cluster nodes in addition to its own specialized and intelligent storage. From a connectivity and high availability point of view, you do not connect any differently to Exadata than you connect to your own cluster.

Oracle is not the only hardware provider who offers such solutions. Other vendors have tightly integrated solutions in their product portfolios. Exadata-like intelligent storage, however, does not exist outside of Oracle.

Unfortunately, most Oracle applications today are not RAC aware. However, with the new concepts of Pluggable Databases Oracle introduced in this release, that problem is mitigated.

Disaster Recovery Considerations

You just read that Real Application Clusters are not a protection from disasters. You might hear the claim that a stretched RAC will protect you from disasters if the data centers are far enough away from each other. Although it is partially true, if your pockets are deep enough, there is still a fundamental problem with having only one database: in case you have to restore, you will incur an outage. And with the trend to larger and larger databases, even the fastest technology may not give you your database back within the service level agreement. The result is a major embarrassment at best.

There are many ways to protect your environment from disaster, and, because of the importance of that subject, Chapter 9 is dedicated to it. This section is just an appetizer.

image Note  There are more replication technologies available. This section is mainly concerned with bitwise identical replication.

In the Oracle world, you have the option to use Data Guard, which, in its standard form, is included with the Enterprise Edition. The simple yet remarkably effective idea is to take the redo generated by the primary database, ship it over the network, and apply it to up to 30 standby databases. The standby databases are merely mounted and cannot be queried while they apply redo. The log shipping can be configured to be synchronous or asynchronously. At the same time, it has many checks to ensure that the data written is not corrupt. With an additional option, called Active Data Guard, you can even query your standby databases while they receive changes from production. Backups from the standby database are possible as well, allowing you to offload those from your production database. Cautious architects configure backup solutions on all data centers, allowing for flexibility should the unthinkable happen and the primary data center blacks out. Data Guard also allows for planned role changes, it can convert a standby to a full read-write clone of your production system to test hot fixes, and last, but not least, it can be used to fail over from production in the event of a catastrophe. Regulatory requirements and/or standard operational procedures often require regular DR tests, especially before a new architecture goes live. Integrating disaster recovery into the architecture right from the start can save you from having to retrofit it later.

image Note  Chapters 9 and 10 cover Data Guard in greater detail.

The alternative to Data Guard is to use block level replication on the storage array, which operates a on a different level. Storage level replication exists for enterprise arrays, and most use a fiber channel fabric to send blocks as they are from the production to the DR array. Other than that, the processes are quite common. The array does not normally need to know (or care) about what data it is shipping. This is an advantage and disadvantage. If done properly, block replication can be the one ticket to happiness when you replicate your entire application stack: application servers, database servers, and other auxiliary components. Then, all you need to do is mount your replicated storage in your DR data center and bring your systems up. Clever management of your corporate Domain Name System can make it simple to swap the old with the new IP addresses.

Virtualization Examples

The next sections are intended to show you some examples for virtualization technologies. The technologies are chosen to provide you an overview of what is possible on different platforms. The selection is by far not complete! The market leader in virtualization—VMware—for example, is not included in the list. It simply does not need to be introduced anymore.

The first technology presented is Oracle Solaris Zones. Zones are examples for operating system virtualization. One image of the operating system is the basis for multiple isolated copies of the same, operating in complete independence. Solaris is available for x86- and SPARC-based systems. Linux Kernel Virtual Machines are conceptually similar, as to an extent IBM’s Logical Partitions.

Following the introduction of Oracle Solaris Zones, you are shown another popular x86-virtualization technology based on a so called bare-metal hypervisor: Oracle VM Server for x86. With Oracle VM, you boot into a very minimalistic operating system that runs directly on the “bare” hardware and provides all access to the hardware for virtual machines.

image Note  Non-enterprise or desktop virtualization products are not covered in this chapter. They are aimed at end-users to run operating systems in isolation on their desktops, not at data centers.

Let me stress again that the selection of examples is neither exhaustive nor provides the relevant selection of virtualization technologies. All of them serve a purpose, and the listing here does not imply a ranking or measure of quality. Let’s start the discussion with an overview of Oracle Solaris Zones.

Oracle Solaris Zones

Solaris zones are an example for operating system virtualization. A Solaris system-x86 and SPARC alike can be subdivided into logical units called zones. Another name for a zone is a container, and the two terms are often used synonymously. The use of Solaris zones is a very easy and attractive way to make use of a large server that offers plenty of resources. A zone is an isolated environment and, in many respects, resembles a virtual machine from desktop virtualization products, although its management requires experience with the command line.

When you install Solaris 10 or 11, you automatically find one zone already created: the so-called global zone. It is then up to you to create additional zones, referred to as “non-global zones.” This section uses the simpler term “zone” instead when referring to non-global zones. Just like with Oracle 12c you have the option to not use zones at all in which case your software will be installed in the global zone. You should, however, consider creating a non-global zone to install your software in. This can make many tasks a lot easier.

In most cases, a zone runs a version of Solaris—the same as the global zone. Recent versions of Solaris support what is termed a Branded Zone. A Branded Zone or BrandZ allows you to run Solaris 8, Solaris 9, and Red Hat Enterprise Linux 3, in addition to Solaris 10. Solaris 11 has reduced the number of supported brands to Solaris 10 only in addition to Solaris 11. The idea of using a Branded Zone is to allow the user to migrate to Solaris 10 but keep applications that cannot possibly be ported to the current Solaris release.

An Introduction to Solaris Zones

Solaris zones have been the weapon of choice for Solaris virtualization, especially on non-SPARC servers where Logical Domains are not available. A first massive wave of consolidation rolled through data centers some years ago, after zones where first introduced. The zone concept is very compelling, elegant, and simple. Instead of having powerful servers sitting idly most of the time when there are no development or test activities, many of these environments could be consolidated into one server. In addition, a zone isolates applications in a way that even root-owned processes in zone1 cannot view processes in zone2. Block level replication on the array level allows for very simple yet effective disaster recovery solutions. As you will, each zone has its own root directory that can reside on block-replicated storage. Application-specific file systems can equally be storage-replicated. Before starting with a discussion of how to set up a zone, a little bit of terminology is explained in Table 4-1.

Table 4-1. Necessary Terminology for Solaris Zones

Concept

Explanation

IPS

The Image Packaging System is a framework for managing software in Solaris 11. It replaces the System V Release 4 package management system used in previous versions of Solaris. The SysVR4 package management system is still available for backward compatibility. So, instead of pkgadd, you will from now on have to get familiar with the pkg command and its many options.

ZFS

ZFS is the next generation Solaris file system and supplements UFS in Solaris. It has many advanced features, such as volume management, high storage capacity, integration checking, copy-on-write clones, Access Control Lists, and many more built-in. In ZFS, you aggregate disks or LUNs into storage pools. The storage pool is conceptually similar to the Linux LVM2 volume group. It can be configured as a striped set of LUNs, or a (RAID 1) mirror, or, alternatively, as a RAIDZ, which is similar to RAID 5.

You create the actual ZFS file systems on top of the storage pool. It is strongly encouraged to create hierarchies of ZFS file systems, grouping similar file systems under a common top level container. The concept behind this hierarchy is very elegant. ZFS file systems inherit metadata properties from their parent container, such as mount points, quotas, caches, and so on.

A ZFS dataset is a generic name for a clone, an actual file system, or a snapshot. In the context of zones, a dataset is most often a file system.

Global Zone

The initial zone after the operating system has been installed.

Non-Global Zone

Non-global zones are user defined containers, similar in concept to virtual machines, as known from desktop virtualization products.

Zones can have multiple lifecycle states. The important ones from a DBA point of view are “installed” and “running.” A zone with the status “installed” is shut down, waiting to be started. The other states are of more interest to the system administrator who creates the zone.

Storage for Solaris zones merits a closer look. For mid- and higher-tier systems (user acceptance testing, production of course), it makes sense to think in terms of disaster recovery, and this is where the isolation aspect of the zones really shines.

image Note  By the way, you can apply the same principle subsequently shown to almost any virtualization technology.

You could, for example, use a dedicated storage device to copy the zones to your disaster recovery center via block-level replication. Figure 4-1 demonstrates this.

9781430244288_Fig04-01.jpg

Figure 4-1. Possible storage setup with SAN replication for zones. This is the same as for the Oracle VM, which is explained in “Oracle VM”

As you can see in Figure 4-1, the LUNs used for Zones 1 and 2 are provided by a storage array. The use of an ISL (Inter Switch Link) allows the replication of the LUNs’ content to a remote data center. The standby LUNs in Array B continuously receive changes from the primary site via the inter-array replication. In the event of a disaster, the zones can be started on the secondary host and resume operations, no more problems with missing configuration items, such as the listener.ora file or different patch levels between primary and standby database. The replication of the Oracle binaries ensures an identical configuration.

Solaris 11 brings a few noteworthy changes to the zone model you may know from Solaris 10. You read previously that the number of supported brands for Branded Zones has been reduced. Only Solaris 10 and 11 are available-migration paths that exist to move Solaris 10 zones to Solaris 11. Solaris 10 zones allowed you to use the loopback device to map certain file systems necessary for the operation of the zone, such as /lib, /platform, /sbin, /usr, as well as others, into the non-global zone, making very efficient use of disk space. The change from the SYSVR4 package system used before Solaris 11 to the new IPS renders this approach unsuitable. For most configurations, a zone is a full-root-zone. Although that might seem disadvantageous at first, it removes a few headaches you could have with Oracle when installing files outside its mount point into /usr, for example. Since /usr was often loopback mounted from the global zone, it was a read only file system. Copying the oraenv related files into /usr was a problem, as was the installation of certain backup software.

Where it was possible to use a non-ZFS file system for the zone root in Solaris 10, the use a ZFS dataset is mandatory in Solaris 11. You can still make use of UFS to store Oracle data files, there is no requirement to only use ZFS in zones. Finally, the package footprint on a newly installed zone is minimal by default. You will see later how to add packages to a zone to support an Oracle database installation.

Creation of a Solaris Zone

A zone can be created using the zonecfg command. When executing zonecfg, you specify which resources you would like to assign to the zone you are creating. Instead of boring you with all the options available, you will walk through the creation of a zone used by an Oracle database. The root file system in the default zone requires only one GB of disk space, but it is recommended to set aside more than that, especially if you are planning to install Oracle. Eight GB should be sufficient for most deployments. In addition to the root file system, an Oracle mount point will be created as well. It is possible to create an oracle installation in the global zone and present that installation to each non-global zone, but this approach has a number of disadvantages. Many users, therefore, decide to create dedicate LUNs for the each zone’s Oracle home. If the importance of the environment merits it, these can be added to the replication set. The subsequent example is built to the following requirements:

  • Create a virtual machine/zone with name “zone1.”
  • Assign a root volume of 8 GB storage using the ZFS /zones/zone1/root from the global zone.
  • Create a mount point for the database binaries in /u01/app on ZFS /zones/zone1/app of 15 GB.
  • Create a mount point for the oracle database files in /u01/oradata on ZFS /zones/zone1/oradata of 30GB.
  • Create a network interface with a dedicated IP address.
  • Use the ZFS storage pool zone1pool for all storage requirements of zone1 to allow for block level replication.

The example assumes you are logged into the global zone as root, and it assumes a Solaris 11 installation. The first step in the zone creation is to create the ZFS data sets. It is good practice to create a hierarchy of file systems to make best use of the metadata property inheritance. In the example, the zone’s root file system is stored in /zones/zone1, whereas the oracle installation is local to the zone, in a zfs file system called /zones/zone1/app. Finally, the database is created in /zones/zone1/oradata. The data sets are created as follows:

root@solaris:∼# zfs create -o mountpoint=/zones/zone1 zone1pool/zone1
root@solaris:∼# zfs create zone1pool/zone1/root
root@solaris:∼# zfs set quota=8G zone1pool/zone1/root
root@solaris:∼# zfs create zone1pool/zone1/app
root@solaris:∼# zfs set quota=15G zone1pool/zone1/app
root@solaris:∼# zfs create zone1pool/zone1/oradata
root@solaris:∼# zfs set quota=30G zone1pool/zone1/oradata

The mount points for my Oracle database file systems must be set to legacy to allow them to be presented to the zone using the “add fs” command.

root@solaris:∼# zfs set mountpoint=legacy zone1pool/zone1/app
root@solaris:∼# zfs set mountpoint=legacy zone1pool/zone1/oradata

This creates all the file systems as required. You can use the zfs list command to view their status. You will also note that the file systems are already mounted, and you do not need to edit the vfstab file at all.

root@solaris:∼# zfs list | egrep "NAME|zone1"
NAME                       USED  AVAIL  REFER  MOUNTPOINT
zone1pool                  377K  24.5G    31K  /zone1pool
zone1pool/zone1            127K  24.5G    34K  /zones/zone1
zone1pool/zone1/app         31K  15.0G    31K  legacy
zone1pool/zone1/oradata     31K  30.0G    31K  legacy
zone1pool/zone1/root        31K  8.00G    31K  /zones/zone1/root
root@solaris:∼#

After the storage is provisioned, configure the zone. To do so, start zonecfg and supply the desired zone name as shown:

root@solaris:∼# zonecfg -z zone1
zone1: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:zone1> create
create: Using system default template 'SYSdefault'
zonecfg:zone1> set zonepath=/zones/zone1/root
zonecfg:zone1> set autoboot=true
zonecfg:zone1> set bootargs="-m verbose"
zonecfg:zone1> set limitpriv="default,sys_time"
zonecfg:zone1> set scheduling-class=FSS
zonecfg:zone1> add fs
zonecfg:zone1:fs> set type=zfs
zonecfg:zone1:fs> set special=zone1pool/zone1/app
zonecfg:zone1:fs> set dir=/u01/app
zonecfg:zone1:fs> end
zonecfg:zone1> add fs
zonecfg:zone1:fs> set type=zfs
zonecfg:zone1:fs> set special=zone1pool/zone1/oradata
zonecfg:zone1:fs> set dir=/u01/oradata
zonecfg:zone1:fs> end
zonecfg:zone1> remove anet
Are you sure you want to remove ALL 'anet' resources (y/[n])? Y
zonecfg:zone1> add anet
zonecfg:zone1:anet> set linkname=net0
zonecfg:zone1:anet> set lower-link=auto
zonecfg:zone1:anet> end
zonecfg:zone1> verify
zonecfg:zone1> commit
zonecfg:zone1> exit

Although this is a very basic example, it already looks quite complex, but do not despair: it’s surprisingly intuitive! The commands you execute create a new zone. Set the zone’s “root” file system to the previously created ZFS data store. Then set the scheduler class and a few more attributes to make the zone play nice with others before adding the file systems for Oracle. The anet resource defines the automatic network device and is responsible for dynamically adding the necessary virtual network card when the zone starts. In previous versions of Solaris, you had to pre-create the VNIC before starting the zone—life has been made easier. Once the configuration information is added, verify the settings and, finally, commit it. In the following step, you then install the zone. This requires an Internet connection or a local IPS repository in the network.

root@solaris:∼# zoneadm -z zone1 install
A ZFS file system has been created for this zone.
Progress being logged to /var/log/zones/zoneadm.20120706T093705Z.zone1.install
       Image: Preparing at /zones/zone1/root/root.
 
 Install Log: /system/volatile/install.2038/install_log
 AI Manifest: /tmp/manifest.xml.iEaG9d
  SC Profile: /usr/share/auto_install/sc_profiles/enable_sci.xml
    Zonename: zone1
Installation: Starting ...
 
              Creating IPS image
              Installing packages from:
                  solaris
                      origin:  http://pkg.oracle.com/solaris/release/
DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                              167/167 32062/32062  175.8/175.8
 
PHASE                                        ACTIONS
Install Phase                            44313/44313
 
PHASE                                          ITEMS
Package State Update Phase                   167/167
Image State Update Phase                         2/2
Installation: Succeeded
 
        Note: Man pages can be obtained by installing pkg:/system/manual
 
 done.
 
        Done: Installation completed in 1817.202 seconds.
 
 
  Next Steps: Boot the zone, then log into the zone console (zlogin -C)
 
              to complete the configuration process.
 
Log saved in non-global zone as /zones/zone1/root/root/var/log/zones/zoneadm.20120706T093705Z.zone1.install

Congratulations! The zone is now installed, but not yet started. Start it using the following command:

root@solaris∼# zoneadm -z zone1 boot

Configuring a Zone

After the prompt returns from the execution of the last command, the zone is ready to log in. Two ways exist to do so, by either your IP address or the zlogin utility. Since the network within the zone is not yet configured, you need to use the zlogin utility.

root@solaris:∼# zlogin zone1
[Connected to zone 'zone1' pts/2]
Oracle Corporation      SunOS 5.11      11.0    November 2011

As you can see, the file systems are present, just as we asked. Notice that the devices allocated to the Oracle mount points on the left hand side refer to the devices in the global zone.

root@zone1:∼# df -h
Filesystem             Size   Used  Available Capacity  Mounted on
rpool/ROOT/solaris      24G   317M        24G     2%    /
/dev                     0K     0K         0K     0%    /dev
zone1pool/zone1/app     15G    31K        15G     1%    /u01/app
zone1pool/zone1/oradata
                        24G    31K        24G     1%    /u01/oradata
zone1pool/zone1/root/rpool/ROOT/solaris/var
                        24G    25M        24G     1%    /var
proc                     0K     0K         0K     0%    /proc
ctfs                     0K     0K         0K     0%    /system/contract
mnttab                   0K     0K         0K     0%    /etc/mnttab
objfs                    0K     0K         0K     0%    /system/object
swap                   6.1G   248K       6.1G     1%    /system/volatile
sharefs                  0K     0K         0K     0%    /etc/dfs/sharetab
/usr/lib/libc/libc_hwcap2.so.1
                        24G   317M        24G     2%    /lib/libc.so.1
fd                       0K     0K         0K     0%    /dev/fd
swap                   6.1G     0K       6.1G     0%    /tmp

However, the system is not yet configured. This is performed using the sysconfig command. Exit your previous session, log into the zone in console mode (using zlogin -Cd zone1), enter sysconfig configure, and then reboot the zone. The console mode does not disconnect you, and, when the zone comes up again, the System Configuration Tool guides you through the setup process. After another reboot, your zone is functional. Since the installation of Oracle binaries is identical to a non-zone installation, it is not shown here.

Monitoring Zones

An invaluable tool for the system administrator or power-DBA is called zonestat. Very often, it is difficult to work out where time is spent on a multi-zone system. All the DBA sees is a high load average, but it is not always apparent where it is caused. Enter zonestat, a great way of monitoring the system. It should be noted, however, that the output of the command, when executed in a non-global zone, will not report individual other zone’s resources.

When invoked on the global zone, however, you get an overview of what is happening. When invoked in its most basic form from the global zone, you see the following output:

root@solaris:∼# zonestat 5 2
Collecting data for first interval...
Interval: 1, Duration: 0:00:05
SUMMARY                   Cpus/Online: 8/8   PhysMem: 8191M   VirtMem: 9.9G
                    ---CPU----  --PhysMem-- --VirtMem-- --PhysNet--
               ZONE  USED %PART  USED %USED  USED %USED PBYTE %PUSE
            [total]  8.00  100% 1821M 22.2% 2857M 27.9%   278 0.00%
           [system]  0.00 0.00% 1590M 19.4% 2645M 25.8%     -     -
              zone1  8.00  100% 83.1M 1.01% 75.7M 0.73%     0 0.00%
             global  0.01 0.19%  148M 1.80%  136M 1.33%   278 0.00%
 
Interval: 2, Duration: 0:00:10
SUMMARY                   Cpus/Online: 8/8   PhysMem: 8191M   VirtMem: 9.9G
                    ---CPU----  --PhysMem-- --VirtMem-- --PhysNet--
               ZONE  USED %PART  USED %USED  USED %USED PBYTE %PUSE
            [total]  7.98 99.8% 1821M 22.2% 2858M 27.9%   940 0.00%
           [system]  0.00 0.00% 1590M 19.4% 2645M 25.8%     -     -
              zone1  7.97 99.6% 83.1M 1.01% 75.7M 0.73%     0 0.00%
             global  0.01 0.22%  148M 1.80%  136M 1.33%   940 0.00%

The previous system has the global zone and a user zone started and running, while a load generator in zone1 stresses all eight CPUs. You have the option to tune in on individual zones, as well as to request more detailed information. In the previous output, you can see that zone1 is fairly busy, the %PART column reports that the system is 96% busy. Physical memory and virtual memory, as well as network traffic, are no reason for concern. What should ring alarm bells is the fact that the total resources are almost 100% in use: in other words, our zone1 consumes all of the host’s resources. Time to investigate!

Zonestat can also be used to perform longer-term monitoring and record statistics in the background. To gather statistics over a 12-hour period with data points every 30 seconds and a summary report for high usage every 30 minutes, you could use the following command:

root@solaris:∼# zonestat -q -R high 30s 12h 30m

Deleting a Zone

A zone can be deleted-careful: the subsequent commands do not ask for confirmation and also delete the ZFS data store. In other words, as soon as you hit return, the zone is gone. You should have a valid and tested backup before proceeding. If you built it according to the suggestion provided (all user data on separate datasets), then you won’t lose everything, but you do still have to go through a painful recovery of the zone itself.

So if you are absolutely sure that you want to remove the zone, shut it down first, wait a couple of hours to see if anyone complains, and then delete it. It is not unheard of that a zone was officially declared not to be in use when someone actually was using it. The shutdown command can be initiated from within the zone itself, or via the global zone:

root@solaris:∼# zoneadm -z zone1 shutdown
root@solaris:∼# zoneadm list -civ
  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              solaris  shared
   - zone1            installed  /zones/zone1                   solaris  excl

The zone is now shut down. Last chance to check if the zone is not needed! If you are sure, then proceed by removing the zone data. This must be performed from the global zone as root. Consider this example:

root@solaris:∼# zoneadm -z zone1 uninstall
Are you sure you want to uninstall zone zone1 (y/[n])? y
Progress being logged to /var/log/zones/zoneadm.20120704T114857Z.zone1.uninstall
root@solaris:∼# zoneadm list -civ
  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              solaris  shared
   - zone1            configured /zones/zone1                   solaris  excl

As you can see, the zone is no longer installed, but rather configured. To really get rid of it, you need to use the zonecfg tool again with the delete option:

root@solaris:∼# zonecfg -z zone1 delete
Are you sure you want to delete zone zone1 (y/[n])? y
root@solaris:∼# zfs list
NAME                       USED  AVAIL  REFER  MOUNTPOINT
rpool                     5.11G  10.5G    39K  /rpool
rpool/ROOT                2.01G  10.5G    31K  legacy
rpool/ROOT/solaris        2.01G  10.5G  1.54G  /
rpool/ROOT/solaris/var     476M  10.5G   474M  /var
rpool/dump                2.06G  10.6G  2.00G  -
rpool/export                98K  10.5G    32K  /export
rpool/export/home           66K  10.5G    32K  /export/home
rpool/export/home/martin    34K  10.5G    34K  /export/home/martin
rpool/swap                1.03G  10.6G  1.00G  -

As you can see, even the ZFS has been removed.

Further Reading

If you are interested in exploring zones in more detail, then you might find the following references useful:

  • Oracle® Solaris Administration: Oracle Solaris Zones, OracleSolaris10Zones, and Resource Management
  • Oracle® Solaris Administration: ZFS File System
  • Oracle® Solaris Administration: Network Interfaces and Network Virtualization
  • Solaris 11 manual page: solaris(5)
  • Solaris 11 manual page: brands(5)
  • Solaris 11 manual page: zones(5)

Oracle VM Server for x86

Oracle has entered relatively late into the virtualization market, but it has since been aggressively promoting and developing its products. One of the company’s first virtualization products was Oracle VM, which is today known to us as Oracle VM Server for x86. It is based on the Xen type 1 bare-metal hypervisor and unsurprisingly runs on x86-hardware. Since the inception of Oracle VM, a number of products have been branded Oracle VM, making it important to add more context as to which is meant. At the time of this writing, some of the additional products with Oracle VM in their name were:

  • Oracle VM Virtualbox
  • Oracle VM Server for SPARC
  • Oracle VM Manager

Oracle VM VirtualBox has initially been developed by innotek GmbH, which was acquired by Sun Microsystems in 2008, only to be taken over by Oracle in 2010. VirtualBox is a desktop virtualization product (that is, it runs on top of a desktop operating system). VirtualBox is a popular solution and is available under a liberal license.

Oracle VM Server for SPARC is the new name for Sun Logical Domains or LDOMs. Currently designed mainly for Chip MultiThreading technology such as employed on Oracle’s T-Series hardware, the Oracle VM Server for SPARC promises lower overhead since the virtualization technology is already included in the CPU. Logical Domains in this respect are not to be confused with Dynamic Domains, which are only available in M-Series hardware.

Oracle VM is an upcoming competitor in the x 86 virtualization world, but competition in this market segment is tough. The version covered in this section is Oracle VM 3.1.1, which is current at the time of this writing.

Introduction to the Xen Hypervisor

The Xen hypervisor is the result of a research project of the University of Cambridge. As a type 1 hypervisor, it does not require any operating system to run, it rather starts on bare metal. Although available in many commercial offerings, the core of Xen is a true open source solution to run all kinds of workloads, including Amazon’s Elastic Compute Cloud. Thanks to a lot of development effort, Xen is able to run a large number of operating systems, including Linux, Solaris, Windows, and even some BSD derivatives.

Xen was different from other virtualization solutions available at the time it was developed. The majority of virtualization solutions at that time used binary translation to execute multiple copies of operating systems on the same hardware. This is termed “full virtualization” in Xen. With full virtualization, guests do not require any modification, and your Windows 2000 CD-ROM could be used to boot virtual hardware almost identically to physical hardware.

Binary translation results in a lot of CPU overhead, since operations of the guest operating system must be translated by the host and implemented in a way that is safe to operate multiple operating systems at the same time. In addition to translating CPU instructions, the virtual machine monitor has to also keep track of the memory used both within the guest and on the host. Finally, some hardware has to be emulated. Before the advent of hardware assisted virtualization technologies, these three were done in very cleverly engineered software. A penalty often applied when running virtualized workloads.

Xen, conversely, used a different approach, called para-virtualization, as illustrated in Figure 4-2. In a para-virtualized environment, the guest operating systems are aware that they are being virtualized, allowing for tighter integration with the hypervisor, resulting in better efficiency. For obvious reasons, this requires changes to the operating system, and it is no surprise that Linux was the most significant Xen-deployed platform.

9781430244288_Fig04-02.jpg

Figure 4-2. Conceptual view on the Xen architecture with physical- and hardware virtualized machines

The Xen architecture is built around the hypervisor, the control domain, and the user domains. The hypervisor is a small piece of software governing access to the hardware of the underlying host. The control domain or dom0 in Xen parlance is a para-virtualized Linux that uses the Xen hypervisor to access the hardware and manages the guests.

One of the biggest benefits offered by Xen’s para-virtualization, even more so when it was released, is the area of I/O operations. The IO devices (“disks”) in a para-virtualized environment merely link to the I/O drivers in the control domain, eliminating the need to I/O virtualization. However, many other important aspects of the operating system also benefit from para-virtualization.

Initially, Xen was not developed as part of the mainline Linux kernel, but that has changed. Two distinct pieces of software are needed to use Xen: the Xen software and changes to the kernel. Before the code merge subsequently described in more detail, the Xen software-including patches to the kernel had to be applied before a Linux system could boot as a dom0. Red Hat Linux 5 supports booting a dom0 kernel based on 2.6.18.x, but has abandoned that approach with the current release, Red Hat Enterprise Linux 6. SuSE Linux has continued to include the patches in their kernels up to SuSE Linux Enterprise 11. The so-modified kernel was said to be “xenified,” and it was different from the non-xen kernel. At the same time, efforts were underway to include the Xen patches in the mainline Linux kernel. There were two main lines of work: making Linux work as a guest domain (domU) and allowing it to boot as a control domain (dom0). Both efforts are grouped under the heading “pvops,” or para-virtualized ops. The aim of the pvops infrastructure for Xen was to provide a single kernel image for booting into the Xen and non-Xen roles. The rewriting of a lot of code has been necessary to comply with the strict coding quality requirements of the mainline kernel. Starting with Kernel 2.6.24, domU support has been gradually added with each release, and the xen.org wiki states that as of 2.6.32.10 domU support should be fairly stable. It was more problematic to get dom0 support into the upstream kernel, and many times the Xen code was not allowed to exit the staging area. The breakthrough came with kernel 2.6.37, explained in an exciting blog post by Konrad Rzeszutek Wilk: “Linux 3.0 – How did we get initial domain (dom0) support there?” Although 2.6.37 offered very limited support to actually run guests, at least it booted, and without the forward-ported patches from 2.6.18. With Linux kernel 3.0, dom0 support was workable and could be included in Linux distributions.

Although there was a time, shortly after Red Hat announced they would not support a Xen dom0 in Enterprise Linux 6, that the future of Xen looked bleak, thanks to the many talented and enthusiastic individuals, Xen is back. And it did not come a moment too soon: the advances in processor technology made the full-virtualization approach a worthy contender to para-virtualization. Today, it is claimed that the virtualization overhead for fully virtualized guests is less than 10%.

During its development, Xen has incorporated support for what it calls hardware virtualization. You previously read that para-virtualized operating systems are made aware of the fact that they are virtualized. Para-virtualization requires access and modifications of the code. Closed source operating systems, such as Windows, for example, therefore, could initially not be run under Xen, but this has been addressed with Xen version 3.x. Thanks to hardware virtualization support in processors, Xen can now run what it refers to as Hardware Virtualized Machines (HVM). Virtualization support in x86-64 processors is available from Intel and AMD. Intel calls its extensions VT-x, AMD call their support AMD-V, but it is often found as “Secure Virtual Machine,” as well in the BIOS settings and literature.

As part of the initiatives to do more work in the processor, another set of bottlenecks has been tackled by the chip vendors. Recall from the previous discussion that the memory structures of the guests had to be carefully shadowed by the hypervisor, causing overhead when modifying page tables in the guest. Luckily, this overhead can be reduced by using Extended Page Tables (EPT) in Intel processors or the AMD equivalent called Rapid Virtualization Indexing (RVI). Both technologies address one of the issues faced by the virtual machine monitor or hypervisor: to keep track of memory. In a native environment, the operating system uses a structure called page table to map physical to virtual addresses. When a request is received to map a logical to physical memory address, a page table walk or traversal of the page tables occurs, which is often slow. The memory management unit uses a small cache to speed up frequently used page lookups, named Translation Lookaside Buffer or TLB.

In the case of virtualization, it is not that simple. Multiple operating systems are executing in parallel on the host, each with their own memory management routines. The hypervisor needs to tread carefully and maintain the image for the guests running on “real” hardware with exclusive access to the page tables. The guest, in turn, believes it has exclusive access to the memory set aside for it in the virtual machine definition. The hypervisor needs to intercept and handle the memory related calls from the guest and map them to the actual physical memory on the host. Keeping the guest’s and the hypervisor’s memory synchronized can become an overhead for memory intensive workloads.

Extended Page Tables and Rapid Virtualization Indexing try to make the life easier for the hypervisor by speeding up memory management with hardware support. In an ideal situation, the need for a shadow page table is completely eliminated, yielding a very noticeable performance improvement.

The advances in the hardware have led to the interesting situation that Xen HVM guests can perform better under certain workloads than PVM guests, reversing the initial situation. HVM guests can struggle when it comes to accessing drivers. Linux benefits from so called PV-on-HVM drivers, which are para-virtualized drivers, completely bypassing the hardware emulation layer. Less code path to traverse provides better performance for disk and network access. Para-virtualized drivers also exist for Windows and other platforms. This is not to say that you should start using hardware virtualizing all your operating systems, but there might be a workload, especially when the guest’s page table is heavily modified where PV-on-HVM can offer a significant performance boost.

Oracle VM Server for x86 Architecture Overview

Oracle VM in its current version, 3.2.2, is based on a server and a management component, plus auxiliary structures, such as external storage. One or more Oracle VM Servers is running Xen 4.1.3 and Oracle’s kernel UEK 2. The installation image for Oracle VM Server is very small, only about 220 MB. It installs directly on the bare metal, and, in the interactive installation, asks you only very few questions before transferring the installation image to the hard disk. Each Oracle VM Server has an agent process used to communicate with the central management interface, called Oracle VM Manager.

Multiple Oracle VM Servers can logically be grouped in a server pool. You get most benefit out of a server pool configuration if the server pool makes use of shared storage in the form of Oracle’s Cluster File System 2. A cluster file system has been chosen to allow guest domains to be dynamically migrated from one server in the pool to another. Consider Figure 4-3 for a schematic overview of the architecture.

9781430244288_Fig04-03.jpg

Figure 4-3. Oracle VM architecture. Agent communication is still used between servers in pool 2 and the management host, but they have been omitted here

The Oracle VM Manager uses the Application Development Framework (ADF) to render the graphical user interface. The application server is WebLogic 11g, and an Oracle database optionally serves as the repository. The simple installation option provides a MySQL 5.5 database as the repository. Either way, do not forget to include both components into your backup strategy. The Oracle VM Manager is the central tool for managing the entire infrastructure for Oracle VM. From here, you discover the servers, provision storage, configure networks, and so on. It is strongly recommended not to modify the Oracle VM Servers directly, the management interface and the hosts may otherwise get out of sync, leading to errors that are difficult to diagnose.

image Note  Oracle Enterprise Manager Cloud Control 12c has support for Oracle VM as well, but this is out of scope of this chapter.

After the initial installation of Oracle VM Server and Manager, the workflow to create a virtual machine in Oracle VM Manager includes the following steps:

  1. Discover Oracle VM Server(s).
  2. Create a network to access the virtual machines.
  3. Configure storage.
  4. Create a new server pool.
  5. Create a shareable repository to store installation files and virtual machines.
  6. Create the virtual machine.

Let’s look at these in more detail. Although the next steps seem quite complex and labor intensive, you have to say, in fairness, that many tasks are one-time setup tasks.

Creating a Virtual Machine

The example to follow assumes that the infrastructure is already in place. In other words, you already have configured SAN storage and networking. Oracle VM defines different types of networks, each of which can be assigned to a network port. Link aggregation is also supported and can be configured in the management interface. In the following example, the following networks have been used:

  • 192.168.1.0/24: management network for server management, cluster management, and VM live migration. This is automatically created and tied to bond0 on the Oracle VM Servers. If possible, bond0 should be configured with multiple network ports.
  • 192.168.100.0/24: used for iSCSI traffic and bound to bond1.

Shared storage is necessary to create a clustered server pool and to allow for migrations of virtual machines. It is also possible to define server pools with local storage only, but, by doing so, you do not benefit from many advanced features, especially VM migration.

The first step after a fresh installation of Oracle VM Manager and a couple of Oracle VM Servers (at least one must be available) is to connect to the management interface, which is shown in Figure 4-4.

9781430244288_Fig04-04.jpg

Figure 4-4. Oracle VM Manager main screen

You start by discovering a new server, which then is moved to the group “unassigned servers.” Let’s leave the servers for the moment. You need to make the VM Manager aware of the storage provisioned. In this example, iSCSI is accessible via the storage network on 192.168.100.1. Support for other storage types exists as well. Prominent examples include NFS and Fibre Channel SANs. When using iSCSI, you can create a new network on the “Networking” tab and dedicate it to storage traffic. If you have plenty of Ethernet ports on the physical hardware, you should consider bonding a few of them for resilience and potentially better performance. The option to create a network bond is on the “Servers and VMs” tab. Select the server on which to create a bonded interface, then change the perspective to “Bond Ports”. A click on the plus sign allows you to create a new bonded port in a simple to use wizard. The bonded interface must be created before you create the new network. Two new networks have been created in Figure 4-5 matching the initial description.

9781430244288_Fig04-05.jpg

Figure 4-5. Oracle VM Manager 3 with the new networks in place

The definition in the management interface is propagated to all VM Servers, visible in the output of ifconfig:

[root@oraclevmserver1 ∼]# ifconfig | grep ^[0-9a-z] -A1
bond0     Link encap:Ethernet  HWaddr 52:54:00:AC:AA:94
          inet addr:192.168.1.21  Bcast:192.168.1.255  Mask:255.255.255.0
--
eth0      Link encap:Ethernet  HWaddr 52:54:00:AC:AA:94
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
--
eth1      Link encap:Ethernet  HWaddr 52:54:00:69:E3:E7
          inet addr:192.168.100.21  Bcast:192.168.100.255  Mask:255.255.255.0
--
lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0

In the output, bond0 is the management network and eth1 is the storage network. Most production systems would use bonded devices here. The difference is a simple mouse click when creating the interface for which you can chose to use a single port or an aggregated interface. With the networks in place, you can begin to add storage to the configuration.

Begin by switching to the “Storage” tab and expand the list of “SAN Servers” in the left navigation pane. Then click the “Discover SAN Server” button. Remember, the example assumes a simple setup using iSCSI. Although the documentation makes you believe that iSCSI performs better than NFS, this is too much of a generalization. You should carefully evaluate all types of network storage for performance, resilience, and ease of administration before deploying such a scenario into production. With the advent of 10GBit/s Ethernet, the situation somewhat eases, but iSCSI still remains a protocol with significant overhead. Fibre Channel, conversely, is a well-established alternative and has proven its worth in thousands of installations.

Back to the “SAN discovery wizard,” you need to follow a few steps to make use of the iSCSI targets provided by your device. On the first page of the wizard, you need to supply a name and description for the new iSCSI target, and the target type needs changed to iSCSI. On the second page in the wizard, provide the name of the iSCSI host. If you are using a dedicated IP storage network, ensure that the iSCSI targets are exported using it, not via the management interface. In case you are using firewalls on the iSCSI host, ensure that port 3260 is open to allow storage discovery.

In the next step, you define administration servers for the group before you govern access to the iSCSI target. In the example, oraclevmserver1 and oraclevmserver2 both have access to the shared storage. To do so, edit the default access group and add the iSCSI initiators to the list of servers allowed to start iSCSI sessions. Once the “SAN Server” configuration is finished, OVM Server discovers targets to which it has access. The process of target discovery may last a minute or longer. After this, a new entry with the name chosen during the storage discovery appears in the tree structure underneath “SAN Servers.” By default, it lists all the iSCSI targets visible. If your device does not appear in the list, ensure that either the authentication credentials are correct and/or the ACL on the iSCSI device allows you to perform the discovery.

Next, add the unassigned Oracle VM Servers to a server pool. If you have not yet discovered the servers, you can do so in the “Servers and VMs” tab. Clicking the “Discover Server” icon opens a wizard that lets you discover servers based on IP address. The newly discovered servers are initially placed into said “unassigned servers” pool before they can be used.

Back in the tab “Servers and VMs,” click on the “Create Server Pool” button to begin this process. The wizard is self-explanatory, except for two options. Be sure to tick the “clustered storage pool” check box, and then use “Physical Disk”. Click on the magnifying glass to pick a LUN. You discovered storage in the previous steps using the “Storage” tab in the OVM Manager.

In the following step, add all the relevant Oracle VM Servers to the pool. You are nearly there! There is only one more step required in preparation of the virtual machine creation, and that is the repository creation. Change to the “Repositories” tab, and click the green plus sign to create one. The repository is used for your VMs, templates, and ISO files, plus any other auxiliary files needed for running virtual machines. Select a previously detected storage unit, at least 10GB (preferably more), and create the repository. Notice that the repository is shared across virtual machines (that is, using OCFS 2). Fill in the values and present it to all Oracle VM Servers. In the background, you have created two OCFS2 file systems, which are mounted to the Oracle VM Servers in the server pool:

[root@oraclevmserver ∼]# mount | grep ocfs
ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
/dev/mapper/1IET_00010001 on /poolfsmnt/0...2b4 type ocfs2 (rw,_netdev,heartbeat=global)
/dev/mapper/1IET_00020001 on /OVS/Repositories/0...aa type ocfs2 (rw,_netdev,heartbeat=global)
[root@oraclevmserver ∼]#

The cluster configuration, including all settings in /etc/ocfs2/cluster.conf, has transparently been applied by Oracle VM Manager. Finally you can create a virtual machine. In the example, a new virtual machine based on Oracle Linux 6 will be created as a para-virtualized guest.

image Note  Oracle also offers pre-built templates for many scenarios from http://edelivery.oracle.com/linux.

Switch to the “Servers and VMs” tab and highlight your server pool in the tree on the left side of the screen. Change the perspective to “virtual machines”, then right-click the pool and select “Create Virtual Machine”. In the appearing wizard, enter the following information provided in Table 4-2.

Table 4-2. The “Create Virtual Machine” Wizard

Step

Information to be provided

How do you want to create your Virtual Machine?

Create a new VM (Click “Next” to continue)

Create Virtual Machine

Server: Any

Repository: your repository name

Name: OracleLinux6

Tick the box “Enable High Availability” to allow the VM to migrate to another server in the server pool should the original host fail unexpectedly.

Description: enter an optional description

Operating System: Oracle Linux 6

Keymap: whichever suits your keyboard best

Domain Type: Xen PVM

The remaining values can be set, depending on your Oracle VM Server hardware. A minimal installation of Oracle Linux 6 is fine with the defaults.

Click “Next” to continue to the next step.

Set up Networks

Assign one or more unassigned VNIC (virtual network interface card) to the virtual machine from the drop down labeled “unassigned VNICs”. If there are no more VNICs, click “Create VNICs” to create some. Ensure the network is set to the virtual machine network.

Arrange Disks

Select virtual disk from the drop down menu in slot 0. Click the green plus sign in the “Actions” column to create a new disk. It is usually a good idea to name the disk in a way that allows you to identify it as a disk belonging to a certain VM. The size should be at least 4 GB for a minimalistic installation of Oracle Linux 6.

Boot Options

Add both network and disk here. Enter a network boot path and point it to the repository location for Oracle Linux 6. This is only possible for para-virtualized guests. See Chapter 5 for more information on how to make the installation tree of Oracle Linux available via the network. HVM guests will require the ISO image of the installation available in the repository.

Click “Finish” to create the virtual machine description and begin the installation of the virtual machine.

Oracle VM Manager will go off and create the new virtual machine metadata on one of the Oracle VM Servers. This is exactly the same way you would create domUs in Xen from the command line, enriched with some extra information needed by Oracle VM Manager. The configuration files for virtual machines reside in the repository, which is mounted on all servers in the pool under /OVS/Repositories using the numeric repository ID. The subdirectory VirtualMachines contains another ID for each domU. Finally, the vm.cfg file contains the information:

vif = ['mac=00:21:f6:00:00:12,bridge=0004fb0010609f0']
OVM_simple_name = 'ol62pvm'
disk = ['file:/OVS/Repositories/0...a/VirtualDisks/0...e.img,xvda,w']
uuid = '0004fb00-0006-0000-5506-fc6915b4897b'
on_reboot = 'restart'
boot = 'c'
cpu_weight = 27500
memory = 1024
cpu_cap = 0
maxvcpus = 1
OVM_high_availability = False
maxmem = 1024
OVM_description = ''
on_poweroff = 'destroy'
on_crash = 'restart'
bootloader = '/usr/bin/pygrub'
name = '0004fb00000600005506fc6915b4897b'
guest_os_type = 'linux'
vfb = ['type=vnc,vncunused=1,vnclisten=127.0.0.1,keymap=en-gb']
vcpus = 1
OVM_os_type = 'Oracle Linux 6'
OVM_cpu_compat_group = ''
OVM_domain_type = 'xen_pvm'

Notice the entries beginning with OVM—they are specific to Oracle VM. With the domU definition finished, highlight the domU definition and click “Start.” The console button will open a VNC session if you have a VNC viewer installed on the client. From then on, install Linux as you normally would. Again, refer to Chapter 5 for guidance on the Oracle Linux 6 installation.

Further Reading

Xen-based para-virtualization is a very fascinating topic, which is far too encompassing to explain it in its entirety here. Since Oracle VM Server for x86 is based, to a large extent, on the free hypervisor, it makes a lot of sense to understand Xen before diving into Oracle’s implementation details. One of the best resources, when it comes to Xen, is the home of Xen: http://www.xen.org. The website has a very useful wiki as well, accessible as wiki.xen.org. Additional useful resources include:

Final Thoughts on Virtualization

The previous sections have given you a high-level overview of different approaches to virtualization. There is most likely a virtualization initiative underway in your company, but most users thus far have only virtualized development and UAT/pre-production environments. The most critical, production and the corresponding disaster recovery environments have most often remained untouched.

During the research for this chapter, a large international company announced that it has completely virtualized their SAP environment, demonstrating that virtualization can work even for production environments. There are some common pitfalls, however. Some users have gone too far with their virtualization strategy. There is a reported case in which one Unix domain containing one Oracle home plus nine databases has been virtualized in nine different virtual machines on x86, each with its own operating system, an own Oracle home and only one database. Although the approach has certainly helped increase isolation between the environments, it has also increased the management overhead. In such a scenario, it is even more important to either not create as many virtual machines or, if that cannot be avoided, have a suitable management strategy for mass-patching operating system and databases. Thankfully, Oracle database 12c and the multi-tenancy option gives administrators similar levels of isolation at reduced overhead of having to maintain too many copies of the operating system.

If done right, virtualization offers tremendous potential and can challenge traditional clustered solutions for uptime and availability. The established companies in the market have very mature and elaborate management frameworks, allowing the administrator to control his environment. It can help contain what the industry refers to as “server sprawl.” Recently, vendors are also pushing into the new yourProductHere as a service market. Times are interesting!

High Availability Example

You read in the introduction that there are additional ways to protect a single instance Oracle database from failure. We have just thoroughly covered the virtualization aspect, mainly because virtualization is such a big driver of innovation in the industry. In addition to virtualization, you have the option to use more than one physical server to protect the instance from hardware failures. The following section provides more information on how to achieve this goal.

You might want to consider that strict high availability requirements often come down in a way to be compatible with an active/passive cluster. However, if your requirements are such that even the short interruption of service is intolerable then you need to have a look at the Real Application Cluster option. Similar to the previous section, this section presents an example for an active/passive cluster based on Oracle Clusterware. It is assumed that you have some background knowledge about clustering.

Oracle Clusterware HA Framework

Oracle Clusterware is Oracle’s cluster manager that allows a group of physically separate servers combine into one logical server. The physical servers are connected together by a dedicated private network and are attached to shared storage. Oracle Clusterware consists of a set of additional operating system processes and daemons that run on each node in the cluster that utilize the private network and shared storage to coordinate activity between the servers. Oracle has renamed the foundation for RAC and the high availability framework from Cluster Ready Services to Clusterware and finally to Grid Infrastructure. Throughout the section, the terms Clusterware and Grid Infrastructure are used interchangeably. This section briefly describes Oracle Clusterware, discussing its components, their functionality, and how to make use of it as a cluster framework. In this context, Clusterware serves the following purposes:

  • Monitor each cluster node’s health and take corrective action if a node is found to be unhealthy by means of cluster membership management.
  • Monitor cluster resources, such as networks, file systems, and databases.
  • Automatically restart cluster resources on the surviving node if necessary.

For a high level overview, consider Figure 4-6.

9781430244288_Fig04-06.jpg

Figure 4-6. Schematic overview of Clusterware in an active passive configuration

Figure 4-6 shows the most common setup for Clusterware in an active/passive configuration. Please note that the figure is greatly simplified to focus on the important aspects of the architecture. Beginning from the lower part of the figure and moving up, you see a storage array providing LUNs to boot the servers from, as well as storage for the Oracle installation and the database itself. Connectivity from the servers to the closest fiber channel switches is multipathed to prevent single points of failure.

The file system employed for storing the operating system and Grid Infrastructure does not really matter, it would be JFS, ZFS, or EXT3/4, depending on your flavor of UNIX. What matters, though, is that you should ideally use Oracle’s Automatic Storage Management (ASM) for the database binaries and the database files, and you need to perform a clustered installation across both nodes. The simple reason is that, with a shared database home reduces the possibility to miss configuration steps on the passive node. The use of ASM prevents time consuming file system consistency checks when mounting the database files on the passive node, a common problem with clusters that don’t use cluster aware file systems.

The database can, by definition, only be started on one node at the same time. For license reasons, it must not be a cluster database. In normal operations, the database is mounted and opened on the active node. Should the Clusterware framework detect that the database on the active node has failed—perhaps caused by a node failure—it will try to restart the database on the same node. Should that fail, the database will be restarted on the passive node. In that time, all user connections will be aborted. Thanks to the cluster logical volume manager (ASM), there is no requirement to forcefully unmount any file system from the failed node and to mount it on the now active node. The failed database can almost instantly be started on the former passive node. As soon as the instance recovery is completed, users can reconnect. Oracle has introduced a very useful feature in Clusterware 10.2, called an application virtual IP address. Such an address can be tied to the database resource in the form of a dependency and migrate with it should the need arise. Application VIPs must be manually created and maintained, adding a little more complexity to the setup.

An easier to implement alternative is available in form of the virtual IP address, which is automatically created during the Clusterware installation. The so-called VIPs exist on every cluster node and have been implemented to avoid waiting for lengthy TCP timeouts. If you try to connect to a net service name and the server has crashed, you may have to wait too long for the operating system to report a timeout. The Clusterware VIP is a cluster resource, meaning it can be started on the passive node in the cluster to return a “this address no longer exists” message to the requestor, speeding up connection requests. Common net service name definitions for an active/passive cluster are:

activepassive.example.com =
(DESCRIPTION=
  (ADDRESS_LIST=
    (FAILOVER=YES) (LOAD_BALANCE=NO)
    (ADDRESS=(PROTOCOL=tcp)(HOST=activenode-vip.example.com)(PORT=1521))
    (ADDRESS=(PROTOCOL=tcp)(HOST=passivenode-vip.example.com)(PORT=1521)))
  (CONNECT_DATA=
    (SERVICE_NAME=activepassive.example.com)
  )
)

This way, the active node, which should be up and running for most of the time, is the preferred connection target. In case of a node failover, however, the active node’s VIP migrates to the passive node and immediately sends the request to try the next node in the address list. This way, no change is required on the application servers in case of a node failure. Once the former active node is repaired, you should relocate the database back to its default location.

Installing a Shared Oracle RDBMS Home

This section assumes that Clusterware is already installed on your servers. The first step in creating the shared Oracle RDBMS home is to create the ASM Cluster File System (ACFS). ACFS is a POSIX compliant file system created on top of an ASM disk group. In many scenarios, it makes sense to create a dedicated disk group for the ACFS file system (the keyword is block-level replication). Once the new ASM disk group is created, you can create a so-called ASM volume on top of it. The volume is managed internally by an entity called ASM Dynamic Volume Manager. Think of ADVM as a logical volume manager. The ASM dynamic volume does not need to be of identical size as the ASM disk group. ADVM volumes can be resized online, allowing for corrections if you are running out of space.

The choice went to ACFS for the simple reason that it guarantees a consistent configuration across nodes. In many active/passive clusters, changes are not properly applied to all nodes, leaving the passive node outdated and unsuitable for role transitions. It is very often the little configuration changes—maybe an update of the local tnsnames.ora file to point to a different host—that can turn a simple role reversal into a troubleshooting nightmare. If there is only one Oracle home, then it is impossible to omit configuration changes on the passive cluster node.

When the ADVM volume is created, you need to create a file system. ADVM supports ext3, but, since this file system will be concurrently mounted on both cluster nodes, it must be a cluster file system. The cluster file system is also provided by Oracle in the form of ACFS—the ASM Cluster File System. First, you need to create the ASM Disk Group for the ASM Cluster File System. Note that you need to set a number of compatibility attributes on the disk group to allow for the creation of an ACFS:

SQL> create diskgroup acfs external redundancy
  2  disk 'ORCL:ACFS_001'
  3  attribute 'compatible.asm'='12.1', 'compatible.advm'='12.1';
 
Diskgroup created.

Mount this disk group in all cluster nodes. The easiest way to do so is to use srvctl to start the diskgroup:

[grid@rac12node1 ∼]$ srvctl start diskgroup –g ACFS

Next, create an ACFS volume that will later contain the file system. You can use either the graphical user interface—ASM Configuration Assistant or ASMCA—or the command line. It is easier to use the latter in scripts, so it was chosen here. Instead of asmcmd, the primary tool used in Oracle 11g, you can now make use of ASMCA in silent mode to create the ACFS volume and file system. The command options used are silent and createVolume:

[grid@rac12node1 ∼]$ asmca -silent -createVolume -volumeName orahomevol 
> -volumeDiskGroup ACFS -volumeSizeGB 8
 
Volume orahomevol created successfully.

The syntax is self-explanatory. ASMCA has been instructed to create a volume with the name “orahomevol” on disk group ACFS with a size of 8 GB. This operation completes quickly. The venerable asmcmd command proves that the volume was indeed created:

[grid@rac12node1 ∼]$ asmcmd volinfo -G ACFS -a
Diskgroup Name: ACFS
 
         Volume Name: ORAHOMEVOL
         Volume Device: /dev/asm/orahomevol-137
         State: ENABLED
         Size (MB): 8192
         Resize Unit (MB): 32
         Redundancy: UNPROT
         Stripe Columns: 4
         Stripe Width (K): 128
         Usage:
         Mountpath:

With the volume in place, you can create the file system on top. Again, ASMCA is used to perform this task:

[grid@rac12node1 ∼]$ asmca -silent -createACFS -acfsVolumeDevice /dev/asm/orahomevol-137 
> -acfsMountPoint /u01/app/oracle
 
ASM Cluster File System created on /dev/asm/orahomevol-137 successfully. Run the generated
ACFS registration script /u01/app/grid/cfgtoollogs/asmca/scripts/acfs_script.sh as
privileged user to register the ACFS with Grid Infrastructure and to mount the ACFS. The
ACFS registration script needs to be run only on this node: rac12node1.

The script creates the ACFS file system, registers it as a resource with Clusterware, and starts it. The final configuration of the file system is:

[grid@rac12node1 ∼]$ srvctl config filesystem -d /dev/asm/orahomevol-137
Volume device: /dev/asm/orahomevol-137
Canonical volume device: /dev/asm/orahomevol-137
Mountpoint path: /u01/app/oracle
User:
Type: ACFS
Mount options:
Description:
Nodes:
Server pools:
Application ID:
ACFS file system is enabled
[grid@rac12node1 ∼]$

Installing the Shared Database Binaries

When you have made sure that the database binaries have been mounted on each host, you need to install the database binaries. Again, you should install the binaries for a single instance only. This way you are protected from accidentally starting a database as a cluster database, which would constitute to a license violation. When selecting the location for the database installation, double-check that you are installing on the newly created ACFS, instead of a local file system.

The installation of the database binaries is completely identical to an installation to a non-shared file system. You can read more about installing the database binaries for a single instance Oracle database in Chapter 6.

Creating the Database

With the Oracle Home created on a shared ACFS device, you can now start the Database Configuration Assistant from it. To make maximum use of the shared file system you should consider placing the diagnostic destination to the cluster file system as well. All other database related files should be placed into ASM. Remember that the database must be a single instance Oracle database.

If your engineering department has supplied a database template, the configuration of the database could not be easier. Here is an example for a single instance database creation in which the default data file location is set to disk group DATA and the fast recovery area is defined in disk group +RECO. Also note how the audit file destination and diagnostic destination are explicitly placed on the ACFS file system:

[oracle@rac12node1 ∼]$ /u01/app/oracle/product/12.1.0.1/dbhome_1/bin/dbca 
> -silent -createDatabase -templateName active_passive.dbc -gdbName apclu
> -sysPassword sys -createAsContainerDatabase true -numberOfPDBs 1
> -pdbName apress -systemPassword sys -storageType ASM -diskGroupName DATA
> -recoveryGroupName RECO -totalMemory 1536
Enter PDBADMIN User Password:
  
Copying database files
1% complete
[...]
74% complete
Creating Pluggable Databases
79% complete
100% complete
Look at the log file "/u01/app/oracle/cfgtoollogs/dbca/apclu/apclu.log" for further details.
[oracle@rac12node1 bin]$

image Note  Pluggable Databases (PDBs) are one of the most interesting new features in Oracle 12c. You can read more about the technical details pertaining to PDBs in Chapter 7.

The silent creation of a database is a very powerful command; refer to the online help or the official documentation set for more information about its options.

Registering the Database with Clusterware

As part of its creation, the database will be registered in Clusterware as a single instance database resource. In all normal circumstances that would be a much appreciated service, but in this special case it needs changing. An Oracle single instance database deployed on a multi-node cluster does not have the ability to “move” around in the cluster. When created, the instance sticks to the cluster node where it was created.

To gain some flexibility, a new custom resource needs created. This resource will allow the database to be started on more than one cluster node: in the event of a node failure it will be relocated to the surviving node without user intervention. Since the database files all reside in ASM—a cluster file system concurrently mounted on all nodes—no time consuming file system integrity check is needed. The first step is to remove the database configuration from the Cluster registry:

[oracle@rac12node1 ∼]$ srvctl stop database -d apclu
[oracle@rac12node1 ∼]$ srvctl remove database -d apclu
Remove the database apclu? (y/[n]) y

Note that this command cleans the entry from the oratab file as well. The following steps require a little more planning. Recap from the introduction that Clusterware manages resources during their life cycle. The major states the resource can be in are:

  • Stopped: the resource is stopped, either intentionally or because it crashed.
  • Started: the resource is up and running.
  • Intermediate: something went wrong when starting the resource.

The status of each resource managed by Clusterware can be viewed using the crsctl utility, as shown here:

[oracle@rac12node1 ∼]$ /u01/app/12.1.0.1/grid/bin/crsctl stat res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details
--------------------------------------------------------------------------------
[...]
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       rac12node2               STABLE
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  ONLINE       rac12node1               STABLE
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  ONLINE       rac12node1               STABLE

A resource can be either local to a cluster node or a cluster resource. A local resource always stays on the cluster node and does not relocate to other nodes. A cluster resource, however, is aware that it is deployed in a cluster and can react to node failures. Two prerequisites are to be met to register a custom resource. The first is a so-called action script that implements callbacks to start, stop, and check the resource. The action script is referenced in the resource description, called a profile. First, an action script is needed. A very basic script implementing the minimum amount of code is subsequently provided. It lacks elegance and sufficient error checking; you are encouraged to improve as needed.

#!/bin/bash
 
[[ "$TRACE" == "YES" ]] && {
        set -x
}
 
PATH=/usr/local/bin:$PATH
ORACLE_SID=apclu
ORACLE_HOME=/u01/app/oracle/product/12.1.0.1/dbhome_1
ORACLE_BASE=/u01/app/oracle
 
export PATH ORACLE_SID ORACLE_HOME ORACLE_BASE
 
PS=/bin/ps
SU=/bin/su
 
# avoid "standard in must be a tty" error when executing the script as RDBMS_OWNER
[[ $(whoami) != "$RDBMS_OWNER" ]] && {
        SWITCH_USER="$SU - $RDBMS_OWNER -c"
}
 
case $1 in
'start')
        $SWITCH_USER $ORACLE_HOME/bin/sqlplus / as sysdba <<EOF
startup
exit
EOF
        RC=$?
        ;;
'stop')
         $SWITCH_USER $ORACLE_HOME/bin/sqlplus / as sysdba <<EOF
shutdown
exit
EOF
        RC=$?
        ;;
'check')
        $PS -ef | grep -v grep | grep ora_smon_${ORACLE_SID} > /dev/null
        RC=$?
        ;;
*)
        echo "invalid command passed to ${0} - must be one of start, stop or check"
esac
 
# we are done-return the status code to the caller
if (( RC == 0 )) ; then
        exit 0
else
        # the actual error code from the command does not matter
        exit 1
fi

Moving the script to the ACFS file system allows for easier maintenance, and do not forget to make it executable for its owner.

image Note  The script assumes that the oratab files on both cluster nodes are properly maintained, otherwise it will fail.

Armed with the script you can create the new resource. The command used for this purpose is crsctl add resource.

[oracle@rac12node1 scripts]$ /u01/app/12.1.0.1/grid/bin/crsctl add resource -h
Usage:
  crsctl add resource <resName> -type <typeName> [[-file <filePath>] | [-attr "<specification>[,...]"]] [-f] [-i]
     <specification>:   {<attrName>=<value> | <attrName>@<scope>=<value>}
        <scope>:   {@SERVERNAME(<server>)[@DEGREEID(<did>)] |
                       @CARDINALITYID(<cid>)[@DEGREEID(<did>)] }
where
     resName         Add named resource
     typeName        Resource type
     filePath        Attribute file
     attrName        Attribute name
     value           Attribute value
     server          Server name
     cid             Resource cardinality ID
     did             Resource degree ID
     -f              Force option
     -i              Fail if request cannot be processed immediately

This command allows you freedom in designing your resource profile; for the purpose of a cluster database, however, it is sufficient to add the resource as shown:

[oracle@rac12node1 scripts]$ /u01/app/12.1.0.1/grid/bin/crsctl 
> add resource apress.apclu.db -type cluster_resource -file add_res_attr

The file referenced in the command contains key/value pairs defining the resource.

ACL=owner:oracle:rwx,pgrp:dba:rwx,other::r--
ACTION_SCRIPT=/u01/app/oracle/admin/apclu/scripts/ha_apclu.sh
AUTO_START=restore
CARDINALITY=1
DEGREE=1
DESCRIPTION=Custom resource for database apclu
PLACEMENT=restricted
HOSTING_MEMBERS=rac12node1 rac12node2
RESTART_ATTEMPTS=2
SCRIPT_TIMEOUT=60
START_DEPENDENCIES=
hard(ora.DATA.dg,ora.RECO.dg,ora.acfs.orahomevol.acfs) weak(type:ora.listener.type,uniform:ora.ons)
pullup(ora.DATA.dg,ora.acfs.orahomevol.acfs)
STOP_DEPENDENCIES=
hard(intermediate:ora.asm,shutdown:ora.DATA.dg,shutdown:ora.RECO.dg,ora.acfs.orahomevol.acfs)
FAILURE_INTERVAL=60
FAILURE_THRESHOLD=1

Some of these directives have been taken from the original resource profile, especially those defining the start and stop dependencies. In the example, /u01/app/oracle/ is an ACFS file system. In a nutshell, this profile allows the database to be started on rac12node1 and rac12node2, the “hosting members.” For this variable to have an effect, you need to define a placement policy of “restricted.” The other variables are internal to Clusterware. If you are interested, you can check their meaning in the Clusterware Administration and Deployment Guide, Appendix B. Do not forget to check that the database is registered in /etc/oratab on both hosts.

Managing the Database

Instead of using the familiar srvctl command or even sqlplus directly, you need to interact with the database using the crsctl utility. Although this might sound daunting, it is not. Starting the database, for example, is accomplished using the following command:

[oracle@rac12node1 scripts]$ /u01/app/12.1.0.1/grid/bin/crsctl 
> start resource apress.apclu.db
> -npreferredHost
CRS-2672: Attempting to start 'apress.apclu.db' on 'rac12node1'
CRS-2676: Start of 'apress.apclu.db' on 'rac12node1' succeeded

You should make a habit of specifying the host the database should start on specifically as part of the command. The syntax is very easy to understand. The crsctl command takes a verb and an object to work on. In this case, the verb is to start and the object is the resource with name “apress.apclu.db”. The optional parameter “-n” specifies which host the resource should start on.

Conversely, you stop the resource with the “stop” verb:

[oracle@rac12node1 scripts]$ /u01/app/12.1.0.1/grid/bin/crsctl 
> stop resource apress.apclu.db
CRS-2673: Attempting to stop 'apress.apclu.db' on 'rac12node1'
CRS-2677: Stop of 'apress.apclu.db' on 'rac12node1' succeeded

You can also manually relocate the database if you have to:

[oracle@rac12node1 scripts]$ /u01/app/12.1.0.1/grid/bin/crsctl relocate resource 
> apress.apclu.db -s rac12node2 -n rac12node1
CRS-2673: Attempting to stop 'apress.apclu.db' on 'rac12node2'
CRS-2677: Stop of 'apress.apclu.db' on 'rac12node2' succeeded
CRS-2672: Attempting to start 'apress.apclu.db' on 'rac12node1'
CRS-2676: Start of 'apress.apclu.db' on 'rac12node1' succeeded

The syntax is straight forward: you specify the source host with the –s and the destination with the –n. Each takes a valid hosting member as an argument. The relocation happens transparently if the current active node experiences an outage, planned and unplanned. The action script will execute the code in the check callback function to see if the resource is available, and issue restart_attempts tries to start the resource on the same host. Since this is impossible—the host went down—it will relocate the resource on the former passive node. Once initiated, the time it takes to fail over to the other node is the same as a recovery from an instance crash on the same server. As a rule of thumb, you can say that the busier the system is then the longer the recovery time will be.

Summary

This chapter discussed possible software solutions for your consolidation project. You first read about virtualization examples using Solaris Zones, followed by an introduction to Oracle VM Server. In the last section you read about creating a single instance database and making it highly available on a clustered ASM configuration with Oracle Clusterware for heartbeat and cluster membership management. This approach is best integrated with Clusterware, different cluster-aware logical volume managers would introduce their own cluster membership stack that could potentially conflict with Oracle’s own.

Every vendor currently tries to position his solution as the best suitable technology for consolidation in the (private) cloud. Such solutions are built on top of the respective management frameworks for resilience and high availability. The other approach is to cluster Oracle databases on physical hardware to provide protection from instance failure and to keep applications running. There is no clear winner yet, and times are interesting. Whichever technology you chose, you should spend sufficient time testing it in production-like environments before defining the service and rolling it out to potential users.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset