CHAPTER 3: Supporting Hardware

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 3

Supporting Hardware

After a quick introduction to some of the most noteworthy Oracle 12.1 New Features, it is time to get back to the core theme of this book: consolidation. What you need to consider as important for any new project is even more true for consolidation projects: defining the right infrastructure for the task ahead. Consolidation projects most likely differ from other projects in that the infrastructure platform—once decided upon—is to be used in many deployments in exactly the same way. A successful consolidation project has the potential of replacing a lot of hardware. But only if done right! And this is exactly the reason why you are finding such a large chapter dedicated to the matter of selecting hardware.

But hardware is not the only important subject to consider. Once the hardware platform has been agreed upon by all parties, the next step is to think about the deployment for Oracle databases. The next chapter is written with that in mind. You read about the author’s preference for the Linux platform on x86–64 hardware in previous chapters, and you will find this thread throughout the rest of the book. Although many believe that Linux on x86–64 is the future mainstream platform, there is, of course, no obligation for you to follow the same path. This is another of the key elements you will find throughout the book: use what your staff is comfortable with. If you would like to transition to new systems, or to new processes and tasks, please do so only after appropriate change management has been implemented. Getting buy-in from the engineers and operational team supporting your systems is important, not only because it is needed for smooth operations and internal support, but also because it adds extra motivation to the team if you are proud of what you do.

This chapter aims at evaluating and introducing options related to the consolidation hardware platform, taking into account the developments in terms of CPU, operating system, and storage. When evaluating these, it is important not to focus on individual components but rather adhere to the big picture. You can read about enabling software in the following chapter. Again, there is a wealth of options here, and the fact that the ones you read about made it into the book doesn’t mean they are the best options for all tasks out there. The chapter is not about comparing and ranking: it is about introducing and familiarizing you with some of the available products.

Enabling Hardware

This chapter will explore some of the potential hardware choices for the consolidation platform. As you just read in the introduction, the choice of hardware should be made after careful consideration. The impact of selecting an inappropriate platform can be felt painfully if it turns out that some factors have not been taken into account. Once budget is allocated for a specific iteration of the consolidated service, it might be difficult to arrange for major changes. After all, it’s economies of scale your company is after!

The focus of the chapter is going to be on the Linux operating system running on the x86–64 platform. The main reason lies in the platform’s great current and future potential. The mantra of commodity computing has always been to use easy-to-replace standard building blocks.

The established vendors in the server market offer a comprehensive suite of support products around these industry standard servers, and making use of them allows you to benefit from economies of scale. First of all, there is a benefit of reducing the platform footprint. Training and employing staff capable of supporting multiple platforms to a high degree of professionalism is expensive. Especially in large organizations where all major platforms are in use, a great deal of time has to be spent to certify a given platform with storage, the operating system, and software stack. Naturally some of this work can be reduced by limiting the number of supported platforms. The number of supported platforms depends on the strategic direction of the company and also on the budget available. Certain vendors, however, may not be able to comply with the platform strategy, but that is not a problem either. Support arrangements can be put in place between the vendor and the application team to cover the nonstandard platform as an exception. The vendor should then be persuaded to adjust its product to be in line with the corporate strategy.

One often-heard argument against limiting the number of supported platforms is that of the vendor lock-in. Being supplied by only one or very few parties for mission critical systems can be either a big advantage (as the vendor marketing will tell you) or a disadvantage (what the other vendors will tell you). There is no clear-cut answer to the question of vendor diversity. This again is often down to managerial decisions and relationships with vendors, but the Linux-operating environment at least gives you a lot of choice.

However, despite the strong favor of Linux in this book, you should not feel motivated to rush and replace your existing estate with x86–64 servers. During many site visits it has become apparent that the staff responsible for “their” operating system (and the servers running it) often have strong feelings towards that platform. In addition to the necessary skill that the administrators must possess to manage a platform, it is important to use adequate change management when introducing something new. The concerns and worries of staff should be taken into consideration, and the budget forecast should include training or new hiring as well.

After so much theory it is time to get back to the real matter: hardware! This section will start off with a question of whether or not blades make sense and whether rack-mounted systems fit into your organization better. It will then discuss the changes in the hardware space that happened in the past two years before exploring more advanced aspects of the x86–64 platform.

Blades or Rack-Mounted?

Blades have established themselves as suitable alternatives to the classical 19-inch rack-mounted server for many workloads. Blades are usually smaller than rack-mounted servers, but they also have fewer components and fewer possibilities for expansion. Some of the components you would normally find inside a rack-mounted server will be in the so-called blade enclosure. The small form factor for blades will allow a very high density, and when they were first introduced, blades seemed ideal to reduce the space required in the data center. However, with the high density comes a heightened requirement to cool the blades and enclosure to prevent them from overheating. With the introduction of chips capable of adjusting their clock rate depending on workload, cooling becomes even more important. Often the chips need to be in their comfort zone when it comes to their Thermal Design Power (TDP). As soon as the processor temperature rises too high, it will clock down and run at reduced speed to prevent damage to its components. Sufficient cooling of modern CPUs is therefore essential to stable CPU performance! Luckily the ever shrinking manufacturing process for processors allows for a reduction of cooling requirements. Processor vendors have realized that a power-efficient processor is a real necessity in the drive to cut cost in data centers.

There is no general industry standard blade enclosure, but you can expect most blade enclosures to contain power supply units, networking equipment to support storage and client connectivity, and other shared infrastructure. The introduction of Data Center Ethernet or Converged Ethernet allows vendors to use high speed Ethernet switches for network and storage traffic, potentially reducing the number of cables in the data center. The matured graphical management interfaces allow the administrator to perform lights-out management of his blades. Graphical representations of the blade enclosure tell the administrator to see which slots are free and which ones are occupied. Often warnings and alerts can be sent via SNMP (Simple Network Management Protocol) traps to monitoring solutions.

A blade enclosure would be of little use if it were not for the blades. The little, yet powerful computers can either be horizontally or vertically added into the enclosure. Depending on the model, they can either occupy a full slot or half a slot. The terms full-width and half-width (or height) have been coined for these. Blades are highly versatile and configurable. If your data center can accommodate them, they are definitely worth evaluating.

Rack-mounted servers have been the predominant form of servers and probably still are. The typical rack-mounted server is measured in unit height; the width is predetermined by the rack you are mounting the server to. The industry currently uses mainly 19-inch- and occasionally 23-inch-wide racks. The rack unit corresponds to 1.75 inches, and a typical 19-inch rack has support for 42 units. With the basic parameters set, the IT department is free to choose whatever hardware is 19-inch-wide and otherwise fits inside the rack. The benefit of rack-mounted servers is that you can mix and match hardware from different vendors in the same cage. Rack-mounted servers can usually take more memory and processors than their blade counterparts. Recent benchmark results available from the TPC website refer to high-end x86-64 servers with 5 and up to 7 unit height, taking a massive two to four terabyte of memory. The next generation of Xeon processors to be released in 2013/2014 will push that limit even further.

Blades seem well suited for clustered applications, especially if individual blades can boot from the SAN and generally have little static configuration on internal storage or the blade itself. Some systems allow the administrator to define a personality of the blade, such as network cards and their associated hardware addresses, the LUN(s) where the operating system is stored on, and other metadata defining the role of the blade. Should a particular blade in a chassis fail, the blade’s metadata can be transferred to another one, which can be powered up and resume the failed blade’s role. Total outage time, therefore, can be reduced, and a technician has a little more time to replace the failed unit.

Rack-mounted servers are very useful when it comes to consolidating older, more power-hungry hardware on the same platform. They also generally allow for better extension in form of available PCIe slots compared to a blade. To harness the full power of a 5U or even 7U server requires advanced features from the operating systems, such as support for the Non-Uniform Memory Architecture (NUMA) in modern hardware. You can read more about making best use of your new hardware in Chapter 4.

Regardless of which solution you decide to invest in for your future consolidation platform, you should consider answering the following questions about your data center:

How much does the data center management charge you for space?
How well can the existing air conditioning system cope with the heat?
Is the raised floor strong enough to withstand the weight of another fully populated rack?
Is there enough power to deal with peaks in demand?
Can your new hardware be efficiently cooled within the rack?
Is your supporting infrastructure, especially networking and fiber channel switches, capable of connecting the new systems the best possible way? You definitely do not want to end up in a situation where you bought 10Gbps Ethernet adapters, for example, and your switches cannot support more than 1 GBps.
Does your network infrastructure allow for a sufficiently large pool of IP addresses to connect the system to its users?

There are many more questions to be asked, and the physical deployment of hardware is an art on its own. All vendors provide planning and deployment guides, and surely you can get vendor technicians and consultants to advise you on the future deployment of their hardware. You might even get the luxury of a site survey wherein a vendor technician inspects corridors, elevators, raised flooring, and power, among other things, to ensure that the new hardware fits physically when it is shipped.

Let’s not forget at this stage that the overall goal of the consolidation efforts is to reduce cost. If the evaluation of hardware is successful, it should be possible to benefit from economies of scale by limiting yourself to one hardware platform, possibly in a few different configurations to cater to the different demands of applications. The more standardized the environment, the easier it is to deliver new applications with a quick turnaround.

With the basic building block in sight, the next question is: what should be added as peripheral hardware? Which options do you have in terms of CPU, memory, and expansion cards? What storage option should you use, etc.? The following sections introduce some changes in the hardware world which have happened over the last few years.

Changes in the Hardware World

The world of hardware is changing at a very rapid pace. The core areas where most technological advance is visible are processor architecture, the explosion of available memory, storage, and networking infrastructure. Combined these changes present unique challenges not only to the operating system but also to the software running on it. Until not too long ago it was unthinkable to have 160 CPU threads exposed to the operating system in the x86–64 world, and scheduler algorithms had to be developed to deal with this large number. In addition to the explosion of the number of CPU cores and threads you also have the possibility of vast amounts of memory at your disposal. Commercial x86-64 servers can now address up to four terabytes. Non-uniform memory access (NUMA) has also become more important on x86-64 platforms. Although the NUMA factor is most relevant in systems with four NUMA nodes and more, understanding NUMA in Linux will soon become crucial when working with large servers.

Exadata was the first platform widely available running Oracle workloads and moving Remote Direct Memory Access into the focus. Most of us associate Infiniband with RDMA, but there are other noteworthy applications, such as iWARP and RoCEE (RDBMA over Converged Enhanced Ethernet), for example. Also, Infiniband is a lot more than you might think: it is a whole series of protocols for various different use cases, ranging from carrier for classic IP (“IP over IB”) to the low-latency Exadata protocol based on RDS or Reliable Datagram Sockets to transporting SCSI (“SRP,” the SCSI RDMA Protocol).

Changes to storage are so fundamental that they deserve a whole section of their own. Classic fiber channel arrays are getting more and more competition in form of PCI Express Flash Cards, as well as small-form factor flash-arrays connected via Infiniband, allowing for ultra-low latency. New vendors are challenging the established storage companies with new and interesting concepts that do not fit the model in which many vendors placed their products over the last decade. This trend can only be beneficial for the whole industry. Even if the new storage startups are quickly swallowed by the established players in the storage market their dynamics, products and ideas will live on. No one can ignore the changes flash memory brought to the way enterprises store data anymore.

Thoughts About the Storage Backend

Oak Table member James Morle has written a great paper titled “Sane SAN 2010.” He has also predicated changes to the way enterprise storage arrays are going to be built. With his statements he alluded to the flash revolution in the data center. Over the last decade the use of NAND flash has become prolific, and that for a reason. Flash storage can offer a lot more I/O operations per second in a single unit, while at the same time requiring a lot less space and cooling. Given the right transport protocol, it can also boost performance massively. The pace of development of flash memory outpaced the development for magnetic disk in the relatively short time it has existed. Advances in capacity for magnetic disks have been made consistently in the past, but the performance data of the disks have not increased at the same speed. Before starting a discussion about storage tiering and the flash revolution, a little bit of terminology must be explained.

Bandwidth/throughput: these figures indicate how much data you can transmit between your storage backend and the database host per unit of time. Most decision support systems are highly bandwidth hungry.

Response time/latency: the quicker the I/O request can be satisfied, the better. Response time is the time it takes from issuing an I/O request to its completion. Online transaction processing systems are usually sensitive to changes in response times.

What are typical ball-park figures to expect from your storage array? In terms of latency you should expect 6–8 milliseconds for single block random reads and 1 or 2 milliseconds for small sequential writes such as log writes. These figures are most likely not reads from physical disk, but rather are those reads affected by the caches in the arrays. With the wide variety of existing storage solutions and the progress made on an almost monthly basis, it is next to impossible to give a valid figure for expected throughput, which is why you do not find one here.

Storage Tiering

Storage tiers have been common in the storage array for quite some time. Storage engineering maps different classes of storage to tiers, often with the help of the storage array’s vendor. The different classes of storage are taken from a matrix of storage types such as DRAM, flash memory, spinning disk, and performance attributes such as RAID levels. A third dimension to the matrix is the transport protocol. Current mainstream transport media include Fiber Channel, Fiber Channel over Ethernet and iSCSI as well as Network File System. As you can imagine, the number of permutations of these are large, and it requires careful planning for which combination should be made available as a storage tier. One approach to storage tiering relying solely on hard disks could resemble the following:

15k RPM Fiber Channel disks in a RAID 10
15k RPM Fiber Channel disks in a RAID 5
10k RPM Fiber Channel disks in RAID 5
Direct NFS with dedicated appliances

In the above list, the lowest number should be the “best” storage tier. Storage tiering has often been used to enable organizations to implement data life-cycle models. Frequently accessed, or “hot,” data was placed on the better-quality storage tiers, and cold, or very infrequently used, data was placed on lower-end storage. In the Oracle world, that often meant placing objects on tablespaces defined within a storage tier. Moving data from one tier to another required either the use of “alter table ... move” commands or alternatively the implementation of calls to the database package DBMS_REDEFINITION. Needless to say, this was a non-automated task that had to be performed during maintenance windows. High-end enterprise arrays nowadays try to perform the same task automatically in the background. Predictable performance is more important than stellar performance on day one, followed by abysmal performance on day two. Time will tell if the automated models are sophisticated and stable enough to guarantee consistent execution times and adherence to the agreed service levels.

The commoditization of flash storage has changed the classic storage tier model presented in the preceding list. The introduction of flash memory, or what some call solid state disk, has fundamentally changed the storage industry.

The Flash Revolution

The proliferation of NAND flash memory, colloquially referred to as solid state disk or SSD, has changed the storage industry profoundly. Where it has been previously necessary to short-stroke many 15k RPM fiber-channel disks to achieve a high number of I/O operations per second, the same performance characteristics, plus potentially lower- access time and less congestion on disk, make flash memory a very attractive solution. Very high I/O performance can now be achieved in smaller, less power-hungry, and easier-to-cool solutions either inside the database host or externally connected.

Interestingly, consumers benefit from flash storage in a similar way to enterprise customers, although the way the storage is actually presented to the hardware differs greatly between the two customer groups. Most enterprise offerings for flash-based memory, which in reality is NAND flash, fall into the following categories:

Connected internally to the server or blade

As 2.5-inch or 3.5-inch solid state disk
As a PCI Express card

Externally attached

Fiber Channel
Via PCIe card
Infiniband

Internally attached 2.5-inch SSD do not play a major role in enterprise computing: the majority of these disks use a SATA 6G interface and are made for high-end consumer desktops and graphic workstations. The next category of internally attached flash memory is more interesting. Recent processors feature PCI Express version 3, offering a lot more bandwidth at a reduced overhead compared to the previous PCI Express version 2.x.

A WORD ABOUT PCI EXPRESS

PCI Express, short for Peripheral Component Interconnect Express (PCIe), is the x86 world’s standard way to add functionality to a server which isn’t already available on the mainboard. Examples for such PCIe cards are 10 Gigabit Ethernet cards, Fiber Channel Host Bus Adapters, Infiniband cards, and the like. PCIe has been designed to replace older standards such as the Accelerated Graphics Port and the older PCI-X and PCI standards. Unlike some of the standards it replaces, PCIe is a high-speed point to point serial I/O bus.

When considering PCIe bandwidth server, vendors often specify the number of lanes to a card slot. These lanes, broadly speaking, equate to bandwidth. Industry standard servers use PCIe x4, x8, and x16 lanes for slots, most of which are version 2. Every processor or mainboard supports a certain maximum number of PCIe lanes. The exact number is usually available from the vendor website.

PCI Express is currently is available in version 3. Thanks to more efficient encoding, the protocol overhead could be reduced compared to PCI version 2.x, and the net bitrate could be doubled.

PCIe 3.0 has a transfer rate of eight giga-transfers per second (GT/s). Compared to 250 MB/s per PCI lane in the initial PCIe 1.0, PCIe 3.0 has a bandwidth of 1 GB/s. With a PCIe 3.0 x16 slot, a theoretical bandwidth of 16 GB/s is possible, which should be plenty for even the most demanding workloads. Most systems currently deployed, however, still use PCIe 2.x with exactly half the bandwidth of PCIe 3.0: 500 MB/s per lane. The number of cards supporting PCIe 3.0 has yet to increase although that is almost certainly going to happen while this book is in print.

PCIe is possibly the best way to connect the current ultra-fast flash solutions so they are least slowed down by hardware and additional protocol bottlenecks. Such cards use single-level cells (SLC) for best performance or multi-level cells (MLC) for best storage capacity. According to vendor specifications, such devices have low micro-second response times and offer hundreds of thousands of I/O operations per second. When it comes to the fastest available storage, then PCIe x4 or x8 cards are hard to beat. The PCIe cards will show up as a storage device under Linux and other supported operating systems, just like a LUN from a storage array making it simple to either add it into an ASM disk group as an ASM disk or alternatively create a suitable file system, such as XFS on top of the LUN. The downside to the use of PCIe flash memory is the fact that a number of these cards cannot easily be configured for redundancy in hardware. PCIe cards are also not hot-swappable, requiring the server to be powered off if a card needs to be replaced. Nor can they be shared between hosts (yet?), making them unsuitable for Oracle configurations requiring shared storage.

Another use case for PCIe flash memory is to use a second-level buffer cache, a feature known as Database Smart Flash Cache. With today’s hardware taking terabytes of DRAM this solution should be carefully evaluated to assess its use in applications and the new hardware platform to see if there is any benefit. Finally, some vendors allow you to use the PCIe flash device as a write-through cache between database host and Fiber Channel attached array. PCIe flash devices used in this way can speed up reads because those reads do not need to use the fiber channel protocol to access data on the array. Since the flash device is write-through, failure of the card does not impact data integrity.

External flash based storage solutions can be connected in a number of ways, with Fiber Channel probably the most common option. Most established vendors of storage arrays offer a new storage tier inside the array based on flash memory. For most customers, that approach is very easy to implement because it does not require investment into new network infrastructure. It also integrates seamlessly into the existing fabrics, and the skillset of the storage team does not need to be extended either. In short, Fiber Channel is a great solution for established data centers to quickly get some of the benefits of flash memory. Using Fiber Channel to address “SSD” inside an array is a bit slower, though—the Sane SAN 2010 paper explains the difference in great detail. Essentially the round trip time of Fiber Channel is higher than the blazing fast Remote Direct Memory Access available with Infiniband’s SCSI RDMA Protocol (SRP).

Some vendors exploit the capabilities of PCIe and allow the external array to be connected via PCIe card plugged into the database server to the array instead of Fiber Chanel. All storage communication has to use that wire, and the storage system behaves as if it were directly plugged into a PCIe slot. Thus the storage system benefits greatly from PCIe’s high bandwidth and low latency properties. Although again a very fast solution, this design does not allow the array to be shared between hosts, the same limitation as with directly attached PCIe cards. The best, although probably most expensive, way to attach an external array to the database hosts is to use Infiniband and the previously mentioned SCSI RDMA Protocol or SRP.

USING INFINIBAND TO TRANSPORT SCSI COMMANDS

The Small Computer System Interface has proven remarkably versatile over the many years of its existence. SCSI has mostly been used in expensive enterprise hardware initially to address hard disks or tape drives when it came out in the 1980s. Like so many protocols it started out as a parallel bus but has been replaced by a serial version, which is known as Serially Attached SCSI, or SAS for short.

SAS has emerged as the de-facto standard for enterprise-class direct attached storage, in which the stricter rules about signaling and cable-length do not matter as much. The current implementations of the SCSI protocol do not allow for more than 10-20m cable length.

The cable-length limitations have been addressed with the introduction of Fiber Channel. FC is a networking protocol, and despite being capable of transporting many workloads, it has found its niche as the predominant form of transporting SCSI commands over a distance. Almost every single modern database server is “SAN-connected,” which means it has its storage provided by an external array. Fiber Channel is divided into a number of protocol layers that remind you of the ISO/OSI layers; the upper level layer is primarily concerned with mapping SCSI to the new method of transportation.

Fiber Channel has evolved over time from Gigabit Fiber Channel in 1997 to 16 GB Fiber Channel available in 2011. Since changing from one generation to another requires investment in supporting infrastructure, it is expected that 8 GB FC will remain mainstream for a few more years.

Some users thought that the evolution of Fiber Channel as the method of choice for transporting SCSI commands did not happen quickly enough. Alternative approaches to the lower level protocols of Fiber Channel are Fiber Channel over Ethernet, Internet SCSI (iSCSI), and other less widely used ones. Most of these additional protocols use layers of the TCP/IP stack. The idea behind that approach is to be able to make use of existing know-how and potentially infrastructure—Ethernet is well understood.

One of the fastest (but also most expensive) ways to transport SCSI commands today is by using Remote Direct Memory Access, RDMA. RDMA is often implemented as Infiniband, which has seen a lot more widespread use since the introduction of Oracle’s Exadata. RDMA allows zero-copy networking between hosts, bypassing many parts of the operating system, taking load off the CPUs, and considerably benefitting the latency of operations. Of the many uses Infiniband permits, the SCSI RDMA Protocol is one and is most useful for general purpose storage solutions.

Infiniband is so much faster than Fiber Channel because of its incredibly low latencies, the high bandwidth, and the zero-copy feature. Quad Data Rate (QDR) has been available since 2008 and is in use in Oracle’s Exadata. It offers up to 10 Gb/s per link, and Exadata, just like most Infiniband systems, has four-lane ports accumulating to 40 Gb/s, which is referred to as “Quad Data Rate”. The next evolution is already available today, called Fourteen Data Rate Infiniband. FDR increases the link speed to 14 Gb/s per link. It is to be expected that most IB ports will be QDR ports again, offering a total of 56 Gb/s. A new encoding method also promises less overhead, but you are well advised to use FDR with PCIe version 3 to make use of the enormous bandwidth on offer if you do not want to bottleneck on the 2nd generation PCIe express cards currently in use. Infiniband is a new technology for most of us, for which your organization needs to start acquiring hardware as well as operational know-how. This can be a significant investment.

In an attempt to categorize flash-based storage irrespective of the way it is attached to the server and to put it into perspective, you can refer to Hennessy and Patterson’s Computer Architecture: A Quantitative Approach. There you find a Memory Hierarchy ranging from super-fast access to registers within the processor to very slow access to external storage such as tape. Sources vary in terms of access times, but the general dimensions are undisputed. Consider the following shortened memory hierarchy in Table 3-1:

Table 3-1. An Attempt to Define the Memory Hierarchy with a Select Few Examples

Storage tier	Approximate latency	Comment
Processor register	Picosecond	Processor registers store information retrieved from higher-tier caches such as the processor L1 cache
Level 1 cache	Less than 1 nanosecond	Usually implemented as Static RAM (SRAM) as opposed to DRAM
Level 2 cache	A few nanoseconds	Can be slower if implemented off the chip but usually found on-die
DRAM	∼30-50 nanoseconds, depending on locality	Dynamic Random Access Memory is otherwise referred to as main memory. A typical server would use DDR3 modules
Flash memory read	microseconds	Not taking the time into account to send the data to the host, i.e., no round-trip time. For that time you need to consider the Fiber Channel roundtrip time or alternatively Infiniband roundtrips.
Flash memory write	Usually longer than flash memory read but still within microseconds	Writes take longer on the physical layer; some solutions use caches to mitigate the write penalty
15k Hard disk read	Few milliseconds

Hennessy and Patterson also graphically plotted the access time for DRAM, which is used for “main memory” in servers and hard disk in a logarithmic scale on the horizontal axis of a chart. The vertical axis denoted the cost per gigabyte. As you can imagine, the access time for DRAM is very, very low; however, that comes at the expense of an exorbitant cost per GB for DRAM. According to the graph a DRAM module provided approximately 50 nanoseconds response time at the cost of roughly 100 dollars per GB. The same year the cost per GB magnetic storage was a lot lower than 1 dollar per GB; however, the access time in nanoseconds could be around 8,000,000 nanoseconds, which is the equivalent of 8 milliseconds. The costs are obviously relevant to when the graph has been created and does not reflect current prices.

Note Let’s not forget that this is an academic comparison. You currently will not be able to build a system with enough DRAM and backup batteries to compete for capacity in a cost-effective way with a multiple-shelf, enterprise storage array.

What is striking, though, by looking at the numbers is what is called the access time gap: less than 100 nanoseconds to access DRAM compared to 8,000,000 nanoseconds for magnetic storage is quite a difference.

You can also see from the above table that the latency for flash memory access is a lot lower than for hard disks, but before you start to base your whole storage estate on SSD, you need to know a little more. Let’s approach the complex topic of SSDs using the categorization in Figure 3-1 as our basis.

Figure 3-1. Attempt of a categorization of SSD

The type of SSD is important, but there is little to say other than that (D)RAM-based SSDs do not have a significant market share anymore. Practically all SSDs today are NAND-flash based. This leads to the next item in the list: DRAM is volatile memory, which means that if there is a power cut, then all the data is potentially lost. Vendors realized that early on and added batteries to their DRAM-based solutions. NAND-based flash memory does not exhibit the same behavior. Please note that the following sections focus on NAND-based flash memory.

You already read about the access paths to the NAND storage; they are listed again for the sake of completeness. What you need to remember is that accessing SSD via PCIe is the quickest path to the storage. This is followed by Remote Direct Memory Access and finally by Fiber Channel.

The type of memory cell bubble denotes how many bits a cell in a NAND-based SSD can store. To better understand the significance of that sentence you need to know that the SSD’s base memory unit is a cell. In other words, information is stored in a non-volatile way in an array of cells. It is the amount of information that can be stored in a cell that is significant to this discussion: a single level cell stores one bit in each cell, which leads to fast transfer speeds. As an added advantage, SLC last longer than their counterparts. Fast and reliable—there has to be a downside, and that is cost. The higher cost associated with SLC means that such cells are almost exclusively found in enterprise class solutions. You also expect high-performance flash memory to be based on Single-Level Cells.

Multi-Level Cells store two bits per cell. Most MLCs, therefore, are slower due to the way the data is accessed. For the same reason individual MLC cells wear out more quickly than SLCs. However, the MLC-based SSDs allow for larger capacity. As a rule of thumb, you would buy SLC for performance and MLC for capacity. But let’s not forget that both SLC and MLC are still a lot faster than magnetic storage.

Triple-Level Cells are not really new, but they do not seem to make commercial sense yet in the enterprise segment. TLC SSDs exist for the consumer market. The advantage of storing three bit per cell is higher capacity, but similar to the step from SLC to MLC, you get even more wear and slower performance.

Another term often heard in the context of SSD is wear leveling. You read in the previous paragraphs that individual cells can wear out over the usage time of the device. The wearing of the cell is caused by writes. The controller managing the device therefore tries to spread the write load over as many cells as possible, completely transparently. The fewer writes a cell has to endure, the longer it will potentially last.

Multiple cells are organized in pages which in turn are grouped into blocks. Most enterprise-type SSD use a page size of 4 KB and a block size of 512 KB. These blocks are addressed much like any other block device, i.e., hard disk, making 1:1 replacements easy and straightforward. For the same reason you could set the sector size of the SSD in Linux and other operating systems to 4k. Read and write operations allow you to access random pages. Erase (delete) operations, however, require a modification of the whole block. In the worst case, if you need to erase a single page (usually 4 KB), then you have to delete the entire block. The storage controller obviously preserves the non-affected memory, writing it back together with the modified data. Such an operation is undesirable simply because it adds latency. Additionally, the write operation adds to the individual cell’s wear factor. If possible, instead of modifying existing cells, the controller will try to write to unused cells, which is a lot faster. Most SSDs, therefore, reserve a sizeable chunk of space, which is not accessible from the operating system. Delete operations can then either be completed in the background after the operation has completed or alternatively be deferred until space pressure arises. The more data is stored on the SSD, the more difficult it is for the controller to find free space. While performance of SSD, therefore, is generally very good, sometimes you might see certain outliers in write performance. All of a sudden, some writes have up to 50 milliseconds additional latency. Such outliers are called a write cliff, caused by the above phenomenon. When getting a SSD on loan, it is important to check how full the device is, a function which is often available from the driver.

When measuring the performance of SSD with Oracle, it is important to use direct I/O. Using direct I/O allows Oracle to bypass the file system cache, making performance numbers a lot more realistic. Without direct I/O a request to read from the storage layer might as well be satisfied from the operating system file system cache. A blazingly fast millisecond response time in an extended trace file cannot be attributed to a very well-functioning I/O storage subsystem, for there is no extra information as to where the requested block was found. When you instruct Oracle to bypass the file system cache, the reported times in the Active Session History or other performance-related information is more likely to reflect the true nature of the I/O request.

When testing a flash memory solution, you might also want to consider the use of a very small Oracle SGA to ensure that I/Os generally are not satisfied from the buffer cache. This is easier said than done since Oracle allocates certain memory areas based, for example, on the number of CPUs as reported in the initialization parameter cpu_count. If you want to set your buffer cache to 48 MB, which is among the lowest possible values, you probably have to lower your cpu_count to 2 and use manual SGA management to size the various pools accordingly.

Putting it Together

So far, you have read a lot about different ways to attach storage to your consolidated platform. Why all that detail? In the author’s experience the DBA (or database engineer) is the best person to consult when it comes to rolling out an Oracle database solution. Why? It’s a simple answer: the database administrator knows how storage, client, and internal facing networking matter to his or her database. The operating system, storage, networking, and every other solution only serve the purpose to allow the database to execute. You will find congenial storage administrators, Unix gurus from the days of Dennis Ritchie, Brian Kernighan, and Ken Thompson, but ultimately their view is focused on their specific area of expertise. Therefore it is more often than not the database administrator or engineer who can provide the big picture! Hopefully after you read these sections of the book, this can be you!

Besides the organizational difficulties just described, there is another reason all the different technologies have been laid out. When you are thinking about the future hosting platform, it is beneficial to roll out a uniform hardware platform. Storage tiering is certainly going to stay with us for the foreseeable future. Its main uses will be information lifecycle management and offering a different cost structure to internal clients. Whether moving data from one tier to another is going to be an automatic process or manual depends largely on the maturity of the automatic solution and the comfort level of engineering to release that feature into production.

Flash storage is going to be increasingly used, and there is a potential to design different hardware classes as the consolidation target. One could think of the platform arrangement described in Table 3-2.

Table 3-2. Potential Hardware Platforms

Platform designation	Characteristics
Gold	Rack-mounted servers, using multiple rack units (≥ 4U) with lots of DRAM for data intensive processing. Many fast CPU cores with latest generation processors.
	Flash storage could be provided externally via Infiniband or, more traditionally via 8 or 16 GB/s Fiber Channel. Magnetic disk is available in form of Fiber Channel attached storage array(s).
	This platform should primarily be used to more critical production databases with high I/O demands. If high availability is a concern, then the database might be clustered with the Real Application Clusters option, hence the requirement to use flash storage outside the server itself.
Silver	This could be the mainstream consolidation platform. The server could use two to four sockets with reasonably fast CPU cores or dual socket with very fast cores. Total memory capacity is less than for the gold platform for architectural reasons.
	High I/O requirements could be satisfied by PCIe-based SSD as an option. A PCIe SSD can either store data in a way a directly attached block device does, or it act as a write-through cache.
	In addition to the “bread and butter” production workloads, such a platform could be used as integration testing platform for the gold servers to save cost. If possible, it should not be used as a UAT platform for gold servers. Using different architecture and hardware has never been a great recipe to ensure a smooth testing period—and more importantly—a smooth production rollout.
Bronze	The bronze platform could be aimed at development and integration systems. These are early in the development cycle, and rapid provisioning of a production clone to debug a serious issue is more important than the latest and greatest technology, memory, etc. Another use case for these systems is as repositories for slowly growing data.
	There would most likely be no flash storage due to cost considerations.

Table 3-2’s matrix has only taken storage and CPUs into account. There are, of course, many more factors influencing the final offering from the engineers. The purpose of this section was to encourage you to think out of the box and include more than the immediately obvious aspect into the mix. The preceding table has deliberately been kept short. Offering too many options could confuse less technical users of the hosting service, leading to lots of questions that could be avoided.

Finally, you should be aware of specific hardware solutions for running Oracle workloads. Exadata has already been mentioned in the sections above. The other option you have from Oracle is the Oracle Database Appliance (ODA). Both of these are part of the Engineered Systems initiative by Oracle, that aims at giving customers a balanced platform for database processing. Both Exadata and the ODA benefit from the fact that their hardware and software stacks are standardized. Over and above that standardization, Oracle will provide you with patches to the whole system. In the case of Exadata, the storage tier as well as the database servers will be supplied with regular patches. Patches to the database running on top of these are also closely related and specifically provided by Oracle. The benefit of such a combination is a further reduced effort in certifying the whole stack by the engineering department. In the ideal case, only the patch application needs to be thoroughly tested before it is rolled out. It’s also possible to create your own platform. You can read more about that approach in the following sections.

Consolidation Features in Linux

The Linux platform appears to the author as one of the most dynamic operating systems currently available for enterprise use. This can be a blessing and a curse at the same time. Engineering departments usually need to spend time to certify the operating system with the various components that are required to convert industry-standard hardware into a platform suitable for mass rollout. As you can imagine, this certification requires time, and the time needed for a certification is exactly proportional to the resources available. You read in the introduction chapter that resources in all departments, but especially for engineering, are becoming scarce. Quick-release cycles for Linux kernels do seem counterproductive in this context. This might be a reason why you see enterprise distributions seemingly using the same kernel (like 2.6.18.x for Red Hat 5.x) even though the change logs indicate hundreds of backports from the mainline kernel.

One of the open-source ideals, however, is “release small, but release often,” and that will not change. New features in enterprise-class hardware are often well supported by the Linux distributions. While the mainline kernel as maintained by Linus Torvalds and other subsystem maintainers moves quickly, the enterprise distributions are careful not to release too often. Oracle Linux appears to become an exception as the Unbreakable Enterprise Kernel (“UEK”) is updated at a faster pace. Each new kernel potentially provides new functionality and support for hardware, which makes it advisable to check for release notes. The following sections explain some interesting features in Linux that appeared with Oracle Linux/Red Hat Enterprise Linux 6.

Non-Uniform Memory Architecture with Intel X86–64

The Non-Uniform Memory architecture has been mentioned in a number of places in this book, and this section is finally providing more information about it. You read earlier that x86–64-based servers have the capability to address huge amounts of memory compared to systems a few years ago. So what does that mean? In the days of the classic Symmetrical Multiprocessor (SMP) systems, memory access had to be shared by all processors on a common bus. The memory controller was located on the Front Side Bus, or FSB. Main memory was attached to a chip called the Northbridge on the mainboard. Needless to say, this approach did not scale well with increasing numbers of CPUs and their cores. Additionally, high speed I/O PCI cards could also be attached to the Northbridge, further adding to the scalability problem.

The memory-access-problem was addressed by moving the memory closer to the processor. New, modern CPUs have memory controllers on the chip, allowing them to address their local memory at very high speeds. Furthermore, there is no shared bus between CPUs; all CPU-to-CPU connections in the x86–64-architecture are point-to-point. In some configurations, however ,an additional hop is required to access remote memory. In such systems, which are not in scope of this section, CPU 1, for example, can only directly talk to CPU 2. If it needs to communicate with CPU 3, then it has to ask CPU 2 first, adding more latency. A typical dual-socket system is shown in Figure 3-2:

Figure 3-2. Simplified schematic design of a two-socket system

Current x86–64 processors have multiple DDR3-memory channels—individual memory channels, however, are not shown in this figure. The operating system has the task of masking the location of memory from its processes. The goal is to allow applications not having to be NUMA-aware to execute on NUMA hardware. However, they should be to take advantage of the local memory! Remember for the rest of the section that local memory access is faster, hence preferred. If local memory cannot be used, then remote memory needs to be used instead. This can happen if a process migrates (or has to migrate) from one CPU to another, in which case local and remote memory are reversed. The previously mentioned point-to-point protocols allow the CPU to access remote memory. This is where the most common NUMA-related problem lies. With every Oracle database it is important to have predictable performance. There are a number of reports from users who have experimented with NUMA: some processes executed a lot faster than without NUMA, but some others did not. The reason was eventually determined to be caused by access to remote memory in a memory-sensitive application. This had the following consequences:

Oracle’s “oracle-validated” and “oracle-rdbms-server-12cR1-preinstall” RPMs disable NUMA entirely by adding numa=off to the boot loader command line.
Multiple notes on My Oracle Support do not openly discourage the use of NUMA but make it very clear that enabling NUMA can cause performance changes. A change can be positive or negative; you are also told to test the impact of enabling NUMA first, but that almost goes without saying.

So how can you make use of NUMA? First you have to enable it—check if your boot loader doesn’t have the numa=off line appended as shown here:

kernel /boot/vmlinuz-2.6.xxx ro root=... rhgb quiet numa=off

It might additionally be necessary to enable NUMA in the BIOS or EFI of your machine. After a reboot your server is potentially NUMA aware. Linux uses the numactl package, among other software, to control NUMA. To list your NUMA nodes in the server, you use the following command:

[oracle@server1 ∼]$ numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 8190 MB
node 0 free: 1622 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 8192 MB
node 1 free: 41 MB
node 2 cpus: 12 13 14 15 16 17
node 2 size: 8192 MB
node 2 free: 1534 MB
node 3 cpus: 18 19 20 21 22 23
node 3 size: 8176 MB
node 3 free: 1652 MB
node distances:
node   0   1   2   3
  0:  10  16  16  16
  1:  16  10  16  16
  2:  16  16  10  16
  3:  16  16  16  10
[oracle@server1 ∼]$

In this example, you see a small system with four memory sockets. For each NUMA node, the number of its CPUs is reported, which are in fact cores. In addition to the cores per NUMA node, you get the size and utilization of the node’s memory. The previously shown server has 32GB available in total, with about 8 GB available per NUMA node. The node distance map is based on the ACPI SLIT—the Advanced Configuration and Power Interface’s System Locality Information Table. Don’t worry about the names and technology. What matters is that the numbers in the table show the relative latency to access memory from a particular node, normalized to a base value of 10. Additional information, similar to the /proc/meminfo output per NUMA node, can be found in /sys/devices/system/node/*/meminfo and related files in the same directory. If you have allocated large pages, then you will find more information about these in /sys/devices/system/node/*/hugepages/.

The numactl utility is not only used to expose the hardware configuration, but it can also actively influence how processes and their memory access are handled by applying what is called the NUMA policy. By default, the NUMA policy matches the one shown here:

[oracle@server1 ∼]$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
cpubind: 0 1 2 3
nodebind: 0 1 2 3
membind: 0 1 2 3

The default NUMA policy sets the current node as the preferred one and does not enforce binding processes to memory nodes or CPUs. If, for example, you want to override the default policy, you have plenty of options:

[root@server1 ∼]# numactl
usage: numactl [--interleave=nodes] [--preferred=node]
               [--physcpubind=cpus] [--cpunodebind=nodes]
               [--membind=nodes] [--localalloc] command args ...
       numactl [--show]
       numactl [--hardware]
       numactl [--length length] [--offset offset] [--shmmode shmmode]
               [--strict]
               [--shmid id] --shm shmkeyfile | --file tmpfsfile
               [--huge] [--touch]
               memory policy | --dump | --dump-nodes
 
memory policy is --interleave, --preferred, --membind, --localalloc
nodes is a comma delimited list of node numbers or A-B ranges or all.
cpus is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
length can have g (GB), m (MB) or k (KB) suffixes

The meaning of all these are well described in the manual page for numactl. However, there is a better way to control memory and process allocation described in the next section. Although the Oracle database has supported NUMA for quite a while, you should be very careful to set the necessary NUMA initialization parameters in an instance. I would like to encourage you to resort to Control Groups instead, unless you are on Oracle Linux 5 without the Unbreakable Enterprise Kernel. But even then it is important, just as the My Oracle Support notes say, to thoroughly test the implication of enabling NUMA support.

Control Groups

Control Groups, or cgroups for short, are an interesting new feature that made it into the latest enterprise Linux kernels. Control groups are available with Oracle’s Kernel UEK in Oracle Linux 5 or with Oracle Linux 6, where it does not matter which kernel you use. Control Groups are a great way to divide a powerful multi-core, NUMA-enabled server into smaller, more manageable logical entities. Some of the concepts you will read about in this section sound similar to Solaris projects, which is part of the Solaris resource management framework.

Note You can read more about Solaris Zones in Chapter 4.

The current Linux implementation is not quite as advanced as its Solaris counterpart yet. But who knows, maybe the Linux community will get a very similar feature: development on LXC or Linux Containers has already started to be more serious.

You read in the previous section that modern computer systems have local and remote memory in a non-uniform memory architecture. Depending on your system’s micro architecture, accessing remote memory can be more costly, or a lot more costly than accessing local memory. It is a good idea to access local memory if possible, as you have seen in the previous section. Very large multi-processor systems with more than four sockets can easily be divided vertically into logical units matching a socket. In such a situation it is crucial to have enough local memory available! Assuming that your system has four NUMA nodes and 64 GB of memory, you would expect to see each NUMA node to have approximately 16 GB RAM.

Control Groups are based on grouping processes into subsystems, such as memory and CPU. All children of processes belonging to a control group will also automatically be associated with that cgroup. Cgroups are not limited to managing CPU and memory: further subsystems have been added to the framework. You can query your system to learn more about which subsystem is available and in use:

[root@server1 cgroup]# lssubsys -am
cpuset /cgroup/cpuset
cpu /cgroup/cpu
cpuacct /cgroup/cpuacct
memory /cgroup/memory
devices /cgroup/devices
freezer /cgroup/freezer
net_cls /cgroup/net_cls
blkio /cgroup/blkio

Note This section is about the cpuset subsystem only. The complete reference to Control Groups can be found in the Oracle Linux Administrator’s Solution Guide for Release 6 in Chapter 8. You are encouraged to review it to get a better understanding about all the interesting features available in the other systems!

The preceding output shows the default configuration from a recent Oracle Linux 6 system with package libcgroup installed. If enabled, the cgconfig service, which is part of the package, will read the configuration file in /etc/cgconfig.conf and mount the subsystems accordingly. Interestingly, the control group hierarchy is not exposed in the output of the mount command but rather in /proc/mounts.

From a consolidation point of view, it makes sense to limit processes to CPU socket(s). The aim of this section is to create two databases and bind all their respective processes to an individual socket and its memory. Control Groups are externalized in userland similar to the sys file system sysfs in Linux by a set of virtual files. Figure 3-3 demonstrates the relationship between the different subsystems and Control Groups.

Figure 3-3. The cgroup example explained

Thankfully for the administrator, the first step of defining mount points for each subsystem is already done for us with the installation of the libcgroup package. You read earlier that the cgroup mount point resembles the SYSFS mount, following the UNIX paradigm that everything is a file. The contents of the top level cpuset directory are shown here:

[root@server1 ∼]# ls /cgroup/cpuset/
cgroup.clone_children  cpuset.memory_migrate           cpuset.sched_relax_domain_level
cgroup.event_control   cpuset.memory_pressure          notify_on_release
cgroup.procs           cpuset.memory_pressure_enabled  release_agent
cpuset.cpu_exclusive   cpuset.memory_spread_page       tasks
cpuset.cpus            cpuset.memory_spread_slab
cpuset.mem_exclusive   cpuset.mems
cpuset.mem_hardwall    cpuset.sched_load_balance
[root@server1 ∼]#

This is the top-level hierarchy, or the hierarchy’s root. Some of the properties in this directory are read-write, but most are read-only. If you make changes to a property here, the change will be inherited by its siblings. The siblings are the actual control groups which need to be created next:

[root@server1 cgroup]# cgcreate -t oracle:dba -a oracle:dba -g cpuset:/db1
[root@server1 cgroup]# cgcreate -t oracle:dba -a oracle:dba -g cpuset:/db2

It is indeed that simple! The cgcreate command will create control groups db1 and db2 for the cpuset controller, as shown by the -g flag. The -a and -t flags allow the oracle account to administer and add tasks to the cpuset. Behind the scenes you will see that it creates two subdirectories in the cpuset directory, each with its own set of virtual configuration files. These control groups will inherit information from their parent, with the exception of the cpuset.memory_pressure_enabled and release_agent fields. From an Oracle point of view, all that remains to be done is to define which CPUs, and memory should be assigned to the cgroup. You use the cgset command to do so. There is a confusingly large number of attributes that can be set, but luckily the documentation is quite comprehensive. Following the example from the previous section, the first six “CPUs,” which are cores in reality, are mapped to db1; the other six will be set to db2. The same happens for memory:

[root@server1 cpuset]# cgset -r cpuset.cpus=0-5 db1
[root@server1 cpuset]# cgset -r cpuset.mems=0 db1
[root@server1 cpuset]# cgset -r cpuset.cpus=6-11 db2
[root@server1 cpuset]# cgset -r cpuset.mems=2 db2

Note that the memory is allocated by NUMA node; the CPU cores are enumerated the same way the operating system sees them. Further enhancements could include a decision on whether the CPUs should exclusively be useable by the processes in a cgroup. In that case, if cpuset.cpu_exclusive is set to 1 and no other processes can utilize the CPU—use this feature with care! The same applies for cpuset.mem_exclusive, but related to memory.

The manual way to start a process in a certain cgroup is called cgexec . Consider this example for starting an Oracle database:

[oracle@server1 ∼]$ cgexec -g cpuset:db1 /u01/app/oracle/product/12.1.0.1/dbhome_1/bin/sqlplus
 
SQL*Plus: Release 12.1.0.1.0 Production on Fri Jun 28 13:49:51 2013
 
Copyright (c) 1982, 2013, Oracle.  All rights reserved.
 
Enter user-name: / as sysdba
Connected to an idle instance.
 
SQL> startup
ORACLE instance started.
...
Database opened.
SQL> exit

An alternative approach would be to add the PID of your current shell to the tasks pseudo-file in the control group directory. This way all the shell’s child processes automatically belong to the control group:

[oracle@server1 ∼]# echo $$ > /cgroups/oracle_cpuset/db1/tasks

Now start the Oracle database as normal. Since the oracle account has been specified as an administrator of the cgroup during its creation, this operation is valid.

In addition to the manual way of adding processes to cgroups, you can make use of the new initialization parameter processor_group available on Linux and Solaris. So instead of using cgexec, you simply set the initialization parameter to point to the existing cgroup:

SQL> show parameter processor_group_name
 
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
processor_group_name                 string      db1

There are a few ways to validate the cgroup mapping. The most common are probably these:

The taskset command
Checking the /proc/pid/cgroup pseudo file
Using custom options to the ps command

Using the taskset approach, you need the process ID of your database first:

[oracle@server1 ∼]# ps –ef | grep CDB1
oracle    2149     1  0 14:24 ?        00:00:00 ora_pmon_CDB1
oracle    2151     1  0 14:24 ?        00:00:00 ora_psp0_CDB1
oracle    2153     1  1 14:24 ?        00:00:06 ora_vktm_CDB1
 
[root@server1 ∼]# taskset -cp 2149
pid 2149's current affinity list: 0-5

This confirms that the cgroup settings are indeed in place. If you want to know the name associated with the control group as well, check the /proc entry for the process:

[oracle@server1 ∼]# cat /proc/2149/cgroup | grep cpuset
1:cpuset:/db1

And finally, you can view the same information with the ps command as well. Feel free to remove fields you do not need:

[oracle@server1 ∼]# ps -ax --format uname,pid,cmd,cls,pri,rtprio,cgroup
...
oracle    2149 ora_pmon_CDB1                TS  19      - ... cpuset:/db1
oracle    2151 ora_psp0_CDB1                TS  19      - ... cpuset:/db1
oracle    2153 ora_vktm_CDB1                RR  41      1 ... cpuset:/db1
...

With that knowledge it should be relatively simple to track down shared pool and other memory related problems with the database.

Up to this point all the configuration about control groups are transient. In other words, if your server reboots, the control group configuration will all be gone. This will be a serious headache because it might prevent databases from starting if the processorgroup_name parameter is set, but the corresponding control group is not defined at the operating system level. Thankfully you can dump your configuration into a file to be read by the cgconfig service. Use the cgsnapshot utility to write the configuration to standard output and cut/paste the relevant bits into the main configuration file /etc/cgconfig.conf. This time the configuration will be read and applied during the system boot.

In addition to the processor_group_name parameter, a little bit of additional setup work needs to be performed. First of all, it is recommended to set another initialization parameter named use_dedicated_broker to true and also enable the new connection broker in the listener configuration file by setting DEDICATED_THROUGH_BROKER_listener_name to on, followed by a reload of the listener. Note that if you are using the new multi-threaded architecture, the use_dedicated_broker parameter is already set to true. You can read more about the multi-threaded architecture in Chapter 2.

Benchmarks

When building a consolidated platform, you might want to consider testing hardware from different vendors to ensure that your choice optimally fits your environment, standard operating procedures, and deployment process. Getting hardware on loan for evaluation is a great way of testing combinations of hardware, software, networking, and storage. From the outset of this testing, you should ensure the following:

Availability of dedicated resources. For example, if you are evaluating a storage solution, you need resources from storage engineering or whoever is responsible. If necessary, you need operating system support to compile drivers and package them in form of RPMs. If you are resource-constrained, it might be better to postpone the evaluation until more continuous time from the respective teams is available. There is nothing more frustrating than having to return the hardware without having been able to come to a conclusion.
Define a test plan. Your evaluation should be planned well in advance. Benchmarks, workloads to be run, and (performance) statistics to be gathered need to be defined and documented for each iteration. A minimum number of iterations should be defined as well to minimize statistical outliers.
Ensure that the hardware can be installed and configured. Your data center needs to have enough rack space, network ports, electricity, etc. The raised floor must support the weight of the hardware, etc. This sounds trivial, but neglecting these factors can lead to significant delay. In reference back to the first bullet point, your data center management also needs to be available to add cables and to configure the essential remote access console for the engineers to take over. If Fiber Channel based storage is required, then the servers need to be connected to the right fabric switches, and initial zoning needs to be performed.
Vendor support must be secured, at least for the duration of the evaluation. The support you get during the evaluation must be at least adequate. Bear in mind that the vendor ought to be very keen to sell. A poor support quality during the proof-of-concept phase might well be an indication of the things to come. Potential support problems should be treated seriously and put on the critical path if necessary.
If you are migrating to a different hardware platform such as x86, then you need support from the operational team in case you have any questions about the application after the migration has taken place. This favor can easily be returned to the operational team by scripting the migration process or at least documenting it to a good, detailed standard.

There are a number of benchmarks you can employ, with varying degrees of meaningfulness. The best benchmarks for Oracle workloads are your own applications. No synthetic benchmark, regardless of how well it is written, can represent your workload. If your applications perform well on the new platform, then there is a higher probability of the platform’s acceptance and ultimate success. The worst possible outcome for the consolidated platform would be the lack of acceptance!

Caution Some of the benchmarks listed and described below can perform destructive testing and/or cause high system load. Always ensure that your benchmark does not have any negative impact on your environment! Do not benchmark outside your controlled and isolated lab environment! Always carefully read the documentation that comes with the benchmark. And be especially careful when it comes to writing benchmarks

If you cannot find representative workloads from within the range of applications in use in your organization, you may have to resort to an off-the-shelf benchmark. Large Enterprise Resource Planning systems usually come with a benchmark, but those systems are unlikely candidates for your consolidation platform. If you want to test different components of your system, specialized benchmarks come to mind, but those don’t test the platform end-to-end. For storage related testing you could use these benchmarks, for example:

iozone
bonnie++
hdbench
fio
And countless others more . . .

Of all the available storage benchmarks, FIO sounds very interesting. It is a very flexible benchmark written by Jens Axboe to test different I/O schedulers in Linux and will be presented in more detail below.

Network-related benchmarks, for example, include the officially Oracle-sanctioned iperf and others. Kevin Closson’s Silly Little Benchmark tests memory performance, and so does the University of Virginia’s STREAM benchmark. The chips on the mainboard, including the CPU and also cooling, can be tested using the High Performance Linpack benchmark, which is often used in High Performance Computing (HPC).

Each of these benchmarks can be used to get the bigger picture, but none is really suited to assess the system’s qualities when used in isolation. More information about the suitability of the platform, especially in respect to storage performance, can be obtained by using Oracle IO Numbers (ORION) or the Silly Little Oracle Benchmark (SLOB) written by Kevin Closson. The below sections are a small selection of available benchmarks.

FIO

FIO is an intelligent I/O testing platform written by Jens Axboe, whom we also have to thank for the Linux I/O schedulers and much more. Out of all the I/O benchmarks, I like FIO because it is very flexible and offers a wealth of output for the performance analyst. As with most of I/O related benchmarks, FIO works best in conjunction with other tools to get a more complete picture.

What is great about the tool is the flexibility, but it requires a little more time to learn all the ins and outs of it. However, if you take the time to learn FIO, you will automatically learn more about Linux as well. FIO benchmark runs are controlled by plain text files with instructions, called a job file. A sample file is shown here; it will be used later in the chapter:

[oracle@server1 fio-2.1]# cat rand-read-sync.fio
[random-read-sync]
rw=randread
size=1024m
bs=8k
directory=/u01/fio/data
ioengine=sync
iodepth=8
direct=1
invalidate=1
ioscheduler=noop

In plain English, the job named “random-read-sync” will use the random-read workload, creates a file of 1 GB in size in the directory /u01/fio/data/, and uses a block size of 8 kb, which matches Oracle’s standard database block size. A maximum of 8 outstanding I/O requests are allowed, and the Linux I/O scheduler to be used is the noop scheduler since the storage in this case is non-rotational. The use of direct I/O is also requested, and the page cache is invalidated first to avoid file system buffer hits for consistency with the other scripts in the test harness—direct I/O will bypass with buffer cache anyway, so strictly speaking, the directive is redundant.

Linux can use different ways to submit I/O to the storage subsystem—synchronous and asynchronous. Synchronous I/O is also referred to as blocking I/O, since the caller has to wait for the request to finish. Oracle will always use synchronous I/O for single block reads such as index lookups. This is true, for example, even when asynchronous I/O is enabled. The overhead of setting up an asynchronous request is probably not worth it for a single block read.

There are situations wherein synchronous processing of I/O requests does not scale well enough. System designers therefore introduced asynchronous I/O. The name says it all: when you are asynchronously issuing I/O requests, the requestor can continue with other tasks and will “reap” the outstanding I/O request later. This approach greatly enhances concurrency but, at the same time, makes tracing a little more difficult. As you will also see, the latency of individual I/O requests increases with a larger queue depth. Asynchronous I/O is available on Oracle Linux with the libaio package.

Another I/O variant is called direct I/O, which instructs a process to bypass the file system cache. The classic UNIX file system buffer cache is known as the page cache in Linux. This area of memory is the reason why users do not see free memory in Linux. Those portions of memory that are not needed by applications will be used for the page cache instead.

Unless instructed otherwise, the kernel will cache information read from disk in said page cache. This cache is not dedicated to a single process allowing other processes to benefit from it as well. The page cache is a great concept for the many Linux applications but not necessarily beneficial for the Oracle database. Why not? Remember from the preceding introduction that the result of regular file I/O is stored in the page cache in addition to being copied to the user process’s buffers. This is wasteful in the context of Oracle since Oracle already employs its own buffer cache for data read from disk. The good intention of a file-system page cache is not needed for the Oracle database since it already has its own cache to counter the relatively slow disk I/O operations.

There are cases in which enabling direct I/O causes performance problems, but those can easily be trapped in test and development environments. However, using direct I/O allows the performance analyst to get more accurate information from performance pages in Oracle. If you see a 1 ms response time from the storage array, you cannot be sure if that is a great response time from the array or rather a cached block from the operating system. If possible, you should consider enabling direct I/O after ensuring that it does not cause performance problems.

To demonstrate the difference between synchronous and asynchronous I/O, consider the following job file for asynchronous I/O.

[oracle@server1 fio-2.1]$ cat rand-read-async.fio
[random-read-async]
rw=randread
size=1024m
bs=8k
directory=/u01/fio/data
ioengine=libaio
iodepth=8
direct=1
invalidate=1
ioscheduler=noop

Let’s compare the two—some information has been removed in an effort not to clutter the output. First, the synchronous benchmark:

[oracle@server1 fio-2.1]$ ./fio rand-read-sync.fio
random-read-sync: (g=0): rw=randread, bs=8K-8K/8K-8K/8K-8K, ioengine=sync, iodepth=8
fio-2.1
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [39240KB/0KB/0KB /s] [4905/0/0 iops] [eta 00m:00s]
random-read-sync: (groupid=0, jobs=1): err= 0: pid=4206: Mon May 27 23:03:22 2013
  read : io=1024.0MB, bw=39012KB/s, iops=4876, runt= 26878msec
    clat (usec): min=71, max=218779, avg=203.77, stdev=605.20
     lat (usec): min=71, max=218780, avg=203.85, stdev=605.20
    clat percentiles (usec):
     |  1.00th=[  189],  5.00th=[  191], 10.00th=[  191], 20.00th=[  193],
     | 30.00th=[  195], 40.00th=[  197], 50.00th=[  205], 60.00th=[  207],
     | 70.00th=[  209], 80.00th=[  211], 90.00th=[  213], 95.00th=[  215],
     | 99.00th=[  225], 99.50th=[  233], 99.90th=[  294], 99.95th=[  318],
     | 99.99th=[  382]
    bw (KB  /s): min=20944, max=39520, per=100.00%, avg=39034.87, stdev=2538.87
    lat (usec) : 100=0.08%, 250=99.63%, 500=0.29%, 750=0.01%, 1000=0.01%
    lat (msec) : 4=0.01%, 20=0.01%, 250=0.01%
  cpu          : usr=0.71%, sys=6.17%, ctx=131107, majf=0, minf=27
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=131072/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=1024.0MB, aggrb=39012KB/s, minb=39012KB/s, maxb=39012KB/s,
         mint=26878msec, maxt=26878msec

Disk stats (read/write):
  sda: ios=130050/20, merge=0/4, ticks=25188/12, in_queue=25175, util=94.12%

The important information here is that the test itself took 26,878 milliseconds to complete, and the storage device completed an average of 4,876 I/O operations per second for an average bandwidth of 39,012KB/s. Latencies are broken down into scheduling latency (not applicable for synchronous I/O— you will see it in the below output) and completion latency. The last number, recorded as “lat” in the output above, is the complete latency and should be the sum of scheduling plus completion latency.

Other important information is that out CPU has not been terribly busy when it executed the benchmark, but this is an average over all cores. The IO depths row shows the IO depth over the duration of the benchmark execution. As you can see, the use of synchronous I/O mandates an I/O depth of 1. The “issued” row lists how many reads and writes have been issued. Finally, the result is repeated again at the bottom of the output. Let’s compare this with the asynchronous test:

[oracle@server1 fio-2.1]$ ./fio rand-read-async.fio
random-read-async: (g=0): rw=randread, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=8
fio-2.1
Starting 1 process
random-read-async: Laying out IO file(s) (1 file(s) / 1024MB)
Jobs: 1 (f=1): [r] [100.0% done] [151.5MB/0KB/0KB /s] [19.4K/0/0 iops] [eta 00m:00s]
random-read-async: (groupid=0, jobs=1): err= 0: pid=4211: Mon May 27 23:04:28 2013
  read : io=1024.0MB, bw=149030KB/s, iops=18628, runt=  7036msec
    slat (usec): min=5, max=222754, avg= 7.89, stdev=616.46
    clat (usec): min=198, max=14192, avg=408.59, stdev=114.53
     lat (usec): min=228, max=223110, avg=416.61, stdev=626.83
    clat percentiles (usec):
     |  1.00th=[  278],  5.00th=[  306], 10.00th=[  322], 20.00th=[  362],
     | 30.00th=[  398], 40.00th=[  410], 50.00th=[  414], 60.00th=[  422],
     | 70.00th=[  430], 80.00th=[  442], 90.00th=[  470], 95.00th=[  490],
     | 99.00th=[  548], 99.50th=[  580], 99.90th=[  692], 99.95th=[  788],
     | 99.99th=[ 1064]
    bw (KB  /s): min=81328, max=155328, per=100.00%, avg=149248.00, stdev=19559.97
    lat (usec) : 250=0.01%, 500=96.31%, 750=3.61%, 1000=0.06%
    lat (msec) : 2=0.01%, 20=0.01%
  cpu          : usr=3.80%, sys=21.68%, ctx=129656, majf=0, minf=40
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=131072/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=1024.0MB, aggrb=149030KB/s, minb=149030KB/s, maxb=149030KB/s,
         mint=7036msec, maxt=7036msec

Disk stats (read/write):
  sda: ios=126931/0, merge=0/0, ticks=50572/0, in_queue=50547, util=95.22%

Apart from the fact that the test completed a lot faster—7,036msec vs. 26,878msec—you can see that bandwidth is a lot higher. The libaio benchmark provides a lot more IOPS as well: 18,628 vs. 4,876. A doubling of the IO depth yields better bandwidth with the asynchronous case at the expense of the latency of individual requests. You do not need to spend too much time trying to find the maximum I/O depth on your platform, since Oracle will transparently make use of the I/O subsystem, and you should probably not change any potential underscore parameter.

You should also bear in mind that maximizing I/O operations per second is not the only target variable when optimizing storage. A queue depth of 32, for example, will provide a large number of IOPS, but this number means little without the corresponding response time. Compare the output below, which has been generated with the same job file as before but a queue depth of 32 instead of 8:

[oracle@server1 fio-2.1]$ ./fio rand-read-async-32.fio
random-read-async: (g=0): rw=randread, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=32
[...]
  read : io=1024.0MB, bw=196768KB/s, iops=24595, runt=  5329msec
[...]
    lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=99.84%, 4=0.13%

The number of IOPS has increased from 18,628 to 24,595, and the execution time is down to 5,329 milliseconds instead of 7,036, but this has come at a cost. Instead of microseconds, 99.84% of the I/O requests now complete in milliseconds.

When you are working with storage and Linux, FIO is a great tool to help you understand the storage solution as well as the hardware connected against it. FIO has a lot more to offer with regards to workloads. You can read and write sequentially or randomly, and you can mix these as well—even within a single benchmark run! The permutations of I/O type, block sizes, direct, and buffered I/O, combined with the options to run multiple jobs and use differently sized memory pages, etc., make FIO one of the best benchmark tools available. There is a great README as part of FIO, and its author has a HOWTO document, as well, for those who want to know all the detail: http://www.bluestop.org/fio/HOWTO.txt.

Oracle I/O numbers

The ORION tool has been available for quite some time, initially as a separate download from Oracle’s website and now as part of the database installation. This, in a way, is a shame since a lot of its attractiveness initially came from the fact that a host did not require a database installation. If you can live with a different, lower, version number, then you can still download ORION for almost all platforms from Oracle’s OTN website. At the time of this writing, the below URL allowed you to download the binaries: http://www.oracle.com/technetwork/topics/index-089595.html. The above link also allows you to download the ORION user’s guide, which is more comprehensive than this section. For 11.2 onwards you find the ORION documentation as part of the official documentation set in the “Performance Tuning Guide”. Instead of the 11.1 binaries downloadable as a standalone package, this book will use the ones provided with the Oracle server installation.

ORION works best during the evaluation phase of a new platform, without user-data on your logical unit numbers (LUNs). (As with all I/O benchmarking tools, write testing is destructive!) The package will use asynchronous I/O and large pages, where possible, for reads and writes to concurrently submit I/O requests to the operating system. Asynchronous I/O in this context means the use of libaio on Linux or equivalent library as discussed in the FIO benchmark section above. You can quite easily see that the ORION slave processes use io_submit and io_getevents by executing strace on them:

[root@server1 ∼]# strace -cp 28168
Process 28168 attached - interrupt to quit
^CProcess 28168 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 97.79    0.323254           4     86925           io_submit
  2.11    0.006968           1     12586           io_getevents
  0.10    0.000345           0     25172           times
------ ----------- ----------- --------- --------- ----------------
100.00    0.330567                124683           total
[root@server1 ∼]#

The process with ID 28168, to which the strace session attached, was a running ORION session.

So what does ORION allow you to do? It’s easiest to start off with the way it generates I/O workloads. Internally ORION uses a two-dimensional matrix depicting small I/O and large I/O requests. Columns in the matrix represent small I/O requests; rows are large I/O requests. This matrix approach is illustrated in Table 3-3:

Table 3-3. The Point Matrix Used Internally by ORION

Small I/O requests are user-configurable but default to 8k, which equals the default block size of modern Oracle databases and most of the other databases. Large I/O requests can also be defined as an input parameter to the benchmark run; they default to 1MB, which is in line with the default maximum size of a single I/O on most platforms. Benchmarks generate load by increasing the concurrency of outstanding I/O requests, either with or without additional large I/O requests. For each element in the matrix, ORION records information and presents it later in text files. To use ORION, you need to pass an argument to the mandatory parameter “run.” It can take the following values:

Simple: A simple run which tests random small 8k in isolation, i.e., uses row 0 in the above matrix: multiple concurrent small I/Os but no large ones. It then runs tests using multiple concurrent large I/Os, i.e., stays in column 0. It does not combine small and large I/Os.
Normal: traverses the whole matrix and tests various combinations of small and large I/Os. This run can take a while to complete.
DSS: Uniquely tests random large I/Os at increasing concurrency, stays in column 0 in the above matrix.
OLTP: tests only small I/Os, stays in row 0 in the above matrix.
Advanced: allows you define your own workload by picking rows, columns, or individual data points in the matrix. Advanced testing is out of scope of this chapter.

There are some concerns whether or not ORION really represents a true database workload. One argument against ORION is that it does not have to deal with a buffer cache and its maintenance and some other work an active Oracle instance has to perform. Despite the few downsides to ORION, you should assume that ORION is a good tool, albeit not 100 percent accurate. It certainly is good enough to get a better understanding of your storage subsystem when used as a baseline.

Before you can start a benchmark run with ORION, you need to define a set of targets you want to test against. These are usually LUNs provisioned from the storage array or a directly attached block device, maybe even via PCIe. The LUNs need to be listed in an input file and terminated with a new line. If you were testing a system with 5 LUNs for a future ASM disk group “DATA,” the file could have the following contents:

/dev/mapper/asm_data_disk001p1
/dev/mapper/asm_data_disk001p2
/dev/mapper/asm_data_disk001p3
/dev/mapper/asm_data_disk001p4
/dev/mapper/asm_data_disk001p5

Note that the devices have been partitioned in this setup. This was done to align the partition with the storage array block boundaries, based on a vendor recommendation. Partitioning a LUN also indicates that it is in use and thus prevents it from being accidentally deleted.

Tip If you are using flash-based storage, then it almost always required to align the partition at a 4k boundary—remember, the page size for NAND flash is usually 4k. You could also review the vendor documentation to update tunables in /sys/block/sdx/queue/.

The LUNs need to be saved in a file called orion.lun. You can optionally pass a parameter “testname” to ORION, in which case the LUNs need to be saved as testname.lun. ORION will make use of large pages by default. If your system does not have large pages configured, ORION will error unless you specify the -hugenotneeded flag. To perform a simple, read-only OLTP-like test named “test1,” you need a file test1.lun containing the LUNs to test against. Make sure that the oracle account has access rights to these LUNs. You invoke ORION using its complete path. After the calibration run you can check the output of the “summary” file to get an idea about your storage subsystem:

ORION VERSION 12.1.0.1.0
 
Command line:
-run oltp -testname test1
 
These options enable these settings:
Test: test1
Small IO size: 8 KB
Large IO size: 1024 KB
IO types: small random IOs, large random IOs
Sequential stream pattern: RAID-0 striping for all streams
Writes: 0%
Cache size: not specified
Duration for each data point: 60 seconds
Small Columns:,      1,      2,      3,      4,      5,      6,      7,      8,      9,
10,     11,     12,     13,     14,     15,     16,     17,     18,     19,     20
Large Columns:,      0
Total Data Points: 21
 
Name: /dev/sdc  Size: 1073741824
1 files found.
 
Maximum Small IOPS=9899 @ Small=17 and Large=0
Small Read Latency: avg=1711 us, min=791 us, max=12316 us, std dev=454 us @ Small=17 and Large=0
 
Minimum Small Latency=123.58 usecs @ Small=1 and Large=0
Small Read Latency: avg=124 us, min=41 us, max=4496 us, std dev=76 us @ Small=1 and Large=0
Small Read / Write Latency Histogram @ Small=17 and Large=0
        Latency:                # of IOs (read)          # of IOs (write)
        0 - 1           us:             0                       0
        2 - 4           us:             0                       0
        4 - 8           us:             0                       0
        8 - 16          us:             0                       0
       16 - 32          us:             0                       0
       32 - 64          us:             10                      0
       64 - 128         us:             364355                  0
      128 - 256         us:             90531                   0
      256 - 512         us:             4281                    0
      512 - 1024        us:             2444                    0
     1024 - 2048        us:             244                     0
     2048 - 4096        us:             38                      0
     4096 - 8192        us:             8                       0
     8192 - 16384       us:             0                       0
    16384 - 32768       us:             0                       0
    32768 - 65536       us:             0                       0
    65536 - 131072      us:             0                       0
   131072 - 262144      us:             0                       0
   262144 - 524288      us:             0                       0
   524288 - 1048576     us:             0                       0
  1048576 - 2097152     us:             0                       0
  2097152 - 4194304     us:             0                       0
  4194304 - 8388608     us:             0                       0
  8388608 - 16777216    us:             0                       0
 16777216 - 33554432    us:             0                       0
 33554432 - 67108864    us:             0                       0
 67108864 - 134217728   us:             0                       0
134217728 - 268435456   us:             0                       0a

This output has been produced from a testbed environment; the numbers are not to be used for comparison with a real storage backend.

Large I/Os are random by default but can be defined as sequential if desired. To add write testing to the benchmark, you need to tell ORION to do so. The -write flag indicates how many percent of I/Os should be writes, and according to the documentation the default is 0.

Caution Write testing WILL DESTROY existing data. Double or triple check that the LUNs you test against do NOT contain data!

At the end of the benchmark, you are presented with a lot of information. All the result files are prefixed with the test name you chose with an added time stamp:

[oracle@server1 temp]# ls -l
total 104
-rw-r--r--. 1 oracle oracle 19261 Jul 22 10:13 test1_20130722_0953_hist.txt
-rw-r--r--. 1 oracle oracle   742 Jul 22 10:13 test1_20130722_0953_iops.csv
-rw-r--r--. 1 oracle oracle   761 Jul 22 10:13 test1_20130722_0953_lat.csv
-rw-r--r--. 1 oracle oracle   570 Jul 22 10:13 test1_20130722_0953_mbps.csv
-rw-r--r--. 1 oracle oracle  1859 Jul 22 10:13 test1_20130722_0953_summary.txt
-rw-r--r--. 1 oracle oracle 18394 Jul 22 10:13 test1_20130722_0953_trace.txt
-rw-r--r--. 1 oracle oracle    20 Jul 22 09:46 test1.lun
[oracle@server1 temp]#

The summary file is the first one to look at. It shows the command line parameters, I/O sizes, the I/O type among other information. From a storage-backend perspective, the matrix data points are most interesting:

Duration for each data point: 60 seconds
Small Columns:,      1,      2,      ...     20
Large Columns:,      0
Total Data Points: 21

Here you see that the OLTP test does not use large I/Os at all. The name and size of the LUNs used are also recorded just prior to the performance figures:

Maximum Small IOPS=9899 @ Small=17 and Large=0
Small Read Latency: avg=1711 us, min=791 us, max=12316 us, std dev=454 us @ Small=17 and Large=0
 
Minimum Small Latency=123.58 usecs @ Small=1 and Large=0
Small Read Latency: avg=124 us, min=41 us, max=4496 us, std dev=76 us @ Small=1 and Large=0
Small Read / Write Latency Histogram @ Small=17 and Large=0

Following that information, you are shown the same latency histogram you were presented at the end of the ORION run. In the above example, that’s the histogram for the data point of 17 small I/Os and 0 large I/Os. All other histograms can be found in the test1_20130722_0953_hist.txt file. The other files contain the information listed in Table 3-4.

Table 3-4. Files Generated During an ORION Benchmark (taken from the file headers)

File name	Contents as per file header
*_hist.txt	Contains histograms of the latencies observed for each data point test. Each data point test used a fixed number of outstanding small and large I/Os. For each data point, histograms are printed for the latencies of small reads, small writes, large reads, and large writes. The value specifies the number of I/Os that were observed within the bucket's latency range.
*_iops.csv	Contains the rates sustained by small I/Os in IOPS. Each value corresponds to a data point test that used a fixed number of outstanding small and large I/Os.
*_lat.csv	Contains the average latency sustained by small I/Os in microseconds. Each value corresponds to a data point test that used a fixed number of outstanding small and large I/Os.
*_mbps.csv	Contains the rates sustained by large I/Os in MBps. Each value corresponds to a data point test that used a fixed number of outstanding small and large I/Os.
*_trace.txt	Raw data

The use of ORION should have given you a better understanding of the capabilities of your storage subsystem. Bear in mind that the figures do not represent a true Oracle workload due to the lack of the synchronization in the Oracle-shared memory structures. Furthermore, ORION does not use the pread/pwrite calls Oracle employs for single block I/O. However, the initial test of your storage subsystem should be a good enough approximation.

There is a lot more to ORION which could not be covered here, especially when it comes to testing I/O performance of multiple LUNs. It is possible to simulate striping and mirroring, and even to simulate log write behavior, by instructing the software to stream data sequentially.

Silly little Oracle Benchmark

Don’t be fooled by the name—SLOB is one of the most interesting and hotly discussed Oracle-related storage benchmarks you can get hold of. It has been released by Kevin Closson and has extensively been discussed on social media and weblogs. SLOB is designed to test the Oracle storage backend and is available in source code. It relies on a C-program’s control, the execution of the actual benchmark work defined in shell scripts and SQL. Since its initial release SLOB has been enhanced, and a new version of the benchmark has been released as SLOB 2. The following section is about SLOB2 unless stated otherwise.

Caution Do NOT run SLOB outside your controlled lab environment! It has the potential to cause serious trouble, such as creating severe queuing delays and a very noticeable performance hit on your shared storage system. The purpose of this section is to give you a tool to evaluate storage solutions for your consolidation environment, before the solution is deployed. So again, do not run SLOB outside your lab, especially not on non-lab storage (storage arrays)! Also ensure that you read the accompanying documentation carefully and make yourself comfortable with the implications of running SLOB.

Before you can actually use SLOB, you need to perform some setup work. The first step is to create a database, and then you load the test data. SLOB is made available here:

http://kevinclosson.wordpress.com/2013/05/02/slob-2-a-significant-update-links-are-here/

At the time of this writing, the May 5 release was current. Download and uncompress the file to a convenient location on the server you want to benchmark. Let’s assume you unzipped it to /home/oracle/SLOB.

Creating and Populating the SLOB Database

The first step when working with the tool is to create the database. For the purpose of our testing, the database will be created as a regular database as opposed to a Container Database. You can read more about the various database types available in Oracle 12.1 in Chapter 7. Despite SLOB being developed for databases pre-Oracle 12.1, you can use the supplied init.ora and database creation script. Before you launch them, add the database to the oratab file, for example:

SLOB:/u01/app/oracle/product/12.1.0.1/dbhome_1:N

Next, review the initialization file to be found in ∼/SLOB/misc/create_database_kit. Change the file to match your environment, especially with regards to the Oracle Managed Files (OMF) parameters. You should definitely consider changing the value for compatible to 12.0 at least. You may also need to increase the value for processes. You can optionally create a password file if you would like to connect remotely. Ensure that the directories referenced in the initialization file exist and that the database owner has read-write permissions to them as well. The file used for the below example is shown here for reference. Instead of Automatic Memory Management, as in the original create.ora file, Automatic Shared Memory Management is used.

db_create_file_dest = '/u01/oradata/'
db_name = SLOB
compatible=12.1.0.1.0
UNDO_MANAGEMENT=AUTO
db_block_size = 8192
db_files = 300
processes = 1000
sga_target=10G
filesystemio_options=setall

Note that the above initialization file will only be used to create the database. The initialization file for the actual benchmark test runs will be different and is found in the top-level SLOB directory named simple.ora. Nevertheless, once the database is created, keep a note of the control_files parameter to ensure you do not have to dig around looking for them later.

Now create the database by executing cr_db.sql. A few minutes later, a minimalistic database is ready. It is not in archivelog mode, which could be important for your testing.

In the next step the wait kit needs to be compiled. This is not normally a problem with Linux-based systems but can be a bit of a challenge on Solaris and AIX. If you followed the pre-requisites for the database installation on Linux, you should already have a compiler, linker and interpreter for Makefiles. On other platforms you should have everything except for the C-compiler “cc.” Regardless of platform, once you have the compiler suite in place, you use the make command in ∼/SLOB/wait_kit to compile the code, as shown here on a Linux system:

[oracle@server1 SLOB]$ cd wait_kit/
[oracle@server1 wait_kit]$ make
rm -fr *.o mywait trigger create_sem
cc     -c -o mywait.o mywait.c
cc -o mywait mywait.o
cc     -c -o trigger.o trigger.c
cc -o trigger trigger.o
cc     -c -o create_sem.o create_sem.c
cc -o create_sem create_sem.o
cp mywait trigger create_sem ../
rm -fr *.o
[oracle@server1 wait_kit]$

The empty database has to be populated with test data next. SLOB works by creating a number of schemas (users) on the tablespace created as part of the cr_db.sql script named IOPS: 128 by default. A new feature of SLOB 2 allows you to scale the data volume per user. Unless defined otherwise, the setup script used to create the test data will create just a single table named CF1 with a unique index in each schema. The table structure is especially prepared for the benchmark so that a block contains a single row only. By default, 10,000 rows are created based on a seed table but in completely random order so that no two users have the same dataset.

If you would like to scale the dataset per user, you need to modify the SLOB configuration script slob.conf and modify the SCALE variable. The default of 10,000 rows equates to approximately 80 MB per user on the lab test system:

SQL> select sum(bytes/power(1024,2)) mb,segment_type
  2    from dba_segments
  3   where owner = 'USER1'
  4   group by segment_type
  5  /
 
        MB SEGMENT_TYPE
---------- ------------------
     .1875 INDEX
        80 TABLE

The total space needed for the default 128 users therefore is approximately 10,240 MB. Larger data volumes require a larger value for the SCALE variable. If you are considering increasing the data set by adding more users, then don’t. Comments in the setup script indicate that the tested maximum number of users is 128, so instead of increasing their number, you should use the SCALE variable.

The data is loaded into the database by calling the setup.sh script. It takes two parameters: the tablespace to store the data on, and optionally the number of users:

[oracle@server1 SLOB]$ ./setup.sh
FATAL: ./setup.sh args
Usage : ./setup.sh: <tablespace name> <optional: number of users>
[oracle@server1 SLOB]$ ./setup.sh IOPS 128
 
NOTIFY: Load Parameters (slob.conf):
 
LOAD_PARALLEL_DEGREE == 12
SCALE == 10000
ADMIN_SQLNET_SERVICE == ""
CONNECT_STRING == "/ as sysdba"
NON_ADMIN_CONNECT_STRING ==
 
NOTIFY: Testing connectivity to the instance to validate slob.conf settings.
NOTIFY: ./setup.sh: Successful test connection: "sqlplus -L / as sysdba"
 
NOTIFY: Creating and loading seed table.
 
Table created.
 
PL/SQL procedure successfully completed.
 
NOTIFY: Seed table loading procedure has exited.
NOTIFY: Setting up user   1  2  3  4  5  6  7  8  9  10  11  12
[...]
NOTIFY: Setting up user   121  122  123  124  125  126  127  128
 
Table dropped.
 
NOTIFY: ./setup.sh: Loading procedure complete (158 seconds). Please check
./cr_tab_and_load.out for any errors
 
[oracle@server1 SLOB]$

Depending on the quality of your storage subsystem, this process can take a little while. You should heed the advice and check the logfile for errors.

Benchmarking Storage with SLOB

The SLOB kit allows you to run three different benchmarks, according to the documentation. Unlike the previously introduced benchmarks, which focused on physical I/O, SLOB also allows you to check how well your architecture can deal with logical I/O. The test cases are categorized in the documentation into focus areas:

Physical I/O on Oracle data files. For this test you need a small buffer cache, ideally less than 100 MB. The test is useful to see how well your storage subsystem is attached to the server and how much physical I/O it can provide.

Logical I/O test. This test case requires a large SGA and especially a large buffer cache. It tests how well your architecture can scale logical I/O. If you think of today’s multi-socket systems with their complex memory arrangements in form of level 1, level 2, last-level cache, DRAM remotely and locally attached, then this test suddenly starts to make a lot of sense.

The final test is about redo. Like the previous LIO test, it too requires a large SGA, in such a way that the database writer processes do not have to flush dirty buffers to disk.

Controlling SLOB

Unlike the initial release of the software, SLOB 2 is controlled slightly differently, by means of a configuration file. The slob.conf file controls the aspects of the benchmark such as allowed levels of concurrency as well as the ratio between reads and writes. In the initial SLOB release, you needed to call the main benchmark script-runit.sh-with two parameters. One indicated the number of readers and the other the number of writers. Concurrency was not permitted. SLOB 2 uses a configuration file to define the characteristics of a given benchmark execution. A sample file is shown here:

[oracle@server1 SLOB]$ cat slob.conf
 
UPDATE_PCT=0
RUN_TIME=60
WORK_LOOP=0
SCALE=10000
WORK_UNIT=256
REDO_STRESS=HEAVY
LOAD_PARALLEL_DEGREE=12
SHARED_DATA_MODULUS=0
 
# Settings for SQL*Net connectivity:
#ADMIN_SQLNET_SERVICE=slob
#SQLNET_SERVICE_BASE=slob
#SQLNET_SERVICE_MAX=2
#SYSDBA_PASSWD="change_on_install"
 
export UPDATE_PCT RUN_TIME WORK_LOOP SCALE WORK_UNIT LOAD_PARALLEL_DEGREE REDO_STRESS SHARED_DATA_MODULUS
 
[oracle@server1 SLOB]$

As you can easily see, these are BASH-style variables sourced into the runit.sh script at runtime to provide the configuration. The most important variables for the purpose of evaluating storage are listed here:

UPDATE_PCT. This parameter replaces the explicit number of readers to writers in the initial release. Specify in percent how many statements should be DML. A setting of 0 or 100 is equivalent to a call to runit.sh with 0 readers or writers respectively. The readme states that values between 51 and 99 are non-deterministic.

RUN_TIME. Setting RUN_TIME to a value in seconds allows you to terminate a given benchmark run after a given interval. The documentation recommends that when you decide to use RUN_TIME, please set WORK_LOOP to 0.

WORK_LOOP. Instead of terminating an execution of runit.sh after a fixed amount of time, you could alternatively specify WORK_LOOP to measure how long it takes for your system to complete a given number of iterations of the workload. When doing so, you should set RUN_TIME to a large number to allow for WORK_LOOP iterations to complete.

For a basic set of tests, the other parameters can be left at their defaults. But SLOB 2 does not stop there. SLOB is made for the Oracle researcher, and there are plenty of methods for trying it in different ways.

Executing a Benchmark

The actual benchmark is initiated by the script runit.sh. It takes just one parameter: the number of sessions to be spawned. The script performs the following steps for locally executed tests in a single-instance Oracle database environment:

Defines a default set of parameters in case those in slob.conf are incomplete before it sources the slob.conf file into the currently executing environment.
Provides sanity checks to the environment.
Performs a log switch in the database.
Sets up operating system monitoring using vmstat, iostat, and mpstat.
Creates the required number of user sessions. Values from slob.conf are passed to the script responsible for the benchmark, named slob.sql. The processes do not execute just yet—they are “held in check” by a semaphore.
Creates an AWR snapshot and gets the current wall clock time.
Starts the execution of all withheld processes.
Waits for the execution of all sessions to finish.
Calculates the run time, creates another snapshot, and generates AWR reports in text and HTML format.

Notice that the resulting AWR reports are not preserved; subsequent executions of SLOB will overwrite them. Users of SLOB, especially Yury Velikanov, commented that for very long running benchmarks, you should disable automatic AWR snapshots. Depending on your AWR settings, it is perfectly possible to have an automatic snapshot in between your benchmark run. The logic in the runit.sh script takes the last two snapshots and generates the AWR report, which causes problems if an automatic snapshot has been taken after the initial one. The automatic snapshot interval should be set to a high enough value, not 0, because this will turn off automatic AWR snapshots.

These concerns about losing the AWR information of previous benchmarks can easily be addressed by writing a small test harness. A sample wrapper script around runit.sh is shown here:

#!/bin/bash
#
# Martin Bach 2013
#
# runit_wrapper.sh is a wrapper to preserve the output of the supporting files from
# individual SLOB version 2 test runs
 
# list of files to preserve. Not all are necesarrily generated during a test run.
FILES="awr.txt awr.*html  awr*.gz iostat.out vmstat.out mpstat.out db_stats.out "
FILES="$FILES sqlplus.out slob.conf"
 
# setting up ...
[[ ! -f $ORACLE_HOME/bin/sqlplus || -z $ORACLE_SID ]] && {
        echo ERR: Your Oracle environment is not set correctly it appears.
        echo ERR: source oraenv into your session and try again
        exit
}
 
# Two command line parameters are needed
(( $# != 2 )) && {
        echo ERR: wrong number of command line parameters
        echo usage: $0 testname num_sessions
        exit
}
 
TESTNAME=$1
NUM_SESSIONS=$2
 
# The script creates a directory to hold the test results. That will fail
# if the directory already exists
[[ -d $TESTNAME ]] && {
        echo ERR: A test with name $TESTNAME has been found already.
        echo ERR: Please double check your parameters
        exit
}
 
echo INFO: preparing to preserve output for test $TESTNAME
mkdir $TESTNAME || {
        echo ERR: cannot create a directory to store the output for $TESTNAME
        exit
}
 
# set the automatic gathering of snapshots to a very high value.
# Note that turning it off (interval => 0) means you cannot even
# take manual snapshots.
echo INFO: increasing AWR snapshot interval
 
$ORACLE_HOME/bin/sqlplus / as sysdba > /dev/null <<EOF
spool $TESTNAME/awr_config.txt
select systimestamp as now from dual;
select * from DBA_HIST_WR_CONTROL;
exec DBMS_WORKLOAD_REPOSITORY.MODIFY_SNAPSHOT_SETTINGS(interval=>360)
exit
EOF
 
# finally run the benchmark. Keep the output too
./runit.sh $NUM_SESSIONS 2>&1 | tee ${TESTNAME}/runit.sh.log
 
# preserve the generated files in $TESTNAME
for FILE in $FILES; do
        [[ -f $FILE ]] && {
                echo INFO: moving file $FILE to directory $TESTNAME
                cp $FILE $TESTNAME 2> /dev/null
        }
done
 
# create a copy of awr.txt for the use with awr_info.sh
cp ${TESTNAME}/awr.txt ${TESTNAME}/awr.txt.${NUM_SESSIONS}
 
# finally preserve all initialisation parameters
$ORACLE_HOME/bin/sqlplus / as sysdba 2> /dev/null <<EOF
create pfile='$(pwd)/${TESTNAME}/init.ora' from memory;
exit
EOF
 
echo INFO: done

Notes on SLOB usage

If you intend to run the Physical I/O or PIO test, you need a small SGA—which is easier said than done. With multi-socket/multi-core servers, Oracle will automatically derive default values for initialization parameters, making it difficult to get small buffer caches. Limiting the cpu_count to a low value—2 or 4—is usually a good starting point, and you could also use manual SGA management, as we know it from Oracle 9i. The below is a sample initialization file with manual SGA management, based on the previously mentioned “simple.ora” initialization file which you should review and adapt if needed:

db_create_file_dest = '/u01/oradata/'
db_name = SLOB
compatible=12.1.0.1.0
UNDO_MANAGEMENT=AUTO
db_block_size = 8192
db_files = 300
processes = 500
control_files = /u01/oradata/SLOB/controlfile/o1_mf_812m3gj5_.ctl
 
shared_pool_size = 600M
large_pool_size = 16M
java_pool_size = 0
streams_pool_size = 0
db_cache_size=48M
cpu_count = 2
 
filesystemio_options=setall
parallel_max_servers=0
recyclebin=off
pga_aggregate_target=1G
workarea_size_policy=auto
...

There are some further options you can set that have been omitted here for brevity. Starting the database with this parameter file allowed the creation of a really small SGA, as you can see from the query below:

SQL> select component,current_size/power(1024,2) mb
  2   from v$sga_dynamic_Components
  3  where current_size <> 0
  4  /
 
COMPONENT                                                                MB
---------------------------------------------------------------- ----------
shared pool                                                             600
large pool                                                               16
DEFAULT buffer cache                                                     48

According to the documentation, the limitation of the cpu_count parameter does not influence the benchmark kit to drive I/O. Running SLOB sessions with an increasing number of sessions, you might notice that the throughput increases up to a point at which the CPUs are completely saturated and form the bottleneck. When reviewing the AWR reports, you should pay attention to the I/O latencies and the throughput. The most interesting sections are the “Load Profile,” “Top 10 Foreground Events by Total Wait Time,” and the “IO profile”:

IO Profile                  Read+Write/Second     Read/Second    Write/Second
∼∼∼∼∼∼∼∼∼∼                  ----------------- --------------- ---------------
            Total Requests:          22,138.9        22,132.9             6.0
         Database Requests:          22,132.8        22,128.0             4.8
        Optimized Requests:               0.0             0.0             0.0
             Redo Requests:               0.9             0.0             0.9
                Total (MB):             173.0           173.0             0.1
             Database (MB):             172.9           172.9             0.0
      Optimized Total (MB):               0.0             0.0             0.0
                 Redo (MB):               0.0             0.0             0.0
         Database (blocks):          22,133.8        22,128.3             5.5
 Via Buffer Cache (blocks):          22,133.8        22,128.3             5.5
           Direct (blocks):               0.0             0.0             0.0

The above output shows the AWR report for an execution of runit.sh 16 and a 16MB buffer cache. This observation is confirmed by collectl, too, which has been running in parallel:

#<----CPU[HYPER]-----><---------------Disks---------------->
#cpu sys inter  ctxsw KBRead  Reads Size KBWrit Writes Size
   2   1 42236  47425 182168  22771    8      0      0    0
   2   1 43790  47359 182272  22776    8     16      1   16
   2   1 42955  46644 179040  22380    8     24      3    8

Which processing lead to those numbers? So far it has not been described what the benchmark does in more detail. The logic is defined in the slob.sql file. The new file combines the application logic, which has been defined in two different files, readers.sql and writers.sql. The new file follows the same model.

For a read-only-workload, in which UPDATE_PCT is set to 0 in slob.conf, you will see select statements exclusively similar to this one taken from the AWR report:

SQL> select * from table(dbms_xplan.display_cursor('309mwpa4tk161'))
 
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------
SQL_ID 309mwpa4tk161, child number 0
-------------------------------------
SELECT COUNT(C2) FROM CF1 WHERE CUSTID > ( :B1 - :B2 ) AND (CUSTID <
:B1 )
 
Plan hash value: 1497866750
 
------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |       |       |       |   259 (100)|          |
|   1 |  SORT AGGREGATE                        |       |     1 |   133 |            |          |
|*  2 |   FILTER                               |       |       |       |            |          |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED | CF1   |   256 | 34048 |     259 (0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN                   | I_CF1 |   256 |       |       2 (0)| 00:00:01 |
------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter(:B1>:B1-:B2)
   4 - access("CUSTID">:B1-:B2 AND "CUSTID"<:B1)
 
23 rows selected.

As you can see, the reader part of the script will perform an index range scan. It is advisable to tail the database’s alert log when creating the data set, as well as when you are running the benchmark, to ensure that you are not running out of space or other system resources, such as processes.

Summary

This chapter has taken you on a journey about possible hardware solutions for your consolidation project. When thinking about the consolidation platform, you should definitely consider the Linux operating system. It is dynamically developed, and many great minds spend a lot of time contributing to it. When this chapter was written, Oracle used Linux as the primary development platform, which gives adopters of the platform early access to bug fixes and enhancements.

As an additional benefit it runs on industry-standard hardware, which has been deployed thousands of times. Also, Oracle has invested a lot of effort in the Linux platform, as well as making it future-proof. It has been said in the introduction, but it should be repeated here: these ideas you just read about should invite you to critically reflect about them! Test, evaluate, and sample until you have come to the conclusion that the platform you have chosen for iteration 1 of the hosting platform is right for your organization.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 3: Supporting Hardware

Create new playlist

Sign In

Sign Up

Table of Contents for
CHAPTER 3: Supporting Hardware