Optimization and tuning on IBM POWER7 and IBM POWER7+
This chapter describes the optimization and tuning of the IBM POWER7 system. It covers the the following topics:
1.1 Introduction
The focus of this publication is about gathering the correct technical information, and laying out simple guidance for optimizing code performance on IBM POWER7 and POWER7+ systems that run the AIX or Linux operating systems. There is much straightforward performance optimization that can be performed with a minimum of effort and without extensive previous experience or in-depth knowledge. This optimization work can:
Substantially improve the performance of the application that is being optimized for the
POWER7 system
Typically carry over improvements to systems that are based on related processor chips
Improve performance on other platforms
The new POWER7+ processor offers higher clock rates and larger caches than the original POWER7 processor, along with some new features, but is otherwise similar to the POWER7 processor. Most of the technical information, and all of the guidance for optimizing performance on POWER7, also applies to POWER7+. Any differences in POWER7+ and any new features in that chip are specifically noted; you can otherwise presume that any information presented for POWER7 also applies to POWER7+.
This guide strives to focus on optimizations that tend to be positive across a broad set of IBM POWER processor chips and systems. While specific guidance is given for the POWER7 and POWER7+ processors, the general guidance is applicable to the IBM POWER6®, POWER5, and even to earlier processors.
This guide is directed at personnel who are responsible for performing that migration and implementation activities on IBM POWER7-based servers, which includes system administrators, system architects, network administrators, information architects, and database administrators (DBAs).
1.2 Outline of this guide
The first section of this guide lays out simple strategies for optimizing performance (see 1.5, “Optimizing performance on POWER7” on page 5). We describe a set of straightforward steps to set up the environment for performance tuning and optimization, followed by an explanation about how to perform these easy and investigative steps. These steps are the most valuable areas on which to focus.
Next, this guide describes deployment, that is, system setup and configuration choices that ensure good performance on POWER7. Together, these simple optimization strategies and deployment guidance satisfy the requirements for most environment and can deliver
substantial improvements.
Finally, this guide describes some of the more advanced investigative techniques that can be used to identify performance bottlenecks in an application. It is here that optimization efforts move into the code internals of an application, and improvements are typically made by modifying source code. Of necessity, coverage in this last area is fairly rudimentary, focusing on general areas of investigation and the tooling that you should use.
Most of the remaining material in this guide is technical information that was developed by domain experts at IBM:
We provide hardware information about the POWER7 and POWER7+ processors (see Chapter 2, “The POWER7 processor” on page 21), highlighting the important features from a performance perspective and laying out the basic information that is drawn upon by the material that follows.
Next, we describe the system software stack, examining the IBM POWER Hypervisor™ (see Chapter 3, “The POWER Hypervisor” on page 55), the AIX and Linux operating systems and system libraries (see Chapter 4, “AIX” on page 67 and Chapter 5, “Linux” on page 97), and the compilers (see Chapter 6, “Compilers and optimization tools for C, C++, and Fortran” on page 107). Java (see Chapter 7, “Java” on page 125) also receives extensive coverage.
In Chapter 4, “AIX” on page 67, we examine a set of operating system-specific optimization opportunities. Then, we highlight some of the areas in which AIX uses some new features of the POWER7+ processor. Next, we provide detailed coverage about the new Dynamic System Optimizer (DSO) feature, building on previous work on the Active System Optimizer (ASO). ASO or DSO can be enabled to autonomously use some of the hardware- or operating system-specific optimizations that are described in this guide. The chapter concludes with a short description of AIX preferred practices regarding system setup and maintenance.
In Chapter 5, “Linux” on page 97, we describe the two primary Linux operating systems that are used on POWER7: Red Hat Enterprise Linux (RHEL) for POWER and SUSE Linux Enterprise Server (SLES) for POWER. The minimum supported levels of Linux, including service pack (SP) levels, are SLES11/SP1 and RHEL6/GA, which provide full support and usage of POWER7 technologies and systems. Linux is based on community efforts that are focused not only on the Linux kernel, but also all of the complementary packages, tools, toolchains, and GNU Compiler Collection (GCC) compilers that are needed to effectively use POWER7. IBM provides the expertise for Power Systems by developing, optimizing, and pushing open source changes to the Linux communities.
Finally, we cover important information about IBM middleware, DB2 (see Chapter 8, “DB2” on page 137) and IBM WebSphere Application Server (see Chapter 9, “WebSphere Application Server” on page 147). Various applications use middleware, and it is critical that the middleware is correctly tuned and performs well. The middleware chapters cover how these products are optimized for POWER7, including select preferred practices for tuning and deploying these products.
Three appendixes are included:
Appendix A, “Analyzing malloc usage under AIX” on page 151 explains some simple techniques for analyzing how an application is using the system memory allocation routines (malloc and related functions in the C library). malloc is often a bottleneck for application performance, especially under AIX. AIX has an extensive set of optimized malloc implementations, and it is easy to switch between them without rebuilding or changing an application. Knowing how an application uses malloc is key to choosing the best memory allocation alternatives that AIX offers. Even Java applications often make extensive use of malloc, either in Java Native Interface (JNI) code that is part of the application itself or in the Java class libraries, or in binary code that is part of the software development kit (SDK).
Appendix B, “Performance tooling and empirical performance analysis” on page 155 describes some of the important performance tools that are available on the IBM Power Architecture under AIX or Linux, and strategies for using them in empirical performance analysis efforts.
These performance tools are most often used as part of the advanced investigative techniques that are described in 1.5, “Optimizing performance on POWER7” on page 5, except for the new performance advisors, which are intended as investigative tools, appropriate for a broader audience of users.
Appendix C, “POWER7 optimization and tuning with third-party applications” on page 185 describes preferred practices for migrating third-party products (Oracle, Sybase ASE and Sybase IQ, SAS, and SAP) to Power Systems. We describe implementation tasks and how to meet environment requirements. Links to comprehensive documentation about current migrations are included.
After you review the advice in this guide, and if you would like more information, begin by visiting the IBM Power Systems website at:
1.3 Conventions that are used in this guide
In this guide, our convention for indicating sections of code or command examples is shown in Table 1-1.
Table 1-1 Conventions that are used in this guide
Type of example
Format that is used in this guide
Example of our convention
Commands and command options within text
Monofont, bolded
ldedit
Command lines or code examples outside of text
Monofont
ldedit -btextpsize=64k -bdatapsize=64k -bstackpsize=64k
Variables in command lines
Monofont, italicized
ldedit -btextpsize=64k -bdatapsize=64k -bstackpsize=64k <executable>
Variables that are limited to specific choices
Monofont, italicized
-mcmodel={medium|large}
1.4 Background
Some trends in processor design are making it more important than ever to consider analyzing and working to improve application performance. In the past, two of the ways in which newer processor chips delivered higher performance were by:
Increasing the clock rate
Making microarchitectural improvements that increase the performance of a single thread
Often, upgrading to a new processor chip gave existing applications a 50% or possibly 100% performance improvement, leaving little incentive to spend much effort to get an uncertain amount of additional performance. However, the approach in the industry has shifted, so that the newer processor chips do not substantially increased clock rates, as compared to the previous generation. In some cases, clock rates declined in newer designs. Recent designs also generally offer more modest improvements in the performance of a single execution thread.
Instead, the focus has shifted to delivering multiple cores per processor chip, and to delivering more hardware threads in each core (known as simultaneous multi-threading (SMT), in IBM Power Architecture terminology). This situation means that some of the best opportunities for improving the performance of an application are in delivering scalable code by having an application make effective use of multiple concurrent threads of execution.
Coupled with the trend toward aggressive multi-core and multi-threaded designs, there are changes in the amount of cache and memory bandwidth available to each hardware thread. Cache sizes and chip-level bandwidth are increasing at a slower rate than the growth of hardware threads, meaning that the amount of cache per thread is not growing as rapidly. In some cases, it decreases from one generation to the next. Again, this situation shows where deeper analysis and performance optimization efforts can provide some benefits.
1.5 Optimizing performance on POWER7
This section provides guidance for optimizing code performance on POWER7 when you use the AIX or Linux operating systems. The POWER7+ processor is a superset of the POWER7 processor, so all optimizations described for POWER7 apply equally for POWER7+. We cover the more prominent performance opportunities that are noted in past optimization efforts. The guidance is organized in to three broad categories:
Lightweight tuning covers simple prescriptive steps for tuning application performance on POWER7. These simple steps can be carried out without detailed knowledge of the internals of the application that is being optimized and usually without modifying the application source code. Simple system utilization and performance tools are used for understanding and improving your application performance. The steps and tools are general guidelines that apply to all types of applications. Although they are simple and straightforward, they often lead to significant performance improvements. It is possible to accomplish these steps in as little as two days or so for a small application, or, at most, two weeks for a large and complex application.
 
Performance improvement: Consider lightweight tuning to be the starting point for any performance improvement effort.
Deployment guidelines cover tuning considerations that are related to the:
 – Configuration of a POWER7 system to deliver the best performance
 – Associated runtime configuration of the application itself
There are many choices in a deployment, some of which are unrelated to the performance of a particular application, and so, at best, this section can present some guidelines and preferred practices. Understanding logical partitions (LPARs), energy management, I/O configurations, and using multi-threaded cores are examples of typical system considerations that can impact application performance.
 
Performance improvement: Consider deployment guidelines to be the second required activity for any reasonably extensive performance effort.
Deep performance analysis covers performance tools and general strategies for identifying and fixing application bottlenecks. This type of analysis requires more familiarity with performance tooling and analysis techniques, sometimes requiring a deeper understanding of the application internals, and often requiring a more dedicated and lengthy effort. Often, a simpler analysis is all that is required to identify serious bottlenecks in an application, but a more detailed investigation is required to perform an exhaustive search for all of the opportunities for increasing performance.
 
Performance improvement: Consider this the last activity that is undertaken, with simpler analysis steps, for a moderately serious performance effort. The more complex iterative analysis is reserved for only the most performance critical applications.
This section provides only minimal background on the guidance provided. Detailed material about these topics is incorporated in the chapters that follow and in the appendixes. The following chapters and appendixes also cover many other performance topics that are not addressed here.
 
Guidance for POWER7 and POWER7+: The guidance that is provided in this book specifically applies to POWER7 and POWER7+ processor chips and systems. The POWER7+ processor is a superset of the POWER7 processor, so all optimizations described for POWER7 apply equally to POWER7+. The guidance that is provided also generally applies to previous generations of POWER processor chips and systems, including POWER5 and POWER6. When our guidance is not applicable to all generations of Power Systems, we note that for you.
1.5.1 Lightweight tuning and optimization guidelines
This section covers building and performance testing applications on POWER7, and gives a brief introduction to the most important simple performance tuning opportunities that are identified for POWER7. More details about these and other opportunities are presented in the later chapters of this guide.
Performance test beds and workloads
In performance work, when you are tuning and optimizing an application for a particular processor, you must run and measure performance levels on that processor. Although there are some characteristics that are shared among processor chips in the same family, each generation of processor chip has unique performance characteristics. Optimizing code for POWER7 requires that you set up a test bed on a POWER7 system.
The POWER7+ processor has higher clock speeds and larger caches than POWER7, and applications should see higher performance on that new chip. Aside from those differences, the performance characteristics of the two chips are the same. A performance effort specifically targeting POWER7+ should use a POWER7+ system, but otherwise a POWER7 system can be used.
You want to see good performance across a range of newer systems, with a special emphasis on optimizing for the latest design. For Power Systems, the previous POWER6 generation is still commonly used. For this reason, it is best to have multiple test bed environments: a POWER7 system for most optimization work and a POWER6 system for limited testing to ensure that all tuning is beneficial on the previous generation of hardware.
POWER7 and POWER6 processors are dissimilar in some respects, and some simple steps can be taken to ensure good performance of a single binary running on either system. In particular, see the information in “C, C++, and Fortran compiler options” on page 8.
Performance test beds must be sized and configured for performance and scalability testing. Choose your scalability goals based on the requirements that are placed on an application, and the test bed must accommodate at least the minimum requirements. For example, when you target a multi-threaded application to scale up to four cores on POWER7, it is important that the test bed be at least a 4-core system and that tests are configured to run in various configurations (1-core, 2-core, and 4-core). You want to be able to measure performance across the different configurations and the scalability can be computed. Ideally, a 4-core system delivers four times the performance of a 1-core system, but in practice, the scalability is generally less than ideal. Scalability bottlenecks might not be clearly visible if the only testing done for this example were in a 4-core configuration.
With the multi-threaded POWER7 cores (see 2.2, “Multi-core and multi-thread scalability” on page 23), each processor core can be instantiated with one, two, or four logical CPUs within the operating system, so a 4-core server, with SMT4 mode (four hardware threads per core), means that the operating system is running 16 logical CPUs. Also, larger-core servers are becoming more pervasive, with scaling considerations well beyond 4-core servers.
The performance test bed must be a dedicated logical partition (LPAR). You must ensure that there is no other activity on the system (including on other LPARs, if any, configured on the system) when performance tests are run. Performance testing initially should be done in a non-virtualized environment to minimize the factors that affect performance. Ensure that the LPAR is running an up-to-date version of the operating system, at the level that is expected for the typical usage of the application. Keep the test bed in place after any performance effort so that performance can occasionally be monitored, which ensures that later maintenance of an application does not introduce a performance regression.
Choosing the appropriate workloads for performance work is also important. Ideally, a workload has the following characteristics:
Be representative of the expected actual usage of the application.
Have simple measures of performance that are easily collected and compared, such as run time or transactions/second.
Be easy to set up and run in an automated environment, with a fairly short run time for a fast turnaround in performance experiments.
Have a low run-to-run variability across duplicated runs, such that extensive tests are not required to obtain a statistically significant measure of performance.
Produce a result that is easily tested for correctness.
When an application is being optimized for both the AIX and Linux operating systems, much of the performance work can be undertaken on just one of the operating systems. However, some performance characteristics are operating system-dependent, so some analysis must be performed on both operating systems. In particular, perform profiling and lock analysis separately for both operating systems to account for differences in system libraries and kernels. Each operating system also has unique scalability considerations. More operating system-specific optimizations are detailed in Chapter 4, “AIX” on page 67 and Chapter 5, “Linux” on page 97.
Build environment and build tools
The build environment, if separate from the performance test bed, must be running an up-to-date operating system. Only recent operating system levels include Application Binary Interface (ABI) extensions to use or control newer hardware features.
Critically, all compilers that are used to build an application should be up-to-date versions that offer full support for the target processor chip. Older levels of a compiler might tolerate newer processor chips, but they do not capitalize on the unique features of the latest processor chips. For the IBM XL compilers on AIX or Linux, XLC11 and XLF13 are the first compiler versions that have processor-specific tuning for POWER7. The new XLC12 compiler is generally recommended. For the GCC compiler on Linux, the IBM Advance Toolchain 4.0 and 5.0 versions contain an updated GCC compiler that is recommended for POWER7. The IBM XL Fortran Compiler is generally recommended over gfortran for the most optimized high floating point performance characteristics. Compiler levels do not change for POWER7+.
For the GCC compiler on Linux, the GCC compilers that come with RHEL and SLES recognize and take advantage of the Power Architecture and optimizations. For improved optimizations and newer GCC technology, the IBM Advance Toolchain package provides an updated GCC compiler and optimized toolchain libraries that you should use with POWER7.
The Advance Toolchain is a key performance technology available for Power Systems running Linux. It includes newer, Power optimized versions of compilers (GCC, G++, and gfortran), utilities, and libraries, along with various performance tools. The full Advance Toolchain must be installed in the build environment, and the Advance Toolchain runtime package must be installed in the performance test bed. The Toolchain is designed to coexist with the GCC compilers and toolchain that are provided in the standard Linux distributions. More information is available in “GCC, toolchain, and IBM Advance Toolchain” on page 98.
Along with the compilers for C/C++ and Fortran, there is the separate IBM Feedback Directed Program Restructuring (FDPR®) tool to optimize performance. FDPR takes a post-link executable image (such as one produced by static compilers) and applies additional optimizations. FDPR is another tool that can be considered for optimizing applications that are based on an executable image. More details can be found in 6.3, “IBM Feedback Directed Program Restructuring” on page 114.
Java also contains a dynamic Just-In-Time (JIT) compiler, and only newer versions are tuned for POWER7 or POWER7+. However, Java compilations to binary code take place at application execution time, so a newer Java release must be installed on the performance test bed system. For IBM Java, tuning for POWER7 was introduced in Java 6 SR7, and that is the recommended minimum version. Newer versions of Java contain more improvements, though, as described in 7.1, “Java levels” on page 126. Java 7 is generally the preferred version to use for POWER7 or POWER7+ because of improvements over previous versions.
C, C++, and Fortran compiler options
For the static compilers, the important compilation options to consider are as follows:
Basic optimization options: With the IBM XL compilers, the minimum suggested optimization level to use is -O, while for GCC it is -O3. Higher levels of optimization are better for some types of code, and you should experiment with them. The XL compilers also have optimization options, such as -qhot, that can be evaluated. More options are detailed in Chapter 6, “Compilers and optimization tools for C, C++, and Fortran” on page 107. The more aggressive optimization options might not work for all programs and might need to be coupled with the strict options described in this list (see page 9).
Target processor chip options: It is possible to build a single executable that runs on various POWER processors. However, that executable does not take advantage of some of the features added to later processor chips, such as new instructions. If only a restricted range of newer processor chips must be supported, consider using the compilation options that enable the usage of newer features. With the XL compilers, for example, if the executable must run only on POWER7 processors, the -qarch=pwr7 option can be specified. The equivalent GCC option is -mcpu=power7.
Target processor chip tuning options: The XL compiler -qtune option specifies that the code produced must be tuned to run optimally on particular processor chips. The executable that is produced still runs on other processor chips, but might not be tuned for them. Some of the possible options to consider are:
 – -qarch=ppc64 -qtune=pwr7 for an executable that is optimized to run on POWER7, but that can run on all 64-bit implementations of the Power Architecture (POWER7, POWER6, POWER5, and so on)
 – -qtune=balanced for an executable that must run well across both POWER6 and IBM POWER7 Systems™
 – -mtune=power7 to tune for POWER7 on GCC
Strict options: Sometimes the compilers can produce faster code by subtly altering the semantics of the original source code. An example of this scenario is expression reorganization. Especially for floating point code, the effect of expression reorganization can produce different results. For some applications, these optimizations must be prevented to achieve valid results. For the XL compilers, certain semantic-altering transformations are allowed by default at higher optimization levels, such as -O3, but those transformations can be disabled by using the -qstrict option (for example, -O3 -qstrict). For GCC, the default is strict mode, but you can use -ffast-math to enable optimizations that are not concerned with Not a Number (NaN), signed zeros, infinities, floating point expression reorganization, or setting the errno variable. The new -Ofast GCC option includes -O3 and -ffast-math, and might include other options in the future.
Source code compatibility options: The XL compilers assume that the C and C++ source code conforms to language rules for aliasing. On occasion, older source code fails when compiled with optimization, because the code violates the language rules. A workaround for this situation is to use the -qalias=noansi option. The GCC workaround is the -fno-strict-aliasing option.
Profile Directed Feedback (PDF): PDF is an advanced optimization feature of the compilers that you should consider for performance critical applications.
Interprocedural Analysis (IPA): IPA is an advanced optimization feature of the compilers that you should consider for performance critical applications.
 
Compiler options: Compiler options do not change for POWER7+.
A simple way to experiment with the C, C++, and Fortran compilation options is to repeatedly build an application with different option combinations, and then to run it and measure performance to see the effect. If higher optimization levels produce invalid results, try adding one or both of the -qstrict and -qalias options with the XL compilers, or -fno-strict-aliasing with GCC.
Not all source files must be compiled with the same set of options, but all files must be compiled at the minimum optimization level. There are cases where optimization was not used on just one or two important source files, and that caused an application to suffer from reduced performance.
Java options
Many Java applications are performance sensitive to the configuration of the Java heap and garbage collection (GC). Experimentation with different heap sizes and GC policies is an important first optimization step. For generational GC, consider using the options that specify the split between nursery space (also known as the new or young space) and tenured space (also known as the old space). Most Java applications have modest requirements for long-lived objects in the tenured space, but frequently allocate new objects with a short life span in the nursery space.
If 64-bit Java is used, use the -Xcompressedrefs option.
By default, recent releases of Java use 64 KB medium pages for the Java heap, which is the equivalent of explicitly specifying the -Xlp64k option. If older releases of Java are used, we strongly encourage using the -Xlp64k option. Otherwise, those releases default to using 4 KB pages. Often, there is some additional performance improvement that is seen in using larger 16 MB large pages by using the -Xlp16m option. However, using 16 MB pages normally requires explicit configuration by the administrator of the AIX or Linux operating system to reserve a portion of the memory to be used exclusively for large pages. (For more information, see 7.3.2, “Configuring large pages for Java heap and code cache” on page 128.) As such, the medium pages are a better choice for general use, and the large pages can be considered for performance critical applications. The new Dynamic System Optimizer (DSO) facility in AIX (see 4.2, “AIX Active System Optimizer and Dynamic System Optimizer” on page 84) autonomously takes advantage of 16 MB pages without administrator configuration. This facility might be appropriate for cases where a very large Java heap is being used.
On Power Systems, the -Xcodecache option often delivers a small improvement in performance, especially in a large Java application. This option specifies the size of each code cache that is allocated by the JIT compiler for the binary code that is generated for Java methods. Ideally, all of the compiled Java method binary code fits into a single code cache, eliminating the small penalty that occurs when one Java method calls another method when the binary code for the two methods is in different code caches. To use this option, determine how much code space is being used, and then set the size of the option correctly. The maximum size of each code cache that is allocated is 32 MB, so the largest value that can be used for this option is -Xcodecache32m. For more information, see 7.3.5, “JIT code cache” on page 129.
The JIT compiler automatically uses an appropriate optimization level when it compiles Java methods, and recent Java releases automatically fully utilize all of the new features of the target POWER7 or POWER7+ processor of the system an application is running on.
For more information about Java performance, see Chapter 7, “Java” on page 125.
Optimized libraries
Optimized libraries are important for application performance. This section covers some considerations that are related to standard libraries for AIX or Linux, libraries for Java, or specialized mathematical subroutine libraries that are available for the Power Architecture.
AIX malloc
The AIX operating system offers various memory allocation packages (the standard malloc() and related routines in the C library). The default package offers good space efficiency and performance for single-threaded applications, but it is not a good choice for the scalability of multi-threaded applications. Choosing the correct malloc package on AIX is important for performance. Even Java applications can make extensive use of malloc through JNI code or internally in the Java Runtime Environment (JRE).
Fortunately, AIX offers a number of different memory allocation packages that are appropriate for different scenarios. These different packages are chosen by setting environment variables and do not require any code modification or rebuilding of an application.
Choosing the best malloc package requires some understanding of how an application uses the memory allocation routines. Appendix A, “Analyzing malloc usage under AIX” on page 151 shows how to easily collect the required information. Following the data collection, experiment with various alternatives, alone or in combination. Some alternatives that deliver high performance include:
Pool malloc: The pool front end to the malloc subsystem optimizes the allocation of memory blocks of 512 bytes or less. It is common for applications to allocate many small blocks, and pools are particularly space- and time-efficient for that allocation pattern. Thread-specific pools are used for multi-threaded applications. The pool malloc is a good choice for both single-threaded and multi-threaded applications.
Multiheap malloc: The multiheap malloc package uses up to 32 separate heaps, reducing contention when multiple threads attempt to allocate memory. It is a good choice for multi-threaded applications.
Using the pool front end and multiheap malloc in combination is a good alternative for multi-threaded applications. Small memory block allocations, typically the most common, are handled with high efficiency by the pool front end. Larger allocations are handled with good scalability by the multiheap malloc. A simple example of specifying the pool and multiheap combination is by using the environment variable setting:
MALLOCOPTIONS=pool,multiheap
For more information malloc alternatives, see Chapter 4, “AIX” on page 67 and “Malloc” on page 68.
Linux Advance Toolchain libraries
The Linux Advance Toolchain contains replacements for various standard system libraries. These replacement libraries are optimized for specific processor chips, including POWER5, POWER6, and POWER7. After you install the Linux Advance Toolchain, the dynamic linker automatically has programs use the library that is optimized for the processor chip type in
the system.
The libraries in Linux Advance Toolchain Version 5.0 and later are optimized to use the multi-core facilities in POWER7.
Mathematical Acceleration Subsystem Library and Engineering and Scientific Subroutine Library
The Mathematical Acceleration Subsystem (MASS) library contains both optimized and vectorized versions of some basic mathematical functions and runs on AIX and Linux. The MASS library is included with the XL compilers and is automatically used by the compilers when the -O3 -qhot=level=1 compilation options are used. The MASS routines can be used automatically with the Advance Toolchain GNU Compiler Collection (GCC) by using the -mveclibabi=mass option, but the library is not included with the compiler and must be separately installed. Explore the use of MASS for applications that use basic mathematical functions. Good results occur when you use the vector versions of the functions. The MASS routines do not necessarily provide the same precision of results as do standard libraries.
The Engineering and Scientific Subroutine Library (ESSL) contains an extensive set of advanced mathematical functions and runs on AIX and Linux. Avoid having applications write their own versions of functions, such as the Basic Linear Algebra Subprograms (BLAS). Instead, use the Power optimized versions in ESSL.
java/util/concurrent
For Java, all of the standard class libraries are included with the JRE. One package of interest for scalability optimization is java/util/concurrent. Some classes in java/util/concurrent are more scalable replacements for older classes, such as java/util/concurrent/ConcurrentHashMap, which can be used as a replacement for java/util/Hashtable. ConcurrentHashMap might be slightly less efficient than Hashtable when run in smaller system configurations where scalability is not an issue, so there can be trade-offs. Also, switching packages requires a source code change, albeit a simple one.
Tuning to capitalize on hardware performance features
For almost all applications, using 64 KB pages is beneficial for performance. Newer Linux releases (RHEL5, SLES11, and RHEL6) default to 64 KB pages, and AIX defaults to 4 KB pages. Applications on AIX have 64 KB pages that are enabled through one or a combination of the following methods:
1. Using an environment variable setting:
LDR_CNTRL=TEXTPSIZE=64K@DATAPSIZE=64K@STACKPSIZE=64K@SHMPSIZE=64K
2. Modifying the executable file with:
ldedit -btextpsize=64k -bdatapsize=64k -bstackpsize=64k <executable>
3. Using linker options at build time:
cc -btextpsize:64k -bdatapsize:64k -bstackpsize:64k ...
ld -btextpsize:64k -bdatapsize:64k -bstackpsize:64k ...
All of these mechanisms for enabling 64 KB pages can be safely used when they run on older hardware or operating system levels that do not support 64 KB pages. When the needed support is not in place, the system simply defaults to using 4 KB pages.
As mentioned in “Java options” on page 10, the recent Java releases default to using 64 KB pages. For Java, it is important that the Java heap space uses 64 KB pages, which are enabled by the -Xlp64k option in older releases (a minimum Linux level of RHEL5, SLES11, or RHEL6 is required).
Larger 16 MB pages are also supported on the Power Architecture and might provide an additional performance boost when compared to 64 KB pages. However, the usage of 16 MB pages requires explicit configuration by the administrator of the AIX or Linux operating system. The new DSO facility in AIX (see 4.2, “AIX Active System Optimizer and Dynamic System Optimizer” on page 84) autonomously uses 16 MB pages without any administrator configuration, and might be appropriate for cases in which a very large memory space is being used by an application.
For certain types of non-numerical applications, turning off the default POWER7 hardware prefetching improves performance. In specific cases, disabling hardware prefetching is beneficial for Java programs, WebSphere Application Server, and DB2. One way to control hardware prefetching is at the partition level, where prefetching is turned off by running the following commands:
AIX: dscrctl -n -s 1
Linux: ppc64_cpu --dscr=1
Controlling prefetching in this way might not be appropriate if different applications are running in a partition, because some applications might run best with prefetching enabled. There are also mechanisms to control prefetching at the process level.
POWER7 allows not only prefetching to be enabled or disabled, but it also allows the fine-tuning of the prefetch engine. Such fine-tuning is especially beneficial for scientific/engineering and memory-intensive applications.1 Hardware prefetch tuning is autonomously used when DSO in AIX is enabled.
For more information about hardware prefetching, the DSO facility, and hardware and operating system tuning and usage for optimum performance, see Chapter 2, “The POWER7 processor” on page 21, Chapter 4, “AIX” on page 67, and Chapter 5, “Linux” on page 97.
Scalability considerations
Aside from the scalability considerations already mentioned (such as AIX malloc tuning and java/util/concurrent usage), there is one Linux operating system setting that enhances scalability in some cases: setting sched_compat_yield to 1. This task is accomplished by running the following command:
sysctl -w kernel.sched_compat_yield=1
This setting makes the Completely Fair Scheduler more compatible with earlier versions of Linux. Use this setting for Java environments, such as for WebSphere Application Server. For more information about multiprocessing with the Completely Fair Scheduler, go to:
1.5.2 Deployment guidelines
This section discusses deployment guidelines as they relate to virtualized and non-virtualized environments, and the effect of partition size and affinity on deployments.
Virtualized versus non-virtualized environments
Virtualization is a powerful technique that is applicable to situations where many applications are consolidated onto a single physical server. This consolidation leads to better usage of hardware and simplified system administration. Virtualization is efficient on the Power Architecture, but it does come with some costs. For example, the Virtual I/O Server (VIOS) partition that is allocated for a virtualized deployment consumes a portion of the hardware resources to support the virtualization. For situations where few business-critical applications must be supported on a server, it might be more appropriate to deploy with non-virtualized resources. This situation is particularly true in cases where the applications have considerable network requirements.
Virtualized environments offer many choices for deployment, such as dedicated or non-dedicated processor cores and memory, IBM Micro-Partitioning® that uses fractions of a physical processor core, and memory compression. These alternatives are explored in Chapter 3, “The POWER Hypervisor” on page 55. When you set up a virtualized deployment, it is important that system administrators have a complete understanding of the trade-offs inherent in the different choices and the performance implications of those choices. Some deployment choices, such as enabling memory compression features, can disable other performance features, such as support for 64 KB memory pages.
The POWER7 processor and affinity performance effects
The IBM POWER7 is the latest processor chip in the IBM Power Systems family. The POWER7 processor chip is available in configurations with four, six, or eight cores per chip, as compared to the POWER5 and POWER6, each of which have two cores per chip. Along with the increased number of cores, the POWER7 processor chip implements SMT4 mode, supporting four hardware threads per core, as compared to the POWER5 and POWER6, which support only two hardware threads per core. Each POWER7 processor core supports running in single-threaded mode with one hardware thread, an SMT2 mode with two hardware threads, or an SMT4 mode with four hardware threads.
Each SMT hardware thread is represented as a logical processor in AIX or Linux. When the operating system runs in SMT4 mode, it has four logical processors for each dedicated POWER7 processor core that is assigned to the partition. To gain full benefit from the throughput improvement of SMT, applications must use all of the SMT threads of the processor cores.
Each POWER7 chip has memory controllers that allow direct access to a portion of the memory DIMMs in the system. Any processor core on any chip in the system can access the memory of the entire system, but it takes longer for an application thread to access the memory that is attached to a remote chip than to access data in the local memory DIMMs.
For more information about the POWER7 hardware, see Chapter 2, “The POWER7 processor” on page 21. This short description provides some background to help understand two important performance issues that are known as affinity effects.
Cache affinity
The hardware threads for each core of a POWER7 processor share a core-specific cache space. For multi-threaded applications where different threads are accessing the same data, it can be advantageous to arrange for those threads to run on the same core. By doing so, the shared data remains resident in the core-specific cache space, as opposed to moving between different private cache spaces in the system. This enhanced cache affinity can provide more efficient utilization of the cache space in the system and reducing the latency of data references.
Similarly, the multiple cores on a POWER7 processor share a chip-specific cache space. Again, arranging the software threads that are sharing the data to run on the same POWER7 processor (when the partition spans multiple sockets) often allows more efficient utilization of cache space and reduced data reference latencies.
Memory affinity
By default, the POWER Hypervisor attempts to satisfy the memory requirements of a partition using the local memory DIMMs for the processor cores that are allocated to the partition. For larger partitions, however, the partition might contain a mixture of local and remote memory. For an application that is running on a particular core or chip, the application should always use only local memory. This enhanced memory affinity reduces the latency of
memory accesses.
Partition sizes and affinity
In terms of partition sizes and affinity, this section describes Power dedicated LPARs, shared resource environments, and memory requirements.
Power dedicated LPARs
Dedicated LPAR deployments generally use larger partitions, ranging from just one POWER7 core up to a partition that includes all of the cores and memory in a large symmetric multi-processor (SMP) system. A smaller partition might run a single application, and a larger partition typically runs multiple applications, or multiple instances of a single application. A common example of multiple instances of a single application is in deployments of WebSphere Application Server.
With larger partitions, one of the most important performance considerations is often which cores and memory are allocated to a partition. For partitions of eight cores or less, the POWER Hypervisor attempts to allocate all cores for the partition from a single POWER7 chip and attempts to allocate only memory local to the chip that is used. Those partitions generally and automatically have good cache and memory affinity. However, it might not be possible to obtain resources for each of the LPARs from a single chip.
For example, assume that you have a 32-core system with four chips, each with eight cores. If five partitions are configured, each with six cores, the fifth LPAR would spread across three chips. Start the most important partition first to obtain resources from a single chip. (The order of starting partitions is one consideration in obtaining the best performance for high
priority workloads.).
Another example is when the partition sizes are mixed. Here, starting smaller partitions consumes resources that are spread across many chips, resulting in larger partitions that are spread across multiple chips, which might be contained on a chip if the larger partitions are started first. It is a preferred practice to start higher priority partitions first, so that there is a better opportunity for them to obtain good affinity characteristics in their core and memory allocations. The affinity of the cores and memory that is allocated to a partition can be determined by running the AIX lssrad -va command. For more information about partition resource allocation and the lssrad command, see Chapter 3, “The POWER Hypervisor” on page 55.
For partitions larger than eight cores, the partition always spans more than one chip and has a mixture of local and remote memory. For these larger partitions, it is often useful to manually force good affinity for an application. Manual affinity can be forced by binding applications so that they can run only on particular cores, and by specifying to the operating system that only local memory can be used by the application.
Consider an example where you run four instances of WebSphere Application Server on a partition of 16 cores on a POWER7 system that is running in SMT4 mode. Each instance of WebSphere Application Server would be bound to run on four of the cores of the system. Because each of the cores has four SMT threads, each instance of WebSphere Application Server is bound to 16 logical processors. Good memory and cache affinity on AIX can therefore be ensured by completing the following steps:
1. Set the AIX MEMORY_AFFINITY environment variable, typically to the value MCM. This setting tells the AIX operating system to use local memory when an application thread requires physical memory to be allocated.
2. Start the four instances of WebSphere Application Server by running the following execrset commands, which bind the execution to the specified set of logical processors:
 – execrset -c 0-15 -m 0 -e <command to start first WebSphere Application Server instance>
 – execrset -c 16-31 -m 0 -e <command to start second WebSphere Application Server instance>
 – execrset -c 32-47 -m 0 -e <command to start third WebSphere Application Server instance>
 – execrset -c 48-63 -m 0 -e <command to start fourth WebSphere Application Server instance>
Some important items to understand in this example are:
For a particular number of instances and available cores, the most important consideration is that each instance of an application runs only on the cores of one POWER7 processor chip.
Memory and logical processor binding is not done independently because doing so can negatively affect performance.
The workload must be evenly distributed over WebSphere Application Server processes for the binding to be effective.
There is an assumed mapping of logical processors to cores and chips, which is always established at boot time. This mapping can be altered if the SMT mode of the system is changed by running smtctl -w now. It is always best to reboot to change the SMT mode of a partition to ensure that the assumed mapping is in place.
Along with the operating system facilities for manually establishing cache and memory affinity, the AIX Active System Optimizer (ASO) can be enabled to autonomously establish
enhanced affinity.
For more information about ASO, the MEMORY_AFFINITY environment variable, the execrset command, and related environment variables and commands, see Chapter 4, “AIX” on page 67.
The same forced affinity can be established on Linux by running taskset or numactl.
For example:
numactl -C 0-15 -l <command to start first WebSphere Application Server instance>
numactl -C 16-31 -l <command to start second WebSphere Application Server instance>
numactl -C 32-47 -l <command to start third WebSphere Application Server instance>
numactl -C 48-63 -l <command to start fourth WebSphere Application Server instance>
The -l option on these numactl commands is the equivalent of the AIX MEMORY_AFFINITY=MCM environment variable setting.
Even for partitions of eight cores or less, better cache affinity can be established with multiple application instances by using logical processor binding commands. With eight cores or less, the performance effects typically range up to about 10% improvement with binding. For partitions of more than eight cores that span more than one POWER7 processor chip, using manual affinity results in significantly higher performance.
Shared resource environments
Virtualized deployments that share cores among a set of partitions also can use logical processor binding to ensure good affinity within the guest operating system (AIX). However, the real dispatching of physical cores is handled by the underlying host operating system (POWER Hypervisor).
The POWER Hypervisor uses a three-level affinity mechanism in its scheduler to enforce affinity as much as possible. The reason why absolute affinity is not always possible is that partitions can expand and use unused cycles of other LPARs. This process is done using uncapped mode in Power, where the uncapped cycles might not always have affinity. Therefore, binding logical processors that are seen at the operating system level to physical threads seen at the hypervisor level works only in some cases in shared partitions. Achieving a high level of affinity is difficult when multiple partitions share resources from a single pool, especially at high utilization, and when partitions are expanding to use other partition cycles. Therefore, creating large shared processor core pools that span across chips tends to create remote memory accesses. For this reason, it might be less desirable to use larger partitions and large processor core pools where high-level affinity performance is expected.
Virtualized deployments can use Micro-Partitioning, where a partition is allocated a fraction of a core. Micro-Partitioning allow a core allocation as small as 0.1 cores in older firmware levels, and as small as 0.05 cores starting at the 760 firmware level, when coupled with supporting operating system levels. This powerful mechanism provides great flexibility in deployments. However, very small core allocations may be more appropriate for situations in which many virtual machines are often idle. Therefore, active 0.05 core LPARs can use those idle cycles. Also, there is one negative performance effect in deployments with considerably small partitions, in particular with 0.1 or less cores at high system utilization: Java warm-up times can be greatly increased. In a Java execution, the JIT compiler is producing binary code for Java methods dynamically. Steady-state optimal performance is reached after a portion of the Java methods are compiled to binary code. With considerably small partitions, there might be a long warm-up period before reaching steady-state performance, where a 0.05 LPAR cannot get additional cycles from other LPARs because the other LPARs are consuming their cycles. Also, if the workload that is running on this small-size LPAR does not need more than 5% of a processor core capacity, then the performance impact is mitigated.
Memory requirements
For good performance, there should be enough physical memory that is available so that application data does not need to be frequently paged in and out between memory and disk. The physical memory that is allocated to a partition must be enough to satisfy the requirements of the operating system and the applications that are running on the partition.
Java is sensitive to having enough physical memory available to contain the Java heap because Java applications often have frequent GC cycles where large portions of the Java heap are accessed. If portions of the Java heap are paged out to disk by the operating system because of a lack of physical memory, then GC cycles can cause a large amount of disk activity, which is known as thrashing.
1.5.3 Deep performance optimization guidelines
Performance tools for AIX and Linux are described in Appendix B, “Performance tooling and empirical performance analysis” on page 155. A deep performance optimization effort typically uses those tools and follows this general strategy:
1. Gather general information about the execution of an application when it is running on a dedicated POWER7 performance system. Important statistics to consider are:
 – The user and system CPU usage of the application: Ideally, a multi-threaded application generates a high overall CPU usage with most of the CPU time in user code. Too high a system CPU usage is generally a sign of a locking bottleneck in the application. Too low an overall usage usually indicates some type of resource bottleneck, such as network or disk. For low CPU usage, look at the number of runnable threads reported by the operating system, and try to ensure that there are as many runnable threads as there are logical processors in the partition.
 – The network utilization of the application: Networks can be a bottleneck in execution either because of bandwidth or latency issues. Link aggregation techniques are often used to solve networking issues.
 – The disk utilization of the application: High disk I/O issues are increasingly being solved by using solid-state disks (SSDs).
Common operating system tools for gathering this general information include topas (AIX), top (Linux), vmstat, and netstat. Detailed CPU usage information is available by running sar. This command diagnoses cases where some logical processors are saturated and others are underutilized, an issue that is seen with network interrupt processing on Linux.
2. Collect a time-based profile of the application to see where execution time is concentrated. Some possible areas of concern are:
 – Particular user routines or Java methods with a high concentration of execution time. This situation is an indication of a poor coding practice or an inefficient algorithm that is being used in the application itself.
 – Particular library routines or Java class library methods with a high concentration of execution time. First, determine whether the hot routine or method is legitimately used to that extent. Look for alternatives or more efficient versions, such as using the optimized libraries in the Linux Advance Toolchain or the vector routines in the MASS library (for more information, see “Mathematical Acceleration Subsystem Library and Engineering and Scientific Subroutine Library” on page 11).
 – A concentration of execution time in the pthreads library (see “Java profiling example” on page 178) or in kernel locking (see “AIX kernel locking services” on page 38) routines. This situation is associated with a locking issue. This locking might ultimately arise at the system level (as seen with malloc locking issues on AIX), or at the application level in Java code (associated with synchronized blocks or methods in Java code). The source of locking issues is not always immediately apparent from a profile. For example, with AIX malloc locking issues, the time that is spent in the malloc and free routines might be quite low, with almost all of the impact appearing in kernel
locking routines.
The tools for gathering profiles are tprof (AIX) and Oprofile (Linux) (both tools are described in “Rational Performance Advisor” on page 161). The curt tool (see “AIX trace-based analysis tools” on page 165) also provides a breakdown, describing where CPU time is consumed and includes more useful information, such as a system
call summary.
3. Where there are indications of a locking issue, collect locking information.
With locking problems, the primary concern is to determine where the locking originates in the application source code. Cases such as AIX malloc locking can be easily solved just by switching to a more scalable memory allocation package through the MALLOCTYPE and MALLOCOPTIONS environment variables. In this case, you should examine how malloc is used and consider making changes at the source code level. For example, rather than repeatedly allocating many small blocks of memory by calling malloc for each block, the application can allocate an array of blocks and then internally manage
the space.
As mentioned in “java/util/concurrent” on page 12, Java locking issues that are associated with some older classes, such as java/util/Hashtable, can be easily solved by using java/util/concurrent/ConcurrentHashMap.
For Java programs, use Java Lock Monitor (see “Java Health Center” on page 177). For non Java programs, use the splat tool on AIX (see “AIX trace-based analysis tools” on page 165).
4. For Java, the WAIT tool is a powerful, easy-to-use analysis tool that is based on collecting thread state information.
Using the WAIT tool requires installing and running only a data collection shell. The shell collects various information about the Java program execution, the most important of which is a set of javacore files. The javacore files show the state of all of the threads at the time the file was dumped. The collected data is submitted to an online tool using a web browser, and the tool analyzes the data and displays the results with a GUI. The GUI presents information about thread states and has powerful features to drill down to see call chains.
The WAIT tool results combine many of the features of a time-based profile, a lock monitor, and other tools. For Java programs, the WAIT tool might be one of the first analysis tools to consider because of its versatility and ease of use.
For more information about IBM Whole-system Analysis of Idle Time, which is the browser-based (that is, no-install) WAIT tool, go to:
 
Guidance about POWER processor chips: The guidance that is provided in this publication generally applies to previous generations of POWER processor chips and systems, including POWER5 and POWER6. When our guidance is not applicable to all generations of Power Systems, we note that for you.
 

1 http://dl.acm.org/citation.cfm?id=2370837 (available for purchase or with access to the ACM Digital Library)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset