Java
This chapter describes the optimization and tuning of Java based applications that are running on a POWER7 processor-based server. It covers the following topics:
7.1 Java levels
You should use Java 6 SR7 or later for POWER7 Systems for two primary reasons. First,
64 KB pages are used for JVM text, data, and stack memory segments and the Java heap by default on systems where 64 KB pages are available. Second, the JIT compiler in Java 6 SR7 and later takes advantage of POWER7 specific hardware features for performance. As a result, Java 6 SR7 or later perform better on POWER7 than older releases of the JVM.
Although 64 KB pages are available since AIX 5L V5.3 and POWER5+ systems, the default page size on AIX is still 4 KB. JVM releases before Java 6 SR7 use the default memory pages that are provided by AIX unless an AIX environment variable is set to override it. (For more information, see “Tuning to capitalize on hardware performance features” on page 12.) The Java heap is mapped to use the default page size unless the appropriate option is used to override it. (For more information, see 7.3.1, “Medium and large pages for Java heap and code cache” on page 127.) In reality, performance-savvy users override the default page size anyway with 64 KB pages to obtain most of the performance benefits from using larger memory pages on Power Systems. Java6 SR7 and later use 64 KB pages almost everywhere 64 KB pages are available and applicable. With Java 6 SR7 or later, no overriding option is necessary, and no AIX environment variable must be set.
The JIT compiler automatically detects on what platform it is running and generates binary code most suitable to, and performing best on, on that platform. Java 6 SR7 and later is able to recognize POWER7 and best use its hardware features. For example, POWER7 supports prefetch instructions for transient data, which is needed, but this data must be evacuated from the CPU caches with priority, which results in more efficient usage of CPU caches and leads to better performance if the workload had identifiable transient data. Many Java objects are inherently transient. Java 6 SR7 and later takes advantage of these prefetch instructions if the appropriate option is specified, as indicated in 7.3.3, “Prefetching” on page 128.
There is a newer Java release that is only available for Power when bundled with WebSphere Application Server. WebSphere Application Server V8 contains Java 6 with a later, improved V2.6 J9 virtual machine (VM). This VM is preferred over Java 6 with the older V2.4 J9 VM because of the performance improvements it contains.
The newest release of Java, Java 7, was generally available at the end of September 2011. It is available as a stand-alone release (not only when bundled with WebSphere Application Server). As with the updated Java 6 in WebSphere Application Server V8, Java7 contains significant improvements in the JDK as compared to the original Java 6. Java 7 is the preferred version of Java to use on POWER7.
7.2 32-bit versus 64-bit Java
64-bit applications that do not require large amounts of memory typically run slower than 32-bit applications. This situation occurs because of the larger data types, like 64-bit pointers instead of 32-bit pointers, which increase the demand on memory throughput.
The exception to this situation is when the processor architecture has more processor registers in 64-bit mode than in 32-bit mode and 32-bit application performance is negatively impacted by this configuration. Because of few registers, the demand on memory throughput can be higher in 32-bit mode than in 64-bit mode. In such a situation, running an application in 64-bit mode is required to achieve best performance.
The Power Architecture does not require running applications in 64-bit mode to achieve best performance because 32-bit and 64-bit modes have the same number of processor registers.
Consider the following items:
Applications with a small memory requirement typically run faster as 32-bit applications than as 64-bit applications.
64-bit applications have a larger demand on memory because of the larger data types, such as pointers being 64-bit instead of 32-bit, which leads to the following circumstances:
 – The memory foot print increases because of the larger data types.
 – The memory alignment of application data contributes to memory demand.
 – More memory bandwidth is required.
For best performance, use 32-bit Java unless the memory requirement of the application requires running in 64-bit mode.
For more information about this topic, see 7.6, “Related publications” on page 136.
7.3 Memory and page size considerations
IBM Java can take advantage of medium (64 KB) and large (16 MB) page sizes that are supported by the current AIX versions and POWER processors. Using medium or large pages instead of the default 4 KB page size can improve application performance. The performance improvement of using medium or large pages is a result of a more efficient use of the hardware translation caches, which are used when you translate application page addresses to physical page addresses. Applications that are frequently accessing a vast amount of memory benefit most from using pages sizes that are larger than 4 KB.
Table 7-1 shows the hardware and software requirements for 4 KB, 64 KB, and 16 MB pages:
Table 7-1 Page sizes that are supported by AIX and POWER processors
Page size
Platform
AIX version
Requires user configuration
4 KB
All
All
No
64 KB
POWER5+ or later
AIX 5L V5.3 and later
No
16 MB
POWER4 or later
AIX 5L V5.3 and later
Yes
7.3.1 Medium and large pages for Java heap and code cache
Medium and large pages can be enabled for the Java heap and JIT code cache independently of other memory areas. The JVM supports at least three page sizes, depending on the platform:
4 KB (default)
64 KB
16 MB
The -Xlp64k and -Xlp16m options can be used to select the wanted page granularity for the heap and code cache. The -Xlp option is an alias for -Xlp16m. Large pages, specifically
16 MB pages, do have some processing impact and are best suited for long running applications with large memory requirements. The -Xlp64k option provides many of the benefits of 16 MB pages with less impact and can be suitable for workloads that benefit from large pages but do not take full advantage of 16 MB pages.
Starting with IBM Java 6 SR7, the default page size is 64 KB.
7.3.2 Configuring large pages for Java heap and code cache
Large pages must be configured on AIX by the system administrator by running vmo. The following example demonstrates how to dynamically configure 1 GB of 16 MB pages:
# vmo -o lgpg_regions=64 -o lgpg_size=16777216
To permanently configure large pages, the -r option must be specified with the vmo command. Run bosboot to configure the large pages at boot time:
# vmo -r -o lgpg_regions=64 -o lgpg_size=16777216
# bosboot -a
Non-root users must have the CAP_BYPASS_RAC_VMM capability on AIX enabled to use large pages. The system administrator can add this capability by running chuser:
# chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE <user_id>
On Linux, 1 GB of 16 MB pages are configured by running echo:
# echo 64 > /proc/sys/vm/nr_hugepages
7.3.3 Prefetching
Prefetching is an important strategy to reduce memory latency and take full advantage of on-chip caches. The -XtlhPrefetch option can be specified to enable aggressive prefetching of thread-local heap memory shortly before objects are allocated. This option ensures that the memory required for new objects that are allocated from the TLH is fetched into cache ahead of time if possible, reducing latency and increasing overall object allocation speed. This option can give noticeable gains on workloads that frequently allocate objects, such as
transactional workloads.
7.3.4 Compressed references
For huge workloads, 64-bit JVMs might be necessary to meet application needs. The 64-bit processes primarily offer a much larger address space, allowing for larger Java heaps, JIT code caches, and reducing the effects of memory fragmentation in the native heap. However, 64-bit processes also must deal with the increased processing impact. The impact comes from the increased memory usage and decreased cache usage. This impact is present with every object allocation, as each object must now be referred to with a 64-bit address rather than a 32- bit address.
To alleviate this impact, use the -Xcompressedrefs option. When this option is enabled, the JVM uses 32-bit references to objects instead of 64-bit references wherever possible. Object references are compressed and extracted as necessary at minimal cost. The need for compression and decompression is determined by the overall heap size and the platform the JVM is running on; smaller heaps can do without compression and decompression, eliminating even this impact. To determine the compression and decompression impact for a heap size on a particular platform, run the following command:
java -Xcompressedrefs -verbose:gc -version ...
The resulting output has the following content:
<attribute name="compressedRefsDisplacement" value="0x0" />
<attribute name="compressedRefsShift" value="0x0" />
Values of 0 for the named attributes essentially indicate that no work must be done to convert between 32-bit and 64-bit references for the invocation. Under these circumstances, 64-bit JVMs running with -Xcompressedrefs can reduce the impact of 64-bit addressing even more and achieve better performance.
With -Xcompressedrefs, the maximum size of the heap is much smaller than the theoretical maximum size allowed by a 64-bit JVM, although greater than the maximum heap under a 32-bit JVM. Currently, the maximum heap size with -Xcompressedrefs is around 31 GB on both AIX and Linux.
7.3.5 JIT code cache
JIT compilation is an important factor in optimizing performance. Because compilation is carried out at run time, it is complicated to estimate the size of the program or the number of compilations that are carried out. The JIT compiler has a cap on how much memory it can allocate at run time to store compiled code and for most of applications the default cap is more than sufficient.
However, certain programs, especially those programs that take advantage of certain language features, such as reflection, can produce a number of compilations and use up the allowed amount of code cache. After the limit of code cache is consumed, no more compilations are performed. This situation can have a negative impact on performance if the program begins to call many interpreted methods that cannot be compiled as a result. The -Xjit:codetotal=<nnn> (where nnn is a number in KB units) option can be used to specify the cap of the JIT code cache. The default is 64 MB or 128 MB for 32-bit and 64-bit JVMs.
Another consideration is how the code caches are allocated. If they are allocated far apart from each other (more than 32 MB away), calls from one code cache to another carry higher processing impact. The -Xcodecache<size> option can be used to specify how large each allocation of code cache is. For example, -Xcodecache4m means 4 MB is allocated as code cache each time the JIT compiler needs a new one, until the cap is reached. Typically, there are multiple pieces (for example, 4) of code cache available at boot-up time to support multiple compilation threads. It is important to alter the default code cache size only if it is insufficient, as a large but empty code cache needlessly consumes resources.
Two techniques can be used to determine if the code cache allocation sizes or total limit must be altered. First, a Java core file can be produced by running kill -3 <pid> at the end/stable state of your application. The core file shows how many pieces of code cache are allocated. The active amount of code cache can be estimated by summing up all of the pieces.
For example, if 20 MB is needed to run the application, -Xcodecache5m (four pieces of 5 MB each) typically allocates 20 MB code caches at boot-up time, and they are likely close to each other and have better performance for cross-code cache calls. Second, to determine if the total code cache is sufficient, the -Xjit:verbose option can be used to print method names as they are compiled. If compilation fails because the limit of code cache is reached, an error to that effect is printed.
7.3.6 Shared classes
The IBM JVM supports class data sharing between multiple JVMs. The -Xshareclasses option can be used to enable it, and the -Xscmx<size> option can be used to specify the maximum cache size of the stored data, where <size> can be <nnn>K, <nnn>M, or <nnn>G for sizes in KB, MB, or GB.
The shared class data is stored in a memory-mapped cache file on disk. Sharing reduces the overall virtual storage consumption when more than one JVM shares a cache. Sharing also reduces the start time for a JVM after the cache is created. The shared class cache is independent of any running JVM and persists until it is deleted.
A shared cache can contain:
Bootstrap classes
Application classes
Metadata that describes the classes
Ahead-of-time (AOT) compiled code
7.4 Java garbage collection tuning
The IBM Java VM supports multiple garbage collection (GC) strategies to allow software developers an opportunity to prioritize various factors. Throughput, latency, and scaling are the main factors that are addressed by the different collection strategies. Understanding how an application behaves regarding allocation frequencies, required heap size, expected lifetime of objects, and other factors can make one or more of the non-default GC strategies preferable. The GC strategy can be specified with the -Xgcpolicy:<policy> option.
7.4.1 GC strategy: Optthruput
This strategy prioritizes throughput at the expense of maximum latency by waiting until the last possible time to do a GC. A global GC of the entire heap is performed, creating a longer pause time at the expense of latency. After GC is triggered, the GC stops all application threads and performs the three GC phases:
Mark
Sweep
Compact (if necessary)
All phases are parallelized to perform GC as quickly as possible.
The optthruput strategy is the default in the original Java 6 that uses the V2.4 J9 VM.
7.4.2 GC strategy: Optavgpause
This strategy prioritizes latency and response time by performing the initial mark phase of GC concurrently with the execution of the application. The application is halted only for the sweep and compact phases, minimizing the total time that the application is paused. Performing the mark phase concurrently with the execution of the application might affect throughput, because the CPU time that would otherwise go to the application can be diverted to low priority GC threads to carry out the mark phase. This situation can be acceptable on machines with many processor cores and relatively few application threads, as idle processor cores can be put to good use otherwise.
7.4.3 GC strategy: Gencon
This strategy employs a generational GC scheme that attempts to deal with many varying workloads and memory usage patterns. In addition, gencon also uses concurrent marking to minimize pause times. The gencon strategy works by dividing the heap into two categories:
New space
Old space
The new space is dedicated to short-lived objects that are created frequently and unreferenced shortly thereafter. The old space is for long-lived objects that survived long enough to be promoted from the new space. This GC policy is suited to workloads that have many short-lived objects, such as transactional workloads, because GC in the new space (carried out by the scavenger) is cheaper per object overall than GC in the old space. By default, up to 25% of the heap is dedicated to the new space. The division between the new space and the old space can be controlled with the -Xmn option, which specifies the size of the new space; the remaining space is then designated as the old space. Alternatively, -Xmns and -Xmnx can be used to set the starting and maximum new space sizes if a non-constant new space size is wanted. For more information about constant versus non-constant heaps in general, see 7.4.5, “Optimal heap size” on page 132.
The gencon strategy is the default in the updated Java 6 that uses the V2.6 J9 VM, and in the later Java 7 version.
7.4.4 GC strategy: Balanced
This strategy evens out pause times across GC operations that are based on the amount of work that is being generated. This strategy can be affected by object allocation rates, object survival rates, and fragmentation levels within the heap. This smoothing of pause times is a best effort rather than a real-time guarantee. A fundamental aspect of the balanced collector's architecture, which is critical to achieving its goals of reducing the impact of large collection times, is that it is a region-based garbage collector. A region is a clearly delineated portion of the Java object heap that categorizes how the associated memory is used and groups related objects together.
During the JVM startup, the garbage collector divides the heap memory into equal-sized regions, and these region delineations remain static for the lifetime of the JVM. Regions are the basic unit of GC and allocation operations. For example, when the heap is expanded or contracted, the memory that is committed or released corresponds to a number of regions.
Although the Java heap is a contiguous range of memory addresses, any region within that range can be committed or released as required. This situation enables the balanced collector to contract the heap more dynamically and aggressively than other garbage collectors, which typically require the committed portion of the heap to be contiguous. Java heap configuration for -Xgcpolicy:balanced strategy can be specified through the -Xmn, -Xmx, and -Xms options.
7.4.5 Optimal heap size
By default, the JVM provides a considerably flexible heap configuration that allows the heap to grow and shrink dynamically in response to the needs of the application. This configuration allows the JVM to claim only as much memory as necessary at any time, thus cooperating with other processes that are running on the system. The starting and maximum size of the heap can be specified with the -Xms and -Xmx options.
This flexibility comes at a cost, as the JVM must request memory from the operating system whenever the heap must grow and return memory whenever it shrinks. This behavior can lead to various unwanted scenarios. If the application heap requirements oscillate, this situation can cause excessive heap growth and shrinkage.
If the JVM is running on a dedicated machine, the processing impact of heap resizing can be eliminated by requesting a constant sized heap. This situation can be accomplished by setting -Xms equal to -Xmx. Choosing the correct size for the heap is highly important, as GC impact is directly proportional to the size of the heap. The heap must be large enough to satisfy the application's maximum memory requirements and contain extra space. The GC must work much harder when the heap is near full capacity because of fragmentation and other issues, so 20 - 30% of extra space above the maximum needs of the application can lower the overall GC impact.
If an application requires more flexibility than can be achieved with a constant sized heap, it might be beneficial to tune the sizing parameters for a dynamic heap. One of the most expensive GC events is object allocation failure. This failure occurs when there is not enough contiguous space in the current heap to satisfy the allocation, and results in a GC collection and a possible heap expansion. If the current heap size is less than the -Xmx size, the heap is expanded in response to the allocation failure if the amount of free space is below a certain threshold. Therefore, it is important to ensure that when an allocation fails, the heap is expanded to allow not only the failed allocation to succeed, but also many future allocations, or the next failed allocation might trigger yet another GC collection. This situation is known as heap thrashing.
The -Xminf, -Xmaxf, -Xmine, and -Xmaxe group of options can be used to affect when and how the GC resizes the heap. The -Xminf<factor> option (where factor is a real number
0 - 1) specifies the minimum free space in the heap; if the total free space falls below this factor, the heap is expanded. The -Xmaxf<factor> option specifies the maximum free space; if the total free space rises above this factor, the heap is shrunk. These options can be used to minimize heap thrashing and excessive resizing. The -Xmine and -Xmaxe options specify the minimum and maximum sizes to shrink and grow the heap by. These options can be used to ensure that the heap has enough free contiguous space to allow it to satisfy a reasonable number of allocations before failure.
Regardless of whether the heap size is constant, it should never be allowed to exceed the physical memory available to the process; otherwise, the operating system might have to swap data in and out of memory. An application's memory behavior can be determined by using various tools, including verbose GC logs. For more information about verbose GC logs and other tools, see “Java (either AIX or Linux)” on page 176.
7.5 Application scaling
Large workloads using many threads on multi-CPU machines face extra challenges regarding concurrency and scaling. In such cases, steps can be taken to decrease contention on shared resources and reduce the processing impact.
7.5.1 Choosing the correct SMT mode
AIX and Linux represent each SMT thread as a logical CPU. Therefore, the number of logical CPUs in an LPAR depends on the SMT mode. For example, an LPAR with four virtual processors that are running in SMT4 mode has 16 logical CPUs; an LPAR with that same number of virtual processors that are running in SMT2 mode has only eight logical CPUs.
Table 7-2 shows the number of SMT threads and logical CPUs available in ST, SMT2, and SMT4 modes.
Table 7-2 ST, SMT2, and SMT4 modes - SMT threads and CPUs available
SMT mode
Number of SMT threads
Number of logical CPUs
ST
1
1
SMT2
2
2
SMT4
4
4
The default SMT mode on POWER7 depends on the AIX version and the compatibility mode the processor cores are running with. Table 7-3 shows the default SMT modes.
Table 7-3 SMT mode on POWER7 is dependent upon AIX and compatibility mode
AIX version
Compatibility mode
Default SMT mode
AIX V6.1
POWER7
SMT4
AIX V6.1
POWER6/POWER6+
SMT2
AIX 5L V5.3
POWER6/POWER6+
SMT2
Most applications benefit from SMT. However, some applications do not scale with an increased number of logical CPUs on an SMT-enabled system. One way to address such an application scalability issue is to make a smaller LPAR, or use processor binding, as described in 7.5.2, “Using resource sets” on page 133. For applications that might benefit from a lower SMT mode with fewer logical CPUs, experiment with using SMT2 or ST modes (see “Hybrid thread and core ” on page 79 and “Selecting different SMT modes” on page 105).
7.5.2 Using resource sets
Resource sets (RSETS) allow specifying which logical CPUs an application can run on. They are useful when an application that does not scale beyond a certain number of logical CPUs is run on a large LPAR. For example, an application that scales up to eight logical CPUs but is run on an LPAR that has 64 logical CPUs.
Resource sets can be created with the mkrset command and attached to a process using the attachrset command. An alternative way is creating a resource set and attaching it to an application in a single step through the execrset command.
The following example demonstrates how to use execrset to create an RSET with CPUs 4 - 7 and run an application that is attached to it:
execrset -c 4-7 -e <application>
In addition to running the application attached to an RSET, set the MEMORY_AFFINITY environment variable to MCM to assure that the application’s private and shared memory is allocated from memory that is local to the logical CPUs of the RSET:
MEMORY_AFFINITY=MCM
In general, RSETs are created on core boundaries. For example, a partition with four POWER7 cores that are running in SMT4 mode has 16 logical CPUs. Create an RSET with four logical CPUs by selecting four SMT threads that belong to one core. Create an RSET with eight logical CPUs by selecting eight SMT threads that belong to two cores. The smtctl command can be used to determine which logical CPUs belong to which core, as shown
in Example 7-1.
Example 7-1 Use the smtctl command to determine which logical CPUs belong to which core
# smtctl
This system is SMT capable.
This system supports up to 4 SMT threads per processor.
SMT is currently enabled.
SMT boot mode is not set.
SMT threads are bound to the same physical processor.
 
proc0 has 4 SMT threads.
Bind processor 0 is bound with proc0
Bind processor 1 is bound with proc0
Bind processor 2 is bound with proc0
Bind processor 3 is bound with proc0
 
proc4 has 4 SMT threads.
Bind processor 4 is bound with proc4
Bind processor 5 is bound with proc4
Bind processor 6 is bound with proc4
Bind processor 7 is bound with proc4
The smtctl output in Example 7-1 shows that the system is running in SMT4 mode with bind processors (logical CPU) 0 - 3 belonging to proc0 and bind processors 4 - 7 belonging to proc1. Create an RSET with four logical CPUs either for CPUs 0 - 3 or for CPUs 4 - 7.
To achieve the best performance with RSETs that are created across multiple cores, all cores of the RSET must be from the same chip and in the same scheduler resource allocation domain (SRAD). The lssrad command can be used to determine which logical CPUs belong to which SRAD, as shown in Example 7-2:
Example 7-2 Use the lssrad command to determine which logical CPUs belong to which SRAD
# lssrad -av
REF1 SRAD MEM CPU
0
0 22397.25 0-31
1
1 29801.75 32-63
The output in Example 7-2 shows a system that has two SRADs. CPUs 0 - 31 belong to the first SRAD, and CPUs 32 - 63 belong to the second SRAD. In this example, create an RSET with multiple cores either using the CPUs of the first or second SRAD.
 
Authority for RSETs: A user must have root authority or have CAP_NUMA_ATTACH capability to use RSETs.
7.5.3 Java lock reservation
Synchronization and locking are an important part of any multi-threaded application. Shared resources must be adequately protected by monitors to ensure correctness, even if some resources are only infrequently shared. If a resource is primarily accessed by a single thread at any time, that thread is frequently the only thread to acquire the monitor that is guarding the resource. In such cases, the cost of acquiring the monitor can be reduced by using the -XlockReservation option. With this option, it is assumed that the last thread to acquire the monitor is also likely to be the next thread to acquire it. The lock is, therefore, said to be reserved for that thread, minimizing its cost to acquire and release the monitor. This option is suited to workloads using many threads and many shared resources that are infrequently shared in practice.
7.5.4 Java GC threads
The GC used by the JVM takes every opportunity to use parallelism on multi-CPU machines. All phases of the GC can be run in parallel with multiple helper threads dividing up the work to complete the task as quickly as possible. Depending on the GC strategy and heap size in use, it can be beneficial to adjust the number of threads that the GC uses. The number of GC threads can be specified with the -Xgcthreads<number> option. The default number of GC threads is generally equal to the number of logical processors on the partition, and it is usually not helpful to exceed this value. Reducing it, however, reduces the GC impact and might be wanted in some situations, such as when RSETs are used. The number of GC threads is capped at 64 starting in V2.6 J9 VM.
7.5.5 Java concurrent marking
The gencon policy combines concurrent marking with generational GC. If generational GC is wanted but the impact of concurrent marking, regarding both the impact of the marking thread and the extra book-keeping that is required when you allocate and manipulate objects, is not wanted, then concurrent marking can be disabled by using the -Xconcurrentlevel0 option. This option is appropriate for workloads that benefit from the gencon policy for object allocation and lifetimes, but also require maximum throughput and minimal GC impact while the application threads are running.
In general, for both the gencon and optavgpause GC policies, concurrent marking can be tuned with the -Xconcurrentlevel<number> option, which specifies the ratio between the amounts of heap that is allocated and heap marked. The default value is 8. The number of low priority mark threads can be set with the -Xconcurrentbackground<number> option. By default, one thread is used for concurrent marking.
For more information about this topic, see 7.6, “Related publications” on page 136.
7.6 Related publications
The publications that are listed in this section are considered suitable for a more detailed discussion of the topics that are covered in this chapter:
Java performance for AIX on POWER7 – best practices, found at:
Java Performance on POWER7, found at:
Top 10 64-bit IBM WebSphere Application Server FAQ, found at:
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset