AIX
This chapter describes the optimization and tuning of a POWER7 processor-based server running the AIX operating system. It covers the the following topics:
4.1 AIX and system libraries
Here we present information about AIX and system libraries.
4.1.1 AIX operating system-specific optimizations
This section describes optimization methods specific to AIX.
Malloc
Every application needs a fast, scalable, and memory efficient allocator. However, each application’s memory request patterns are different. It is difficult to provide one common allocator or tunable that can satisfy the needs of all applications. AIX provides different memory allocators and suboptions within the allocator, so that a system administrator or developer can choose more suitable settings for their application. This chapter explains the available choices and when to choose them.
Memory allocators
AIX provides three different allocators, and each of them uses a different memory management algorithm and data structures. These allocators work independently, so the application developer must choose one of them by exporting the MALLOCTYPE
environment variable. The allocators are:
Default allocator
The default allocator is selected when the MALLOCTYPE environment variable is unset. This setting maintains a consistent performance, even in a worst case scenario, but might not be as memory efficient as a Watson allocator. This allocator is ideal for 32-bit applications, which do not make frequent calls to malloc().
Watson allocator
This allocator is selected when MALLOCTYPE=watson is set. This allocator is designed for 64-bit applications. It is memory efficient, scalable, and provides good performance. This allocator has a built-in bucket component for allocation requests up to 512 bytes. Table 4-1 provides the mapping for the allocation requests to bucket size.
Table 4-1 Mapping for allocation requests to bucket size
Request size
Bucket size
Request size
Bucket size
Request size
Bucket size
Request size
Bucket size
1 - 4
 
33-40
40
129-144
144
257-288
288
5 - 8
 
41 - 48
48
145 - 160
160
289 - 320
320
9 - 12
12
49 - 56
56
161 - 176
176
321 - 352
352
13 - 16
16
57 - 64
64
177 - 192
192
353 - 384
384
17 - 20
20
65 - 80
80
193 - 208
208
385 - 416
416
21 - 24
24
81 - 96
96
209 - 224
224
417 - 448
448
25 - 28
28
97 - 112
112
224 - 240
240
449 - 480
480
29 - 32
32
113 - 128
128
241 - 256
256
481 - 512
512
This allocator is ideal for 64-bit memory-intensive applications.
Malloc 3.1 allocator
This allocator is selected when MALLOCTYPE=3.1 is set. This is a bucket allocator that divides the heap into 28 hash buckets, each with a size of 2 pow (x+4), where x stands for bucket index. This allocator provides the best performance at the cost of memory. In most cases, this algorithm can use as much as twice the amount of memory that is actually requested by the application. In addition, an extra page is required for buckets larger than 4096 bytes because objects of a page in size or larger are page-aligned. Interestingly, some earlier customer applications still use this allocator, as it is more tolerant for application memory overwrite bugs.
Memory allocator suboptions
There are many suboptions available that can be selected by exporting the MALLOCOPTIONS environment variable. This chapter covers a few of the suboptions that are more relevant to performance tuning. For a complete list of options, see System Memory Allocation Using the malloc Subsystem, available at:
Multiheap
By default, the malloc subsystem uses a single heap, which causes lock contention for internal locks that are used by malloc in case of multi-threaded applications. By enabling this option, you can configure the number of parallel heaps to be used by allocators. You can set the multiheap by exporting MALLOCOPTIONS=multipheap[:n], where n can vary between 1- 32 and 32 is the default if n is not specified.
Use this option for multi-threaded applications, as it can improve performance.
Buckets
This suboption is similar to the built-in bucket allocator of the Watson allocator. However, with this option, you can have fine-grained control over the number of buckets, number of blocks per bucket, and the size of each bucket. This option also provides a way to view the usage statistics of each bucket, which be used to refine the bucket settings.
In case the application has many requests of the same size, then the bucket allocator can be configured to preallocate the required size by correctly specifying the bucket options. The block size can go beyond 512 bytes, compared to the Watson allocator or malloc
pool options.
You can enable the buckets allocator by exporting MALLOCOPTIONS=buckets. Complete details about the buckets options for fine-grained control are available1. Enabling the buckets allocator turns off the built-in bucket component if the Watson allocator is used.
malloc pools
This option enables a high performance front end to malloc subsystem for managing storage objects smaller than 513 bytes. This suboption is similar to the built-in bucket allocator of the Watson allocator. However, this suboptions maintains the bucket for each thread, providing lock-free allocation and deallocation for blocks smaller than 513 bytes. This suboption improves the performance for multi-threaded applications, as the time spent on locking is avoided for blocks smaller than 513 bytes.
The pool option makes small memory block allocations fast (no locking) and memory efficient (no header on each allocation object). The pool malloc both speeds up single threaded applications and improves the scalability of multi-threaded applications.
malloc disclaim
By enabling this option, free() automatically disclaims memory. This suboption is useful for reducing the paging space requirement. This option can be set by exporting MALLOCOPTIONS=disclaim.
Use cases
Here are some uses cases that you can use to set up your environment:
1. For a 32-bit single-threaded application, use the default allocator.
2. For a 64-bit application, use the Watson allocator.
3. Multi-threaded applications use the multiheap option. Set the number of heaps proportional to the number of threads in the application.
4. For single-threaded or multi-threaded applications that make frequent allocation and deallocation of memory blocks smaller than 513, use the malloc pool option.
5. For a memory usage pattern of the application that shows high usage of memory blocks of the same size (or sizes that can fall to common block size in bucket option) and sizes greater than 512 bytes, use the configure malloc bucket option.
6. For older applications that require high performance and do not have memory fragmentation issues, use malloc 3.1.
7. Ideally, the Watson allocator, along with the multiheap and malloc pool options, is good for most multi-threaded applications; the pool front end is fast and is scalable for small allocations, while multiheap ensures scalability for larger and less frequent allocations.
8. If you notice high memory usage in the application process even after you run free(), the disclaim option can help.
For more information about this topic, see 4.4, “Related publications” on page 94.
Pthread tunables
The AIX pthread library can be customized with a set of environment variables. Specific variables that improve scaling and CPU usage are listed here. A full description is provided in the following documentation settings:
AIXTHREAD_SCOPE={P|S}
The P option signifies a process-wide contention scope (M:N) while the S option signifies a system-wide contention scope (1:1). Use system scope (1:1) for AIX. Although process scope (M:N) continues to be supported, it is no longer being enhanced in AIX.
SPINLOOPTIME=n
The SPINLOOPTIME variable controls the number of times the system tries to get a busy mutex or spin lock without taking a secondary action, such as calling the kernel to yield the process. This control is intended for MP systems, where it is hoped that the lock that is held by another actively running pthread is released. The parameter works only within libpthreads (user threads). If locks are usually available within a short period, you might want to increase the spin time by setting this environment variable. The number of times to try a busy lock before yielding to another pthread is n. The default is 40 and n must be a positive value.
YIELDLOOPTIME=n
The YIELDLOOPTIME variable controls the number of times that the system yields the logical processor when it tries to acquire a busy mutex or spin lock before it goes to sleep on the lock. The logical processor is yielded to another kernel thread, assuming that there is another executable thread with sufficient priority. This variable is effective in complex applications, where multiple locks are in use. The number of times to yield the logical processor before blocking on a busy lock is n. The default is 0 and n must be a
positive value.
For more information about this topic, see 4.4, “Related publications” on page 94.
pollset
AIX 5L V5.3 introduced the pollset APIs. Pollsets are an AIX replacement for UNIX select() and poll(). Pollset, select(), and poll() all allow an application to efficiently query the status of file descriptors. This action is typically done to allow a single application to multiplex I/O across many file descriptors. Pollset APIs can be more efficient when the number of file descriptors that are queried becomes large.
Efficient I/O event polling through the pollset interface on AIX contains a pollset summary and outlines the most advantageous use of Java. To see this topic, go to:
For more information about this topic, see 4.4, “Related publications” on page 94.
File system performance benefits
AIX Enhanced Journaled File System (JFS2) is the default file system for 64-bit kernel environments. Applications can capitalize on the features of JFS2 for better performance.
Direct I/O
The AIX read-ahead and write-behind JFS2 feature might not be suitable for applications that perform large sized I/O operations, as the cache hit ratio is low. In those cases, an application developer must evaluate Direct I/O for I/O intensive applications.
Programs that are good candidates for direct I/O are typically CPU-limited and perform much disk I/O. Technical applications that have large sequential I/Os are good candidates. Applications that benefit from striping are also good candidates.
The direct I/O access method bypasses the file cache and transfers data directly from disk into the user space buffer, as opposed to using the normal cache policy of placing pages in kernel memory.
At the user level, file systems can be mounted using the dio option on the mount command.
At the programming level, applications enable direct I/O access to a file by passing the O_DIRECT flag to the open subroutine. This flag is defined in the fcntl.h file. Applications must be compiled with _ALL_SOURCE enabled to see the definition of O_DIRECT.
For more information, see Working with File I/O, available at:
Concurrent I/O
An AIX JFS2 inode lock imposes write serialization at the file level. Serializing write accesses prevents data inconsistency because of overlapping writes. Serializing reads regarding writes ensures that the application does not read stale data.
However, some applications can choose to implement their own data serialization, usually at a finer level of granularity than the file. Therefore, they do not need the file system to implement this serialization for them. The inode lock hinders performance in such cases, by unnecessarily serializing non-competing data accesses. For such applications, AIX offers the concurrent I/O (CIO) option. Under concurrent I/O, multiple threads can simultaneously perform reads and writes on a shared file. Applications that do not enforce serialization for accesses to shared files should not use concurrent I/O, as it can result in data corruption because of competing accesses.
Enhanced JFS supports concurrent file access to files. Similar to direct I/O, this access method bypasses the file cache and transfers data directly from disk into the user
space buffer.
Concurrent I/O can be specified for a file either by running mount -o cio or by using the open() system call (by using O_CIO as the OFlag parameter).
Asynchronous I/O
If an application does a synchronous I/O operation, it must wait for the I/O to complete. In contrast, asynchronous I/O operations run in the background and do not block user applications, which improves performance, because I/O operations and applications processing can run simultaneously. Many applications, such as databases and file servers, take advantage of the ability to overlap processing and I/O.
Applications can use the aio_read(), aio_write(), or lio_listio() subroutines (or their 64-bit counterparts) to perform asynchronous disk I/O. Control returns to the application from the subroutine when the request is queued. The application can then continue processing while the disk operation is being performed.
I/O completion ports
A limitation of the AIO interface that is used in a threaded environment is that aio_nwait() collects completed I/O requests for all threads in the same process. One thread collects completed I/O requests that are submitted by another thread.
Another limitation is that multiple threads cannot invoke the collection routines (such as aio_nwait()) at the same time. If one thread issues aio_nwait() while another thread is calling it, the second aio_nwait() returns EBUSY. This limitation can affect I/O performance when many I/Os must run at the same time and a single thread cannot run fast enough to collect all the completed I/Os.
On AIX, using I/O completion ports with AIO requests provides the capability for an application to capture the results of various AIO operations on a per-thread basis in a multi-threaded environment. This functionality provides threads with a method of receiving a completion status for only the AIO requests initiated by the thread.
You can enable IOCP on AIX by running smitty iocp. Verify that IOCP is enabled by running the following command:
lsdev -Cc iocp
The resulting output should match the following example:
iocp0 Available I/O Completion Ports
shmat versus mmap
Memory mapped files provide a mechanism for a process to access files by directly incorporating file data into the process address space. The use of mapped files can reduce I/O data movement because the file data does not have to be copied into process data buffers, as is done by the read and write subroutines. When more than one process maps the same file, its contents are shared among them, providing a low-impact mechanism by which processes can synchronize and communicate.
AIX provides two methods for mapping files and anonymous memory regions. The first set of services, which are known collectively as the shmat services, are typically used to create and use shared memory segments from a program. The second set of services, which are known collectively as the mmap services, is typically used for mapping files, although it can be used for creating shared memory segments as well.
Both the mmap and shmat services provide the capability for multiple processes to map the same region of an object so that they share addressability to that object. However, the mmap subroutine extends this capability beyond that provided by the shmat subroutine by allowing a relatively unlimited number of such mappings to be established. Although this capability increases the number of mappings that are supported per file object or memory segment, it can prove inefficient for applications in which many processes map the same file data into their address space. The mmap subroutine provides a unique object address for each process that maps to an object. The software accomplishes this task by providing each process with a unique virtual address, which is known as an alias. The shmat subroutine allows processes to share the addresses of the mapped objects.
shmat can be used to share memory segments in a way that is similar to how it creates and uses files. An extended shmat capability is available for 32-bit applications with their limited address spaces. If you define the EXTSHM=ON environment variable, then processes running in that environment can create and attach more than 11 shared memory segments.
Use the shmat services under the following circumstances:
When mapping files larger than 256 MB
When mapping shared memory regions that must be shared among unrelated processes (no parent-child relationship)
When mapping entire files
In general, shmat is more efficient but less flexible.
Use mmap under the following circumstances:
Many files are mapped simultaneously.
Only a portion of a file must be mapped.
Page-level protection must be set on the mapping (allows a 4K boundary).
For more information, see General Programming Concepts: Writing and Debugging Programs, available at:
For more information about this topic, see 4.4, “Related publications” on page 94.
Large segment tunable (AIX V6.1 only)
AIX V6.1 TL5 and AIX V7.1 introduce the 1 TB Segment Aliasing. 1 TB segments can improve the performance of 64-bit large memory applications. The optimization is specific to large shared memory (shmat() and mmap()) regions.
1 TB segments are a feature present in POWER5+, POWER6, and POWER7 processors. They can be used to reduce the hardware virtual to real translation impact. Applications that are 64-bit and that have large shared memory regions can benefit from incorporating
1 TB segments.
An overview of 1 TB segment usage can be found in the IBM AIX Version 7.1 Differences Guide, SG24-7910.
For more information about this topic, see 4.4, “Related publications” on page 94.
64-bit versus 32-bit Application Binary Interfaces
AIX provides complete support for both 32-bit and 64-bit Application Binary Interface (ABIs). Applications can be developed using either ABI with some performance trade-offs. The 64-bit ABI provides more scaling benefits. With both ABIs, there are performance trade-offs to
be considered.
Overview of 64-bit/32-bit Application Binary Interface
All current POWER processors support a 32-bit and 64-bit execution mode. The 32-bit execution mode is a subset of the 64-bit execution mode. The modes are similar, where the most significant difference is addresses in address generation (effective addresses are truncated to 32 bits) and computation of some fixed-point status registers (carry, overflow, and so on). Although hardware 32-bit/64-bit mode does not affect performance, the 32-bit/64-bit ABIs provided by AIX do have performance implications and tradeoffs.
The 32-bit ABI provides an ILP32 model (32-bit integers, longs, and pointers) while the 64-bit ABI provides an LP64 model (32-bit integer and 64-bit longs/pointers). Although current POWER CPUs have 64-bit fixed-point registers, they are treated as 32-bit fixed-point registers by the ABI (the high 32 bits of all fixed-point registers are treated as volatile or undefined by the ABI). The 32-bit ABI preserves only 32-bit fixed-point context across subroutine linkage, non-local goto (longjmp()), or signal delivery. 32-bit programs cannot attempt to use 64-bit registers when they run in 32-bit mode (32-bit ABI). In general, other registers (floating point, vector, and status registers) are the same size in both
32-bit/64-bit ABIs.
Starting with AIX V6.1 all supervisor code (kernel, kernel extensions, and device drivers) uses the 64-bit ABI. In general, a unified system call interface is provided to applications that provides efficient system call linkage to both 32-bit and 64-bit applications. Because the AIX V 6.1 kernel is 64-bit, it implies that all systems supported by AIX V 6.1 support the 64-bit ABI. Some older IBM PowerPC® CPUs supported on AIX 5L V 5.3 cannot run the 64-bit ABI.
Operating system libraries provide both 32-bit and 64-bit objects, allowing full support for either ABI. Development tools (assembler, linker, and debuggers) support both ABIs.
Trade-offs
The primary motivation to choose the 64-bit ABI is to go beyond the 4 GB directly memory addressability barrier. A second reason is to improve scalability by extending some 32-bit data type limits that are in the 32-bit ABI (time_t, pid_t, and offset_t). Lastly, 64-bit mode provides access to 64-bit fixed-point registers and instructions that can improve the performance of specific fixed-point operations (long long arithmetic and 64-bit
memory copies).
The 64-bit ABI does have some performance drawbacks, such as the 64-bit fixed-point registers and the LP64 model grow stack usage and data structures. These items can cause a performance drawback for some applications. Also, 64-bit text is generally larger for most compiles, producing a larger i-cache footprint.
The most significant issue is typically the porting effort (for existing applications), as changing between ILP32 and LP64 normally requires a port. Large memory addressability and scalability are normally the deciding factor when you chose an application execution model.
For more information about this topic, see 4.4, “Related publications” on page 94.
Affinity APIs
Most applications must be bound to logical processors to get a performance benefit from memory affinity to prevent the AIX dispatcher from moving the application to processor cores in different Multi-chip Modules (MCMs) while the application runs.
The most likely way to obtain a benefit from memory affinity is to limit the application to running only on the processor cores that are contained in a single MCM. You can accomplish this task by running the bindprocessor command and the bindprocessor() function. It can also be done with the resource set affinity commands (rset) and service applications. Often, affinity is provided as an administrator option that can be optionally enabled on large systems.
When the application requires more processor cores than contained in a single MCM, the performance benefit through memory affinity depends on the memory allocation and access patterns of the various threads in the application. Applications with threads that individually allocate and reference unique data areas might see improved performance.
The AIX Active System Optimizer (ASO) facility is capable of autonomously establishing enhanced affinity (see 4.2, “AIX Active System Optimizer and Dynamic System Optimizer” on page 84). This situation is in contrast to the manual usage of the affinity APIs documented in this section.
Processor affinity (bindprocessor)
Processor affinity is the probability of dispatching of a thread to the logical processor that was previously running it. If a thread is interrupted and later redispatched to the same logical processor, the processor's cache might still contain lines that belong to the thread. If the thread is dispatched to a different logical processor, it probably experiences a series of cache misses until its cache working set is retrieved from RAM or the other logical processor's cache. If a dispatchable thread must wait until the logical processor that it was previously running on is available, the thread might experience an even longer delay.
The highest possible degree of processor affinity is to bind a thread to a specific logical processor. Binding means that the thread is dispatched to that logical processor only, regardless of the availability of other logical processors.
The bindprocessor command and the bindprocessor() subroutine bind the thread (or threads) of a specified process to a particular logical processor. Explicit binding is inherited through fork() and exec() system calls. The bindprocessor command requires the process identifier of the process whose threads are to be bound or unbound, and the bind CPU identifier of the logical processor to be used.
While CPU binding is useful for CPU-intensive applications, it can sometimes be counter productive for I/O-intensive applications.
RSETS
Every process and kernel thread can have an RSET attached to it. The CPUs on which a thread can be dispatched are controlled by a hierarchy of resource sets. RSETs are mandatory bindings and are honored by the AIX kernel always. Also, RSETs can affect Dynamic Reconfiguration (DR) activities.
Resource sets
These resource sets are:
Thread effective RSET Created by ra_attachrset(). Must be a subset (improper or proper) of “Other RSETs” on page 76.
Thread partition RSET Used by WLM. Partition RSETS allow WLM to limit where a thread can run.
Process effective RSET Created by ra_attachrset(), ra_exec(), and ra_fork(). Must be a subset (improper or proper) of the process partition RSET.
Process partition RSET Used by WLM to limit where processes in a WLM class are allowed to run. Can also be created by root users using the rs_setpartion() service.
Other RSETs
Another type of RSET is the exclusive RSET. Exclusive use processor resource sets (XRSETs) allow an installation to limit the usage of the processors in XRSETs; they are used only by work that is attached to those XRSETS. They can be created by running the mkrset command in the 'sysxrset' namespace.
RSET data types and operations
The public shipped header file rset.h contains declarations for the public RSET data types and function prototypes.
An RSET is an opaque data type. Applications allocate an RSET by calling rs_alloc(). Applications receive a handle to the RSET. The RSET handle (datatype rsethandle_t in sys/rset.h) is then used in RSET APIs to manipulate or attach the RSET.
Summary of RSET commands
Here is a summary of the RSET commands:
lsrset: Displays RSETS stored in the system registry or RSETS attached to a process.
For example:
lsrset -av Displays all RSETS in the system registry.
lsrset -p 28026 Displays the effective RSET attached to PID 28026.
mkrset: Makes a named RSET containing specific CPU and memory pools and place the RSET in the system registry. For example, mkrset -c 6 10-12 test/lotsofcpus creates an RSET named test/lotsofcpus that contains the specified CPUs.
rmrset: Removes an RSET from the system registry. For example:
rmrset test/lotsofcpus
attachrset: Attaches an RSET to a specified PID. The RSET can either be in the system registry or CPUs or mempools that are specified in the command. For example:
attachrset test/lotsofcpus 28026 Attaches an RSET in a register to a process.
attachrset -c 4-8 28026 Attaches an RSET with CPUs 4 - 8 to a process as an effective RSET.
attachrset -P -c 4-8 28026 Attaches an RSET with CPUs 4 - 8 to process as a partition rset.
detachrset: Detaches an RSET from a specified PID. For example:
detachrset 28026 Detaches an effective RSET from a PID.
detachrset -P 20828 Detaches a partition RSET from a PID.
execrset: Runs a specific program or command with a specified RSET. For example:
execrset sys/node.04.00000 -e test Runs a program test with an effective RSET from the system registry.
execrset -c 0-1 -e test2 Runs program test2 with an effective RSET that contains logical CPU IDs 0 and 1.
execrset -P -c 0-1 -e test3 Runs program test3 with a partition RSET that contains logical CPU IDs 0 and 1.
RSET manipulation and information services
This list contains only user space APIs. There are also similar kernel extension APIs. For example, krs_alloc() is the kernel extension equivalent to rs_alloc().
rs_alloc() Allocates and initializes an RSET and returns an RSET handle to
a caller.
rs_free() Frees a resource set. The input is an RSET handle.
rs_init() Initializes a previously allocated RSET. The initialization options are the same as for rs_alloc().
rs_op() Performs one of a set of operations against one or two RSETS.
rs_getinfo() Get information about an RSET.
rs_getrad() Get resource allocation domain information from an input RSET.
rs_numrads() Returns the number of system resource allocation domains at the specified system detail level that have available or online resources.
rs_getpartition() Gets a process's partition RSET.
rs_setpartition() Sets a process's partition RSET.
rs_discardname()
rs_getnameattr()
rs_getnamedrset()
rs_setnameattr()
rs_registername() These are services that are used to manage the RSET system registry. There are services to create, obtain, and delete RSETs in
the registry.
Attachment services
Here are the RSET attachment services:
ra_attachrset() A service to attach a work component to an RSET. The service uses the rstype_t and rsid_t parameters to identify the work component to attach to the input RSET (specified by an rsethandle_t).
ra_detachrset() Detaches an RSET from the work unit that is specified by the rstype_t/rsid_t parameters.
ra_exec() Runs a program that is attached to a specific work component. The service uses rstype_t and rsid_t to specify the work component. However, the only supported rstype_t is R_RSET. All of the various versions of exec() are supported.
ra_fork() Forks a process that is attached to a specific work component. The service uses rstype_t and rsid_t to specify the work component. However, the only supported rstype_t is R_RSET.
ra_get_attachinfo()
ra_free_attachinfo() These services retrieve the RSET attachments that are attached to a memory range. The ra_attachrset() allows RSET attachments to ranges of memory in a file or shared memory segment. These services allow those attachments to be queried.
ra_getrset() Retrieves the RSET attachment to a process or thread. The return code indicates where the returned RSET
is attached.
ra_mmap() and ra_mmapv() Maps a file or memory region into a process and attaches it to the resource set that is specified by the rstype_t and rsid_t parameters. A memory allocation policy similar to ra_attachrset() allows a caller to specify how memory is preferentially allocated when the area is accessed.
ra_shmget() and ra_shmgetv() Gets a shared memory segment with an attachment to a resource set. The RSET is specified by the rstype_t and rsid_t parameters. A memory allocation policy similar to ra_attachrset() allows a caller to specify how memory is preferentially allocated when the area is accessed.
AIX Enhanced Affinity (Scheduler Resource Allocation Domain)
AIX Enhanced Affinity is a collection of AIX internal system changes and API extensions to improve performance on POWER7 Systems. Enhanced Affinity improves performance by increasing CPU and memory locality on POWER7 Systems. Enhanced Affinity extends the AIX existing memory affinity support. AIX V6.1 technology level 6100-05 contains AIX Enhanced Affinity support.
Enhanced Affinity status is determined during system boot and remains unchanged for the life of the system. A reboot is required to change the Enhanced Affinity status. In AIX V6.1.0 technology level 6100-05, Enhanced Affinity is enabled by default on POWER7 machines. Enhanced Affinity is available only on POWER7 machines. Enhanced Affinity is disabled by default on POWER6 and earlier machines. A vmo command tunable (enhanced_memory_affinity) is available to disable Enhanced Affinity support on
POWER7 machines.
Here are two concepts that are related to Enhanced Affinity:
Scheduler Resource Allocation Domain (SRAD): SRAD is the collection of logical CPUs and physical memory resources that are close from a hardware affinity perspective. An AIX system (partition) can consist of one or more SRADs. An SRAD represents the same collection of system resources as an existing MCM. A specific SRAD in a partition is identified by a number. It is an sradit_t data type and is often referred to as an SRADID.
SRADID: The numeric identifier of a specific SRAD. It is a short integer data type. An SRADID value is the index of the resource allocation domain at the R_SRADSDL system detail level in the system’s resource set topology.
Power Systems before POWER7 Systems provided only system topology information to dedicated CPU logical partitions. This setup limited the usefulness of RSET attachments for CPU and memory locality purposes to dedicated CPU partitions. POWER7 Systems provide system topology information for shared CPU logical partitions (SPLPAR).
You can use the AIX Enhanced Affinity services to attach SRADs to threads and memory ranges so that the application preferentially identifies the logical CPUs or physical memory to use to run the application. AIX continues to support RSET attachments to identify resources for an application.
RSET versus SRADs
When you compare RSET with SRADIDs:
1. SRADIDs can be attached to threads, shared memory segments, memory map regions, and process memory subranges. SRADIDs may not be attached at the process level (R_PROCESS). SRADIDs may not be attached to files (R_FILDES).
2. SRADID attachments are considered advisory. There are no mandatory SRADID attachments. AIX might ignore advisory SRADID attachments.
3. Process and thread RSET attachments continue to be mandatory. The process and thread resource set hierarchy continues to be enforced. Memory RSET attachments (shared memory, file, and process subrange) continue to be advisory. This situation is unchanged from previous affinity support.
API support
SRADIDs can be attached to threads and memory by using the following functions:
ra_attach() (new)
ra_fork()
ra_exec()
ra_mmap() and ra_mmapv()
ra_shmget() and ra_shmgetv()
SRADIDs can be detached from thread and memory by using the sra_detach()
function (new).
Hybrid thread and core
AIX provides facilities to customize simultaneous multi-threading (SMT) characteristics of CPUs running within a partition. The features require some partition-wide CPU configuration options, so their use is limited to specific workloads.
Background
SMT is a feature that is introduced in POWER5 and capitalized on by AIX. It allows a single physical processor core to simultaneously dispatch instructions from more than one hardware thread context. An overview of SMT is provided in the AIX SMT Overview.2
SMT does include some performance tradeoffs:
SMT can provide a significant throughput and capacity improvement on POWER processors. When you are in SMT mode, there is a trade-off between overall CPU throughput and performance of each hardware thread. SMT allows multiple instruction streams to be run simultaneously, but the concurrency can cause some resource conflict between the instruction streams. This conflict can result in a decrease in performance for an individual thread, but an increase in overall throughput.
Some workloads do not run well with the SMT feature. This situation is not typical for commercial workloads, but has been observed with scientific (floating point
intensive) workloads.
AIX provides options to allow SMT customization. The smtctl option allows the SMT feature to be enabled, disabled, or capped (SMT2 versus SMT4 mode on POWER7). The partition-wide tuning option, smtctl, changes the SMT mode of all processor cores in the partition. It is built on the AIX DR (dynamic reconfiguration) framework to allow hardware threads (logical processors) to be added and removed in a running partition. Because of this option’s global nature, it is normally set by system administrators. Most AIX systems (commercial) use the default SMT settings enabled (that is, SMT2 mode on POWER5 and POWER6, and SMT4 mode on POWER7).
When SMT is enabled (SMT2 or SMT4 mode), the AIX kernel takes advantage of the platform feature to dynamically change SMT modes. These mode switches are done based on system load (the number of running or waiting to run software threads) to choose the optimal SMT mode for the CPUs in the partition. The mode switching policies optimize overall workload throughput, but do not attempt to optimize individual software threads.
Hybrid thread features
AIX provides some basic features that allow more control in SMT mode. With these features, specific software threads can be bound to hardware threads assigned to ST mode CPUs. This configuration allows for an asymmetric SMT configuration where some CPUs are in high SMT mode, while others have SMT mode disabled. This configuration allows critical software threads within a workload to receive an ST performance boost, while it allows the remaining threads to benefit from SMT mode. Typical reasons to take advantage of this hybrid
mode are:
Asymmetric workload, where the performance of one thread serializes an entire workload. For example, one master thread dispatches work to many subordinate threads.
Software threads that are critical to a system administrator.
The ability to create hybrid SMT configurations is limited under current AIX releases and does require administrator or privileged configuration changes. CPUs that provide ST mode hardware threads must be placed into exclusive processor resource sets (XRSETs). XRSETs contain logical CPUs that are segregated from the general kernel dispatching pool. Software threads must be explicitly bound to CPUs in an XRSET. The only way to create an XRSET is by running the mkrset command. All of the hardware threads for logical CPUs must be contained in the XRSET created RSET. To accomplish this task, run the
following commands:
lsrset -av Displays the RSET topology. The system CPU topology is broken down into a hierarchy that has the form sys/node.XX.YYYYY. The largest XX value is the CPU (core) level. This command provides logical processor groups by core.
mkrset -c 4-7 sysxrset/set1 Creates an XRSET sysxrset/set1 containing logical CPUs 4 - 7.
An XRSET alone can be used to ensure that only specific work uses a CPU set. There is also the ability to restrict work execution to primary threads in an XRSET. This ability is known as an STRSET. STRSETs allow software threads to use ST execution mode independently of the load on the other CPUs in the system. Work can be placed onto STRSETs by running the following commands:
execrset -S This command allows external programs to start and be bound to an exclusive RSET.
ra_attach(R_STRSET) This API allows a thread to be bound to an STRSET.
For more information about this topic, see 4.4, “Related publications” on page 94.
Sleep and wake-up primitives (thread_wait and thread_post)
AIX provides proprietary thread_wait() and thread_post() APIs that can be used to optimize thread synchronization and communication (IPC) operations. AIX also provides several standard APIs that can be used for thread synchronization and communication. These APIs include pthread_cond_wait(), pthread_cond_signal(), and semop(). Although many applications use these standard APIs, the low-level primitives are available to optimize these operations. thread_wait() and thread_post() can be used to optimize critical applications services, such as user-mode locking or message passing. They are more efficient than the portable/standard APIs.
Here is more information about the associated subroutines:
thread_wait
The thread_wait subroutine allows a thread to wait or block until another thread posts it with the thread_post or the thread_post_many subroutine or until the time limit specified by the timeout value expires.
If the event for which the thread is waiting and for which it is posted occurs only in the future, the thread_wait subroutine can be called with a timeout value of 0 to clear any pending posts by running the following command:
thread_wait (timeout)
thread_post
The thread_post subroutine posts the thread whose thread ID is indicated by the value of the tid parameter, of the occurrence of an event. If the posted thread is waiting in thread_wait, it is awakened immediately. If it is not waiting in thread_wait, the next call to thread_wait is not blocked, but returns with success immediately.
Multiple posts to the same thread without an intervening wait by the specified thread counts only as a single post. The posting remains in effect until the indicated thread calls the thread_wait subroutine, upon which the posting is cleared.
thread_post_many
The thread_post_many subroutine posts one or more threads of the occurrence of the event. The number of threads to be posted is specified by the value of the nthreads parameter, while the tidp parameter points to an array of thread IDs of threads that must be posted. The subroutine works just like the thread_post subroutine, but can be used to post to multiple threads at the same time. A maximum of 512 threads can be posted in one call to the thread_post_many subroutine.
For more information about this topic, see 4.4, “Related publications” on page 94.
Shared versus private loads
You can use AIX to share text for libraries and dynamically loaded modules. File permissions can be used to enable and disable sharing of loaded text.
Documentation
AIX provides optimizations that enable sharing of loaded text (libraries and dynamically loaded modules). Sharing text among processes often improves performance because it reduces resource usage (memory and disk space). It also allows unrelated software-threads to share cache space when they run concurrently. Lastly, it can reduce load times when the code is already loaded by a previous program.
Applications can control if private or shared loads are performed to shared text regions. Shared loads require that execute permissions be set for group/other on the text files. As a preferred practice, you should enable sharing.
For more information about this topic, see 4.4, “Related publications” on page 94.
Workload partitions (WPAR shared LPP installs)
Starting with AIX V6.1, the WPAR feature gives the system administrator the ability to easily create an isolated AIX operating system that can run services and applications. WPAR provides a secure and isolated environment for enterprise applications in terms of process, signal, and file system space. Any software that is running within the context of a workload partition appears to have its own separate instance of AIX.
The usage of multiple virtual operating systems within a single global operating environment can have multiple advantages. It increases administrative efficiency by reducing the number of AIX instances that must be maintained.
Applications can be installed in a shared environment or a non-shared environment. When an application is installed in a shared environment, it means that it is installed in the global environment and then the application is shared with one or more WPARs. When an application is installed in a non-shared environment, it means that it is installed in the WPAR only. Other WPARs do not have access to that application.
Shared WPAR installation
A shared installation is straightforward because installing software in the global environment is accomplished in the normal manner. What must be considered is whether the system WPARs that share a single installation will or will not interfere with each other’s operation.
For software to function correctly in a shared-installation environment, the software package must be split into shareable and non-shareable files:
Shareable files (such as executable code and message catalogs) must be installed into the shared global file systems that are read-only to all system WPARs.
Non-shareable files (such as configuration and runtime-modifiable files) must be installed into the file systems that are writable to individual WPARs. This configuration allows multiple WPARs to share a single installation, yet still have unique configuration and runtime data.
In addition to splitting the software package, the software installation process must include a synchronization step to install non-shareable files into system WPARs. To accomplish this task, the application must provide a means to encapsulate the non-shareable files within the shared global file systems so that the non-shared files can be extracted into the WPAR by some means. For example, if a vendor creates a custom-installation system that delivers files into /usr and /, then the files that are delivered into / must be archived within /usr and then extracted into / using some vendor-provided mechanism. This action can occur automatically the first time that the application is started or configured.
Finally, the software update process must work so that the shareable and non-shareable files stay synchronized. If the shared files in the global AIX instance are updated to a certain fix level, then the non-shared files in individual WPARs also must be updated to the same level. Either the update process discovers all the system WPARs that must be updated or, at start time, the application detects the out-of-synchronization condition and applies the update. Some software products manage to never change their non-shaerable files in their update process, so they do not need any special handling for updates.
This type of installation sometimes takes a little effort on the part of the application, but it allows you to get the most value from using WPARs. If there is a need to run the same version of the software in several WPARs, this type of installation provides the
following benefits:
It increases administrative efficiency by reducing the number of application instances that users must maintain. The administrator saves time in application-maintenance tasks, such as applying fixes and performing backups and migrations.
It allows users to quickly deploy multiple instances of the same application, each in its own secure and isolated environment. It can take only a matter of minutes to create and start a WPAR to run a shared installation of the application.
By sharing one AIX or application image among multiple WPARs, the memory resource usage is reduced because only one copy of the application image is in real memory.
For more information about WPAR, see WPAR concepts, available at:
4.1.2 Using POWER7+ features under AIX
When the AIX operating system runs on POWER7+ processors, it transparently uses the POWER7+ on-chip encryption accelerators. For each of the uses that are described in this section, there are no application visible changes or awareness required.
AIX encrypted file system (EFS)
Integrated with the AIX Journaled File System (JFS2) is the ability to create an encrypted file system (EFS) where all data at rest in the file system is encrypted. When AIX EFS runs on POWER7+, it uses the encryption accelerators, which can show up to a 40% advantage in file system I/O-intensive operations. Applications do not need to be aware of this situation, but application and workload deployments might be able to take advantage of higher levels of security by using AIX EFS for sensitive data.
AIX Internet Protocol Security (IPSec)
When IPSec is enabled on AIX running on POWER7+, AIX transparently uses the POWER7+ encryption accelerators for all data in transit. The advantage that is provided by the accelerators is more pronounced when jumbo frames (a maximum transmission unit (MTU) of 9000 bytes) are used. Applications do not need to be aware of this situation, but application and workload deployments might be able to take advantage of higher levels of security by enabling IPSec.
AIX /dev/random (random number generation)
AIX capitalizes on the on-chip random number generator on POWER7+. Applications that use the AIX special files /dev/random or /dev/urandom transparently get the advantages of stronger hardware-based random numbers. If an application is making high frequency usage of random number generation, there may also be a performance advantage.
AIX PKCS11 Library
On POWER7+ systems, the AIX operating system PKCS11 library transparently uses the POWER7+ on-chip encryption accelerators. For an application using the PKCS11 APIs, no change or awareness by the application is required. The AIX library interfaces dynamically decides, based on the algorithm and data size, when to use the accelerators. Because of the cost of setup and programming of the on-chip accelerators, the advantage is limited to operations on large blocks of data (tens to hundreds of kilobytes).
4.2 AIX Active System Optimizer and Dynamic System Optimizer
Workloads are becoming increasingly complex. Typically, they involve a mix of single-threaded and multi-threaded applications with interactions that are complex and vary over time. The servers that host these workloads are also continuously evolving to support an ever-increasing demand for processing capacity and flexibility. Tuning such an environment for optimal performance is not trivial. It often requires excessive amounts of time and highly specialized skills. Apart from resources, manual tuning also has the drawback that it is static in nature, and systems must be retuned when new workloads are introduced or when the characteristics of existing ones change in time. The Active System Optimizer (ASO) and Dynamic System Optimizer (DSO) attempt to address the optimization of both the operating system and server autonomously.
4.2.1 Concepts
DSO is built on the Active System Optimizer (ASO) framework, and expands on two of the ASO optimization strategies.
Active System Optimizer
DSO is built on the ASO framework that is introduced in AIX V7.1 TL1 SP1. The ASO framework includes a user-space daemon, advanced instrumentation, and two optimization strategies. The DSO package extends ASO to provide more optimizations. When the DSO product is installed and enabled, base ASO features and the extended DSO optimizations are
all activated.
ASO contains a user-level daemon that autonomously tunes the allocation of system resources to achieve an improvement in system performance. ASO is available on the POWER7 platform in AIX V7.1 TL1 SP1 (4Q 2011) and AIX V6.1 TL8 SP1 (4Q 2012). DSO extensions are available in 4Q 2012, and require AIX V7.1 TL2 SP1 or AIX V6.1 TL8 SP1 (on the POWER7 platform).
The ASO framework works by continuously monitoring and analyzing how current workloads impact the system, and then using this information to dynamically configure the system to optimize for current workload requirements.
The ASO framework is transparent, and the administrator is not required to continuously monitor its operations. In fact, the only required tunable that ASO provides is to turn it on or off. Once turned on, ASO automatically identifies opportunities to improve performance and applies the appropriate system changes.
The ASO daemon has a maximum system usage impact of 3%. ASO also monitors the system for situations that are not suitable for certain types of performance optimization. In such situations, ASO hibernates those optimization strategies, waking up occasionally to check for a change in environment (for more information about this topic, see 4.2.3, “Workloads” on page 87 and “System requirements” on page 90).
The primary design goal of ASO/DSO is to act only when it is reasonably certain that the result is an improvement in workload performance.
Figure 4-1 illustrates the basic ASO framework. ASO uses information from the AIX kernel and the POWER7 Performance Monitoring Unit (PMU)3, 4, 5 to perform long-term runtime analysis with the aim of improving workload performance.
Figure 4-1 Basic ASO architecture that shows an optimization flow on a POWER7 system
Optimization strategies
Two optimization strategies are provided with ASO:
Cache affinity optimization
Memory affinity optimization
DSO adds two more optimizations to the ASO framework:
Large page optimization
Memory prefetch optimization
The first version that was released in 4Q 2011 included the cache and memory affinity optimizations. The 4Q 2012 version introduces the large page and data stream prefetch optimization types.
 
ASO and DSO optimizations: ASO cache and memory affinity optimizations are bundled with all AIX editions: AIX Express, AIX Standard, and AIX Enterprise. DSO large page and memory prefetch optimizations are available as a separately chargeable premium feature or bundled with AIX Enterprise.
4.2.2 ASO and DSO optimizations
The ASO framework allows multiple optimizations to be managed. Two optimizations are included with the framework. Two more optimizations are added with the DSO package.
Cache and memory affinity optimization
Power Systems are continually increasing in processing capacity in terms of the number of cores and the number of SMT threads per core. The latest Power Systems support up to 256 cores and four SMT threads per core, which allows a single logical partition (LPAR) to have 1024 logical CPUs (hardware threads). This increase in processing units has led to a hierarchy of affinity domains.
Each core forms the smallest affinity domain. Multiple cores in a chip, multiple chips in a package, and the system itself form other higher affinity domains. System performance is close to optimal when the data that is crossing between these domains is minimal. The need for cross-domain interactions arises because of either:
Software threads in different affinity domains must communicate
Data being accessed that is in a memory bank that is not in the same affinity domain as the one of the requesting software thread
Apart from the general eligibility requirements that are listed in “System requirements” on page 90, a workload must also be multi-threaded to be considered for cache and memory
affinity optimization.
Cache affinity
ASO analyzes the cache access patterns that are based on information from the kernel and the PMU to identify potential improvements in cache affinity by moving threads of workloads closer together. If such a benefit is predicted, ASO uses proprietary algorithms to estimate the optimal size of the affinity domain for the workload, and it uses kernel services (see “Affinity APIs ” on page 75) to restrict the workload to that domain. After acting, ASO continues to monitor the workload to ensure that it performs as predicted. If the results are not as expected, ASO reverses its actions immediately.
Memory affinity
After a workload is identified and optimized for cache affinity, ASO begins monitoring the memory access patterns of the workload process private memory. If it is found that the workload can benefit from moving process private memory closer to the current affinity domain, then hot pages are identified and migrated closer using software instrumentation. Single-threaded processes are not considered for this optimization, because their process private data is already affinitized by the kernel when the thread is moved to a new affinity domain. Also, in the current version, only workloads that fit within a single Scheduler Resource Affinity Domain (SRAD, a chip/socket in POWER7) are considered.
Large page optimization
AIX allows translations of multiple memory page sizes within the same segment. Although 4 KB and 64 KB translations are allowed in the current version of AIX (Version 6.1 and greater), Version 6.1 TL8 and Version 7.2 TL2 (4Q 2012) include dynamic 16 MB translation. For workloads that use large chunks of data, using pages larger than the default size is useful because the number of TLB/ERAT misses is reduced (For information about general page size information and TLB/ERAT, see 2.3.1, “Page sizes (4 KB, 64 KB, 16 MB, and 16 GB)” on page 25). DSO uses this new AIX feature to promote heavily used regions of memory to
16 MB pages dynamically, potentially improving the performance of workloads that use
those regions.
 
System V shared memory: In the current version, only System V shared memory is eligible for dynamic 16 MB page optimization.
Memory prefetch optimization
Power Architecture provides a special purpose register (the DSCR) to control the enablement, depth, and stride for hardware data stream prefetching (for more information, see 2.3.7, “Data prefetching using d-cache instructions and the Data Streams Control Register (DSCR)” on page 46). Setting this register appropriately can potentially benefit workloads, depending on the workload data access patterns. DSO collects information from the AIX kernel and PMU to dynamically determine the optimal setting of this register for a specific period.
4.2.3 Workloads
For ASO/DSO to consider a workload for optimization, the workload must pass certain minimum criteria, as described in this section.
Ideal workloads
Workload characteristics for each optimization are:
Cache affinity optimization and memory affinity optimization: Workloads must be long-lived (the minimum lifetime varies with the type of optimization), multi-threaded, and have stable CPU usage. The performance gain is higher for workloads that have a high amount of communication between the threads in the workload.
Large page optimization: Workloads must use large System V memory regions, for example, a database with a large shared memory region. Workloads can be either multi-threaded or a group of single-threaded processes. DSO must be active when a workload is started for this optimization to be applied.
Memory prefetch optimization: Workloads must have a large System V memory footprint, high CPU usage, and a high context switch rate. Workloads can be either multi-threaded or a group of single-threaded processes. This optimization is disabled if the DCSR register is set manually at the system level (through the dscrctl command).
Eligible workloads
For ASO/DSO to consider a workload for optimization, it must pass certain minimum criteria. These criteria are:
General requirements
Fixed priority. The ASO daemon runs with a fixed scheduler priority of 32. ASO does not optimize a workload if it or any of its threads has a fixed priority more favorable (numerically lower) than itself.
Cache and memory affinity optimization
 – Multi-threaded
Workloads must be multi-threaded to be considered for optimization.
 – Workload Manager
Workloads that are classified by the Workload Manager (WLM) with tiers or minimum limits set are not optimized. Furthermore, if the system CPU capacity is fully used, ASO does not optimize processes that belong to classes with specific shares.
 
WPAR workloads: In general, WPAR workloads (which implicitly use WLM) can be optimized by ASO if minimum CPU and memory limits are not specified.
 – User-specified placement
Workloads for which placement is explicitly set by the user, such as with bindprocessor, RSET attachments (real, partition, or exclusive RSETs), and SRAD attachments, are not eligible for ASO optimization. A though ASO does not affect these workloads, AIX continues to enforce the resource constraints as normal. Furthermore, if the user attempts to place such a restriction on a workload that has been or is being optimized by ASO, then ASO undoes its optimization and lets the user restriction be placed normally.
 – CPU usage
The CPU usage of the workload should be above 0.1 cores.
 – Workload age
Workloads must be at least 10 seconds of age to be considered for cache affinity and
5 minutes of age for memory affinity optimization.
Large page optimization
 – Fully populated segments
The shared memory segments should be fully populated to be considered for page size promotion.
 – Memory footprint
The workload memory footprint should have at least 16 GB of System V
shared memory.
 – CPU usage
CPU usage of the workload should be above two cores. A workload may be either a multi-threaded process or a collection of single-threaded processes.
 – Workload age
Workloads must be at least 60 minutes of age to be considered.
Memory prefetch optimization
 – Memory footprint
The workload memory footprint should have at least 16 GB of System V
shared memory.
 – CPU usage
CPU usage of the workload should be above eight cores. A workload may be either a multi-threaded process or a collection of single-threaded processes.
 – Workload age
Workloads must be at least 10 minutes of age to be considered.
Optimization time
When you test the effect of DSO on applications, it is important to run the tests for enough time. The duration depends on the type of optimization that is being measured. For example, in the case of large page optimization, there is a small increase in system usage (less than 2%) when the pages are being promoted. So, to see the benefit of large page optimization, the test must be run for longer than the time taken to complete the promotions.
Here is the list of minimum test durations for each type of optimization. All of these times are approximate, and only after the workload is stable.
Cache affinity optimization 30 minutes
Memory affinity optimization 1 hour
Large page optimization 12 hours for promoting 100 GB of shared memory
Memory prefetch optimization 30 minutes
4.2.4 The asoo command
The ASO framework is off by default in an AIX installation. The asoo command must be used to enable the ASO framework. The command syntax is as follows:
asoo -po aso_active=1
4.2.5 Environment variables
Two environment variables are provided that control the behavior of ASO: ASO_ENABLED and ASO_OPTIONS. These variables are examined at start time.
ASO_ENABLED
ASO_ENABLED provides an administrator with the ability to alter the default behavior of ASO when you evaluate a workload for optimization.
Permitted values for this environment variable are:
ALWAYS: ASO skips some of the primary eligibility checks for optimization eligibility, such as age of workload and minimum CPU usage.
NEVER: ASO excludes this process from any optimization under all circumstances.
Unsupported value: ASO optimizes the process as normal.
ASO_OPTIONS
ASO_OPTIONS provides an administrator with the ability to individually enable and disable cache, memory affinity, large page, and memory prefetch optimization.
Permitted values for this environment variable are shown in Table 4-2
Table 4-2 DSO ASO_OPTIONS environment variable
Option
Values
Effect
ALL
ON, OFF
Enables or disables all ASO optimization. Options are processed left to right; redundant or conflicting options use the rightmost setting.
CACHE_AFFINITY
ON, OFF
Enables or disables the cache affinity optimization.
MEMORY_AFFINITY
ON, OFF
Enables or disables the memory affinity optimization. The memory affinity optimization is applied only if the cache affinity optimization is also applied.
LARGE_PAGE
ON, OFF
Enables or disables 16 MB MPSS optimization.
MEMORY_PREFETCH
ON, OFF
Enables or disables data stream prefetch optimization.
<unset>
ON, OFF
All optimization is enabled.
<any other values>
ON, OFF
An undefined variable.
Multiple options can be combined, as in the following example:
$ export ASO_OPTIONS="ALL=OFF,CACHE_AFFINITY=ON"
System requirements
ASO/DSO optimizations can be limited based on system configuration. Here are the significant requirements and limitations:
ASO and DSO takes advantage of a number of hardware features, and is supported only on POWER7 running in native mode.
ASO and DSO is integrated tightly with the AIX kernel, so various optimizations are supported at different operating system levels. The minimum levels for cache and memory affinity optimization are AIX V7.1 TL1 SP1 or AIX V6.1 TL8. Memory prefetch and large page optimization require the installation of the DSO package that is supported on AIX V7.1 TL2 SP1 and AIX V6.1 TL8 SP1.
If it is running in a dedicated processor environment, virtual processor management (core folding) must be disabled, which is the default. Enabling PowerSaver mode on the HMC causes virtual processor management in a dedicated environment to be enabled (forcing cache and memory affinity optimizations to be disabled).
Enabling active memory sharing disables all optimizations except memory prefetch.
Enabling active memory expansion disables memory affinity and large page optimization.
In an SPLPAR environment, when CPU resources are capped, the system entitlement must be a minimum of two cores for all but memory prefetch. Memory prefetch requires eight cores.
For large page and memory prefetch optimization, the system should have a minimum of 20 GB system memory.
4.2.6 Installing DSO
The AIX DSO is available as a separately chargeable premium package that includes the two new types of optimizations: Large page optimization and memory prefetch optimization. The package name is dso.aso and is installable using installp or smitty, as with any
AIX package.
ASO detects the installation of this package automatically and enables memory prefetch and large page optimization. After installation, no restart of the operating system or ASO
is required.
4.2.7 Log files
Information about the functioning of ASO and its various optimizations are logged in the following files:
/var/log/aso/aso.log
This file lists major ASO events, such as when it is enabled or disabled or when it hibernates. It also contains a basic audit trail of optimizations that are performed to workloads. Example 4-1 shows a sample aso.log file.
Example 4-1 Sample aso.log file that lists major ASO events
# ASO log configuration
aso.notice /var/log/aso/aso.log rotate size 128k time 7d
aso.info /var/log/aso/aso_process.log rotate size 1024k
Example 4-2 shows a sample aso.log file that results from insufficient CPU entitlement on an SPLPAR.
Example 4-2 Sample aso.log file that shows insufficient CPU entitlement on an SPLPAR
Oct 20 02:15:04 p7e04 aso:notice aso[13238402]: [HIB] Current number of system
virtual cpus too low (1 cpus)
Oct 20 02:15:04 p7e04 aso:notice aso[13238402]: [HIB] Increase system virtual cpus
to at least 3 cpus to run ASO. Hibernating.
/var/log/aso/aso_process.log
The aso_process.log file contains data from aso.log, plus a detailed audit trail of workloads that are considered for optimization, the actions that are taken, or reasons for not acting (and, if not, why not). See Example 4-3.
Example 4-3 Sample aso_process.log file that includes data, an audit trail, and a list of actions taken
 
Timestamp ASO process id
 
 
 
Oct 21 05:52:47 localhost aso:info aso[5963954]: [SC][5243360] Considering for
optimisation (cmd='circularBufferBenchmark', utilisation=1.14,pref=0; attaching
StabilityMonitorBasic)
 
 
Name of workload
Tag
 
 
 
Oct 21 05:54:12 localhost aso:info aso[5963954]: [EF][sys_action][5243360]
Attaching (load 2.14) to domain TwoCore (cores=2,firstCpu=0)
4.3 AIX preferred practices
This section describes AIX preferred practices, and includes three subsections:
4.3.1 AIX preferred practices that are applicable to all Power Systems generations
Preferred practices for the installation and configuration of all Power Systems
generations are:
If this server is a VIO Server, then run the VIO Performance Advisor on the VIO Server. Instructions are available for VIOS Advisor at:
For logical partitions (LPARs) with Java applications, run and evaluate the output from the Java Performance Advisor, which can be run on POWER5 and POWER6, to determine if there is an existing issue before you migrate to POWER7. Instructions are available for Java Performance Advisor at:
For virtualized environments, you can also use the IBM PowerVM Virtualization Performance Advisor. Instructions for the IBM PowerVM Virtualization Performance Advisor are available at:
The number of online virtual CPUs of a single LPAR cannot exceed the number of active CPUs in a pool. See the output of lparstat –i from the LPAR to see the values for online virtual CPUs and active CPUs in pool.
4.3.2 AIX preferred practices that are applicable to POWER7
Here are the AIX preferred practices that are applicable to POWER7.
Preferred practices for installation and configuration
Preferred practices for installation and configuration are:
To ensure that your system conforms to the minimum requirements, see Chapter 3, “The POWER Hypervisor” on page 55 and the references that are provided for that chapter (see 4.4, “Related publications” on page 94).
Review the POWER7 Virtualization Best Practice Guide, available at:
For more information about this topic, see 4.4, “Related publications” on page 94.
4.3.3 POWER7 mid-range and high-end High Impact or Pervasive advisory
IBM maintains a strong focus on the quality and reliability of Power System servers. To maintain that reliability, the currency of microcode levels on your systems is critical. Therefore, apply the latest POWER7 system firmware and management console levels for your systems as soon as possible. These service pack updates contain a collective number of High Impact or PERvasive (HIPER) fixes that continue to provide you with the system availability you expect from IBM Power Systems.
Before you migrate to POWER7, you see more benefits if your AIX level contains the performance bundle set of APARS. Visit IBM Fix Central (http://www.ibm.com/support/fixcentral/) to download the latest service pack (SP) for your POWER7 Systems.
Furthermore, you should:
Define an upgrade or update plan to bring your servers to the latest fix levels.
Concurrently update servers that are running older firmware levels to the latest level.
See the associated SP readme files for information about the specific fixes included in each service pack. Sections of readme files that describe more tips and considerations are a helpful resource as well.
When you install firmware from the HMC, avoid the do not auto accept option. Selecting this advanced option can cause firmware installation problems.
You should subscribe to My Notifications6 to provide you with customizable communications that contain important news, new or updated support content, such as publications, hints, and tips, technical notes, product flashes (alerts), and downloads and drivers.
4.4 Related publications
The publications that are listed in this section are considered suitable for a more detailed discussion of the topics that are covered in this chapter:
1 TB Segment Aliasing, found at:
AIX 64-bit Performance in Focus, SG24-5103
AIX Linking and Loading Mechanisms, found at:
Efficient I/O event polling through the pollset interface on AIX, found at:
Exclusive use processor resource sets, found at:
execrset command, found at:
General Programming Concepts: Writing and Debugging Programs, found at:
IBM AIX Version 7.1 Differences Guide, SG24-7910
Refer to section 1.2, “Improved performance using 1 TB segments”
load and loadAndInit Subroutines, found at:
mkrset command, found at:
Oracle Database and 1 TB Segment Aliasing, found at:
pollset_create, pollset_ctl, pollset_destroy, pollset_poll, and pollset_query Subroutines, found at:
POWER7 Virtualization Best Practice Guide, found at:
ra_attach Subroutine, found at:
Shared library memory footprints on AIX 5L, found at:
thread_post Subroutine, found at:
thread_post_many Subroutine, found at:
thread_wait Subroutine, found at:
Thread environment variables, found at:
 

1 System Memory Allocation Using the malloc Subsystem, available at:
http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.genprogc/doc/genprogc/sys_mem_alloc.htm
2 Simultaneous Multithreading, available at:
3 Commonly Used Metrics For Performance Analysis, available at:: http://www.power.org/documentation/commonly-used-metrics-for-performance-analysis/ (registration required)
4 Comprehensive PMU Event Reference - POWER7, available at: http://www.power.org/documentation/comprehensive-pmu-event-reference-power7/ (registration required)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset