The POWER7 processor
This chapter introduces the POWER7 processor and describes some of the technical details and features of this product. It covers the the following topics:
2.1 Introduction to the POWER7 processor
The POWER7 processor is manufactured using the IBM 45 nm Silicon-On-Insulator (SOI) technology. Each chip is 567 mm2 and contains 1.2 billion transistors. As shown in Figure 2-1, the chip contains eight cores, each with its own 256 KB L2 and 4 MB L3 (embedded DRAM) cache, two memory controllers, and an interconnection system that connects all components within the chip. The interconnect also extends through module and board technology to other POWER7 processors in addition to DDR3 memory and various I/O devices. The number of memory controllers and cores available for use depends upon the particular POWER7 system.
Figure 2-1 The POWER7 processor chip
Each core is a 64-bit implementation of the IBM Power ISA (Version 2.06 Revision B), and has the following features:
Multi-threaded design, capable of up to four-way SMT
32 KB, four-way set-associative L1 i-cache
32 KB, eight-way set-associative L1 d-cache
64-entry Effective to Real Address Translation (ERAT) for effective to real address translation for instructions (2-way set associative)
64-entry ERAT for effective to real address translation for data (fully associative)
Aggressive branch prediction, using both local and global prediction tables with a selector table to choose the best predictor
15-entry link stack
128-entry count cache
128-entry branch target address cache
Aggressive out-of-order execution
Two symmetric fixed-point execution units
Two symmetric load/store units, which can also run simple fixed-point instructions
An integrated, multi-pipeline vector-scalar floating point unit for running both scalar and SIMD-type instructions, including the VMX instruction set and the new Vector Scalar eXtension (VSX) instruction set, and capable of up to eight flops per cycle
Hardware data prefetching with 12 independent data streams and software control
Hardware DFP capability
Adaptive power management
The POWER7 processor is designed for system offerings from 16-core blades to 256-core drawers. It incorporates a dual-scope broadcast coherence protocol over local and global SMP links to provide superior scaling attributes.
For more information about this topic, see 2.4, “Related publications” on page 51.
2.1.1 The POWER7+ processor
The POWER7+ is the same POWER7 processor core with new technology, including more on-chip accelerators and an additional L3 cache. There are no new instructions in POWER7+ over POWER7. The differences in POWER7+ are:
Manufactured with 32-nm technology
A 10 MB L3 cache per core
On-chip encryption accelerators
On-chip compression accelerators
On-chip random number generators
2.2 Multi-core and multi-thread scalability
POWER7 Systems advancements in multi-core and multi-thread scaling are significant. A significant POWER7 performance opportunity comes from parallelizing workloads to enable the full potential of the Power platform. Application scaling is influenced by both multi-core and multi-thread technology in POWER7 processors. A single POWER7 chip can contain up to eight cores. With SMT, each POWER7 core can present four hardware threads. SMT is the ability of a single physical processor core to simultaneously dispatch instructions from more than one hardware thread context. Because there are multiple hardware threads per physical processor core, additional instructions can run at the same time. SMT is primarily beneficial in commercial environments where the speed of an individual transaction is not as important as the total number of transactions performed. SMT is expected to increase the throughput of workloads with large or frequently changing working sets, such as database servers and
web servers.
Additional details about the SMT feature are described in Table 2-1 and Table 2-2.
Table 2-1 Multi-thread per core features by POWER generation
Technology
Cores/system
Maximum SMT mode
Maximum hardware threads per LPAR
IBM POWER4
32
ST
32
IBM POWER5
64
SMT2
128
IBM POWER6
64
SMT2
128
IBM POWER7
256
SMT4
1024
Table 2-2 Multi-thread per core features by single LPAR scaling
Single LPAR scaling
AIX release
Linux
32-core/32-thread
5.3/6.1/7.1
RHEL 5/6
SLES 10/11
64-core/128-thread
5.3/6.1/7.1
RHEL 5/6
SLES 10/11
64-core/256-thread
6.1(TL4)/7.1
RHEL 6
SLES 11sp1
256-core/1024-thread
7.1
RHEL 6
SLES 11sp1
Operating system enablement usage of multi-core and multi-thread technology varies by operating system and release.
Power operating systems present an SMP view of the resources of a partition. Hardware threads are presented as logical CPUs to the application stack. Many applications can use the operating system scheduler to place workloads onto logical processors and maintain the SMP programming model. In some cases, the differentiation between hardware threads per core can be used to improve performance. Placement of a workload on hardware book, drawer and node, socket, core, and thread boundaries can improve application scaling. Details about operating system binding facilities are available in 4.1, “AIX and system libraries” on page 68, and include:
Affinity/topology bindings
Hybrid thread features
Using multi-core and multi-thread features is a challenging prospect. An overview about this topic and about the application considerations is provided in 4.1, “AIX and system libraries” on page 68. Additionally, the following specific scaling topics are described in 4.1, “AIX and system libraries” on page 68:
pthread tuning
malloc tuning
For more information about this topic, see 2.4, “Related publications” on page 51.
2.3 Using POWER7 features
This section describes several features of POWER7 that can affect performance, including page sizes, cache sharing, SMT priorities, and others.
2.3.1 Page sizes (4 KB, 64 KB, 16 MB, and 16 GB)
The virtual address space of a program is divided into segments. The size of each segment can be either 256 MB or 1 TB on a Power System. The virtual address space can also consist of a mix of these segment sizes. The segments are again divided into units, called pages. Similarly, the physical memory on the system is divided into page size units called page frames. The role of the Virtual Memory Manager (VMM) is to manage the allocation of real memory page frames, and to manage virtual memory page references (which is always larger than the available real memory). The VMM must minimize the total processor time, disk bandwidth price, and response time to handle the virtual memory page faults. IBM Power Architecture provides support for multiple virtual memory page sizes, which provides performance benefits to an application because of hardware efficiencies that are associated with larger page sizes.1,2
The POWER5+ and later processor chips support four virtual memory page sizes: 4 KB, 64 KB, 16 MB, and 16 GB. The POWER6 processor also supports using 64 KB pages inside segments along with a base page size of 4 KB.3 The 16 GB pages can be used only within 1 TB segments.
Large pages provide multiple technical advantages:4
Reduced Page Faults and Translation Lookaside Buffer (TLB) Misses: A single large page that is being constantly referenced remains in memory. This feature eliminates the possibility of several small pages often being swapped out.
Unhindered Data Prefetching: A large page enables unhindered data prefetch (which is constrained by page boundaries).
Increased TLB Reach: This feature saves space in the TLB by holding one translation entry instead of n entries, which increases the amount of memory that can be accessed by an application without incurring hardware translation delays.
Increased ERAT Reach: The ERAT on Power is a first level and fully associative translation cache that can go directly from effective to real address. Large pages also improve the efficiency and coverage of this translation cache as well.
Large segments (1 TB) also provide reduced Segment Lookaside Buffer (SLB) misses, and increases the reach of the SLB. The SLB is a cache of the most recently used Effective to Virtual Segment translations.
While 16 MB and 16 GB pages are intended only for particularly high performance environments, 64 KB pages are considered general purpose, and most workloads benefit from using 64 KB pages rather than 4 KB pages.
Multipage size support on Linux
On Power Systems running Linux, the default page size is 64 KB, so most, but not all, applications are expected to see a performance benefit from this default. There are cases in which an application uses many small files, which can mean that each file is loaded into a 64 KB page, resulting in poor memory utilization.
Support for 16 MB pages (hugepages in Linux terminology) is available through various mechanisms and is typically used for databases, Java engines, and high-performance computing (HPC) applications. The libhugetlbfs package is available in Linux distributions, and using this package gives you the most benefit from 16 MB pages.
Multipage size support on AIX
The pagesize -a command on AIX determines all of the page sizes that are supported by AIX on a particular system.
IBM AIX 5L™ Version 5.3 with the 5300-04 Technology Level supports up to four different page sizes, but the actual page sizes that are supported by a particular system vary, based on processor chip type.
AIX V6.1 supports segments with two page sizes: 4 KB and 64 KB. By default, processes use these variable page size segments. This configuration is overridden by the existing page size selection mechanism. Page sizes are an attribute of an individual segment (whether single page size or mixed per segment). A process address space can consist of a mix of segments of varying page sizes. For example, the process text segment can be 4 KB pages, and its stack and heap segments can be 64 KB page size segments.5
Because the 64 KB page size is easy to use, and because it is expected that many applications perform better when they use the 64 KB page size rather than the 4 KB page size, AIX has rich support for the 64 KB page size. No system configuration changes are necessary to enable a system to use the 64 KB page size. On systems that support the 64 KB page size, the AIX kernel automatically configures the system to use it. Table 2-3 and Table 2-4 on page 27 list the page size specifications for Power Systems.
Table 2-3 Page size support for Power HW and AIX configuration support6
Page size
Required hardware
Requires user configuration
Restricted
4 KB
ALL
No
No
64 KB
POWER5+ or later
No
No
16 MB
POWER4 or later
Yes
Yes
16 GB
POWER5+ or later
Yes
Yes
Table 2-4 Supported segment page sizes7
Segment base page size
Supported page sizes
Minimum required hardware
4 KB
4 KB/64 KB
POWER6
64 KB
64 KB
POWER5+
16 MB
16 MB
POWER4
16 GB
16 GB
POWER5+
The vmo command on AIX allows configuration of the VMM tunable parameters.
The vmo tunable vmm_mpisze_support toggles the operating system multiple page size support for the extra page sizes that are provided by POWER5+ and later machines.
A value of 1 indicates that the operating system takes advantage of extra page sizes that are supported by a processor chip. A value of 2 indicates that the operating system takes advantage of the capability of using multiple page sizes per segment. When set to 0, the only page size the operating system recognizes are 4 KB and the system large page size.
AIX V6.1 takes advantage of this new hardware capability to combine the conservative memory usage aspects of the 4 KB page size in sparsely referenced memory regions, with the performance benefits of the 64 KB page size in densely referenced memory regions.
AIX V6.1 takes advantage of this automatically, without user intervention. This AIX feature is referred to as dynamic Variable Page Size Support (VPSS). Some applications might prefer to use a larger page size, even when a 64 KB region is not fully referenced. The page size promotion aggressiveness factor (PSPA) can be used to reduce the memory-referenced requirement, at which point a group of 4 KB pages is promoted to a 64 KB page size. The PSPA can be set for the whole system by using the vmm_default_pspa vmo tunable, or for a specific process by using the vm_pattr system call.8
In addition to 4 KB and 64 KB page sizes, AIX supports 16 MB pages, also called large pages, and 16 GB pages, also called huge pages. These page sizes are intended for use only in high-performance environments, and AIX normally does not automatically configure a system to use these page sizes. However, the new Dynamic System Optimizer (DSO) facility in AIX (see 4.2, “AIX Active System Optimizer and Dynamic System Optimizer” on page 84) can autonomously configure and use 16 MB pages when enabled.
Use the vmo tunables lgpg_regions and lgpg_size to configure the number of 16 MB large pages on a system.
The following example allocates 1 GB of 16 MB large pages:
vmo -r -o lgpg_regions=64 -o lgpg_size=16777216
To use large pages, non-root users must have the CAP_BYPASS_RAC_VMM capability in AIX enabled. The system administrator can add this capability by running chuser:
chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE <user_id>
Huge pages must be configured using the Hardware Management Console (HMC). To do so, complete the following steps:
1. On the managed system, click Properties  Memory  Advanced Options  Show Details to change the number of 16 GB pages.
2. Assign 16 GB huge pages to a partition by changing the partition profile.
Application support to use multisize pages on AIX9
As described in Power Instruction Set Architecture Version 2.06,10 you can specify page sizes to use for four regions of a 32-bit or 64-bit process address space.
These page sizes can be configured with an environment variable or with settings in an application XCOFF binary with the ldedit or ld commands, as shown in Table 2-5.
Table 2-5 Page sizes for four regions of a 32-bit or 64-bit process address space
Region
ld or ldedit option
LDR_CNTRL environment variable
Description
Data
bdatapsize
DATAPSIZE
Initialized data, bss, and heap
Stack
bstackpsize
STACKPSIZE
Initial thread stack
Text
btextpsize
TEXTPSIZE
Main executable text
Shared memory
None
SHMPSIZE
Shared memory that is allocated by the process
You can specify a different page size to use for each of the four regions of a process address space. Only the 4 KB and 64 KB page sizes are supported for all four memory regions. The 16 MB page size is supported only for the process data, process text, and process shared memory regions. The 16 GB page size is supported only for a process shared
memory region.
You can set the preferred page sizes for an application in the XCOFF/XCOFF64 binary file by running the ldedit or ld commands.
The ld or cc commands can be used to set these page size options when you are linking
an executable command:
ld -o mpsize.out -btextpsize:4K -bstackpsize:64K sub1.o sub2.o
cc -o mpsize.out -btextpsize:4K -bstackpsize:64K sub1.o sub2.o
The ldedit command can be used to set these page size options in an existing
executable command:
ldedit -btextpsize=4K -bdatapsize=64K -bstackpsize=64K mpsize.out
We can set the preferred page sizes of a process with the LDR_CNTRL environment variable. As an example, the following command causes the mpsize.out process to use 4 KB pages for its data, 64 KB pages for its text, 64 KB pages for its stack, and 64 KB pages for its shared memory on supported hardware:
LDR_CNTRL=DATAPSIZE=4K@TEXTPSIZE=64K@SHMPSIZE=64K mpsize.out
Page size environment variables override any page size settings in an executable XCOFF header. Also, the DATAPSIZE environment variable overrides any LARGE_PAGE_DATA environment variable setting.
Rather than using the LDR_CNTRL environment variable, consider marking specific executable files to use large pages, because this limits the large page usage to the specific application that benefits from large page usage.
Page size and shared memory
To back shared memory segments of an application with large pages, specify the SHM_LGPAGE and SHM_PIN flags in the shmget() function. If large pages are unavailable, the 4 KB pages back the shared memory segment.
Support for specifying the page size to use for the shared memory of a process with the SHMPSIZE environment variable is available starting in IBM AIX 5L Version 5.3 with the 5300-08 Technology Level, or later, and AIX Version 6.1 with the 6100-01 Technology Level, or later.
Monitoring page size that is used by an application
Monitoring page size is accomplished by running the following commands:11
The ps command can be used to monitor the base page sizes that are used for process data, stack, and text.
The vmstat command has two options available to display memory statistics for a specific page size.
The vmstat -p command displays global vmstat information, along with a breakdown of statistics per page size.
The vmstat -P command displays per page size statistics.
For more information about this topic, see 2.4, “Related publications” on page 51.
2.3.2 Cache sharing
Power Systems consist of multiple processor cores and multiple processor chips that share caches and memory in the system. The architecture uses a processor and memory layout that you can use to scale the hardware to many nodes of processor chips and memory. One advantage is that systems can be used for multiple workloads and workloads that are large. However, these characteristics must be carefully weighed in the design, implementation, and evaluation of a workload. Aspects of a program, such as the allocation of data across cores and chips and the layout of data within a data structure, play a key role in maximizing performance, especially when scaling across many processor cores and chips.
Power Systems use a cache-coherent SMP design in which all the memory in the system is accessible to all of the processor cores in the system, and all of the cache is
coherently maintained:12
1. Any processor core on any chip can access the memory of the entire system.
2. Any processor core can access the contents of any core cache, even if it on a
different chip.
 
Processor core access: In both of these cases, the processor core can access only memory or cache that it has authorized access to using normal operating system and Hypervisor memory access permissions and controls.
In POWER7 Systems, each chip consists of eight processor cores, each with on-core L1 instruction and d-caches, an L2 cache, and an L3 cache, as shown in Figure 2-2 on page 31. All of these caches are effectively shared. The L2 cache has a longer access latency than L1, and L3 has a longer access latency than L2. Each chip also has memory controllers, allowing direct access to a portion of the memory DIMMs in the system.13 Thus, it takes longer for an application thread to access data in cache or memory that is attached to a remote chip than to access data in a local cache or memory. These types of characteristics are often referred to as affinity performance effects (see “The POWER7 processor and affinity performance effects” on page 14). In many cases, systems that are built around different processor models have varying characteristics (for example, while L3 is supported, it might not be implemented on some models).
Functionally, it does not matter which core in the system an application thread is running on, or what memory the data it is accessing is on. However, this situation does affect the performance of applications, because accessing a remote memory or cache takes more time than accessing a local memory or cache.14 This situation becomes even more imperative with the capability of modern systems to support massive scaling and the resulting possibility for remote accesses to occur across a large processor interconnection complex.
The effect of these system properties can be observed by application threads, because they often move, sometimes rather frequently, between processor cores. This situation can happen for various reasons, such as a page fault or lock contention that results in the application thread being preempted while it waits for a condition to be satisfied, and then being resumed on a different core. Any application data that is in the cache local to the original core is no longer in the local cache, because the application thread moved and a remote cache access is required.15 Although modern operating systems, such as AIX, attempt to ensure that cache and memory affinity is retained, this movement does occur, and can result in a loss in performance. For an introduction to the concepts of cache and memory affinity, see “The POWER7 processor and affinity performance effects” on page 14.
The IBM POWER Hypervisor is responsible for:
Virtualization of processor cores and memory that is presented to the operating system
Ensuring that the affinity between the processor cores and memory an LPAR is using is maintained as much as possible
However, it is important for application designers to consider affinity issues in the design of applications, and to carefully assess the impact of application thread and data placement on the cores and the memory that is assigned to the LPAR the application is running in.
Various techniques that are employed at the system level can alleviate the effect of cache sharing. One example is to configure the LPAR so that the amount of memory that is requested for the LPAR is satisfied by the memories that are locally available to processor cores in the system (the memory DIMMs that are attached to the memory controllers for each processor core). Here, it is more likely that the POWER Hypervisor is able to maintain affinity between the processor cores and memory that is assigned to the partition,
improving performance16.
For more information about LPAR configuration and running the lssrad command to query the affinity characteristics of a partition, see Chapter 3, “The POWER Hypervisor” on page 55.
The rest of this section covers multiple topics that can affect application performance, including the effects of cache geometry, alignment of data, and sensitivity to the scaling of applications to more cores. Tips are provided for using the various functionalities that are provided in Power Systems and AIX.
Cache geometry
Cache geometry refers to the specific layout of the caches in the system, including their location, interconnection, and sizes. These design details change for every processor chip, even within the Power Architecture. Figure 2-2 shows the layout of a POWER7 chip, including the processor cores, caches, and local memory. Table 2-6 shows the cache sizes and related geometry information for POWER7.
Figure 2-2 POWER7 chip and local memory17
Table 2-6 POWER7 storage hierarchy18
Cache
POWER7
POWER7+
L1 i-cache: Capacity/associativity

32 KB, 4-way

32 KB, 4-way
L1 d-cache: Capacity/associativity
bandwidth

32 KB, 8-way
2 16 B reads or
1 16 B writes per cycle

32 KB, 8-way
2 16 B reads or
1 16 B writes per cycle
L2 cache: Capacity/associativity
bandwidth

256 KB, 8-way
Private
32 B reads and 16 B writes per cycle

256 KB, 8-way
Private
32 B reads and 16 B writes per cycle
L3 cache: Capacity/associativity
bandwidth
On-Chip
4 MB/core, 8-way
16 B reads and 16 B writes per cycle
On-Chip
10 MB/core, 8-way
16 B reads and 16 B writes per cycle
Optimizing for cache geometry
There are several ways to optimize for cache geometry, as described in this section.
Splitting structures into hot and cold elements
A technique for optimizing applications to take advantage of cache is to lay out data structures so that fields that have a high rate of reference (that is, hot) are grouped, and fields that have a relatively low rate of reference (that is, cold) are grouped.19 The concept is to place the hot elements into the same byte region of memory, so that when they are pulled into the cache, they are co-located into the same cache line or lines. Additionally, because hot elements are referenced often, they are likely to stay in the cache. Likewise, the cold elements are in the same area of memory and result in being in the same cache line, so that being written out to main storage and discarded causes less of a performance degradation. This situation occurs because they have a much lower rate of access. Power Systems use 128-byte length cache lines. Compared to Intel processors (64-byte cache lines), these larger cache lines have the advantage of increasing the reach possible with the same size cache directory, and the efficiency of the cache by covering up to 128-bytes of hot data in a single line. However, it also has the implication of potentially bringing more data into the cache than needed for fine-grained accesses (that is, less than 64 bytes).
As described in Eliminate False Sharing, Stop your CPU power from invisibly going down the drain,20 it is also important to carefully assess the impact of this strategy, especially when applied to systems where there are a high number of CPU cores and a phenomenon referred to as false sharing can occur. False sharing occurs when multiple data elements are in the same cache line that can otherwise be accessed independently. For example, if two different hardware threads wanted to update (store) two different words in the same cache line, only one of them at a time can gain exclusive access to the cache line to complete the store. This situation results in:
Cache line transfers between the processors where those threads are
Stalls in other threads that are waiting for the cache line
Leaving all but the most recent thread to update the line without a copy in their cache
This effect is compounded as the number of application threads that share the cache line (that is, threads that are using different data in the cache line under contention) is scaled upwards.21, 20 The discussion about cache sharing22 in also presents techniques for analyzing false sharing and suggestions for addressing the phenomenon.
Prefetching to avoid cache miss penalties
Prefetching to avoid cache miss penalties is another technique that is used to improve performance of applications. The concept is to prefetch blocks of data to be placed into the cache a number of cycles before the data is needed. This action hides the penalty of waiting for the data to be read from main storage. Prefetching can be speculative when, based on the conditional path that is taken through the code, the data might end up not actually being required. The benefit of prefetching depends on how often the prefetched data is used. Although prefetching is not strictly related to cache geometry, it is an important technique.
A caveat to prefetching is that, although it is common for the technique to improve performance for single-thread, single core, and low utilization environments, it actually can decrease performance in high thread-count per-socket and high-utilization environments. Most systems today virtualize processors and the memory that is used by the workload. Because of this situation, the application designer must consider that, although an LPAR might be assigned only a few cores, the overall system likely has a large number of cores. Further, if the LPARs are sharing processor cores, the problem becomes compounded.
The dcbt and dcbtst instructions are commonly used to prefetch data.23,24 Power Architecture ISA 2.06 Stride N Prefetch Engines to boost Application's performance25 provides an overview about how these instructions can be used to improve application performance. These instructions can be used directly in hand-tuned assembly language code, or they can be accessed through compiler built-ins or directives.
Prefetching is also automatically done by the POWER7 hardware and is configurable, as described in 2.3.7, “Data prefetching using d-cache instructions and the Data Streams Control Register (DSCR)” on page 46.
Alignment of data
Processors are optimized for accessing data elements on their naturally aligned boundaries. Unaligned data accesses might require extra processing time by the processor for individual load or store instructions. They might require a trap and emulation by the host operating system. Ensuring natural data alignment also ensures that individual accesses do not span cache line boundaries.
Similar to the idea of splitting structures into hot and cold elements, the concept of data alignment seeks to optimize cache performance by ensuring that data does not span across multiple cache lines. The cache line size in Power Systems is 128 bytes.
The general technique for alignment is to keep operands (data) on natural boundaries, such as a word or doubleword boundary (that is, an int would be aligned to be on a word boundary in memory). This technique might involve padding and reordering data structures to avoid cases such as the interleaving of chars and doubles: char; double; char; double. High-level language compilers do automatic data alignment. However, padding must be carefully analyzed to ensure that it does not result in more cache misses or page misses (especially for rarely referenced groupings of data).
Additionally, to achieve optimal performance, floating point and VMX/VSX have different alignment requirements. For example, the preferred VSX alignment is 16 bytes instead of the element size of the data type being used. This situation means that VSX data that is smaller than 16 bytes in length must be padded out to 16 bytes. The compilers introduce padding as necessary to provide optimal alignment for vector data types.
Sensitivity of scaling to more cores
Different processor chip versions and system models provide less or more scaling of LPARs and workloads to cores. Different processor chips and systems might have different bus widths and latencies. All of these factors result in the sensitivity of the performance of an application/workload to the number of cores it is running on to change based on the processor chip version and system model.
In general terms, an application that tends to not access memory without CPU intervention (that are core-centric) scales perfectly across more cores. Performance loss when scaling across multiple cores tends to come from one or more of the following sources:
Increased cache misses (often from invalidations of data by other processor cores, especially for locks)
The increased cost of cache misses, which in turn drives overall memory and interconnect fabric traffic into the region of bandwidth limitations (saturating the memory busses and interconnect)
The additional cores that are being added to the workload in other nodes, resulting in increased latency in reaching memory and caches in those nodes
Briefly, cache miss requests and returning data can end up being routed through busses that connect multiple chips and memory, which have particular bandwidth and latency characteristics. The goal for scaling across multiple cores, then, is to minimize the change in the potential penalties that are associated with cache misses and data requests as the workload size grows.
It is difficult to assess what strategies are effective for scaling to more cores without considering the complex aspects of a specific application. For example, if all of the cores that the application is running across eventually access all of the data, then it might be wise to interleave data across the processor sockets (which are typically a grouping of processor chips) to optimize them from a memory bus utilization point of view. However, if the access pattern to data is more localized so that, for most of the data, separate processor cores are accessing it most of the time, the application might obtain better performance if the data is close to the processor core that is accessing that data the most (maintaining affinity between the application thread and the data it is accessing). For the latter case, where the data ought to be close to the processor core that is accessing the data, the AIX MEMORY_AFFINITY=MCM environment variable can be set to achieve this behavior.
When multiple processor cores are accessing the same data and that data is being held by a lock, resulting in the data line in the cache that is invalidated, programs can suffer. This phenomenon is often referred to as hot locks, where a lock is holding data that has a high rate of contention. Hot locks result in intervention and can easily limit the ability to scale a workload because all updates to the lock are serialized. Tools such as splat (see “AIX trace-based analysis tools” on page 165) can be used to identify hot locks.
Hot locks can be caused by the programmer having lock control access to too large an area of data, which is known as coarse-grained locking.26 In that case, the strategy to effectively deal with a hot lock is to split the lock into a set of fine-grained locks, such that multiple locks, each managing a smaller portion of the data than the original lock, now manage the data for which access is being serialized. Hot locks can also be caused by trying to scale an application to more cores than the original design intended. In that case, using an even finer grain of locking might be possible, or changes can be made in data structures or algorithms, such that lock contention is reduced.
Additionally, the programmer must spend time considering the layout of locks in the cache to ensure that multiple locks, especially hot locks, are not in the same cache line because any updates to the lock itself results in the cache line being invalidated on other processor cores. When possible, locks should be padded so that they are in their own distinct cache line.
For more information about this topic, see 2.4, “Related publications” on page 51.
2.3.3 SMT priorities
POWER5 introduced the capability for the SMT thread priority level for each hardware thread to be set, controlling the relative priority of the threads within a single core. This relative difference between the priority of each hardware thread determines the number of decode cycles each thread receives during a period.27 Typically, changing the SMT priority level is done by using a special no-op OR instruction or by using the thread_set_smt_priority system call in AIX. The result can be boosted performance for the sibling SMT threads on the same processor core.
Concepts and benefits
The POWER processor architecture uses SMT to provide multiple streams of hardware execution. POWER7 provides four SMT hardware threads per core and can be configured to run in SMT4, SMT2, or single-threaded mode (SMT1 mode or, as referred to in this publication, ST mode) while POWER6 and POWER5 provide two SMT threads per core and can be run in SMT2 mode or ST mode.
By using multiple SMT threads, a workload can take advantage of more of the hardware features provided in the POWER processor than if a single SMT thread is used per core. By configuring the processor core to run in multi-threaded mode, the operating system can maximize the usage of the hardware capabilities that are provided in the system and the overall workload throughput by correctly balancing software threads across all of the cores and SMT hardware threads in the partition.
The Power Architecture provides an SMT Thread Priority mechanism by which the priority among the SMT threads in the processor core can be adjusted so that an SMT thread can receive more or less favorable performance (in terms of dispatch cycles) than the other threads in the same core. This mechanism can be used in various situations, such as to boost the performance of other threads while the thread with a lowered priority is waiting on a lock, or when waiting on other cooperative threads to reach a synchronization point.
SMT thread priority levels
Table 2-7 lists various SMT thread priority levels that are supported in the Power Architecture. The level at which code can set the SMT priority level to is controlled by the privilege level that the code is running at (such as problem-state versus supervisor level). For example, code that is running in problem-state cannot set the SMT priority level to High. However, AIX provides a system call interface that allows the SMT priority level to be set to any level other than the ones restricted to hypervisor code.
Table 2-7 SMT thread priority levels for POWER5, 6, and 728, 29
SMT thread priority level
Minimum privilege that is required to set level in
POWER5, POWER6, and POWER7
Thread shutoff
(read only; set by disabling thread)
Hypervisor
Very low
Supervisor
Low
Problem-state
Medium low
Problem-state
Medium
Problem-state
Medium high
Supervisor
High
Supervisor
Very high
Hypervisor
Various methods for setting the SMT priority level are described in “APIs” on page 37.
AIX kernel usage of SMT thread priority and effects
The AIX kernel is optimized to take advantage of SMT thread priority by lowering the SMT thread priority in select code paths, such as when spinning in the wait process. When the kernel modifies the SMT thread priority and execution is returned to a process-thread, the kernel sets the SMT thread priority back to Medium or the level that is specified by the process-thread using an AIX system call that modified the SMT thread priority (see “APIs” on page 37).
Where to use
SMT thread priority can be used to improve the performance of a workload by lowering the SMT thread priority that is being used on an SMT thread that is running a particular
process-thread when:
The thread is waiting on a lock
The thread is waiting on an event, such as the completion of an IO event
Alternatively, process-threads that are performance sensitive can maximize their performance by ensuring that the SMT thread priority level is set to an elevated level.
APIs
There are three ways to set the SMT priority when it is running on POWER processors:30, 31
1. Modify the SMT priority directly using the PPR register.32
2. Modify the SMT priority through the usage of special no-ops.33
3. Using the AIX thread_set_smt_priority system call.34
On POWER7 and earlier, code that is running in problem-state can only set the SMT priority level to Low, Medium-Low, or Medium. On POWER7+, code that is running in problem-state can additionally set the SMT priority to Very-Low.
For more information about this topic, see 2.4, “Related publications” on page 51.
2.3.4 Storage synchronization (sync, lwsync, lwarx, stwcx, and eieio)
The Power Architecture storage model provides for out-of-order storage accesses, providing opportunities for performance enhancement when accesses do not need to be in order. However, when accessing storage shared by multiple processor cores or shared with I/O devices, it is important that accesses occur in the correct order that is required for the sharing mechanisms that is used.
The architecture provides mechanisms for synchronization of such storage accesses and defines an architectural model that ought to be adhered to by software. Several synchronization instructions are provided by the architecture, such as sync, lwsync, lwarx, stcwx, and eieio. There are also operating system-specific locking services provided that enforce such synchronization. Software must be carefully designed when you use these mechanisms to ensure optimal performance while providing appropriate data consistency because of their inherent heavyweight nature.
Concepts and benefits
The Power Architecture defines a storage model that provides weak ordering of storage accesses. The order in which memory accesses are performed might differ from the program order and the order in which the instructions that cause the accesses are run.35
The Power Architecture provides a set of instructions that enforce storage access synchronization, and the AIX kernel provides a set of kernel services that provide locking mechanisms and associated synchronization support.36 However, such mechanisms come with an inherent cost because of the nature of synchronization. Thus, it is important to intelligently use the correct storage mechanisms for the various types of storage access scenarios to ensure that accesses are performed in program order while minimizing
their impact.
AIX kernel locking services
AIX provides a set of locking services that enforce synchronization by using mechanisms that are provided by the Power Architecture. These services are documented in online publications.37 The correct use of these locking services allows code to ensure that shared memory is accessed by only one producer or consumer of data at a time.
Associated instructions
The following instructions provide various storage synchronization mechanisms:
sync This instruction provides an ordering function, so that all instructions issued before the sync complete and no subsequent instructions are issued until after the sync completes.38
lwsync This instruction provides an ordering function similar to sync, but it is only applicable to load, store, and dcbz instructions that are run by the processor (hardware thread) running the lwsync instruction, and only for specific combinations of storage
control attributes.39
lwarx This instruction reserves a storage location for subsequent store using a stcwx instruction and notifies the memory coherence mechanism of the reservation.40
stcwx This instruction performs a store to the target location only if the location specified by a previous lwarx instruction is not used for storage by another processor (hardware thread) or mechanism, which invalidates the reservation.41
eieio This instruction creates a memory barrier that provides an order for storage accesses caused by load, store, dcbz, eciwx, and ecowx instructions.42
Where to use
Care must be taken when you use synchronization mechanisms in any processor architecture because the associated load and store instructions have a heavier weight than normal loads and stores and the barrier operations have a cost that is associated with them. Thus, it is imperative that the programmer carefully consider when and where to use such operations, so that data consistency is ensured without adversely affecting the performance of the software and the overall system.
PowerPC storage model and AIX programming43 describes where synchronization mechanisms must be used to ensure that the code adheres to the Power Architecture. Although this documentation covers how to write compliant code, it does not cover the performance aspect of using the mechanisms.
Unless the code is hand-tuned assembler code, you should take advantage of the locking services that are provided by AIX because they are tuned and provide the necessary synchronization mechanisms. Power Instruction Set Architecture Version 2.0644 provides assembler programming examples for sharing storage. For more information, see Appendix B, “Performance tooling and empirical performance analysis” on page 155
For more information about this topic, see 2.4, “Related publications” on page 51.
2.3.5 Vector Scalar eXtension (VSX)
VSX in the Power ISA introduced more support for Vector and Scalar Binary Floating Point Operations conforming to the IEEE-754 Standard for Floating Point Arithmetic. The introduction of VSX in to the Power Architecture increases the parallelism by providing Single Instruction Multiple Data (SIMD) execution functionality for floating point double precision to improve the performance of the HPC applications.
The following VSX features are provided to increase opportunities for vectorization:
A unified register file, a set of Vector-Scalar Registers (VSR), supporting both scalar and vector operations is provided, eliminating the impact of vector-scalar data transfer
through storage.
Support for word-aligned storage accesses for both scalar and vector operations
is provided.
Robust support for IEEE-754 for both vector and scalar floating point operations
is provided.
A 64-entry Unified Register File is shared across VSX, the Binary floating point unit (BFP), VMX, and the DFP unit. The 32 64-bit Floating Point Registers (FPRs), which are used by the BFP and DFP units, are mapped to registers 0 - 31 of the Vector Scalar Registers. The 32 vector registers (VRs) that are used by the VMX are mapped to registers 32 - 63 of
the VSRs,45 as shown in Table 2-8.
Table 2-8 The Unified Register File
FPR0
 
VSR0
FPR1
 
VSR1
..
..
 
 
 
 
FPR30
 
 
FPR31
 
 
VR0
 
VR1
 
..
 
..
 
VR30
VSR62
VR31
VSR63
VSX supports Double Precision Scalar and Vector Operations and Single Precision Vector Operations. VSX instructions numbering 142 are broadly divided into two categories that can operate on 64 vector scalar registers:46, 47, 48, 49, 50
Computational instructions: Addition, subtraction, multiplication, division, extracting the square root, rounding, conversion, comparison, and combinations of these operations
Non-computational instructions: Loads/stores, moves, select values, and so on
Compiler support for vectors
XLC supports vector processing technologies through language extensions on both AIX and Linux. GCC supports using the VSX engine on Linux. XL and GCC C implement and extend the AltiVec Programming Interface specification. In the extended syntax, type qualifiers and storage class specifiers can precede the keyword vector (or its alternative spelling, __vector) in a declaration.
Also, the XL compilers are able to automatically generate VSX instructions from scalar code when they generate code that targets the POWER7 processor. This task is accomplished by using the -qsimd=auto option with the -O3 optimization level or higher.
Table 2-9 lists the supported vector data types and the size and possible values for
each type.
Table 2-9 Vector data types
Type
Interpretation of content
Range of values
vector unsigned char
16 unsigned char
0..255
vector signed char
16 signed char
128..127
vector bool char
16 unsigned char
0, 255
vector unsigned short
8 unsigned short
0..65535
vector unsigned short int
vector signed short
8 signed short
32768..32767
vector signed short int
vector bool short
8 unsigned short
0, 65535
vector bool short int
vector unsigned int
4 unsigned int
0..232-1
vector unsigned long
vector unsigned long int
vector signed int
4 signed int
231..231-1
vector signed long
vector signed long int
vector bool int
4 unsigned int
0, 232-1
vector bool long
vector bool long int
vector float
4 float
IEEE-754 single (32 bit) precision floating point values
vector double
2 double
IEEE-754 double (64 bit) precision floating point values
vector pixel
8 unsigned short
1/5/5/5 pixel
 
Vector types: The vector double type requires architectures that support the VSX instruction set extensions, such as POWER7. You must specify the XL -qarch=pwr7 -qaltivec compiler options when you use this type, or the GCC -mcpu=power7 or
-mvsx options.
The hardware does not have instructions for supporting vector unsigned long long, vector bool long long, or vector signed long long. In GCC, you can declare these types, but the only hardware operation you can use these types for is vector floating point convert. In 64-bit mode, vector long is the same as vector long long. In 32-bit mode, these types are
not permitted.
All vector types are aligned on a 16-byte boundary. An aggregate that contains one or more vector types is aligned on a 16-byte boundary, and padded, if necessary, so that each member of vector type is also 16-byte aligned. Vector data types can use some of the unary, binary, and relational operators that are used with primitive data types. All operators require compatible types as operands unless otherwise stated. For more information about the operator’s usage, see the XLC online publications51, 52, 53.
Individual elements of vectors can be accessed by using the Vector Multimedia eXtension (VMX) or the VSX built-in functions. For more information about the VMX and the VSX built-in functions, refer to the built-in functions section of Vector Built-in Functions.54
Vector initialization
A vector type is initialized by a vector literal or any expression that has the same vector type. For example:55
vector unsigned int v1;
vector unsigned int v2 = (vector unsigned int)(10); // XL only, not GCC
v1 = v2;
The number of values in a braced initializer list must be less than or equal to the number of elements of the vector type. Any uninitialized element is initialized to zero.
Here are examples of vector initialization using initializer lists:
vector unsigned int v1 = {1}; // initialize the first 4 bytes of v1 with 1
// and the remaining 12 bytes with zeros
vector unsigned int v2 = {1,2}; // initialize the first 8 bytes of v2 with 1 and 2
// and the remaining 8 bytes with zeros
vector unsigned int v3 = {1,2,3,4}; // equivalent to the vector literal
// (vector unsigned int) (1,2,3,4)
How to use vector capability in POWER7
When you target a POWER processor that supports VMX or VSX, you can request the compiler to transform code into VMX or VSX instructions. These machine instructions can run up to 16 operations in parallel. This transformation mostly applies to loops that iterate over contiguous array data and perform calculations on each element. You can use the NOSIMD directive to prevent the transformation of a particular loop:56
Using a compiler: Compiler versions that recognize the POWER7 architecture are XL C/C++ 11.1 and XLF Fortran 13.1 or recent versions of GCC, including the Advance Toolchain, and the SLES 11SP1 or Red Hat RHEL6 GCC compilers:
 – For C:
 • xlc -qarch=pwr7 -qtune=pwr7 -O3 -qhot -qsimd
 • gcc -mcpu=power7 -mtune=power7 -O3
 – For Fortran
 • xlf -qarch=pwr7 -qtune=pwr7 -O3 -qhot -qsimd
 • gfortran -mcpu=power7 -mtune=power7 -O3
Using Engineering and Scientific Subroutine (ESSL) libraries with vectorization support:
 – Select routines have vector analogs in the library
 – Key FFT, BLAS routines
Vector capability support in AIX
A program can determine whether a system supports the vector extension by reading the vmx_version field of the _system_configuration structure. If this field is non-zero, then the system processor chips and operating system contain support for the vector extension. A __power_vmx() macro is provided in /usr/include/sys/systemcfg.h for performing this test. A value of 2 means that the processor chip is both VMX and VSX capable.
The AIX Application Binary Interface (ABI) is extended to support the addition of vector register state and conventions. AIX supports the AltiVec programming interface specification.
A set of malloc subroutines (vec_malloc, vec_free, vec_realloc, and vec_calloc) is provided by AIX that give 16-byte aligned allocations. Vector-enabled compilation, with _VEC_ implicitly defined by the compiler, result in any calls to older mallocs and callocs being redirected to their vector-safe counterparts, vec_malloc and vec_calloc. Non-vector code can also be explicitly compiled to pick up these same malloc and calloc redirections by explicitly defining __AIXVEC.
The alignment of the default malloc(), realloc(), and calloc() allocations can also be controlled at run time. This task can be done externally to any program by using the MALLOCALIGN environment variable, or internally to a program by using the mallopt() interface command option.57
For more information about this topic, see 2.4, “Related publications” on page 51.
2.3.6 Decimal floating point (DFP)
Decimal (base 10) data is widely used in commercial and financial applications. However, most computer systems have only binary (base two) arithmetic. There are two binary number systems in computers: integer (fixed-point) and floating point. Unfortunately, decimal calculations cannot be directly implemented with binary floating point. For example, the value 0.1 needs an infinitely recurring binary fraction, while a decimal number system can represent it exactly, as one tenth. So, using binary floating point cannot ensure that results are the same as those results using decimal arithmetic.
In general, DFP operations are emulated with binary fixed-point integers. Decimal numbers are traditionally held in a binary-coded decimal (BCD) format. Although BCD provides sufficient accuracy for decimal calculation, it imposes a heavy cost in performance, because it is usually implemented in software.
IBM POWER6 and POWER7 processor-based systems provide hardware support for DFP arithmetic. The POWER6 and POWER7 microprocessor cores include a DFP unit that provides acceleration for the DFP arithmetic. The IBM Power instruction set is expanded:
54 new instructions were added to support the DFP unit architecture. DFP can provide a performance boost for applications that are using BCD calculations.58
How to take advantage of DFP unit on POWER
You can take advantage of the DFP unit on POWER with the following features:59
Native DFP language support with a compiler
The C draft standard includes the following new data types (these are native data types, as are int, long, float, double, and so on):
_Decimal32 7 decimal digits of accuracy
_Decimal64 16 decimal digits of accuracy
_Decimal128 34 decimal digits of accuracy
 
Note: The printf() function uses new options to print these new data types:
_Decimal32 uses %Hf
_Decimal64 uses %Df
_Decimal128 uses %DDf
 – The IBM XL C/C++ Compiler, release 9 or later for AIX and Linux, includes native DFP language support. Here is a list of compiler options for IBM XL compilers that are related to DFP:
 • -qdfp: Enables DFP support. This option makes the compiler recognize DFP literal suffixes, and the _Decimal32, _Decimal64, and _Decimal128 keywords.
 • -qfloat=dfpemulate: Instructs the compiler to use calls to library functions to handle DFP computation, regardless of the architecture level. You might experience performance degradation when you use software emulation.
 • -qfloat=nodfpemulate (the default when the -qarch flag specifies POWER6 or POWER7): Instructs the compiler to use DFP hardware instructions.
 • -D__STDC_WANT_DEC_FP__: Enables the referencing of DFP-defined symbols.
 • -ldfp: Enables the DFP functionality that is provided by the Advance Toolchain
on Linux.
For hardware supported DFP, with -qarch=pwr6 or -qarch=pwr7, use the
following command:
cc -qdfp
For software emulation of DFP (on earlier processor chips), use the
following command:
cc -qdfp -qfloat=dfpemulate
 – The GCC compilers for Power Systems also include native DFP language support.
As of SLES/11/SP1, and RHEL6, IEEE 754R, DFP is fully integrated with compiler and run time (printf and DFP math) support. For older Linux distribution releases (RHEL5/SLES10 and earlier), you can use the freely available Advance Toolchain compiler and run time. The Advance Toolchain runtime libraries can also be integrated with recent XL (V9+) compilers for DFP exploitation.
The latest Advance Toolchain compiler and run times can be downloaded from the following website:
Advance Toolchain is a self-contained toolchain that does not rely on the base system toolchain for operability. In fact, it is designed to coexist with the toolchain shipped with the operating system. You do not have to uninstall the regular GCC compilers that come with your Linux distribution to use the Advance Toolchain.
The latest Enterprise distributions and Advance Toolchain run time use the Linux CPU tune library capability to automatically select hardware DFP or software implementation library variants, which are based on the hardware platform.
Here is a list of GCC compiler options for Advance Toolchain that are related to DFP:
 • -mhard-dfp (the default when -mcpu=power6 or -mcpu=power7 is specified): Instructs the compiler to directly take advantage of DFP hardware instructions for
decimal arithmetic.
 • -mno-hard-dfp: Instructs the compiler to use calls to library functions to handle DFP computation, regardless of the architecture level. If your application is dynamically linked to the libdfp variant and running on POWER6 or POWER7 processors, then the run time automatically binds to the libdfp variant implemented with hardware DFP instructions. Otherwise, the software DFP library is used. You might experience performance degradation when you use software emulation.
 • -D__STDC_WANT_DEC_FP__: Enables the reference of DFP defined symbols.
 • -ldfp: Enables the DFP functionality that is provided by recent Linux Enterprise Distributions or the Advance Toolchain run time.
Decimal Floating Point Abstraction Layer (DFPAL), which is a no additional cost, downloadable library from IBM.60
Many applications that are using BCD today use a library to perform math functions. Changing to a native data type can be hard work, after which you might have an issue with one code set for AIX on POWER6 and one for other platforms that do not support native DFP. The solution to this problem is DFPAL, which is an alternative to the native support. DFPAL contains a header file to include in your code and the DFPAL library.
The header file is downloadable from General Decimal Arithmetic at http://speleotrove.com/decimal/ (search for DFPAL). Download the complete source code, and compile it on your system.
If you have hardware support for DFP, use the library to access the functions.
If you do not have hardware support (or want to compare the hardware and software emulation), you can force the use of software emulation by setting a shell variable before you run your application by running the following command:
export DFPAL_EXE_MODE=DNSW
Determining if your applications are using DFP
There are two AIX commands that are used for monitoring:
hpmstat (for monitoring the whole system)
hpmcount (for monitoring a single program)
The PM_DFU_FIN (DFU instruction finish) field in the output of the hpmstat and hpmcount commands verifies that the DFP operations finished.
The -E PM_MRK_DFU_FIN option in the tprof command uses the AIX trace subsystem, which tells you which functions are using DFP and how often.
For more information about this topic, see 2.4, “Related publications” on page 51.
2.3.7 Data prefetching using d-cache instructions and the Data Streams Control Register (DSCR)
The hardware data prefetch mechanism reduces the performance impact that is caused by the latency in retrieving cache lines from higher level caches and from memory. The data prefetch engine of the processor can recognize sequential data access patterns in addition to certain non-sequential (stride-N) patterns and initiate prefetching of d-cache lines from L2 and L3 cache and memory into the L1 d-cache to improve the performance of these storage reference patterns.
The Power ISA architecture also provides cache instructions to supply a hint to prefetch engines for data prefetching to override the automatic stream detection capability of the data prefetcher. Cache instructions, such as dcbt and dcbtst, allow applications to specify stream direction, prefetch depth, and number of units. These instructions can avoid the starting cost of the automatic stream detection mechanism.
The d-cache instructions dcbt (d-cache block touch) and dcbtst (d-cache block touch for store) affect the behavior of the prefetched lines. The syntax for the assembly language instructions is:61
dcbt RA, RB, TH
dcbtst RA, RB, TH
RA specifies a source general-purpose register for Effective Address (EA) computation.
RB specifies a source general-purpose register for EA computation.
TH indicates when a sequence of d-cache blocks might be needed.
The block that contains the byte addressed by the EA is fetched into the d-cache before the block is needed by the program. The program can later perform loads and stores from the block and might not experience the added delay that is caused by fetching the block into
the cache.
The Touch Hint (TH) field is used to provide a hint that the program probably loads or stores to the storage locations specified by the Effective Address (EA) and the TH field. The hint is ignored for locations that are caching-inhibited or guarded. The encodings of the TH field depend on the target architecture that is selected with the -m flag or the .machine assembly language pseudo-op.
The range of values for the TH field is 0b01000 - 0b01111.
The dcbt and dcbtst instructions provide hints about a sequence of accesses to data elements, or indicate the expected use. Such a sequence is called a data stream, and a dcbt or dcbtst instruction in which TH is set to one of these values is said to be a data stream variant of dcbt or dcbtst.
A data stream to which a program can perform Load accesses is said to be a load data stream, and is described using the data stream variants of the dcbt instruction.
A data stream to which a program can perform Store accesses is said to be a store data stream, and is described using the data stream variants of the dcbtst instruction.
The contents of the DSCR, a special purpose register, affects how the data prefetcher responds to hardware-detected and software-defined data streams.
The layout of the DSCR register is:
 
URG1
LSD
SNSE
SSE
DPFD

1 POWER7+ only
Where:
Bits 58 – LSD – Load Stream Disable
Disables hardware detection and initiation of load streams.
Bits 59 – SNSE – Stride-N Stream Enable
Enables hardware detection and initiation of load and store streams that have a stride greater than a single cache block. Such load streams are detected when LSD = 0 and such store streams are detected when SSE=1.
Bits 60 – SSE – Store Stream Enable
Enables hardware detection and initiation of Store streams.
Bits 61:63 – DPFD – Default Prefetch Depth
Supplies a prefetch depth for hardware-detected streams and for software-defined streams for which a depth of zero is specified, or for which dcbt or dcbtst with TH=1010 is not used in their description.
Bits 55:57 - URG - Depth Attainment Urgency
This field is a new one added in the POWER7+ processor. This field indicates how quickly the prefetch depth should be reached for hardware-detected streams. Values and their meanings are as follows:
 – 0: Default
 – 1: Not urgent
 – 2: Least urgent
 – 3: Less urgent
 – 4: Medium
 – 5: Urgent
 – 6: More urgent
 – 7: Most urgent
The ability to enable or disable the three types of streams that the hardware can detect (load streams, store streams, or stride-N streams), or to set the default prefetch depth, allows empirical testing of any application. There are no simple rules for determining which settings are optimum overall for a application: the performance of prefetching depends on many different characteristics of the application in addition to the characteristics of the specific system and its configuration. Data prefetches are purely speculative, meaning they can improve performance greatly when the data that is prefetched is, in fact, referenced by the application later, but can also degrade performance by expending bandwidth on cache lines that are not later referenced, or by displacing cache lines that are later referenced by
the program.
Similarly, setting DPFD to a deeper depth tends to improve performance for data streams that are predominately sourced from memory because the longer the latency to overcome, the deeper the prefetching must be to maximize performance. But deeper prefetching also increases the possibility of stream overshoot, that is, prefetching lines beyond the end of the stream that are not later referenced. Prefetching in multi-core processor implementations has implications for other threads or processes that are sharing cache (in SMT mode) or the same system bandwidth.
Controlling DSCR under Linux
DSCR settings on Linux are controlled with the ppc64_cpu command. Controlling DSCR settings for an application is generally considered advanced and specific tuning.
Currently, setting the DSCR value is a cross-LPAR setting.
Controlling DSCR under AIX
Under AIX, DSCR settings can be controlled both by programming API and from the command line by running the following commands:62,63
dscr_ctl() API
#include <sys/machine.h>
int dscr_ctl(int op, void *buf_p, int size)
Where:
op: Operation. Possible values are DSCR_WRITE, DSCR_READ, DSCR_GET_PROPERTIES, and DSCR_SET_DEFAULT.
Buf_p: Pointer to an area of memory where the values are copied from (DSCR_WRITE) or copied to (DSCR_READ and DSCR_GET_PROPERTIES). For DSCR_WRITE, DSCR_READ, and DSCR_SET_DEFAULT operations, buf_p must be a pointer to a 64-bit data area (long long *). For DSCR_GET_PROPERTIES, buf_p must be a pointer to a struct dscr_properties (defined in <sys/machine.h>).
Size: Size in bytes of the area pointed to by buf_p.
Function:
The action that is taken depends on the value of the operation parameter that is defined
in <sys/machine.h>:
DSCR_WRITE Stores a new value from the input buffer into the process context and in the DSCR.
DSCR_READ Reads the current value of DSCR and returns it in the
output buffer.
DSCR_GET_PROPERTIES Reads the number of hardware streams that are supported by the platform, the platform (firmware) default Prefetch Depth and the Operating System default Prefetch Depth from kernel memory, and returns the values in the output buffer (struct dscr_properties defined in <sys/machine.h>).
DSCR_SET_DEFAULT Sets a 64-bit DSCR value in a buffer pointed to by buf_p as the operating system default. Returns the old default in the buffer pointed to by buf_p. Requires root authority. The new default value is used by all the processes that do not explicitly set a DSCR value using DSCR_WRITE. The new default is not permanent across reboots. For an operating system default prefetch depth that is permanent across reboots, use the dscrctl command, which adds an entry into the inittab to initialize the system-wide prefetch depth default value upon reboot (for a description of this command, see “The dscrctl command” on page 50).
Return values are:
0 if successful
-1 if an error detected. In this case, errno is set to indicate the error. Possible values are:
EINVAL Invalid value for DSCR (DSCR_WRITE, DSCR_SET_DEFAULT).
EFAULT Invalid address that is passed to function.
EPERM Operation not permitted (DSCR_SET_DEFAULT by non-root user).
ENOTSUP Data streams that are not supported by platform hardware.
Symbolic values for the following SSE and DPFD fields are defined in <sys/machine.h>:
DPFD_DEFAULT 0
DPFD_NONE 1
DPFD_SHALLOWEST 2
DPFD_SHALLOW 3
DPFD_MEDIUM 4
DPFD_DEEP 5
DPFD_DEEPER 6
DPFD_DEEPEST 7
DSCR_SSE 8
Here is a description of the dscr_properties structure in <sys/machine.h>:
struct dscr_properties {
uint version;
uint number_of_streams; /* Number of HW streams */
long long platform_default_pd; /* PFW default */
long long os_default_pd; /* AIX default */
long long dscr_res[5]; /* Reservd for future use */
};
Here is an example of this structure:
#include <sys/machine.h>
int rc;
long long dscr = DSCR_SSE | DPFD_DEEPER;
rc = dscr_ctl(DSCR_WRITE, &dscr);
...
A new process inherits the DSCR from its parent during a fork. This value is reset to the system default during exec.
When a thread is dispatched (starts running on a CPU), the value of the DSCR for the owning process is written in the DCSR. You do not need to save the value of the register in the process context when the thread is undispatched because the system call writes the new value both in the process context and in the DSCR.
When a thread runs dcsr_ctl to change the prefetch depth for the process, the new value is written into the AIX process context and the DSCR register of the thread that is running the system call. If another thread in the process is concurrently running on another CPU, it starts using the new DSCR value only after the new value is reloaded from the process context area after either an interrupt or a redispatch. This action can take as much as 10 ms (a clock tick).
The dscrctl command
The system administrator can use this command to read the current settings for the hardware streams mechanism and set a system wide value for the DSCR. The DSCR is privileged. It can be read or written only by the operating system.
To query the characteristics of the hardware streams on the system, run the
following command:
dscrctl -q
Here is an example of this command:
# dscrctl -q
Current DSCR settings:
number_of_streams = 16
platform_default_pd = 0x5 (DPFD_DEEP)
os_default_pd = 0xd (DSCR_SSE | DPFD_DEEP)
To set the operating system default prefetch depth on the system temporarily (that is, for the current session) or permanently (that is, after each reboot), run the following command:
dscrctl [-n] [-b] -s <dscr_value>
The dscr_value is treated as a decimal number unless it starts with 0x, in which case it is treated as hexadecimal.
To cancel a permanent setting of the operating system default prefetch depth at boot time, run the following command:
dscrctl -c
Applications that have predictable data access patterns, such as numerical applications that process arrays of data in a sequential manner, benefit from aggressive data prefetching. These applications must run with the default operating system prefetch depth, or whichever settings are empirically found to be the most beneficial.
Applications that have considerably unpredictable data access patterns, such as some transactional applications, can be negatively affected by aggressive data prefetching. The data that is prefetched is unlikely to be needed, and the prefetching uses system bandwidth and might displace useful data from the caches. Some WebSphere Application Server and DB2 workloads have this characteristic. Performance can be improved by disabling hardware prefetching in these cases by running the following command:
dscrctl -n -s 1
This system (partition) wide disabling is only appropriate if it is expected to benefit all of the applications that are running in the partition. However, the same effect can be achieved on an application-specific basis by using the programming API.
For more information about this topic, see 2.4, “Related publications” on page 51.
2.4 Related publications
The publications that are listed in this section are considered suitable for a more detailed discussion of the topics that are covered in this chapter:
AIX dscr_ctl API sample code, found at:
AIX Version 7.1 Release Notes, found at:
Refer to the section, The dscrctl command.
Application configuration for large pages, found at:
False Sharing, found at:
lwsync instruction, found at:
Multiprocessing, found at:
The Performance of Runtime Data Cache Prefetching in a Dynamic Optimization System, found at:
POWER6 Decimal Floating Point (DFP), found at:
POWER7 Processors: The Beat Goes On, found at:
Power Architecture ISA 2.06 Stride N prefetch Engines to boost Application's performance, found at:
https://www.power.org/documentation/whitepaper-on-stride-n-prefetch-feature-of-isa-2-06/ (registration required)
Power ISA Version 2.06 Revision B, found at:
Refer to the following sections:
 – Section 3.1: Program Priority Registers
 – Section 3.2: “or” Instruction
 – Section 4.3.4: Program Priority Register
 – Section 4.4.3: OR Instruction
 – Section 5.3.4: Program Priority Register
 – Section 5.4.2: OR Instruction
 – Book I – 4 Floating Point Facility
 – Book I – 5 Decimal Floating Point
 – Book I – 6 Vector Facility
 – Book I – 7 Vector-Scalar Floating Point Operations (VSX)
 – Book I – Chapter 5 Decimal Floating-Point.
 – Book II – 4.2 Data Stream Control Register
 – Book II – 4.3.2 Data Cache Instructions
 – Book II – 4.4 Synchronization Instructions
 – Book II – A.2 Load and Reserve Mnemonics
 – Book II – A.3 Synchronize Mnemonics
 – Book II – Appendix B. Programming Examples for Sharing Storage
 – Book III – 5.7 Storage Addressing
PowerPC storage model and AIX programming: What AIX programmers need to know about how their software accesses shared storage, found at:
Refer to the following sections:
 – Power Instruction Set Architecture
 – Section 4.4.3 Memory Barrier Instructions – Synchronize
Product documentation for XL C/C++ for AIX, V12.1 (PDF format), found at:
Simple performance lock analysis tool (splat), found at:
Simultaneous Multithreading, found at:
splat Command, found at:
trace Daemon, found at:
What makes Apple's PowerPC memcpy so fast?, found at:
What programmers need to know about hardware prefetching?, found at:
 

1 Power ISA Version 2.06 Revision B, available at: http://power.org/wp-content/uploads/2012/07/PowerISA_V2.06B_V2_PUBLIC.pdf
2 What’s New in the Server Environment of Power ISA v2.06, a white paper from Power.org, available at: https://www.power.org/documentation/whats-new-in-the-server-environment-of-power-isa-v2-06/ (registration required)
4 What’s New in the Server Environment of Power ISA v2.06, a white paper from Power.org, available at: https://www.power.org/documentation/whats-new-in-the-server-environment-of-power-isa-v2-06/ (registration required)
6 Ibid
7 Ibid
8 Ibid
9 Ibid
10 Power ISA Version 2.06 Revision B, available at: http://power.org/wp-content/uploads/2012/07/PowerISA_V2.06B_V2_PUBLIC.pdf
12 Of NUMA on POWER7 in IBM i, available at:
13 Ibid
14 Ibid
15 Ibid
16 Ibid
17 Ibid
18 Ibid
19 Splitting Data Objects to Increase Cache Utilization (Preliminary Version, 9th October 1998). available at: http://www.ics.uci.edu/%7Efranz/Site/pubs-pdf/ICS-TR-98-34.pdf
20 Eliminate False Sharing, Stop your CPU power from invisibly going down the drain, available at: http://drdobbs.com/goparallel/article/showArticle.jhtml?articleID=217500206
21 Ibid
22 Ibid
25 Power Architecture ISA 2.06 Stride N prefetch Engines to boost Application's performance, available at: https://www.power.org/documentation/whitepaper-on-stride-n-prefetch-feature-of-isa-2-06/ (registration required)
28 Setting Very Low SMT priority requires only the Problem-State privilege on POWER7+ processors. The required privilege to set a particular SMT thread priority level is associated with the physical processor implementation that the LPAR is running on, and not the processor compatible mode. Therefore, setting Very Low SMT priority only requires user level privilege on POWER7+ processors, even when running in P6-, P6+-, or P7-compatible modes.
29 Power ISA Version 2.06 Revision B, available at: http://power.org/wp-content/uploads/2012/07/PowerISA_V2.06B_V2_PUBLIC.pdf
30 Power ISA Version 2.06 Revision B, available at: http://power.org/wp-content/uploads/2012/07/PowerISA_V2.06B_V2_PUBLIC.pdf
32 Power ISA Version 2.06 Revision B, available at: http://power.org/wp-content/uploads/2012/07/PowerISA_V2.06B_V2_PUBLIC.pdf
33 Ibid
34 thread_set_smt_priority or thread_read_smt_priority System Call, available at:
35 PowerPC storage model and AIX programming: What AIX programmers need to know about how their software accesses shared storage, by Lyons, et al, available at: http://www.ibm.com/developerworks/systems/articles/powerpc.html
36 Ibid
38 sync (Synchronize) or dcs (Data Cache Synchronize) Instruction. available at: http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.aixassem/doc/alangref/sync.htm
39 PowerPC storage model and AIX programming: What AIX programmers need to know about how their software accesses shared storage, Michael Lyons, et al, available at: http://www.ibm.com/developerworks/systems/articles/powerpc.html
43 PowerPC storage model and AIX programming: What AIX programmers need to know about how their software accesses shared storage, Michael Lyons, et al, available at: http://www.ibm.com/developerworks/systems/articles/powerpc.html
44 Power ISA Version 2.06 Revision B, available at: http://power.org/wp-content/uploads/2012/07/PowerISA_V2.06B_V2_PUBLIC.pdf
45 What’s New in the Server Environment of Power ISA v2.06, a white paper from Power.org, available at: https://www.power.org/documentation/whats-new-in-the-server-environment-of-power-isa-v2-06/ (registration required)
47 Vector Built-in Functions, available at:
http://publib.boulder.ibm.com/infocenter/comphelp/v111v131/index.jsp?topic=/com.ibm.xlc111.aix.doc/compiler_ref/vec_intrin_cpp.html
48 Vector Initialization, available at:
49 Engineering and Scientific Subroutine Library (ESSL) and Parallel ESSL, available at:
http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.essl.doc/esslbooks.html
50 What’s New in the Server Environment of Power ISA v2.06?, a white paper from Power.org, available at:
https://www.power.org/documentation/whats-new-in-the-server-environment-of-power-isa-v2-06/ (registration required)
51 Support for POWER7 processors, available at:
52 Vector Built-in Functions, available at:
53 Vector Initialization, available at:
56 Ibid
59 How to compile DFPAL?, available at: http://speleotrove.com/decimal/dfpal/compile.html
60 Ibid
61 Power ISA Version 2.06 Revision B, available at: http://power.org/wp-content/uploads/2012/07/PowerISA_V2.06B_V2_PUBLIC.pdf
62 dscr_ctl subroutine, available at:
63 dscrctl command, available at:
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset