Linux
This chapter describes the optimization and tuning of the POWER7 processor-based server running the Linux operating system. It covers the following topics:
5.1 Linux and system libraries
This section contains information about Linux and system libraries.
5.1.1 Introduction
When you work with IBM POWER7 processor-based servers, systems, and solutions, a solid choice for running enterprise-level workloads is Linux. Red Hat Enterprise Linux (RHEL) and SUSE Linux Enterprise Server (SLES) provide operating systems that are optimized and targeted for the Power Architecture. These operating systems run natively on the Power Architecture and are designed to take full advantage of the specialized features of
Power Systems.
Both RHEL and SLES provide the tools, kernel support, optimized compilers, and tuned libraries for POWER7 Systems. The Linux distributions provide for excellent performance, and more application and customer-specific tuning approaches are available. IBM provides a number of value-add packages, tools, and extensions that provide for more tunings, optimizations, and products for the best possible performance on POWER7. The typical Linux open source performance tools that Linux users are comfortable with are available on the PowerLinux systems.
The Linux distributions are enabled to run on small Power Micro-Partitioning partitions through the broad range of IBM Power offerings, from low-cost PowerLinux servers and Flex System nodes, up through the largest IBM Power 770 and Power 795 servers.
IBM premier products, such as IBM XL compilers, IBM Java products, IBM WebSphere, and IBM DB2 database products, all provide Power optimized support with the RHEL and SUSE operating systems.
For more information about this topic, see 5.2, “Related publications” on page 106.
5.1.2 Linux operating system-specific optimizations
This section describes optimization methods specific to Linux.
GCC, toolchain, and IBM Advance Toolchain
This section describes 32-bit and 64-bit modes and CPU-tuned libraries.
Linux support for 32-bit and 64-bit modes
The compiler and runtime are fully capable of supporting either 32-bit or 64-bit mode applications simultaneously. The compilers can select the target mode through the -m32 or -m64 compiler options.
For the SUSE Linux Enterprise Server and Red Hat Enterprise Linux distributions, the shared libraries have both 32-bit and 64-bit versions. The toolchain (compiler, assembler, linker, and dynamic linker) selects the correct libraries based on the -ms32 or -m64 option or the mode of the application program.
The Advance Toolchain defaults to 64-bit, as do SLES 11 and RHEL 6. Older distribution compilers defaulted to 32-bit.
Applications can use 32-bit and 64-bit execution modes, depending on their specific requirements, if their dependent libraries are available for the wanted mode.
The 32-bit mode is lighter with a simpler function call sequence and smaller footprint for stack and C++ objects, which can be important for some dynamic language interpreters and applications with many small functions.
The 64-bit mode has a larger footprint because of the larger pointer and general register size, which can be an asset when you handle large data structures or text data, where larger (64-bit) general registers are used for high bandwidth in the memory and string functions.
The handling of floating point and vector data is the same (registers size and format and instructions) for 32-bit and 64-bit modes. Therefore, for these applications, the key decision depends on the address space requirements. For 32-bit Power applications (32-bit mode applications that are running on 64-bit Power hardware with a 64-bit kernel), the address space is limited to 4 GB, which is the limit of a 32-bit address. 64-bit applications are currently limited to 16 TB of application program or data per process. This limitation is not a hardware one, but is a restriction of the shared Linux virtual memory manager implementation. For applications with low latency response requirements, using the larger, 64-bit addressing to avoid I/O latencies using memory mapped files or large local caches is a good trade-off.
CPU-tuned libraries
If an application must support only one Power hardware platform (such as POWER7 and newer), then compiling the entire application with the appropriate -mcpu= and -mtune= compiler flags might be the best option.
For example, -mcpu=power7 allows the compiler to use all the new instructions, such as the Vector Scalar Extended category. The -mcpu=power7 option also implies -mtune=power7 if it is not explicitly set.
mcpu focuses on the instruction mix that the compiler generates. mtune focuses on optimizing the order of the instructions
Most applications do need to run on more than one platform, for example, in POWER6 mode and POWER7 mode. For applications composed of a main program and a set of shared libraries or applications that spend significant execution time in other (from the Linux run time or extra package) shared libraries, you can create packages that automatically select the best optimization for each platform.
Linux also supports automatic CPU tuned library selection. There are a number of implementation options for CPU tuned library implementers as described here. For more information, see Optimized Libraries, available at:
The Linux Technology Center works with the SUSE and Red Hat Linux Distribution Partners to provide some automatic CPU-tuned libraries for the C/POSIX runtime libraries. However, these libraries might not be supported for all platforms or have the latest optimization.
One advantage of the Advance Toolchain is that the runtime RPMs for the current release do include CPU-tuned libraries for all the currently supported POWER processors and the latest processor-specific optimization and capabilities, which are constantly updated. Additional libraries are added as they are identified. The Advance Toolchain run time can be used with either Advance Toolchain GCC or XL compilers and includes configuration files to simplify linking XL compiled programs with the Advance Toolchain runtime libraries.
These techniques are not restricted to systems libraries, and can be easily applied to application shared library components. The dynamic code path and processor tuned libraries are good starting points. With this method, the compiler and dynamic linker do most of the work. You need only some additional build time and extra media for the multiple
library images.
In this example, the following conditions apply:
Your product is implemented in your own shared library, such as libmyapp.so.
You want to support Linux running on POWER5, POWER6, and POWER7 Systems.
DFP and Vector considerations:
 – Your oldest supported platform is POWER5, which does not have a DFP or the
Vector unit.
 – POWER6 has DFP and a Vector Unit implementing the older VMX (vector float but no vector double) instructions.
 – POWER7 has DPF and the new Vector Scalar Extended (VSX) Unit (the original VMX instructions plus Vector Double and more).
 – Your application benefits greatly from both Hardware Decimal and high performance vector, but if you compile your application with -mcpu=power7 -O3, it does not run on POWER5 (no hardware DFP instructions) or POWER6 (no vector double
instructions) machines.
You can optimize all three Power platforms if you build and install your application and libraries correctly by completing the following steps:
1. Build the main application binary file and the default version of libmyapp.so for the oldest supported platform (in this case, use -mcpu=power5 -O3). You can still use decimal data because the Advance Toolchain and the newest SLES11 and RHEL6 include a DFP emulation library and run time.
2. Install the application (myapp) into the appropriate ./bin directory and libmyapp.so into the appropriate ./lib64 directory. The following paths provide the application main and default run time for your product:.
 – /opt/ibm/myapp1.0/bin/myapp
 – /opt/ibm/myapp1.0/lib64/libmyapp.so
3. Compile and link libmyapp.so with -mcpu=power6 -O3, which enables the compiler to generate DFP and VMX instructions for POWER6 machines.
4. Install this version of libmyapp.so into the appropriate ./lib64/power6 directory.
For example:
/opt/ibm/myapp1.0/lib64/power6/libmyapp.so
5. Compile and link the fully optimized version of libmyapp.so for POWER7 with -mcpu=power7 -O3, which enables the compiler to generate DFP and all the VSX instructions. Install this version of libmyapp.so into the appropriate ./lib64/power7 directory. For example:
/opt/ibm/myapp1.0/lib64/power7/libmyapp.so
By simply running some extra builds, your myapp1.0 is fully optimized for the current and N-1/N-2 Power hardware releases. When you start your application with the appropriate LD_LIBRARY_PATH (including /opt/ibm/myapp1.0/lib64), the dynamic linker automatically searches the subdirectories under the library path for names that match the current platform (POWER5, POWER6, or POWER7). If the dynamics linker finds the shared library in the subdirectory with the matching platform name, it loads that version; otherwise, the dynamic linker looks in the base lib64 directory and use the default implementation. This process continues for all directories in the library path and recursively for any dependent libraries.
Using the Advance Toolchain
The latest Advance Toolchain compilers and run time can be downloaded from:
The latest Advance Toolchain releases (starting with Advance Toolchain 5.0) add multi-core runtime libraries to enable you to take advantage of application level multi-cores. The toolchain currently includes a Power port of the open source version of Intel Thread Building Blocks, the Concurrent Building Blocks software transactional memory library, and the UserRCU library (the application level version of the Linux kernel’s Read-Copy-Update concurrent programming technique). Additional libraries are added to the Advance Toolchain run time as needed and if resources allow it.
Linux on Power Enterprise Distributions default to 64 KB pages, so most applications automatically benefit from large pages. Larger (16 MB) segments can be best used with the libhugetlbfs API. Large segments can be used to back shared memory, malloc storage, and (main) program text and data segments (incorporating large pages for shared library text or data is not supported currently).
Tuning and optimizing malloc
Methods for tuning and optimizing malloc are described in this section.
Linux malloc
Generally, tuning malloc invocations on Linux systems is an application-specific focus.
Improving malloc performance
Linux is flexible regarding the system and application tuning of malloc usage.
By default, Linux manages malloc memory to balance the ability to reuse the memory pool against the range of default sizes of memory allocation requests. Small chunks of memory are managed on the sbrk heap. This sbrk heap is labeled as [heap] in /proc/self/maps.
When you work with Linux memory allocation, there are a number of tunables available to users. These tunables are coded and used in the Linux malloc.c program. Our examples (“Malloc environment variables” on page 101 and “Linux malloc considerations” on page 102) show two of the key tunables, which force the large sized memory allocations away from using mmap, to using the memory on the program stack by using the sbrk system directive.
When you control memory for applications, the Linux operating system automatically makes a choice between using the stack for mallocs with the sbrk command, or mmap regions. Mmap regions are typically used for larger memory chunks. When you use mmap for large mallocs, the kernel must zero the newly mmapped chunk of memory.
Malloc environment variables
Users can define environment variables to control the tunables for a program. The environment variables that are shown in the following examples caused a significant performance improvement across several real-life workloads.
To disable the usage of mmap for mallocs (which includes Fortran allocates), set the max value to zero:
MALLOC_MMAP_MAX_=0
To disable the trim threshold, set the value to negative one:
MALLOC_TRIM_THRESHOLD_=-1
Trimming and using mmap are two different ways of releasing unused memory back to the system. When used together, they change the normal behavior of malloc across C and Fortran programs, which in some cases can change the performance characteristics of the program. You can run one of the following commands to use both actions:
# ./my_program
# MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=-1 ./my_program
Depending on your application's behavior regarding memory and data locality, this change might do nothing, or might result in performance improvement.
Linux malloc considerations
The Linux GNU C run time includes a default malloc implementation that is optimized for multi-threading and medium sized allocations. For smaller allocations (less than the MMAP_THRESHOLD), the default malloc implementation allocates blocks of storage with sbrk() called arenas, which are then suballocated for smaller malloc requests. Larger allocations (greater than MMAP_THRESHOLD) are allocated by an anonymous mmap, one per request.
The default values are listed here:
DEFAULT_MXFAST 64 (for 32-bit) or 128 (for 64-bit)
DEFAULT_TRIM_THRESHOLD 128 * 1024
DEFAULT_TOP_PAD 0
DEFAULT_MMAP_THRESHOLD 128 * 1024
DEFAULT_MMAP_MAX 65536
Storage within arenas can be reused without kernel intervention. The default malloc implementation uses trylock techniques to detect contentions between POSIX threads, and then tries to assign each thread its own arena. This action works well when the same thread frees storage that it allocates, but it does result in more contention when malloc storage is passed between producer and consumer threads. The default malloc implementation also tries to use atomic operations and more granular and critical sections (lock and unlock) to enhance parallel thread execution, which is a trade-off for better multi-thread execution at the expense of a longer malloc path length with multiple atomic operations per call.
Large allocations (greater than MMAP_THRESHOLD) require a kernel syscall for each malloc() and free(). The Linux Virtual Memory Management (VMM) policy does not allocate any real memory pages to an anonymous mmap() until the application touches those pages. The benefit of this policy is that real memory is not allocated until it is needed. The downside is that, as the application begins to populate the new allocation with data, the application experiences multiple page faults, on first touch to allocate and zero fill the page. This situation means that on the initial touching of memory, there is more processing then, as opposed to the earlier timing when the original mmap is done. In addition, this first touch timing can impact the NUMA placement of each memory page.
Such storage is unmapped by free(), so each new large malloc allocation starts with a flurry of page faults. This situation is partially mitigated by the larger (64 KB) default page size of the Red Hat Enterprise Linux and SUSE Linux Enterprise Server on Power Systems; there are fewer page faults than with 4 KB pages.
Malloc tuning parameters
The default malloc implementation provides a mallopt() API to allow applications to adjust some tuning parameters. For some applications, it might be useful to adjust the MMAP_THRESHOLD, TOP_PAD, and MMAP_MAX limits. Increasing MMAP_THRESHOLD so that most (application) allocations fall below that threshold reduces syscall and page fault impact, and improves application start time. However, this situation can increase fragmentation within the arenas and sbrk() storage. Fragmentation can be mitigated to some extent by also increasing TOP_PAD, which is the extra memory that is allocated for each sbrk().
Reducing MMAP_MAX, which is the maximum number of chunks to allocate with mmap(), can also limit the use of mmap() when MMAP_MAX is set to 0. Reducing MMAP_MAX does not always solve the problem. The run time reverts to mmap() allocations if sbrk() storage, which is the gap between the end of program static data (bss) and the first shared library, is exhausted.
Linux malloc and memory tools
There are several readily available tools in the Linux open source community:
A website that describes the heap profiler that is used at Google to explore how C++ programs manage memory, found at:
Massif: a heap profiler, available at:
For more information about tuning malloc parameters, see Malloc Tunable Parameters, available at:
Thread-caching malloc (TCMalloc)
Under some circumstances, an alternative malloc implementation can prove beneficial for improving application performance. Packaged as part of Google's Perftools package (http://code.google.com/p/gperftools/?redir=1), and in the Advance Toolchain 5.0.4 release, this specialized malloc implementation can improve performance across a number of C and C++ applications.
TCMalloc uses a thread-local cache for each thread and moves objects from the memory heap into the local cache as needed. Small objects with less than 32 KB are mapped into allocatable size-classes. A thread cache contains a singly linked list of free objects per size-class. Large objects are rounded up to a page size (4 KB) and handled by a central page heap, which is an array of linked lists.
For more information about how TCMalloc works, see TCMalloc: Thread-Caching Malloc, available at:
The TCMalloc implementation is part of the gperftools project. For more information about this topic, go to:
Usage
To use TCMalloc, link TCMalloc into your application by using the -ltcmalloc linker flag by running the following command:
$ gcc [...] -ltcmalloc
You can also use TCMalloc in applications that you did not compile yourself by using LD_PRELOAD as follows:
$ LD_PRELOAD="/usr/lib/libtcmalloc.so"
These examples assume that the TCMalloc library is in /usr/lib. With the Advance Toolchain 5.0.4, the 32-bit and 64-bit libraries are in /opt/at5.0/lib and /opt/at5.0/lib64.
Using TCMalloc with hugepages
To use large pages with TCMalloc, complete the following steps:
1. Set the environment variables for libhugetlbfs.
2. Allocate the number of large pages from the system.
3. Set up the libugetlbfs mount point.
4. Monitor large pages usage.
TCmalloc backs up the heap allocation on the large pages only.
Here are a more detailed version of these steps:
1. Set the environment variables for libhugetlbfs by running the following commands:
 – # export TCMALLOC_MEMFS_MALLOC_PATH=/libhugetlbfs/
 – # export HUGETLB_ELFMAP=RW
 – # export HUGETLB_MORECORE=yes
Where:
 – TCMALLOC_MEMFS_MALLOC_PATH=/libhugetlbfs/ defines the libhugetlbfs mount point.
 – HUGETLB_ELFMAP=RW allocates both RSS and BSS (text/code and data) segments on the large pages, which is useful for codes that have large static arrays, such as
Fortran programs.
 – HUGETLB_MORECORE=yes makes heap usage on the large pages.
2. Allocate the number of large pages from the system by running one of the
following commands:
 – # echo N > /proc/sys/vm/nr_hugepages
 – # echo N > /proc/sys/vm/nr_overcommit_hugepages
Where:
 – N is the number of large pages to be reserved. A peak usage of 4 GB by your program requires 256 large pages (4096/16).
 – nr_hugepages is the static pool. The kernel reserves N * 16 MB of memory from the static pool to be used exclusively by the large pages allocation.
 – nr_overcommit_hugepages is the dynamic pool. The kernel sets a maximum usage of N large pages and dynamically allocates or deallocates these large pages.
3. Set up the libhugetlbfs mount point by running the following commands:
 – # mkdir -p /libhugetlbfs
 – # mount -t hugetlbfs hugetlbfs /libhugetlbfs
4. Monitor large pages usage by running the following command:
# cat /proc/meminfo | grep Huge
This command produces the following output:
HugePages_Total:
HugePages_Free:
HugePages_Rsvd:
HugePages_Surp:
Hugepagesize:
Where:
 – HugePages_Total is the total pages that are allocated on the system for LP usage.
 – HugePages_Free is the total free memory available.
 – HugePages_Rsvd is the total of large pages that are reserved but not used.
 – Hugepagesize is the size of a single LP.
You can monitor large pages by NUMA nodes by running the following command:
# watch -d grep Huge /sys/devices/system/node/node*/meminfo
MicroQuill SmartHeap
MicroQuill SmartHeap is an optimized malloc that is used for SPECcpu2006 publishes for optimizing performance on selected benchmark components. For more information, see SmartHeap for SMP: Does your app not scale because of heap contention?, available at:
Large TOC -mcmodel=medium optimization
The Linux Application Binary Interface (ABI) on the Power Architecture is enhanced to optimize larger programs. This ABI both simplifies an application build and improves
overall performance.
Previously, the TOC (-mfull-toc) defaulted to a single instruction access form that restricts the total size of the TOC to 64 KB. This configuration can cause large programs to fail at compile or link time. Previously, the only effective workaround was to compile with the -mminimal-toc option (which provides a private TOC for each source file). The minimal TOC strategy adds a level of indirection that can adversely impact performance.
The -mcmodel=medium option extends the range of the TOC addressing to +/-2 GB. This setup eliminates most TOC-related build issues. Also, as the Linux ABI TOC includes Global Offset Table (GOT) and local data, you can enable a number of compiler- and linker-based optimizations, including TOC pointer relative addressing for local static and constant data. This setup eliminates a level of indirection and improves the performance of large programs.
Currently, this optimization is only available with Advance Toolchain 4.0 and later.
The medium and large code models are 64-bit only. Advance Toolchain users must remove old -mminimal-toc options.
Selecting different SMT modes
Linux enables Power SMT capabilities. By default, the system runs at the highest SMT level. The ppc64_cpu command can be used to force the system kernel to use lower SMT levels (ST or SMT2 mode).
For example:
ppc64_cpu --smt=1 Sets the SMT mode to ST.
ppc64_cpu --smt Shows the current SMT mode.
When you run in these modes, the logical processor numbering does not change. However, in SMT2 mode, only every second logical processor is available (0, 2, 4, and so on), or in ST mode, only every fourth logical processor is available (0, 4, 8, and so on).
High SMT modes are best for maximizing total system throughput, while lower SMT modes might be appropriate for high performance threads and low latency applications. For code with low levels of instruction-level parallelism (often seen in Java code, for example), high SMT modes are generally preferred.
The setaffinity API allows processes and threads to have affinity to specific
logical processors.
For more information about this topic, see 5.2, “Related publications” on page 106.
Using setaffinity to bind to specific logical processors
The setaffinity API allows processes and threads to have affinity to specific logical processors. The number and numbering of logical processors is a product of the number of processor cores (in the partition) and the SMT capability of the machine (four-way SMT
for POWER7).
Linux completely fair scheduler
Java applications scale better on Linux in some cases if the sched_compat_yield scheduler tunable is set to 1 by running the following command:
sysctl -w kernel.sched_compat_yield=1
5.2 Related publications
The publications that are listed in this section are considered suitable for a more detailed discussion of the topics that are covered in this chapter:
Red Hat Enterprise Linux 6 Performance Tuning Guide, Optimizing subsystem throughput in Red Hat Enterprise Linux 6, Edition 3.0, found at:
SUSE Linux Enterprise Server System Analysis and Tuning Guide (Version 11 SP2), found at:
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset