Chapter 6. Compilers and optimization tools for C, C++, and Fortran

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Compilers and optimization tools for C, C++, and Fortran

This chapter describes the optimization and tuning of the POWER7 processor-based server using compilers and tools. It covers the following topics:

•Compiler versions and optimization levels

•Advanced compiler optimization techniques

•IBM Feedback Directed Program Restructuring

6.1 Compiler versions and optimization levels

The IBM XL compilers are updated periodically to improve application performance and add processor-specific tuning and capabilities. The XLC11/XLF13 compilers for AIX and Linux are the first versions to include the capabilities of POWER7, and are the preferred version for projects that target current generation systems. The newer XLC12/XLF14 compilers provide performance improvements, and are preferred for template-heavy C++ codes.

The enterprise Linux distributions (RHEL6.1 GCC- 4.4 and SLES11/SP1 GCC- 4.3) include GCC compilers with POWER7 enabled (using the -mcpu and -mtune options), but do not have the latest Higher Order Optimizations. For the GNU GCC, G++ and gfortran compilers on Linux, the IBM Advance Toolchain 4.0 (GCC- 4.5) and 5.0 (GCC- 4.6) versions contain releases that are preferred for POWER7. XLF is preferred over gfortran for its high floating point performance characteristics.

For all production codes, it is imperative to enable a minimum level of compiler optimization by adding the -O option for the XL compilers, or -O2 with the GNU compilers (-O3 is the preferred option). Without optimization, the focus of the compiler is on faster compilation and debug ability, and it generates code that performs poorly at run time. In practice, many projects set up a dual build environment, with a development build without optimization for use during development and debugging, and a production build with optimization to be used for performance verification and production delivery.

For projects with increased focus on runtime performance, you should take advantage of the more advanced compiler optimization. For numerical or compute-intensive codes, the XL compiler options -O3 or -qhot -O3 enable loop transformations, which improve program performance by restructuring loops to make their execution more efficient by the target system. These options perform aggressive transformations that can sometimes cause minor differences on precision of floating point computations. If that is a concern, the original program semantics can be fully recovered with the -qstrict option.

For GCC, the minimum suggested level of optimization is -O3. The GCC default is a strict mode, but the -ffast-math option disables strict mode. The -Ofast option combines -O3 with -ffast-math in a single option. Other important options include -fpeel-loops, -funroll-loops, -ftree-vectorize, -fvect-cost-model, and -mcmodel=medium.

By default, these compilers generate code that run on various Power Systems. Options should be added to exclude older processor chips that are not supported by the target application. This configuration might enable better code generation as the compiler takes advantage of capabilities not available on those older systems.

There are two major XL compiler options to control this support:

•-qarch: Indicates the oldest processor chip generation that the binary file supports.

•-qtune: Indicates the processor chip generation of most interest for performance.

For example, for an application that must run on POWER6 systems, but for which most users are on a POWER7 system, the appropriate combination is -qarch=pwr6 -qtune=pwr7. For an application that must run well across both POWER6 and POWER7 Systems in current common usage, consider using -qtune=balanced.

On GCC, the equivalent options are -mcpu and -mtune. So, for an application that must run on POWER6, but which is usually run on POWER7, the options are -mcpu=power6
and -mtune=power7.

The POWER7 processor supports the VSX instruction set, which improves performance for numerical applications over regular data sets. These performance features can increase the performance of some computations, and can be accessed manually by using the Altivec vector extensions, or automatically by the XL compiler by using the -qarch=pwr7 -qhot -O3 -qsimd options.

The GCC compiler equivalents are the -maltivec and -mvsx options, which you should combine with -ftree-vectorize and -fvect-cost-model. On GCC, the combination of -O3 and -mcpu=power7 implicitly enables Altivec and VSX code generation with auto-vector (-ftree-vectorize) and -mpopcntd. Other important options include -mrecip=rsqrt and -mveclibabi=mass (which require -ffast-math or -Ofast to be effective). If the compiler uses optimizations dependent on the MASS libraries, the link command must explicitly name the MASS library directories and library names.

For more information about this topic, see 6.4, “Related publications” on page 123.

6.2 Advanced compiler optimization techniques

This section describes some of the more advanced compiler optimization techniques.

6.2.1 Common prerequisites

Compiler analysis and transformations improve runtime performance by changing the translation of the program source into assembly code. Changes in these translations might cause the application to behave differently, possibly even causing it to produce
incorrect results.

Compilers follow rules and assumptions that are part of the programming language to perform this transformation. If the programmer breaks some of these rules, it is possible for the application to misbehave, and it might do so only at higher optimization levels, where it is more difficult for the problem to be diagnosed.

To put this situation into perspective, imagine a C program with three variables: “int a[4], b, c;”. These variables are normally placed contiguously in memory. If the user runs a statement of the form a[5]=0, this statement breaks the language rules, but if variable b is unused, the statement might overwrite variable b and the program might continue to behave correctly. However, if, at a higher optimization level, variable b is eliminated, as the compiler determines it is unused, the incorrect statement might overwrite variable c, triggering a runtime failure.

It is critical, then, to eliminate programming errors as higher optimization is applied. Testing the application thoroughly without optimization is a good initial step, but it is not required or sufficient. The application must be tested at the optimization level to be used in production.

6.2.2 XL compiler family

There are several XL compiler programming errors that can provide guidance
toward optimization.

Prerequisites

The XL compilers assists with identifying certain programming errors that are outlined 6.2.1, “Common prerequisites” on page 109:

•Static analysis/warnings: The XL compilers can identify suspicious code constructs, and provide some information about these constructs through the -qinfo=all option. You should examine the output of this option to identify suspicious code constructs and validate that the constructs are correct.

•Runtime analysis or warning: The XL compilers can cause the application to perform runtime checks to validate program correctness by using the -qcheck option. This option triggers a program abort when an error condition (such as a null pointer dereference or out-of-bounds array access) is run, identifying a problem and making it easier for you to identify it. This option has a significant performance cost, so use it only during functional verification, not on a production environment.

•Aliasing compliance: The C, C++, and Fortran languages specify rules that govern the access of data through overlapping pointers. These rules are brought into play aggressively by optimization techniques, but they can lead to incorrect results if they are broken. The compiler can be instructed not to take advantage of these rules, at a cost of runtime performance. This situation can be useful for older code that is written without following these rules. The options to request this optimization are -qalias=noansi for C/C++ and -qalias=nostd for Fortran.

High-order transformations

The XL compilers have sophisticated optimizations to improve the performance of numeric applications. These applications often contain regular loops that process large amounts of data. The high-order transformations (HOT) optimizations in these compilers analyze these loops, identify opportunities for restructuring them to improve cache usage, improve data reuse, and expose more instruction-level parallelism to the hardware. For these types of applications, the performance impact of this option can be substantial.

There are two levels of aggressiveness to the HOT optimization framework in these compilers:

•Level 0, which is the default at optimization level -O3, performs a minimal amount of loop optimization, focusing on simple opportunities while it minimizes compilation time.

•Level 1, which is the default at optimization levels -O4 and up, performs full loop analysis and transformation of loops, and is preferred for numerical applications.

The HOT optimizations can be explicitly requested through the -qhot=level=0 and -qhot=level=1 options.

OpenMP

The OpenMP API is an industry specification for shared-memory parallel programming. The latest XL Compilers provide a full implementation of the OpenMP 3.0 specification in C, C++, and Fortran. You can program with OpenMP to capitalize on the incremental introduction of parallelism in an existing application by adding pragmas or directives to specify how the application can be parallelized.

For applications with available parallelism, OpenMP can provide a simple solution for parallel programming, without requiring low-level thread manipulation. The OpenMP implementation on the XL compilers is available by using the -qsmp=omp option.

Whole-program analysis

Traditional compiler optimizations operate independently on each application source file. Inter-procedural optimizations operate at the whole-program scope, using the interaction between parts of the application on different source files. It is often effective for large-scale applications that are composed of hundreds or thousands of source files.

On the XL compilers, these capabilities are accessed by using the -qipa option. It is also implied when you use optimization levels -O4 and -O5. In this phase, the compiler saves a high-level representation of the program in the object files during compilation, and reoptimizes it at the whole-program scope during the link phase. For this situation to occur, the compiler driver must be used to link the resulting binary, instead of invoking the system linker directly.

Whole-program analysis (IPA) is effective on programs that use many global variables, overflowing the default AIX limit on global symbols. If the application requires the use of the -bbigtoc option to link successfully on AIX, it is likely a good candidate for IPA optimization.

There are three levels of IPA optimization on the XL compilers (0, 1, and 2). By default, -qipa implies ipa=level=1, which performs basic program restructuring. For more aggressive optimization, apply -qipa=level=2, which performs full program restructuring during the link step. The time that it takes to complete the link step can increase significantly.

Optimization that is based on Profile Directed Feedback

Profile-based optimization allows the compiler to collect information about the program behavior and use that information when you make code generation decisions. It involves compiling the program twice: first, to generate an instrumented version of the application that collects program behavior data when run, and a second time to generate an optimized binary file using information that is collected by running the instrumented binary through a set of typical inputs for the application.

Profile-based optimization in the XL compiler is accessed through the -qpdf1 and -qpdf2 options, on top of -O or higher optimization levels. The instrumented binary file is generated by using -qpdf1 on top of all other options, and the resulting binary file generates the profile data on a file, named ._pdf by default.

The Profile Directed Feedback (PDF) framework on the XL compilers is built on top of the IPA infrastructure, with -qpdf1 and -qpdf2 implying -qipa=level=0. For the PDF2 step, it is possibly to reuse the object files from the -qpdf1 compilation step, and relink only the application with the -qpdf2 option.

For PDF optimizations to be successful, the instrumented workload must be run with common workloads that reflect common usage of the application. Use multiple workloads that can exercise the program in different ways. The data for all instrumentation runs are aggregated into a single PDF file and used during optimization.

For the PDF profile data to be written out at the end of execution, the program must either implicitly or explicitly call the exit() library subroutine. Using exit() causes code that is introduced as part of the PDF instrumentation to be run and write out the PDF profile data. In contrast, running the _exit() system call skips the writing of the PDF profile data file, which results in inaccurate profile data being recorded.

6.2.3 GCC compiler family

The information in this section applies specifically to the GCC compiler family.

Prerequisites

The GCC compiler assists with identifying certain programming errors that are outlined in 6.2.1, “Common prerequisites” on page 109:

•Static analysis and warnings. The -pedantic and -pedantic-errors options warn of violations of ISO C or ISO C++ standards.

•The language standard to enforce and the aliasing compliance requirements are specified by the -std, -ansi, and -fno-strict-aliasing options. For example:

– ISO C 1990 level: -std=c89, -std=iso9899:1990, and -ansi

– ISO C 1998 level: -std=c99 and -std=iso9899:1999

– Do not assume strict aliasing rules for the language level: –fno-strict-aliasing

The GCC compiler documentation contains more details about these options.¹, ², ³

High-order transformations

The GCC compilers have sophisticated additional optimizations beyond -O3 to improve the performance of numeric applications. These applications often contain regular loops that process large amounts of data. These optimizations, when enabled, analyze these loops, identify opportunities for restructuring them to improve cache usage, improve data reuse, and expose more instruction-level parallelism to the hardware. For these types of applications, the performance impact of this option can be substantial. The key compiler options include:

•-fpeel-loops

•-funroll-loops

•-ftree-vectorize

•-fvect-cost-model

•-mcmodel=medium

Specifying the -mveclibabi=mass option and linking to the MASS libraries enables more loops for -ftree-vectorize. The MASS libraries support only static archives for linking, and so they require explicit naming and library search order for each platform/mode:

•POWER7 32-bit: -L<MASS-dir>/lib -lmassvp -lmass_simdp7 -lmass -lm

•POWER7 64-bit: -L<MASS-dir>/lib64 -lmassvp_64 -lmass_simdp7_64 -lmass_64 -lm

•POWER6 32-bit: -L<MASS-dir>/lib -lmassvp6 -lmass -lm

•POWER6 64-bit: -L<MASS-dir>/lib64 -lmassvp6_64 -lmass_64 -lm

ABI improvements

The -mcmodel={medium|large} option implements important ABI improvements that are further optimized in hardware for future generations of the POWER processor. This optimization extends the TOC to 2 GB and eliminates the previous requirement for -mminimal-toc or multi-TOC switching within a single a program or library. The default for newer GCC compilers (including Advance Toolchain 4.0 and later) is -mcmodel=medium. This model logically extends the TOC to include local static data and constants and allows direct data access relative to the TOC pointer.

OpenMP

The OpenMP API is an industry specification for shared-memory parallel programming. The current GCC compilers, starting with GCC- 4.4 (Advance Toolchain 4.0+), provide a full implementation of the OpenMP 3.0 specification in C, C++, and Fortran. Programming with OpenMP allows you to benefit from the incremental introduction of parallelism in an existing application by adding pragmas or directives to specify how the application can
be parallelized.

For applications with available parallelism, OpenMP can provide a simple solution for parallel programming, without requiring low-level thread manipulation. The GNU OpenMP implementation on the GCC compilers is available under the -fopenmp option. GCC also provides auto-parallelization under the -ftree-parallelize-loops option.

Whole-program analysis

Starting with GCC- 4.6 (Advance Toolchain 5.0), there is the Link Time Optimization (LTO) feature. LTO allows separate compilation of multiple source files but saves additional (abstract program description) information in the resulting object file. Then, at application link time, the linker can collect all the objects (with additional information) and pass them back to the compiler (GCC) for whole program IPA and final code generation.

The GCC LTO feature is enabled on the compile and link phases by the -flto option. A simple example follows:

gcc -flto -O3 -c a.c
gcc -flto -O3 -c b.c
gcc -flto -o program a.o b.o

Additional options that can be used with -flto include:

•-flto-partition={1to1|balanced|none}

•-flto-compression-level=n

Detailed descriptions about -flto and its related options are in Options That Control Optimization, available at:

http://gcc.gnu.org/onlinedocs/gcc-4.6.3/gcc/Optimize-Options.html#Optimize-Options

Profiled-based optimization

Profile-based optimization allows the compiler to collect information about the program behavior and use that information when you make code generation decisions. It involves compiling the program twice: first, to generate an instrumented version of the application that collects program behavior data when run, and a second time to generate an optimized binary using information that is collected by running the instrumented binary through a set of typical inputs for the application.

Profile-based optimization in the GCC compiler is accessed through the -fprofile-generate and -fprofile-use options on top of -O2 optimization levels. The instrumented binary is generated by using -fprofile-generate on top of all other options, and the resulting binary file generates the profile data in a file, named ._pdf by default. For example:

gcc -fprofile-generate -O3 -c a.c
gcc -fprofile-generate -O3 -c b.c
gcc -fprofile-generate -o program a.o b.o
program < sample1
program < sample2
program < sample3
gcc -fprofile-use -O3 -c a.c
gcc -fprofile-use -O3 -c b.c
gcc -fprofile-use -o program a.o b.o

Additional options that are related to GCC PDF include:

-fprofile-correction Corrects for missing counter samples from multi-threaded applications.

-fprofile-dir=PATH Specifies the directory for generating and using profile data.

-fprofile-generate=PATH Combines -fprofile-generate and -fprofile-dir.

-fprofile-use=PATH Combines -fprofile-use and -fprofile-dir.

Detailed descriptions about -fprofile-generate and its related options can be found Options That Control Optimization, available at:

http://gcc.gnu.org/onlinedocs/gcc-4.6.3/gcc/Optimize-Options.html#Optimize-Options

For more information about this topic, see 6.4, “Related publications” on page 123.

6.3 IBM Feedback Directed Program Restructuring

Feedback Directed Program Restructuring (FDPR) is a feedback-based, directed, and post-link optimization tool.

6.3.1 Introduction

FDPR optimizes the executable binary file of a program by collecting information about the behavior of the program while the program is used for a typical workload, and then creates a new version of the program that is optimized for that workload. Both main executable and dynamically shared libraries (DLLs) are supported.

FDPR performs global optimizations at the level of the entire executable library, including statically linked library code. Because the executable library to be optimized by FDPR is not relinked, the compiler and linker conventions do not need to be preserved, thus allowing aggressive optimizations that are not available to optimizing compilers.

The main advantage that is provided by FDPR is the reduced footprint of both code and data, resulting in more effective cache usage. The principal optimizations of FDPR include global code reordering, global data reordering, function inlining, and loop unrolling, along with various tuning options tailored for the specific Power target. The effectiveness of the optimization depends largely on how representative the collected profile is regarding the
true workload.

FDPR runs on both Linux and AIX and produces optimized code for all versions of the Power Architecture. POWER7 is its default target architecture.

Figure 6-1 shows how FDPR is used to optimize executable programs.

Figure 6-1 FDPR operation

FDPR builds an optimized executable program in three distinct phases:

1. Instrumentation (Yellow)

– Creates an instrumented version of the input program and an empty profile file.

– The input program can be an executable file or a dynamically linked shared library.

2. Profiling (Green)

– Runs the instrumented program on a representative workload.

– The profile file is filled with count data at run time.

3. Optimization (Red)

FDPR receives the original input program along with the filled profile file to create an optimized version of the input

6.3.2 FDPR supported environments

FDPR is available on the following platforms:

•AIX and Power Systems: Part of the AIX 5L V5 operating system and higher for both 32-bit and 64-bit applications. For more information, see AIX 5L Performance Tools Handbook, SG24-6039:

•Software Development Toolkit for PowerLinux: Available for use through the IBM SDK for PowerLinux. Linux distributions of Red Hat EL5 and above, and SUSE SLES10 and above are supported. For more information, see:

http://www14.software.ibm.com/webapp/set2/sas/f/lopdiags/sdklop.html

In these resources, detailed online help, including manuals, is provided for each of
these environments.

6.3.3 Acceptable input formats

The input binary can be a main executable program or a shared library, originally written in any language (for example, C, C++, or Fortran), if it is statically compiled. Thus, Java byte code is not acceptable. Code that is written in assembly language is acceptable, but must follow the Power ABI convention. For more information, see 64-bit PowerPC ELF Application Binary Interface Supplement 1.9, available at:

http://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi-1.9.pdf

It is important that the file includes relocation information. Although this is the default in AIX, on Linux you must add ‑Wl,‑q, or ‑Wl,‑‑emit‑relocs to the command used for linking the program (or ‑q if the ld command is used directly).

The input binary can include debug information. FDPR correctly processes line number information so that the optimized output can be debugged.

6.3.4 General operation

FDPR is started by running the fdprpro program as follows:

$ fdprpro -a action [-p] in -o out -f prof [opt …]

The action indicates the specific processing that is requested. The most common ones are instr for the instrumentation step and opt for the optimization step.

The in, out, and prof indicate the input and output binary files and profile files.

FDPR comes also with a wrapper command, named fdpr, which performs the instrumentation, profiling, and optimization under one roof. Run man fdpr for more information about this wrapper.

Special input and output files

FDPR has a number of options that control input and output files. One option that controls the input files is --ignored-function-list file (-ifl file).

In some cases, the structure of some functions confuses FDPR, which can result in bad code generation. The file that is specified by --ignored-function-list file (-ifl file) contains a list of functions that are considered unsafe for optimization. This configuration prevents the potential bad code generation that might otherwise occur.

In addition to the profile and the instrumented and output optimized files, FDPR can optionally produce various secondary files to help you understand the static and dynamic nature of the input binary program. These secondary files have the same base name as the output file and a special extension. The options that control important output files are:

•--disassemble_text (-d) and --dump-mapper (-dm): The -d option creates a disassembly of (the code segment) of the program (extension .dis_text). The disassembly is useful to understand the structure of program as analyzed or created by FDPR. The -dm option produces a mapping of basic-blocks from their original address to their address in the optimized code. This mapping can be used, for example, to understand how a specific piece of code was broken, or for user-specific
post-processing tools.

•--dump-ascii-profile (-dap): This option dumps the profile file in a human readable ASCII format (extension .aprof). The .aprof file is useful for manual inspection or user-defined post-processing of the collected profile.

•--verbose n (-v n), --print-inlined-funcs (-pif), and --journal file (-j file): These options generate different analyses of the optimized file. -v n generates general and optimization-specific statistics (.stat extension). The amount of verbosity is set by n. Basic statistics are provided by -v 1. Optimization-specific statistics are added in level 2 and instruction mix in level 3. The list of inlining and inlined functions is produced with the -pif option (.inl_list extension). The -j file produces a journal of the main optimizations, in an XML formal, with detailed information about each optimization site, including the corresponding source file and line information. This information can be used by GUI tools to display optimizations in the context of the source code.

Controlling output to the console

The amount of progress information that is printed to the console can be controlled by two options. The default progress information is as follows

fdprpro (FDPR) Version vvv for Linux/POWER

fdprpro -a opt -O3 in -o out -f prof

> reading_exe ...

> adjusting_exe ...

> analyzing ...

> building_program_infrastructure ...

> building_profiling_cfg ...

> add_profiling ...

>> reading_profile ...

>> building_control_flow_transfer_profiling ...

> pre_reorder_optimizations ...

>> derat_optimization ...

...

This information might also be interspersed with warning and debugging messages. Use the -quiet (-q) option to avoid progress information. To limit the warning information, use the -warning l (-w l) option.

6.3.5 Instrumentation and profiling

FDPR instrumentation is performed by running the following command:

$ fdprpro -a instr in [-o out] [-f prof] [opts…]

If out is not specified, the output file is in in.instr. If the profile is not specified, in.nprof
is used.

Two files are created: the instrumented program and an empty profile. The instrumented program (or shared library), when run on a representative workload, fills the profile with execution counts of nodes and edges of the binary control flow graph (CFG). A node in this CFG is a basic block (piece of code with single entry and exit points). An edge indicates a control transfer between two basic blocks through a branch (regular branch, call, or
return instruction).

To run the instrumented program, use the same command parameters as with the original program. As indicated in 6.3.1, “Introduction” on page 114, the workload that is exercised during the instrumented run should be representative, making the optimization step more effective. Because of the instrumentation code, the program is slower.

Successive runs of the instrumented program accumulate the counts in the same profile. Similarly, if the instrumented program is a shared library, each time the shared library participates in a process, the corresponding profile is updated with added counts.

Profiling shared libraries

When the dynamic linker searches for and links a shared library during execution, it looks for the original name that is used to the command used for linking the program. To ensure that the instrumented library is run, ensure that the following items are true:

1. The instrumented library should have the same name as the original library. The user can rename the original or place the libraries in different folders.

2. The folder that contains the library must be in the library search path: LIBPATH on AIX and LD_LIBRARY_PATH on Linux.

Moving and renaming the profile file

The location of the profile file is specified in the instrumented program, as indicated by the -f option. However, the profile file might be moved, or if its original specification is relative, the real location can change before execution.

Use the -fdir option to set the profile directory if it is known at instrumentation time and is different from the one implied or specified by the -f option.

Use the FDPR_PROF_DIR environment variable to specify the profile directory if the profile file is not present in the relative or absolute location where it was created in the instrumentation step (or where specified originally by -fdir).

Use the FDPR_PROF_NAME environment variable to specify the profile file name if the profile file name changed.

Profile file descriptor

When the instrumented binary file is run, the profile file is mapped to shared memory. The process is using a default file descriptor (FD) number (1023 on Linux and 1999 on AIX) for the mapping. If the application uses this specific FD, an error can occur during the profiling phase because of this conflict of use. Use the -fd option to change the default FD used by FDPR:

$ fdprpro -a instr my_prog -fd <fd num>

The FD can also be controlled by using the FDPR_PROF_FD environment variable by changing the FD at run time:

$ export FDPR_PROF_FD=fd_num

FDPR can be used to profile several binary executable files in a single run of an application. If so, you must specify a different FD for each binary. For example:

•$ fdprpro ‑a instr in/libmy_lib1 -o out/libmy_lib1 ‑f out/libmy_lib1.prof ‑fd 1023

•$ fdprpro ‑a instr in/libmy_lib2 ‑o out/libmy_lib2 ‑f out/libmy_lib2.prof ‑fd 1022

Because environment variables are global in nature, when profiling several binary files at the same time, use explicit instrumentation options (-f, -fd, and -fdir) to differentiate between the profiles rather than using the environment variables (FDPR_PROF_FD and FDPR_PROF_NAME).

Instrumentation stack

The instrumentation is using the stack for saving registers by dynamically allocating space on the stack at a default location below the current stack pointer. On AIX, this default is at offset -10240, and on Linux it is -1800. In some cases, especially in multi-threaded applications where the stack space is divided between the threads, following a deep calling sequence, the application can be quite close to the end of the stack, which can cause the application to fail. To allocate the instrumentation closer to the current stack pointer, use the -iso option:

$ fdprpro -a instr my_prog -iso -300

6.3.6 Optimization

The optimization step is performed by running the following command:

$ fdprpro -a opt in [-o out] -f prof [opts…]

If out is not specified, the output file is in.fdpr. No profile is provided by default. If none is specified or if the profile is empty, the resulting output binary file is not optimized.

Code reordering

Global code reordering works in two phases: making chains and reordering the chains.

The initial chains are sequentially ordered basic blocks, with branch conditions inverted where necessary, so that branches between the basic blocks are mostly not taken. This configuration makes instruction prefetching more efficient. Chains are terminated when the heat (that is, execution count) goes below a certain threshold relative to the initial heat.

The second phase orders chains by successively merging the more strongly linked two chains, based on how frequent the calls between the chains are. Combining chains crosses function boundaries. Thus, a function can be broken into multiple chunks in which different pieces of different functions are placed closely if there is a high frequency of call, branch, and return between them. This approach improves code locality and thus i-cache and page
table efficiency.

You use the following options for code reordering:

•--reorder-code (-RC): This component is the hard-working component of the global code reordering. Use --rcaf to determine the aggressiveness level:

– 0: no change

– 1: standard (default)

– 2: most aggressive.

Use ‑‑rcctf to lower the threshold for terminating chains. Use -pp to preserve function integrity and -pc to preserve CSECT integrity (AIX only). These two options limit global code reordering and might be requested for ease of debugging.

•--branch-folding (-bf) and --branch-prediction (-bp): These options control important parts of the code reordering process. The -bf folds branch to branch into a single branch. The -bp sets the static branch prediction bit when taken or not taken statistics justify it.

Function inlining

FDPR performs function inlining of function bodies into their respective calling sites if the call site is selected by one of a number of user-selected filters:

•Dominant callers (--selective-inlining (-si), ‑sidf f, and -siht f): The filter criteria here is that the site is dominant regarding other callers of the called function (the callee). It is controlled by two attributes. The -sidf option sets the domination percentage threshold (default 80). The -siht option further restricts the selection to functions hotter than the threshold, which is specified in percents relative to the average (default 100).

•Hot functions (‑‑inline‑hot‑functions f (‑ihf f)): This filter selects inlining for all call sites where the call is hotter than the heat threshold (in percent, relative to the average).

•Small functions (‑‑inline‑small‑functions f (‑isf f)): This filter selects for inlining all functions whose size, in bytes, is smaller than or equal to the parameter.

•Selective hot code (--selective-hot-code-inline f (-shci f)): The filter computes how much execution count is saved if the function is inlined at a call site and selects those sites where the relative saving is above the percentage.

De-virtualization

De-virtualization is addressed by the --ptrgl-optimization (-pto) option. It is full-blown call by a pointer mechanism (ptrgl) sets a new TOC anchor, loads the function address, moves it to the counter register (CTR), and jumps indirectly through the CTR. The -pto optimizes this mechanism in cases where there is few hot targets from a calling site. In terms of C++, it de-virtualizes the virtual method calls by calling the actual targets directly. The optimized code compares the address of the function descriptor, which is used for the indirect call, against the address of a hot candidate, as identified in the profile, and conditionally calls such target directly. If none of the hot targets match, the code invokes the original indirect call mechanism. The idea is that most of the time the conditional direct branches are run instead of the ptrgl mechanism. The impact of the optimization on performance depends heavily on the function call profile.

The following thresholds can help to tune the optimization and to adjust it to
different workloads:

•Use -ptoht thres to set the frequency threshold for indirect calls that are to be optimized (thres can be 0 - 1, with 0.8 by default).

•Use -ptosl n to set the limit of the number of hot functions to optimize in a given indirect call site (the default for n is 3).

Loop-unrolling

Most programs spend their time in loops. This statement is true regardless of the target architecture or application. FDPR has one option to control the unrolling optimization for loops: --loop-unrolling factor (-lu factor).

FDPR optimizes loop using a technique called loop-unrolling. By unrolling a loop n times, the number of back branches is reduced n times, so code prefetch efficiency can be improved. The downside with loop-unrolling is code inflation, which results in increased code footprint and increased i-cache misses. Unlike traditional loop-unrolling, FDPR is able to mitigate this problem by unrolling only the hottest paths in the loop. The factor parameter determines the aggressiveness of the optimization. With -O3, the optimization is invoked with -lu 9.

By default, loops are unrolled two times. Use -lu factor to change that default.

Architecture-specific optimizations

Here are some architecture-specific optimizations:

•--machine tgt (-m tgt): FDPR optimizations include general optimizations that are based on a high-level program representation as a control and data flow, in addition to peephole optimizations, relying on different architecture features. Those optimizations can perform better when tuned for specific platforms. The -m flag allows the user to specify the target machine model when known in cases where the program is not intended for use on multiple target platforms. The default target is POWER7.

•--align-code code (-A code): Optimizing the alignment and the placement of the code is crucial to the performance of the program. Correct alignment can improve instruction fetching and dispatching. The alignment algorithm in FDPR uses different techniques that are based on the target platform. Some techniques are generic for the Power Architecture, and others are considered dispatch rules of the specific machine model. If code is 1 (the default), FDPR applies a standard alignment algorithm that is adapted for the selected target machine (see -m in the previous bullet point). If code is 2, FDPR applies a more advanced version, using dispatch rules and other heuristics to decide how the program code chunks are placed relatively to i-cache sectors, again based on the selected target. A value of 0 disables the alignment algorithm.

Function optimization

FDPR includes a number of function level optimizations that are based on detailed data flow analysis (DFA). With DFA, optimizations can determine the data that is contained in each register at each point in the function and whether this value is used later.

The function optimizations are:

•--killed-regs (-kr): A register is considered killed at a point (in the function) if its value is not used in any ensuing path. FDPR uses the Power ABI convention that defines which registers are non-volatile (NV) across function calls. NV registers that are used inside a function are saved in its prolog and restored in its epilog. The -kr optimization analyzes called functions that are looking for save and restore instructions of killed NV registers. If the register is killed at the calling site, then the save and restore instructions for this register are removed. The optimization considers all calls to this function, because an NV might be alive when the function is called. When needed, the optimization might also reassign (rename) registers at the calling side to ensure that an NV is indeed killed and can be optimized.

•‑‑hco-reschedule (-hr): The optimization analyzes the flow through hot basic blocks and looks for instructions that can be moved to dominating colder basic blocks (basic block b1 dominates b2 if all paths to b2 first go through b1). For example, an instruction that loads a constant to a register is a candidate for such motion.

•‑‑simplify-early-exit factor (-see factor): Sometimes a function starts with an early exit condition so that if the condition is met, the whole body of the function is ignored. If the condition is commonly taken, it makes sense to avoid saving the registers in the prolog and restoring them in the epilog. The -see optimization detects such a condition and provides a reduced epilog that restores only registers modified by computing the condition. If factor is 1, a more aggressive optimization is performed where the prolog is also optimized.

Peephole optimization

Peephole optimizations require a small context around the specific site in the code that is problematic. The more important ones that FDPR performs are ‑las, ‑tlo, and ‑nop.

•‑‑load-after-store (‑las): In recent Power Architectures, when a load instruction from address A closely follows a store to that address, it can cause the load to be rejected. The instruction is then tried in a slower mode, which produces a large performance penalty. This behavior is also called Load-Hit-Store (LHS). With the ‑las optimization, the load is pushed further from the store, thus avoiding the reject condition.

•‑‑toc-load-optimization (‑tlo): The TOC (Table-Of-Content) is a data section in programs where pointers are kept to avoid the lengthy address computation at run time. Loading an address (a pointer) is a costly operation and FDPR is able to save on processing if the address is close enough to the TOC anchor (R2). In such cases, the load from TOC is replaced by an addi Rt,R2,offset, where R2+offset = loaded address. The optimization is performed after data is reordered so that commonly accessed data is placed closer to R2, increasing the potential of this optimization. A TOC is used in 32-bit and 64-bit programs on AIX, and in 64-bit programs on Power Systems running Linux. Linux 32-bit uses a GOT and this optimization is not relevant there.

•‑‑nop-removal (‑nop): The compiler (or the linker) sometimes inserts no-operation (NOP) instructions in various places to create some necessary space in the instruction stream. The most common place is following a function call in code. Because the call might have modified the TOC anchor register (R2), the compiler inserts a load instruction that resets R2 to its correct value for the current function. Because FDPR has a global view of the program, the optimization can remove the NOP if the called function uses the same TOC (the TOC anchor is used in AIX and in Linux 64-bit).

Data reordering

The profile that is collected by FDPR provides important information about the running of branch instructions, thus enabling efficient code reordering. The profile does not provide this direct information whether to put specific objects one after the other. Nevertheless, FDPR is able to infer such placement by using the collected profile.

The relevant options are:

•‑-reorder-data (-RD): This optimization reorders data by placing pointers and data closer to the TOC anchor, depending on their hotness. FDPR uses a heuristic where the hotness is computed as the total count of basic blocks where the pointer to the data was retrieved from the TOC.

•‑‑reduce-toc thres (‑rt thres): The optimization removes from the TOC entries that are colder than the threshold. Their access, if any, is replaced by computing the address (see -tlo optimization in “Peephole optimization” on page 122). Typically, you use -rt 0, which removes only the entries that are never accessed.

Combination optimizations

FDPR has predefined optimization sets that provide a good starting point for
performance tuning:

•-O: Performs code reordering (-RC) with branch prediction bit setting (-bp), branch folding (-bf), and NOOP instructions removal (-nop).

•-O2: Adds to -O function de-virtualization (-pto), TOC-load optimization (-tlo), function inlining (-isf 8), and some function optimizations (-hr, -see 0, and -kr).

•-O3: Switches on data reordering (-RD and -rt 0), loop-unrolling (-lu), more aggressive function optimization (-see 1 and -vro), and employs more aggressive inlining (-lro and -isf 12). This set provides an aggressive but still stable set of optimizations that are beneficial for many benchmarks and applications.

•-O4: Essentially turns on more aggressive inlining (-sidf 50, -ihf 20, and -shci 90). As a result, the number of branches is reduced, but at the cost of increasing code footprint. This option works well with large i-caches or with small to medium programs/threads.

For more information about this topic, see 6.4, “Related publications” on page 123.

6.4 Related publications

The publications that are listed in this section are considered suitable for a more detailed discussion of the topics that are covered in this chapter:

•C/C++ Cafe (IBM Rational), found at:

http://www.ibm.com/rational/cafe/community/ccpp

•FDPR, Post-Link Optimization for Linux on Power, found at:

https://www.ibm.com/developerworks/mydeveloperworks/groups/service/html/communityview?communityUuid=5a116d75-b560-4152-9113-7515fa73e67a

•Feedback Directed Program Restructuring (FDPR), found at:

https://www.research.ibm.com/haifa/projects/systems/cot/fdpr/

•GCC online documentation

– All versions: http://gcc.gnu.org/onlinedocs/

– Advance Toolchain 4.0: http://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/

– Advance Toolchain 5.0: http://gcc.gnu.org/onlinedocs/gcc-4.6.3/gcc/

– Advance Toolchain 6.0 (3Q2012): http://gcc.gnu.org/onlinedocs/gcc-4.7.1/gcc/

•XL Compiler Documentation: