Developing applications with Blue Gene/Q compilers
Applications to be run on the Blue Gene/Q system must be compiled and linked with a compiler that targets the Blue Gene/Q environment. Because compilation occurs on the front end node and not on the Blue Gene/Q system, these compilers are cross compilers.
This chapter describes the considerations for developing, compiling, and optimizing C/C++ and Fortran applications for the IBM Blue Gene/Q PowerPC A2 processor and the quad-processing extension (QPX) in the PowerPC AS v2 floating-point unit. This chapter contains information about the following topics:
7.1 Programming environment overview
Figure 2-1 on page 10 shows the system calls that the CNK manages, including forwarding I/O to the I/O node kernel. Figure 6-1 on page 66 shows a summary of the messaging software stack that supports the execution of Blue Gene/Q applications.
7.2 Compilers for the Blue Gene/Q system
The Blue Gene/Q system includes support for the IBM extensible library (XL) family of optimizing compilers. It also supports the GNU Compiler Collection, a Python interpreter, and the GNU toolchain tools.
7.2.1 IBM XL compilers
The Blue Gene/Q system includes the IBM XL compilers. These compilers can be used to develop C, C++, and Fortran applications for the IBM Blue Gene/Q system. This family comprises the following products, which are referred to in this chapter as the IBM XL compilers for Blue Gene/Q:
XL C/C++ Advanced Edition V12.1 for Blue Gene/Q
XL Fortran Advanced Edition V14.1 for Blue Gene/Q
The information presented in this chapter is an overview of the features that are available for use with the Blue Gene/Q system. For complete documentation about these compilers, see the information at the following websites:
XL C/C++
XL Fortran
Documentation is also typically included as PDF files in the installation directories under /opt/ibmcmp.
The default installation directory for the IBM XL compilers is /opt/ibmcmp. The system administrator can specify another installation directory. See the compiler documentation for more information about changing the default installation directories for the XL compilers.
If an alternative installation location is used for the XL compiler, create a link that refers to the alternative installation location in the /opt/ibmcmp directory, for example:
ln -s /bgsys/xlcompilers/latest /opt/ibmcmp
When this link is created, you can run the compiler with /opt/ibmcmp, but use the compilers that are installed in the alternative locations.
The examples in this chapter are based on the default installation location of /opt/ibmcmp. If another installation location is used, the path in the examples must be also changed to match the alternative location.
7.2.2 GNU Compiler Collection
The standard GNU Compiler Collection V4.4.6 for C, C++, and Fortran is supported on the Blue Gene/Q system. The versions of the toolchain components are:
gcc 4.4.6
binutils 2.21.1
glibc 2.12.2
gdb 7.2
For more information about the toolchain compilers, see the man pages or the GNU website:
7.2.3 Python interpreter
You can install patches to build a version of Python that runs on the Blue Gene/Q system. See Section 7.11, “Python support” on page 95 for more information.
7.2.4 Toolchain tools
The GNU toolchain provides a variety of tools. These tools are in the /bgsys/drivers/ppcfloor/gnu-linux/bin directory and have the prefix powerpc64-bgq-linux-. Some tools have function added for the Blue Gene/Q system:
gdb Contains support for remote debugging, the display of the Blue Gene/Q instruction set, and the display of QPX register contents.
objdump Disassembles instructions on the Blue Gene/Q system.
readelf Recognizes and displays the note section for the Blue Gene/Q system.
nm Recognizes the vector4double data type for the XL compiler
7.3 Compiling and linking applications on the Blue Gene/Q system
The following Blue Gene/Q GNU compilers are stored at /bgsys/drivers/ppcfloor/gnu-linux/bin:
powerpc64-bgq-linux-gcc
powerpc64-bgq-linux-gfortran
powerpc64-bgq-linux-g++
The names of the XL compilers for Blue Gene/Q are listed in Table 7-1. There are multiple variations for each language (C, C++, Fortran), depending on the language standard to be used. Use the thread-safe version of the compiler (the name ends in _r) to compile programs that run threads.
Table 7-1 Scripts available in the bin directory for compiling and linking
Language
Script name or names
C
bgc89, bgc99, bgcc, bgxlc bgc89_r, bgc99_r bgcc_r, bgxlc_r
C++
bgxlc++, bgxlc++_r, bgxlC, bgxlC_r
Fortran
bgf2003, bgf95, bgxlf2003, bgxlf90_r, bgxlf_r, bgf77, bgfort77, bgxlf2003_r, bgxlf95, bgf90, bgxlf, bgxlf90, bgxlf95_r
Example 7-1 shows how to compile and link a simple program.
Example 7-1 Linking and compiling a simple program
Compile and link a program with the toolchain:
$ /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc -o hello hello.c
Compile and link a program with the XL C compiler:
$ /opt/ibmcmp/vacpp/bg/12.1/bin/bgxlc -o hello hello.c
Compile and link a program with the XL Fortran compiler:
/opt/ibmcmp/xlf/bg/14.1/bin/bgxlf90_r -o hello hello.f
Compile and link a program with the toolchain Fortran compiler:
/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gfortran -o hello hello.f
7.4 Compiler options specific to the Blue Gene/Q system
Both the GNU compilers and the XL compilers have many options to configure compilation and linking. See the compiler documentation links for a complete list of options and descriptions. The options in this section are provided with the Blue Gene/Q versions of the compilers and are specific to programs that are compiled for the Blue Gene/Q system.
7.4.1 Options for the Blue Gene/Q system
This section presents the options for the Blue Gene/Q system.
GNU compilers
Specify the following option to use dynamic linking for the program:
-dynamic For performance reasons, the compilers for the Blue Gene/Q system default to static linking. If this option is not specified, static linking is used.
XL compilers
This section provides the XL compilers.
Default options
The following options are set when the bgxl compiler invocation scripts in the bin directory of the installation folder are used. These options are the default settings:
-qarch=qp -qtune=qp These options identify that the code is targeted for Blue Gene/Q.
-q64 The Blue Gene/Q compilers generate only 64-bit code.
-qsimd=auto Use this option to indicate whether the compiler transforms code into a form that can use the QPX floating-point instruction set. This process is sometimes referred to as simdizing the code. The -qsimd option defaults to auto. To disable the auto simdization of code, use the -qsimd=noauto option.
Additional Blue Gene/Q options
These options are specific to the Blue Gene/Q system. These options are not set by default:
-qnostaticlink Use this option to specify that the executable program is generated with dynamic linking. For all Blue Gene/Q XL compilers, the default linking mode is static.
-qmkshrobj Use this option to generate shared libraries when linking with the Blue Gene/Q XL compilers.
-qflttrap=qpxstore Use this option to generate code that permits floating‑point exceptions to occur when the QPX floating‑point unit is used. The QPX floating‑point unit generates a limited set of floating‑point exceptions. Only the QPX store instruction can generate a floating‑point exception for not a number (NAN) or infinity (INF). This option is not enabled by default.
-qtm Use this option to process the transactional memory #pragmas in the program. When source code contains transactional memory #pragmas and is compiled with the -qtm option, the compiler generates code that uses the transactional memory on the Blue Gene/Q system. The #pragmas to identify the transactional code must be used in addition to the option -qtm for this feature to be enabled. Transactional memory is only useful when threads are present. A thread-safe compiler (that is, a compiler with _r in its name) must be used with this option. For more information about the syntax and use of transactional memory, see the links to the compiler documentation.
-qsmp=speculative Use this option to process the speculative thread #pragmas in the code. Both speculative thread constructs and this option are required to generate speculative threads. Use this option with a thread-safe compiler (that is a compiler with _r in its name). For more information about the syntax and use of speculative threads, see the links to the compiler documentation.
7.4.2 Unsupported compiler options
The following compiler options are not supported by the Blue Gene/Q compilers.
GNU compilers:
-m32 The Blue Gene/Q system supports only 64-bit architecture. The -m32 option is not supported.
XL compilers:
-q32 The Blue Gene/Q system uses a 64-bit architecture. The 32-bit mode is not supported.
-qaltivec The A2 processor does not support vector single instruction, multiple data (VMX) instructions.
7.5 Support for pthreads and OpenMP
Programs that use threads can be built with the Blue Gene/Q compilers and run on the Blue Gene/Q system. Threads on Blue Gene/Q system are implemented as pthreads that are defined by glibc in the Blue Gene/Q toolchain.
OpenMP can also be used to create threads, which are implemented by the OpenMP run‑time environment as pthreads. For more information about the threading model for the Blue Gene/Q system, see 3.9, “Threading overview” on page 17. For more information about the GNU OpenMP environment, which is also referred to as GOMP, see the GNU website at: http://gcc.gnu.org/projects/gomp/.
The IBM XL compilers for the Blue Gene/Q system support the OpenMP 3.1 standard. See the XL compiler documentation for information about the options that are required to enable OpenMP. This section also contains information about the source code changes for the OpenMP run time environment.
The GNU toolchain for Blue Gene/Q contains support for OpenMP 3.0.
7.5.1 Thread stack size for the Blue Gene/Q system
The thread stack size depends on the system configuration:
Minimum thread stack size
The minimum thread stack size is determined by glibc. In glibc 2.10, the value for PTHREAD_STACK_MIN is defined as 128 KB for PowerPC64. The smallest allowable stack in glibc 2.12.2 is also 128 KB. If the stack size is smaller than the minimum value, the pthread_create() function returns an error.
Default thread stack size
The default thread stack size on the Blue Gene/Q system is 4 MB.
Maximum thread stack size
The maximum size for a thread stack in a program depends on the amount of space that is available to allocate for stack space. There are many factors that can affect stack space. These factors include how much heap space is available to the process, how much heap space is already used, how many threads are being created, how much thread local storage is being used, and how many processes are running on the node. If memory errors occur when the pthread_create() function is called, there is probably not enough space to create the stack for the new thread. To set the thread stack size when creating a thread with pthread_create, use the pthread_setstacksize function as shown in Example 7-2.
Example 7-2 Using the pthread_setstacksize option to set the thread stack size
pthread_attr_t attr;
pthread_t thd;
int rc;
pthread_attr_init(&attr);
pthread_setstacksize(&attr, 8000000);
rc = pthread_create(&thd, &attr, p, 0);
To set the thread stack size when using OpenMP threads, use the OMP_STACKSIZE environment variable. For example, you can enter the following command to set the OpenMP stack to 8 MB:
export OMP_STACKSIZE=8M
7.6 Creating libraries on the Blue Gene/Q system
On the Blue Gene/Q system, two types of libraries can be created:
Static libraries
Shared (dynamically loaded) libraries
When a program is statically linked, the required code from the static libraries is linked into the program. Example 7-3 illustrates how to create a static library on the Blue Gene/Q system with the XL family of compilers.
Example 7-3 Static library creation using the XL compilers
# Compile with the XL compiler
/opt/ibmcmp/vac/bg/12.1/bin/bgxlc -c pi.c
/opt/ibmcmp/vac/bg/12.1/bin/bgxlc -c main.c
#
# Create the library
/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-ar rcs libpi.a pi.o
#
# Create the executable program
/opt/ibmcmp/vac/bg/12.1/bin/bgxlc -o pi main.o -L. -lpi
Example 7-4 shows the same procedure with the GNU collection of compilers.
Example 7-4 Static library creation using the GNU compilers
# Compile with the GNU compiler
/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc -c pi.c
/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc -c main.c
#
# Create the library
/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-ar rcs libpi.a pi.o
#
# Create the executable program
/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc -o pi main.o -L. -lpi
Shared libraries are loaded at execution time.
Use the -qnostaticlink option with the XL C and C++ compilers to build a dynamic binary. The static libgcc.a is linked in by default. To use the shared version of the libgcc library, also specify -qnostaticlink=libgcc. For example, use /opt/ibmcmp/vacpp/bg/12.1/bin/bgxlc -o hello hello.c -qnostaticlink -qnostaticlink=libgcc.
Example 7-5 shows shared library creation with the XL compiler.
Example 7-5 Shared library creation using the XL compiler
# Use XL to create shared library
/opt/ibmcmp/vac/bg/12,1/bin/bgxlc -qpic -c libpi.c
/opt/ibmcmp/vac/bg/12.1/bin/bgxlc -qpic -c main.c
#
# Create the shared library
/opt/ibmcmp/vac/bg/12.1/bin/bgxlc -qmkshrobj -Wl,-soname, libpi.so.0 -o libpi.so.0.0 libpi.o
#
# Set up the soname
ln -sf libpi.so.0.0 libpi.so.0
#
# Create a linker name
ln -sf libpi.so.0 libpi.so
#
# Create the executable program
/opt/ibmcmp/vac/bg/12.1/bin/bgxlc -o pi main.o -L. -lpi -qnostaticlink -qnostaticlink=libgcc
Example 7-6 illustrates the same procedure with the GNU collection of compilers.
Example 7-6 Shared library creation using the GNU compiler
# Compile with the GNU compiler
/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc -fPIC -c libpi.c
/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc -fPIC -c main.c
#
# Create shared library
/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc -shared
-Wl,-soname,libpi.so.0 -o libpi.so.0.0 libpi.o -lc
#
# Set up the soname
ln -sf libpi.so.0.0 libpi.so.0
#
# Create a linker name
ln -sf libpi.so.0 libpi.so
#
# Create the executable program
/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc -o pi main.o -L. -lpi -dynamic
The -qnostaticlink and -qmkshrobj options can also be used in a similar manner with the XL Fortran compilers.
7.7 Running dynamically linked applications on the Blue Gene/Q system
Unlike most other platforms, the compilers that generate code to run on the Blue Gene/Q system use static linking instead of dynamic linking by default. The use of static linking improves performance. If dynamic linking is used, there are some differences in how to build a program to run on the Blue Gene/Q system.
7.7.1 Creating a program
If no linking options are specified when linking a program, a statically linked program is generated. To use dynamic linking with GNU compilers, use the -dynamic option. Example 7-7 shows how to link a program that is to run with dynamic linking.
Example 7-7 Linking a program to be run with dynamic linking
/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc -o hello hello.c -dynamic
/opt/ibmcmp/vacpp/bg/12.1/bin/bgxlc -o hello hello.c -qnostaticlink
A program that is created with a compiler that targets the Blue Gene/Q system identifies the path to the dynamic linker as /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux/lib/ld64.so.1. This directory is available on the front end nodes and the I/O node. The readelf tool displays this information. For more information, see 7.7.5, “Tools for dynamic linking” on page 88.
7.7.2 Creating a shared library
When the GNU compilers are used, a shared library for the Blue Gene/Q system is created. This library is the same library as for Linux on IBM Power®. Compile the code that is included in a shared library with the pic option to identify that it contains position independent code. Example 7-8 provides syntax for creating a shared library with GNU compilers.
Example 7-8 Creating a shared library with GNU compilers
/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc -c util.c -fpic
/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux/gcc -o libtest.so util.o -shared
When the XL compilers are used, a shared library for the Blue Gene/Q system is created with the -qmkshrobj option. The -qmkshrobj option is supported for C, C++, and Fortran. Example 7-9 provides syntax for creating a shared library with XL compilers.
Example 7-9 Creating a shared library with XL compilers
/opt/ibmcmp/vacpp/bg/12.1/bin/bgxlc -c util.o -qpic
/opt/ibmcmp/vacpp/bg/12.1/bin/bgxlc -o libtest.so util.o -qmkshrobj
 
Important: Do not use the ld command to explicitly link a program. Use the Blue Gene/Q compilers (GNU and XL) to run the link command. This method ensures that the compiler links in dependent libraries that might be missed if the ld command is used alone.
7.7.3 Running a Blue Gene/Q dynamically linked program on a front end node
Some dynamically linked programs that are built for the Blue Gene/Q system run on the front end node because it is a PowerPC64 processor. This configuration is not supported.
To run programs in this configuration, explicitly invoke the dynamic linker, and use the Blue Gene/Q program as an argument.
Example 7-10 Invoking the dynamic linker
/bgsys/drivers/ppcfloor/gnu-linux/powerpc64-bgq-linux/lib/ld64.so.1 ./hello
7.7.4 Running a dynamically linked program on the Blue Gene/Q system
Running a dynamically linked program on the Blue Gene/Q system is similar to the way a statically linked program is run. When dynamically linked programs are run on the Blue Gene/Q system, the paths to those libraries must be known to the Blue Gene/Q dynamic linker, or the program fails to run. As described in 7.7.1, “Creating a program” on page 86, the information about the path to the Blue Gene/Q dynamic linker is embedded in the program when it is linked. By default, the dynamic linker searches the directories that are expected to contain shared libraries for use with Blue Gene/Q system. These locations include the directories for the Blue Gene/Q toolchain shared libraries and the Python shared library if the Python library is installed. The linker also searches for Message Passing Interface (MPI) or Parallel Active Messaging Interface (PAMI) shared libraries for the Blue Gene/Q driver. The driver is in the /usr/lib64/bgq/ directory on the I/O node.
The Blue Gene/Q dynamic linker follows the same search conventions as the native GNU dynamic linker. At program load time, the dynamic linker attempts to load all of the dependent libraries in the program that are identified as NEEDED in the dynamic section of the program Executable and Linkable Format (ELF) file, as displayed by the readelf tool. The search path order for a dynamically linked program on the Blue Gene/Q compute node contains the following locations:
The path in the DT_RPATH dynamic section of the program, if it exists
The path identified by the LD_LIBRARY_PATH environment variable that is specified on the runjob command
The path in the DT_RUNPATH dynamic section of the program, if it exists
The information provided in /etc/ld.so.bgq.cache file on the I/O node
The default paths searched by the Blue Gene/Q dynamic linker, which include /lib64/bgq and /usr/lib64/bgq on the I/O node
The default native paths /lib64 and /usr/lib64
The NEEDED, DT_RPATH, and DT_RUNPATH information for a program can be viewed using the readelf utility. For more information, see Section 7.7.5, “Tools for dynamic linking” on page 88.
7.7.5 Tools for dynamic linking
The tools that are described in this section provide information that can be used to determine how programs or shared libraries are built. It also provides information about the search paths that are used to find shared libraries.
The readelf tool
The readelf tool is in the toolchain. This tool provides information about the content of the dynamically linked program. It displays the path to the Blue Gene/Q dynamic linker, the set of shared library dependencies in the program (that are identified as NEEDED in the dynamic section of the program ELF file), and the DT_RPATH and DT_RUNPATH information. Figure 7-1 on page 89 displays this information.
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
PHDR 0x0000000000000040 0x0000000001000040 0x0000000001000040
0x0000000000000188 0x0000000000000188 R E 8
INTERP 0x00000000000001c8 0x00000000010001c8 0x00000000010001c8
0x0000000000000015 0x0000000000000015 R 1
[Requesting program interpreter: /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux/lib/ld64.so.1]
LOAD 0x0000000000000000 0x0000000001000000 0x0000000001000000
0x0000000000000824 0x0000000000000824 R E 10000
LOAD 0x0000000000010000 0x0000000001100000 0x0000000001100000
0x0000000000000300 0x0000000000000370 RW 10000
DYNAMIC 0x0000000000010028 0x0000000001100028 0x0000000001100028
0x0000000000000170 0x0000000000000170 RW 8
NOTE 0x00000000000001e0 0x00000000010001e0 0x00000000010001e0
0x0000000000000040 0x0000000000000040 R 4
GNU_EH_FRAME 0x00000000000007d8 0x00000000010007d8 0x00000000010007d8
0x0000000000000014 0x0000000000000014 R 4
...
 
Dynamic section at offset 0x10028 contains 19 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x000000000000000f (RPATH) Library rpath: [/bgusr/boger]
Figure 7-1 Output from the readelf tool
The native readelf tool and the readelf tool in the Blue Gene/Q toolchain are similar. In some cases, the Blue Gene/Q tool provides additional information.
The ldd tool
In most cases, the native ldd tool in the /usr/bin/ldd directory does not provide the correct information for dynamically linked programs that are created to run on the Blue Gene/Q system. Run the ldd tool in the Blue Gene/Q toolchain from a front end node to find the shared library dependencies. Figure 7-2 shows output from the ldd tool when the Blue Gene/Q toolchain is run from a front end node.
/bgsys/drivers/ppcfloor/gnu-linux/powerpc64-bgq-linux/bin/ldd ./hello
linux-vdso64.so.1 => (0x00000fff9bb80000)
libc.so.6 => /bgsys/drivers/ppcfloor/gnu-linux/powerpc64-bgq-linux/lib/libc.so.6 (0x00000fff9b940000)
/bgsys/drivers/ppcfloor/gnu-linux/powerpc64-bgq-linux/lib/ld64.so.1 (0x0000000040450000)
Figure 7-2 Output from the ldd tool on a front end node
Figure 7-3 shows output from the Blue Gene/Q ldd tool when it is run from a front end node.
/bgsys/drivers/ppcfloor/gnu-linux/powerpc64-bgq-linux/bin/ldd ./hello
linux-vdso64.so.1 => (0x00000fffa2720000)
libc.so.6 => /lib64/bgq/libc.so.6 (0x00000fffa24e0000)
/bgsys/drivers/ppcfloor/gnu-linux/powerpc64-bgq-linux/lib/ld64.so.1 (0x0000000040450000)
Figure 7-3 Output from the Blue Gene/Q ldd tool
The Blue Gene/Q toolchain shared libraries for a front end node are in the directories for the toolchain in /bgsys.
To run the equivalent of the ldd tool on the compute node, use the runjob command as shown in Figure 7-4. Long lines are separated with the backslash () character.
runjob --block R00-M0-N01 --corner R00-M0-N01-J00 --shape 1x1x1x1x1
--cwd /bgusr/boger/bgq/c --envs LD_TRACE_LOADED_OBJECTS=1 --exe hello
libc.so.6 => /lib64/bgq/libc.so.6 (0x0000001e01503000)
/bgsys/drivers/ppcfloor/gnu-linux/powerpc64-bgq-linux/lib/ld64.so.1 => ld (0x0000003001000000)
Figure 7-4 Running ldd on the compute node
LD_DEBUG tracing
To trace the search paths that are used by the dynamic linker, use the LD_DEBUG environment variable. The Blue Gene/Q variable has the same values as the Linux on Power variable. To see all of the supported values for the LD_DEBUG variable, run tracing with the LD_DEBUG=help: option. See Example 7-11.
Example 7-11 LD_DEBUG=help tracing
runjob --block R00-M0-N01 --corner R00-M0-N01-J00
--shape 1x1x1x1x1 --raise --cwd /bgusr/boger/bgq/c --envs LD_DEBUG=help --exe ./hello
Valid options for the LD_DEBUG environment variable are:
 
libs display library search paths
reloc display relocation processing
files display progress for input file
symbols display symbol table processing
bindings display information about symbol binding
versions display version dependencies
dbgevents display debug events
all all previous options combined
statistics display relocation statistics
unused determined unused DSOs
help display this help message and exit
 
To direct the debugging output into a file instead of standard output
a filename can be specified using the LD_DEBUG_OUTPUT environment variable.
To see information about the directories that were searched to find a particular shared library, use the LD_DEBUG=libs: option. See Example 7-12 on page 91.
Example 7-12 LD_DEBUG=libs debugging
runjob --block R00-M0-N01 --corner R00-M0-N01-J00 --shape 1x1x1x1x1 --raise
--cwd /bgusr/boger/bgq/c --envs LD_DEBUG=libs --exe hello
1: find library=libc.so.6 [0]; searching
1: search path=/bgusr/boger/tls:/bgusr/boger (RPATH from file hello)
1: trying file=/bgusr/boger/tls/libc.so.6
1: trying file=/bgusr/boger/libc.so.6
1: search cache=/etc/ld.so.bgq.cache
1: trying file=/lib64/bgq/libc.so.6
1:
1:
1: calling init: /lib64/bgq/libc.so.6
1:
1:
1: initialize program: hello
1:
1:
1: transferring control: hello
1:
Hello from pid: 1 start: 0x1
1:
1: calling fini: hello [0]
1:
1:
1: calling fini: /lib64/bgq/libc.so.6 [0]
1:
To display detailed information about the libraries that were loaded, including the start address and size, use the LD_DEBUG=files: option. See Example 7-13.
Example 7-13 LD_DEBUG=files debugging
runjob --block R00-M0-N01 --corner R00-M0-N01-J00 --shape 1x1x1x1x1 --raise
--cwd /bgusr/boger/bgq/c --envs LD_DEBUG=files --exe hello
1: file=hello [0]; generating link map
1: dynamic: 0x0000000001100028 base: 0x0000000000000000 size: 0x0000000000100380
1: entry: 0x00000000011001b8 phdr: 0x0000000001000040 phnum: 7
1:
1:
1: file=libc.so.6 [0]; needed by hello [0]
1: file=libc.so.6 [0]; generating link map
1: dynamic: 0x0000001e01719e10 base: 0x0000001e01503000 size: 0x00000000002320f8
1: entry: 0x0000001e0171ad78 phdr: 0x0000001e01503040 phnum: 10
1:
1:
1: calling init: /lib64/bgq/libc.so.6
1:
1:
1: initialize program: hello
1:
1:
1: transferring control: hello
1:
Hello from pid: 1 start: 0x1
1:
1: calling fini: hello [0]
1:
1:
1: calling fini: /lib64/bgq/libc.so.6 [0]
1:
When using LD_DEBUG on a multi-node block, use the --label option on the runjob command to display which output corresponds to which node.
7.8 Mathematical Acceleration Subsystem Libraries
The Mathematical Acceleration Subsystem (MASS) libraries are tuned mathematical intrinsic functions that are available in versions for the IBM AIX® and Linux operating systems, including the Blue Gene/Q system. The MASS libraries provide improved performance over the standard mathematical library routines, are thread-safe, and support compilations in C, C++, and Fortran applications. For more information about MASS, see the Mathematical Acceleration Subsystem webpage at:
The MASS libraries are included with the XL compiler collections for Blue Gene/Q, which are installed in the /opt/ibmcmp path.
7.9 Engineering and Scientific Subroutine Libraries
The Engineering and Scientific Subroutine (ESSL) libraries for Linux on Power support the Blue Gene/Q system. ESSL provides over 150 math subroutines that are tuned for performance on the Blue Gene/Q system and use ESSL version 5.1.1. For more information about ESSL, see the Engineering Scientific Subroutine Library and Parallel ESSL website at:
 
Important: When using IBM XL Fortran V14.1 for the Blue Gene/Q system, use ESSL V5.1.1. If incompatible versions of ESSL and Fortran are selected, the RPM installation fails with a dependency error message.
7.10 Cross-compilation on the Blue Gene/Q system
An I/O node that is used as a front end node is a cross-compilation environment. When building an application in a cross-compilation environment, build tools such as configure and make might not provide the same results as when building natively. The results depend on whether the tools and application are designed to build correctly in the variety of cross-compilation environments that are available. Problems can occur because the configure and make steps for the build of a tool or application compile and execute small code snippets. These snippets identify characteristics of the target platform as part of the build process. If these code snippets are compiled with a cross-compiler and executed on the build machine instead of the target machine, the program might fail to execute or produce results that do not reflect the target machine.
See the following sections for information about minimizing unexpected results:
7.10.1 Configuring and building on an I/O node used as a front end node
An I/O node used as a front end node uses the same hardware as a compute node but runs the Linux operating system instead of the Compute Node Kernel (CNK). If a program is coded to use instructions that are specific to the A2 processor, it runs on an I/O node that is used as a front end node. However, it does not work on a standard front end node. If an application is configured on an I/O node that is used as a front end node, the program can be compiled and run natively. This method prevents some problems that are related to cross-compiling. An I/O node that is used as a front end node can be used to compile and run a set of small snippets. This method often provides correct results. However, it does not work for the following types of programs because CNK support is required:
Transactional memory
Speculative execution
Blue Gene/Q MPI
System calls that are supported on CNK but not on the Linux operating system
System calls that provide different results on the CNK than on the Linux operating system
Example 7-14 shows how to compile and run a simple program on an I/O node that is used as a front end node. In this case, the program is running directly on the I/O node.
Example 7-14 Compiling and running a simple program on an I/O node used as a front end node
$ /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc -o hello hello.c
$ ./hello
Hello
7.10.2 Using implicit program launching from a front end node
The Blue Gene/Q toolchain includes implicit toolchain launching. Implicit launching means invoking a Blue Gene/Q program as if it were being run natively on a front end node or an I/O node used as a front end node. However, the runjob command for that program is run implicitly based on the appropriate settings for some environment variables. This capability is sometimes referred to as magic. To enable this function, the variables in Example 7-15 must be set.
Example 7-15 Using implicit program launching
$ export BG_PGM_LAUNCHER=yes
$ export RUNJOB_BLOCK=R00-M0-N00
 
To run on a single node the following are also needed:
 
$ export RUNJOB_SHAPE=1x1x1x1x1
$ export RUNJOB_CORNER=R00-M0-N00-J00
Use the corresponding runjob environment variables to configure additional runjob arguments for use on the implicit runjob execution. See the man pages for the runjob command for a complete list and description of the runjob arguments. To disable implicit program launching, unset the BG_PGM_LAUNCHER environment variable.
The runjob program does not include scheduling. If the program runs on a single node, you must specify the node on the block.
Ensure that the small programs that are built during the configuration of the package are built using the Blue Gene/Q compilers. Each configuration script is unique, so general instructions for how to force the compiler cannot be provided. However, Example 7-16 works for many packages.
Example 7-16 Example configuration script for Blue Gene/Q compilers
$ ./configure CC=/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc
You can verify that the program is being run remotely in this situation by compiling a program with the Blue Gene/Q cross-compiler and executing the program on the front end node.
Example 7-17 shows a sample program.
Example 7-17 Program to verify the Blue Gene/Q implicit runjob invocation
 
#include <stdio.h>
#include <sys/utsname.h>
 
int main(int argc, char** argv)
{
struct utsname uts;
 
uname(&uts);
printf("machine: %s ", uts.machine);
if (strcmp(uts.sysname, "Linux") == 0) {
printf("We are on Linux! ");
}
else {
printf("We are NOT on Linux! ");
}
return 0;
}
Example 7-18 shows how to compile and run this program.
Example 7-18 Compiling and running a program
Set up the environment variables:
$ export RUNJOB_BLOCK=R00-M0-N03
$ export RUNJOB_SHAPE=1x1x1x1x1
$ export RUNJOB_CORNER=R00-M0-N03-J12
$ export BG_PGM_LAUNCHER=yes
 
Compile the program:
$/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc -o test-onbgq test-onbgq.c
 
Run the program:
$ ./test-onbgq
machine: BGQ
We are NOT on Linux!
To verify that the program is launched with the runjob command on the compute node, set the RUNJOB_VERBOSE environment variable to a value that ensures verbose output. For more information, see the manual page for the runjob command.
The implicit launch support is compiled into Blue Gene/Q programs when the toolchain compilers or the XL compilers are used. A program can be created without this support by adding the Wl,-e_start_no_magic option when linking the program, as shown in Example 7-19. When this option is used, the program does not use an implicit runjob invocation. Instead, it runs the program natively on the front end node.
Example 7-19 Using implicit launch support
/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc -o hello hello.c -Wl,-e_start_no_magic
7.11 Python support
The Python interpreter is often used for scientific applications. The Blue Gene/Q system includes RPMs for a patched version of the Python interpreter. Python patches are available for Python 2.6.7, 2.6.8, 2.7.3, 3.2.2, 3.2.3 and can be built with either the GNU or XL compilers.
The default Python installation path is /bgsys/tools, but the interpreter can be installed in another location. For example, you can build and install the Blue Gene/Q Python library into a more efficient file system such as the IBM General Parallel File System (GPFS).
The Python interpreter is a dynamically linked application. See Example 7-6 on page 86 for information about the Blue Gene/Q environment. For information about how to build the Python interpreter, see the README file in the /bgsys/drivers/ppcfloor/tools/python directory. To see the options for the build scripts, use the -h flag.
7.11.1 Using the Python interpreter in a cross-compiled environment
The Blue Gene/Q system is a cross-compiled environment. The Python interpreter for the Blue Gene/Q system is built to run on Blue Gene/Q hardware. When building an application that must use the Python interpreter as part of its build process on the front end node, use the hostpython executable program in the installation directory. The executable programs in /bgsys/tools/Python-2.6/bin/python and /bgsys/tools/Python-2.6/bin/hostpython are identical, except for the default search paths and the location of the default dynamic linker. Both programs generate the same Python compiled modules, but are used on different hosts when compiling modules.
Table 7-2 summarizes the Python executable programs that exist on a front end node and lists when they can be used.
Table 7-2 Python executable programs on a front end node
Path
Host when used for build
When used to run Python
Targets
/usr/bin/python
Front end node
Front end node
RHEL6.x on PPC64
/bgsys/tools/Python-2.6/bin/python
I/O node used as a front end node
Compute node, I/O node used as a front end node
Blue Gene/Q hardware
/bgsys/tools/Python-2.6/bin/hostpython
Front end node
Compute node, I/O node used as a front end node
Blue Gene/Q hardware
Here are some examples of how to run the Python interpreter:
To run the native Python interpreter on the front end node, but not run it on the Blue Gene/Q system, use the native Python interpreter /usr/bin/python.
To run the Python interpreter on the Blue Gene/Q compute node, use the runjob command to run the Python interpreter /bgsys/tools/Python-2.6/bin/python.
To compile a Python module as part of an application, where that module is later used to run on the Blue Gene/Q system, run the hostpython executable program on the front end node to compile the module.
To run the Python interpreter on an I/O node used as front end node, use the Python interpreter /bgsys/tools/Python-2.6/bin/python.
7.11.2 Running the Python interpreter on the Blue Gene/Q system
The Python interpreter can be built with either the XL or GNU compilers. When the Python interpreter is built and installed on the front end node as part of the build, a tar file called either bgqpython2.6.7gnu.tar.gz or bgqpython2.6.7XL.tar.gz is created and installed into /bgsys/linux/bgfs on the service node. This tar file is used at I/O node boot time to extract the files that are required at run time by the Python interpreter on the I/O node. At run time, the Python shared library is installed from the I/O node into /usr/lib64/bgq and the Python modules are installed into /bgfs/usr/lib64. The Python interpreter starts faster when the shared libraries and Python modules are accessed from these locations.
Example 7-20 shows how to verify the locations of these files.
Example 7-20 Verifying the locations of Python files
On the service node or front end node:
ls /bgsys/linux/bgfs
bgqpython2.6.7gnu.tar.gz bgqpython3.2.tar.gz
On an I/O node used as a front end node
ls /bgfs/usr/lib64
libpython2.6.so.1.0 libpython3.2.so.1.0 python2.6 libpython2.6.so libpython3.2.so python3.2
The Blue Gene/Q Python interpreter is built and installed on the front end node. As part of this installation, the Python shared library and the python modules are packaged into a tar file for use when booting the I/O block where the Python interpreter is run. When the I/O block is booted, the Python shared library libpython2.6.so.x is in /lib64/bgq/ and the Python modules are found in /bgfs/usr/lib64/.
To compile applications, use the Python library that is on a front end node. In the Blue Gene/Q Python installation, there are two Python binaries: python and hostpython. The hostpython binary is designed to compile Python modules on the front end node. The python binary is designed to be used when running the Python interpreter on the compute node or an I/O node used as a front end node.
If the Python interpreter is installed in the default location in /bgsys/tools, the binaries are stored in the following locations:
/bgsys/tools/Python-2.6/bin/python hello.py
/bgsys/tools/Python-2.6/bin/hostpython hello.py
Example 7-21 shows how to run the Python interpreter on the Blue Gene/Q system.
Example 7-21 Running Python on the Blue Gene/Q system
$ runjob --block R00-M0-N00 : /bgsys/drivers/ppcfloor/gnu-linux/bin/python testarray.py
For more information about the Python interpreter, see the following websites:
Python Programming Language - Official Web site
The Python Tutorial
The Python Standard Library
pyMPI: Putting the py in MPI
7.12 Using the QPX floating‑point unit
The Blue Gene/Q hardware contains the quad-processing extension (QPX) to the IBM Power Instruction Set Architecture. The computational model of the QPX architecture is a vector single instruction, multiple data (SIMD) model with four execution slots and a register file that contains 32 registers with 256 bits. Each of the 32 registers contains four elements of 64 bits. Each of the execution slots operates on one vector element. These elements are referred to as vector registers or quad registers.
Figure 7-5 shows the quad floating-point unit.
Figure 7-5 Blue Gene/Q quad floating-point unit
7.12.1 Using SIMD instructions in applications
There are two methods to use the QPX floating‑point instruction set in your application when using the IBM XL compiler for Blue Gene/Q:
Using automatic simdization
Using vector intrinsics functions or assembly code
Using automatic simdization with the IBM XL compiler
If your program is compiled with the XL compiler for Blue Gene/Q, simdization is enabled by default. The compiler attempts to automatically transform code to efficiently use the QPX floating‑point instruction set. Simdization is enabled at all optimization levels, but more aggressive and effective simdization occurs at higher optimization levels. Simdization is controlled by the -qsimd option and can be disabled by using the -qsimd=noauto option.
The -qreport option provides information about what code is simdized and why simdization does not occur.
For information about how to enable more simdization to occur, see the compiler documentation links in Section 7.2.1, “IBM XL compilers” on page 80.
Using the QPX vector intrinsics functions
Vector intrinsics functions are provided in the IBM XL compiler for Blue Gene/Q. The vector intrinsics are built-in functions that map directly onto the QPX instruction set. You can use these functions to tune code and enable the compiler to optimize the code as effectively as possible.
For a description of the set of vector intrinsics, see the compiler documentation links in Section 7.2.1, “IBM XL compilers” on page 80.
Example 7-22, Example 7-23, and Example 7-24 on page 100 show some common vector intrinsics functions for Blue Gene/Q. These examples are not optimized for performance, and are not comprehensive.
Example 7-22 shows basic quadword load, store, and arithmetic operations in C.
Example 7-22 C example with basic quadword load, store, and arithmetic operations
 
// vector version of y[i] = a*x[i] + y[i]
// where "x" and "y" are 32-byte aligned
 
#include <stdio.h>
 
#define NPTS 8
static double __attribute__((aligned(32))) x[NPTS], y[NPTS];
 
int main(int argc, char * argv[])
{
int i;
double a = 2.0;
vector4double av, xv, yv;
for (i=0; i<NPTS; i++) {
x[i] = (double) i;
y[i] = (double) i + 1;
}
if ((long) x & 0x1F ) printf("x is not 32-byte aligned ");
if ((long) y & 0x1F) printf("y is not 32-byte aligned ");
av = vec_splats(a); // replicate "a" in four vector slots
for (i=0; i<NPTS; i+=4) {
xv = vec_ld(0L, &x[i]); // load four contiguous elements of x[]
yv = vec_ld(0L, &y[i]); // load four contiguous elements of y[]
yv = vec_madd(av, xv, yv); // yv = av*xv + yv
vec_st(yv, 0L, &y[i]); // store four contiguous elements of y[]
}
for (i=0; i<NPTS; i++) printf("y[%d] = %.1lf ", i, y[i]);
return 0;
}
Example 7-23 shows basic quad‑word load, store, and arithmetic operations in Fortran.
Example 7-23 Fortran example with basic quadword load, store, and arithmetic operations
 
program fmain
implicit none
integer i
integer, parameter :: n = 8
real(8) a, x(n), y(n)
!IBM* align(32, x, y)
vector(real(8)) av, xv, yv
a = 2.0d0
do i = 1, n
x(i) = dble(i-1)
y(i) = dble(i)
end do
if (iand(loc(x), z'1F') .ne. 0) print *, 'x is not 32-byte aligned'
if (iand(loc(y), z'1F') .ne. 0) print *, 'y is not 32-byte aligned'
av = vec_splats(a) ! replicate "a" in four vector slots
do i = 1, n, 4
xv = vec_ld(0, x(i)) ! load four contiguous elements of x()
yv = vec_ld(0, y(i)) ! load four contiguous elements of y()
yv = vec_madd(av, xv, yv) ! yv = av*xv + yv
call vec_st(yv, 0, y(i)) ! store four contiguous elements of y()
end do
do i = 1, n
write(*,'(a,i1,a,f4.1)') 'y(', i, ') = ', y(i)
end do
end
 
The basic quadword load intrinsic, vec_ld(), returns a vector variable with values taken from the address argument rounded down to the nearest 32-byte boundary. Therefore, it is important to know the alignment of the variable that is used. The vec_lda() signaling variant of the vector load function generates an exception if the address is not 32-byte aligned. To handle arbitrary alignment, determine the alignment and use appropriate shift or permute instructions.
Alignment can be determined by a bitwise AND operation between the address and 0x1F = 31 (as a base-10 integer). Variables of type double can be on 8, 16, 24, or 32-byte boundaries. The shift and permute intrinsics are designed to take two vectors as input and return a vector with the required values, shifted or permuted from the input arguments. For example, if you have a target address that has 16-byte alignment, you can get a vector with four contiguous elements starting from that address by two vector loads followed by one shift. Example 7-28 shows this process in C. The third argument to the vector shift function, vec_sldw(), must be an integer constant, not an integer variable.
Similarly, the argument to the general permute control intrinsic, vec_gpci(), must be an integer constant. It is convenient to use an octal constant for that purpose because there is a one-to-one correspondence between the selected slots from the two vector arguments and the digits of the constant in octal format. For example, to select slots 3, 2, 1, and 0, you can code vec_gpci(03210) in C, or vec_gpci(o'3210') in Fortran. To maximize performance, use 32-byte aligned data.
Example 7-24 shows an example of vector and permute instructions in C.
Example 7-24 Vector shift and permute instructions in C
 
// example of vector shift and permute operations
 
#include <stdio.h>
 
#define NPTS 10
static double __attribute__((aligned(32))) x[NPTS];
 
int main(int argc, char * argv[])
{
int i;
vector4double v1, v2, xv, pctl;
 
for (i=0; i<NPTS; i++) x[i] = (double) i;
 
if (((long) &x[2] & 0x1F) != 16 ) printf("x[2] is not 16-byte aligned ");
 
v1 = vec_ld(0L, &x[2]); // v1 has values x[0-3]
v2 = vec_ld(0L, &x[6]); // v2 has values x[4-7]
 
xv = vec_sldw(v1, v2, 2); // xv has values x[2-5]
for (i=0; i<4; i++) printf("xv[%d] = %.1lf ", i, xv[i]);
 
printf(" ");
 
pctl = vec_gpci(05432); // xv has "x" values from slots 5,4,3,2
xv = vec_perm(v1, v2, pctl);
for (i=0; i<4; i++) printf("xv[%d] = %.1lf ", i, xv[i]);
 
return 0;
}
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset