Using GNU profiling
This appendix describes the GNU profiling function that is provided in the GNU toolchain for Blue Gene/Q.
For basic documentation about the GNU toolchain profiling tools, see the gprof documentation on the following web site:
The path for the Blue Gene/Q gprof utility is /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gprof.
Using the Blue Gene/Q gmon tool
The basic gmon support is described in the man pages for the GNU toolchain:
Specifying which ranks generate gmon.out files
Blue Gene/Q applications typically run simultaneously on multiple ranks. The execution path through the code, and therefore the profiling data collected, might vary depending on the rank. A separate gmon.out file is generated for each rank in the program. The gmon.out files are named gmon.out.N where N is the rank where the program was run.
On large blocks, many gmon.out files might be written. It is normally not necessary to write a file for each rank because many of the files contain similar or identical information. The default setting is to generate gmon.out files only for profiling data collected on ranks 0 - 31. This setting reduces I/O usage. To generate gmon.out files for a different set of ranks, use the BG_GMON_RANK_SUBSET environment variable to specify which ranks have gmon.out files generated. The variable can be set as shown in Example E-1.
Example E-1 Setting the BG_GMON_RANK_SUBSET variable
BG_GMON_RANK_SUBSET=N /* Only generate the gmon.out file for rank N. */
 
BG_GMON_RANK_SUBSET=N:M /* Generate gmon.out files for all ranks from N to M. */
 
BG_GMON_RANK_SUBSET=N:M:S /* Generate gmon.out files for all ranks from N to M. Skip S; 0:16:8 generates gmon.out.0, gmon.out.8, gmon.out.16 */
Functions to disable gmon.out files for some nodes
The Blue Gene/Q toolchain includes the mondisable() function. This function can be called from the user program to prevent writing out a gmon.out file on the node from which it is called. This function can be used to reduce the number of gmon.out files that are generated on a large block. It can be especially useful if it is known that many nodes generate the same or similar profiling data.
Profiling for threads
The base GNU toolchain does not provide support for profiling on threads. The patches that are provided for the Blue Gene/Q toolchain include support for profiling on threads. These patches include various methods to enable and disable profiling on a per‑thread basis.
Profiling with the GNU toolchain
Profiling tools provide information about potential bottlenecks in the program. They help identify functions or sections of the code that might become good candidates to optimize. When using gmon profiling, two levels of profiling information can be generated: machine instruction level or full level. Select options based on the required level of detail and the acceptable amount of overhead. As with any type of performance data collection, monitoring and saving of performance information uses system resource and affects the resulting performance data. The amount of additional resource, or profiling overhead, is greater with some options than with others. When compiling with the GNU compilers or the IBM XL compilers, enable profiling by adding -pg to the compile flags.
Using timer tick (machine instruction level) profiling
This level of profiling provides timer tick profiling information at the machine instruction level. Profiling data collection is based on the SIGPROF timer. The timer is enabled in the program. When the timer expires, the program counter for the executing instruction is updated. As the program runs, the data collection provides a sample of instruction addresses that are executed by the program. To enable this type of profiling and no other performance data collection, add the -pg option on the link command but do not include it on the compile commands. This level of profiling adds the least amount of performance collection overhead. It provides profile information based on instruction addresses but does not provide call graph or call count information.
In the base GNU toolchain, threads are not profiled by default. However, the Blue Gene/Q system includes support for thread profiling. Thread profiling is not enabled by default. To enable thread profiling, link the program with the -pg option and perform one of the following steps:
Set the BG_GMON_START_THREAD_TIMERS environment variable on the runjob command.
Set this environment variable to “all” to enable the SIGPROF timer on all threads created with the pthread_create() function.
When profiling an MPI application, additional threads called comm threads might be created to assist with the MPI function. Set this environment variable to “nocomm” to enable the SIGPROF timer on all threads except the extra threads that are created to support MPI.
Add a call to the gmon_start_all_thread_timers() function to the program so that it is called from the main thread. This setting configures the SIGPROF timer on all threads that are created with pthread_create after the point where this call is made. Threads that are created before the call to the gmon_start_all_thread_timers() function are not profiled.
Add a call to the gmon_thread_timer(int start) function from the thread to be profiled. To start the thread time, call this function with 1 as an argument. To stop the thread timer, call this function with the value 0.
The prototypes for these gmon functions are in sys/gmon.h in the Blue Gene/Q toolchain.
Collecting call count information
In addition to instruction profiling, call count information can also be collected. To collect this type of data, all files must be compiled and linked with the -pg option. This option provides profiling information based on the SIGPROF timer as described in “Using timer tick (machine instruction level) profiling” on page 149 and call graph information and procedure call counts. This level of performance data collection introduces the most overhead. The call count information is collected on all threads that execute code that was compiled with the -pg option. It is not affected or controlled by the thread profiling switches that are described in this appendix. When higher levels of compiler optimization are used, the statement mappings and procedure calls might not appear as expected due to inlining, code movement, scheduling, and other optimizations that are performed by the compiler. Programs built this way collect call count information for all threads.
To also collect profiling information for threads, thread-level profiling must be enabled with one of the methods that are described in “Using timer tick (machine instruction level) profiling” on page 149.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset