Appendix B. Performance tooling and empirical performance analysis

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Performance tooling and empirical performance analysis

This appendix describes the optimization and tuning of the POWER7 processor-based server from the perspective of performance tooling and empirical performance analysis. It covers the following topics:

•Introduction

•Performance advisors

•AIX

•Linux

•Java (either AIX or Linux)

Introduction

This appendix includes a general description about performance advisors, and descriptions specific to the three performance advisors that are referenced in this book:

•AIX

•Linux

•Java (either AIX or Linux)

Performance advisors

IBM developed four new performance advisors that empower users to address their own performance issues to best use their Power Systems server. These performance advisors can be run by a broad class of users.

The first three of these advisors are tools that run and analyze the configuration of a system and the software that is running on it. They also provide advice about the performance implications of the current configuration and suggestions for improvement. These three advisors are documented in “Expert system advisors” on page 156.

The fourth advisor is part of the IBM Rational Developer for Power Systems Software. It is a component of an integrated development environment (IDE), which provides a set of features for performance tuning of C and C++ applications on AIX and Linux. That advisor is documented in “Rational Performance Advisor” on page 161.

Expert system advisors

The expert system advisors are three new tools that are developed by IBM. What is unique about these applications is that they collect and interpret performance data. In one step, they collect performance metrics, analyze data, and provide a one-page visual report. This report summarizes the performance health of the environment, and includes instructions for alleviating detected problems. The performance advisors produce advice that is based on the expertise of IBM performance analysts, and IBM documented preferred practices. These expert systems focus on AIX Partition Virtualization, VIOS, and Java performance.

All of the advisors follow the same reporting format, which is a single page XML file you can use to quickly assess conditions by visually inspecting the report and looking at the descriptive icons, as shown in Figure B-1.

Figure B-1 Descriptive icons in expert system advisors (AIX Partition Virtualization, VIOS Advisor, and Java Performance Advisor)

The XML reports generated by all of the advisors are interactive. If a problem is detected, three pieces of information are shared with the user.

1. What is this?

This section explains why a particular topic was monitored, and provides a definition of the performance metric or setting.

2. Why is it Important?

This report entry explains why the topic is relevant and how it impacts performance.

3. How do I modify?

Instructions for addressing the problem are listed in this section.

VIOS Performance Advisor

The VIOS Performance Advisor provides guidance about various aspects of VIOS, including:

•CPU

•Shared processing pool

•Memory

•Fibre Channel performance

•Disk I/O subsystem

•Shared Ethernet adapter

The output is presented on a single page, and copies of the report can be saved, making it easy to document the settings and performance of VIOS over time. The goal of the advisor is for you to be able to self-assess the health of your VIOS and act to attain
optimal performance.

Figure B-2 is a screen capture of the VIOS Performance Advisor, focusing on the FC adapter section of the report, which attempts to guide the user in determining if any of the FC ports are being saturated, and, if so, to what extent. An investigate image was displayed next to the idle FC port to confirm that the idle adapter port is intentional and because of an administrative configuration design choice.

The VIOS Performance Advisor can be found at:

https://www.ibm.com/developerworks/wikis/display/WikiPtype/VIOS+Advisor

Figure B-2 shows a screen capture from the VIOS Advisor.

Figure B-2 The VIOS Advisor

Virtualization Performance Advisor

The Virtualization Performance Advisor provides guidance for various aspects of an LPAR, both dedicated and shared, including:

•LPAR physical memory domain allocation

•Physical CPU entitlement and virtual CPU optimization

•SMT effectiveness

•Processor folding effectiveness

•Shared processing pool

•Memory optimization

•Physical Fibre Channel adapter optimization

•Virtual disk I/O optimization (vSCSI and NPIV)

The output is presented on a single page, and copies of the report can be saved, making it easy for the user to document the settings and performance of their LPAR over time. The goal of the advisor is for the user to be able to self-assess the health of their LPAR and act to attain optimal performance.

Figure B-3 is a snapshot of the LPAR performance advisor, focusing on the LPAR optimization section of the report, which applies virtualization preferred practice guidance to LPAR configuration, resource usage of the LPAR, and shared processor pool, and determines if the LPAR configuration is optimized. If the advisor finds that the LPAR configuration is not optimal for the workload, it guides the user in determining the best possible configuration. The LPAR Performance Advisor can be found at:

https://www.ibm.com/developerworks/wikis/display/WikiPtype/PowerVM+Virtualization+performance+advisor

Figure B-3 LPAR Virtualization Advisor

Java Performance Advisor

The Java Performance Advisor provides recommendations to improve performance of a stand-alone Java or WebSphere Application Server application that is running on an AIX machine. The guidance that is provided is categorized into four groups as follows:

•Hardware and LPAR-related parameters: Processor sharing, SMT levels, memory, and
so on

•AIX specific tunables: Process RSET, TCP buffers, memory affinity, and so on

•JVM tunables: Heap sizing, garbage collection (GC) policy, page size, and so on

•WebSphere Application Server related settings for a WebSphere Application
Server process

The guidance is based on Java tuning preferred practices. The criteria that are used to determine the guidance include the relative importance of the Java application, machine usage (test/production), and the user's expertise level.

Figure B-4 is a snapshot of Java and WebSphere Application Server recommendations from a sample run, indicating the best JVM optimization and WebSphere Application Server settings for better results, as per Java preferred practices. Details about the metrics can be obtained by expanding each of the metrics. The output of the run is a simple XML file that can be viewed by using the supplied XSL viewer and any browser. The Java Performance Advisor can be found at:

https://www.ibm.com/developerworks/wikis/display/WikiPtype/Java+Performance+Advisor

Figure B-4 Java Performance Advisor

Rational Performance Advisor

IBM Rational Developer for Power Systems Software IDE V8.5 introduces a new component that is called Performance Advisor, which provides a rich set of features for performance tuning C and C++ applications on IBM AIX and IBM PowerLinux systems. Although not directly related to the tooling described in “Expert system advisors” on page 156, Rational Performance Advisor has the same goal of helping users to best use Power hardware with tooling that offers simple collection, management, and analysis of performance data.

Performance Advisor gathers data from several sources. The raw application performance data comes from the same expert-level tprof and OProfile CPU profilers described in “AIX” on page 162 and “Linux” on page 171, and other low-level operating system tools. The debug information that is generated by the compiler allows this data to be matched back to the original source code. XLC compilers can generate XML report files that provide information about optimizations that were performed during compilation. Finally, the application build and runtime systems are analyzed to determine whether there are any potential
environmental problems.

All of this data is automatically gathered, correlated, analyzed, and presented in a way that is quick to access and easy to understand (Figure B-5).

Figure B-5 Rational Performance Advisor

Key features include:

•Performance Explorer organizes your performance tuning sessions and data.

•System Scorecard reports on your Power build and runtime environments.

•Hotspots Browser shows CPU profiling results for your application and its functions.

•Hotspots Comparison Browser compares runs for regression analysis or fix verification.

•The Performance Source Viewer and Outline view gives precise line-level profiling results.

•Invocations Browser displays dynamic call information from your application

•The Recommendations view offers expert-system guidance.

More information about Rational Performance Advisor, including a trial download, can be found in Rational Developer for Power Systems Software, available at:

http://www.ibm.com/software/rational/products/rdp/

AIX

The section introduces tools and techniques that are used for optimizing software for a combination of Power Systems and AIX. The intended audience for this section is software development teams. As such, this section does not address performance topics that are related to capacity planning, and system-level performance monitoring and tuning.

For capacity planning, see the IBM Systems Workload Estimator, available at:

http://www-912.ibm.com/estimator

For system-level performance monitoring and tuning information for AIX, see Performance Management, available at:

http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/multiple_page_size_support.htm

The bedrock of any empirically based software optimization effort is a suite of repeatable benchmark tests. To be useful, such tests must be representative of the manner in which users interact with the software. For many commercial applications, a benchmark test simulates the actions of multiple users that drive a prescribed mix of application transactions. Here, the fundamental measure of performance is throughput (the number of transactions that are run over a period) with an acceptable response time. Other applications are more batch-oriented, where few jobs are started and the time that is taken to completion is measured. Whichever benchmark style is used, it must be repeatable. Within some small tolerance (typically a few percent), running the benchmark several times on the same setup yields the same result.

Tools and techniques that are employed in software performance analysis focus on pinpointing aspects of the software that inhibit performance. At a high level, the two most common inhibitors to application performance are:

•Areas of code that consume large amounts of CPU resources. This code is usually caused by using inefficient algorithms, poor coding practices, or inadequate compiler optimization

•Waiting for locks or external events. Locks are used to serialize execution through critical sections, that is, sections of code where the need for data consistency requires that only one software thread run at a time. An example of an external event is the system that is waiting for a disk I/O to complete. Although the amount of time that an application must wait for external events might be outside of the control of the application (for example, the time that is required for a disk I/O depends on the type of storage employed), simply being aware that the application is having to wait for such an event can open the door to potential optimizations.

CPU profiling

A CPU profiler is a performance tool that shows in which code CPU resources are being consumed. Tprof is a powerful CPU profiler that encompasses a broad spectrum of
profiling functionality:

•It can profile any program, library, or kernel extension that is compiled with C, C++, Fortran, or Java compilers. It can profile machine code that is created in real time by the JIT compiler.

•It can attribute time to processes, threads, subroutines (user mode, kernel mode, shared library, and Java methods), source statements, and even individual machine instructions.

•In most cases, no recompilation of object files is required.

Usage of tprof typically focuses on generating subroutine-level profiles to pinpoint code hotspots, and to examine the impact of an attempted code optimization. A common way to invoke tprof is as follows:

$ tprof -E -skeuz -x sleep 10

The -E flag instructs tprof to employ the PMU as the sampling mechanism to generate the profile. Using the PMU as the sampling mechanism provides a more accurate profile than the default time-based sampling mechanism, as the PMU sampling mechanism can accurately sample regions of kernel code where interrupts are disabled. The s, k, e, and u flags instruct tprof to generate subroutine-level profiles for shared library, kernel, kernel extension, and user-level activity. The z flag instructs tprof to report CPU time in the number of ticks (that is, samples), instead of percentages. The -x sleep 10 argument instructs tprof to collect profiling data during the running of the sleep 10 command. This command collects profile data over the entire system (including all running processes) over a period of 10 seconds.

Excerpts from a tprof report are shown in Example B-1, Example B-2 on page 164, and Example B-3 on page 164.

Example B-1 is a breakdown of samples of the processes that are running on the system. When multiple processes have the same name, they have only one line in this report: the number of processes with that name is in the “Freq” column. “Total” is the total number of samples that are accumulated by the process, and “Kernel”, “User”, and “Shared” are the number of samples that are accumulated by the processes in kernel (including kernel extensions), user space, and shared libraries. “Other” is a catchall for samples that do not fall in the other categories. The most common scenario where samples wind up in “Other” is because of CPU resources that are being consumed by machine code that is generated in real time by the JIT compiler. The -j flag of tprof can be used to attribute these samples to Java methods.

Example B-1 Excerpt from a tprof report - breakdown of samples of processes running on the system

Process Freq Total Kernel User Shared Other

======= ==== ===== ====== ==== ====== =====

wait 4 5810 5810 0 0 0

./version1 1 1672 35 1637 0 0

/usr/bin/tprof 2 15 13 0 2 0

/etc/syncd 1 2 2 0 0 0

/usr/bin/sh 2 2 2 0 0 0

swapper 1 1 1 0 0 0

/usr/bin/trcstop 1 1 1 0 0 0

rmcd 1 1 1 0 0 0

======= === ===== ====== ==== ====== =====

Total 13 7504 5865 1637 2 0

Example B-2 is a breakdown of samples of the threads that are running on the system. In addition to the columns described in Example B-1 on page 163, this report has PID and TID columns that detail the process IDs and thread IDs.

Example B-2 Excerpt from a tprof report - breakdown of threads that are running on the system

Process PID TID Total Kernel User Shared Other

======= === === ===== ====== ==== ====== =====

wait 16392 16393 1874 1874 0 0 0

wait 12294 12295 1873 1873 0 0 0

wait 20490 20491 1860 1860 0 0 0

./version1 245974 606263 1672 35 1637 0 0

wait 8196 8197 203 203 0 0 0

/usr/bin/tprof 291002 643291 13 13 0 0 0

/usr/bin/tprof 274580 610467 2 0 0 2 0

/etc/syncd 73824 110691 2 2 0 0 0

/usr/bin/sh 245974 606263 1 1 0 0 0

/usr/bin/sh 245976 606265 1 1 0 0 0

/usr/bin/trcstop 245976 606263 1 1 0 0 0

swapper 0 3 1 1 0 0 0

rmcd 155876 348337 1 1 0 0 0

======= === === ===== ====== ==== ====== =====

Total 7504 5865 1637 2 0

Total Samples = 7504 Total Elapsed Time = 18.76s

Example B-3 from the report gives the subroutine-level profile for the version1 program. In this simple example, all of the time is spent in main().

Example B-3 Excerpt from a tprof report - subroutine-level profile for the version1 program, with all time spent in main()

Profile: ./version1

Total Ticks For All Processes (./version1) = 1637

Subroutine Ticks % Source Address Bytes

============= ====== ====== ======= ======= =====

.main 1637 21.82 version1.c 350 536

More information about using AIX tprof for Java programs is available in “Hot method or routine analysis” on page 177.

The functionality of tprof is rich. As such, it cannot be fully described in this guide. For complete tprof documentation, see tprof Command, available at:

http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.cmds/doc/aixcmds5/tprof.htm

AIX trace-based analysis tools

Trace¹ is a powerful utility that is provided by AIX for collecting a time-sequenced log of operating system events on a Power Systems server. The AIX kernel and kernel extensions are richly instrumented with trace hooks that, when trace is activated, append trace records with context-relevant data, to a pinned, kernel-resident trace buffer. These records can be later read from that buffer and logged to a disk-resident file. Further utilities are provided to interpret and summarize trace logs and generate human-readable reports. The tprof CPU profiler is one such utility. Besides tprof, two of the most-commonly used trace-based utilities are curt² and splat.³,⁴

The curt command takes as its input a trace collected using the AIX trace facility, and generates a report that breaks down how CPU time is consumed by various
entities, including:

•Processes (grouped by process name)

•Individual processes

•Individual threads

•System calls (either on a system-wide or per-thread basis)

•Interrupts

One of the most useful reports from curt is the System Calls Summary. This report provides a system-wide summary of the system calls that are executed while the trace is collected. For each system call, the following information is provided:

•Count: The number of times the system call was run during the monitoring interval

•Total Time: Amount of CPU time (in milliseconds) consumed in running the system call

•% sys time: Percentage of overall CPU capacity that is spent in running the system call

•Avg Time: Average CPU time that is consumed for each execution of the system call

•Min Time: Minimum CPU time that is consumed during an execution of the system call

•Max Time: Maximum CPU time that is consumed during an execution of the system call

•SVC: Name and address of the system call

An excerpt from a System Calls Summary report is shown in Example B-4.

Example B-4 System Calls Summary report (excerpt)

System Calls Summary

--------------------

Count Total Time % sys Avg Time Min Time Max Time SVC (Address)

(msec) time (msec) (msec) (msec)

======== =========== ====== ======== ======== ======== ================

123647 3172.0694 14.60% 0.0257 0.0128 0.9064 kpread(2a2d5e8)

539 1354.6939 6.24% 2.5133 0.0163 4.1719 listio64(516ea40)

26496 757.6204 3.49% 0.0286 0.0162 0.0580 _esend(2a29f88)

26414 447.7029 2.06% 0.0169 0.0082 0.0426 _erecv(2a29e98)

9907 266.1382 1.23% 0.0269 0.0143 0.5350 kpwrite(2a2d588)

34282 167.8132 0.77% 0.0049 0.0032 0.0204 _thread_wait(2a28778)

As a first step, compare the mix of system calls to the expectation of how the application is expected to behave. Is the mix aligned with expectations? If not, first confirm that the trace is collected while the wanted workload runs. If the trace is collected at the correct time and the mix still differs from expectations, then investigate the application logic. Also, examine the list of system calls for potential optimizations. For example, if select or poll is used frequently, consider employing the pollset facility (see “pollset” on page 71).

As a further breakdown, curt provides a report of the system calls run by each thread. An example report is shown in Example B-5.

Example B-5 system calls run by each thread

Report for Thread Id: 549305 (hex 861b9) Pid: 323930 (hex 4f15a)

Process Name: proc1

--------------------

Total Application Time (ms): 89.010297

Total System Call Time (ms): 160.465531

Total Hypervisor Call Time (ms): 18.303531

Thread System Call Summary

--------------------------

Count Total Time Avg Time Min Time Max Time SVC (Address)

(msec) (msec) (msec) (msec)

======== =========== ======== ======== ======== ================

492 157.0663 0.3192 0.0032 0.6596 listio64(516ea40)

494 3.3656 0.0068 0.0002 0.0163 GetMultipleCompletionStatus(549a6a8)

12 0.0238 0.0020 0.0017 0.0022 _thread_wait(2a28778)

6 0.0060 0.0010 0.0007 0.0014 thread_unlock(2a28838)

4 0.0028 0.0007 0.0005 0.0008 thread_post(2a288f8)

Another useful report that is provided by curt is the Pending System Calls Summary. This summary shows the list of threads that are in an unfinished system call at the end of the trace. An example report is given in Example B-6.

Example B-6 Threads that are in an unfinished system call at the end of the trace

Pending System Calls Summary

----------------------------

Accumulated SVC (Address) Procname (Pid Tid)

Time (msec)

============ ========================= ==========================

0.0082 GetMultipleCompletionStatus(549a6a8) proc1(323930 532813)

0.0089 _nsleep(2a28d30) proc2(270398 545277)

0.0054 _thread_wait(2a28778) proc1(323930 549305)

0.0088 GetMultipleCompletionStatus(549a6a8) proc1(323930 561437)

3.3981 listio64(516ea40) proc1(323930 577917)

0.0130 kpwrite(2a2d588) proc1(323930 794729)

For each thread in an unfinished system call, the following items are provided:

•The accumulated time in the system call

•The name of the system call (followed by the system call address in parentheses)

•The process name, followed by the Process ID and Thread ID in parentheses

This report is useful in determining what system calls are blocking threads from proceeding. For example, threads appearing in this report with an unfinished recv call are waiting on data to be received over a socket.

Another useful trace-based tool is splat, which is the Simple Performance Lock Analysis Tool. The splat tool provides reports about the usage of kernel and application (pthread-level) locks. At the pthread level, splat can report about the usage of pthread synchronizers: mutexes, read/write locks, and condition variables. Importantly, splat provides data about the degree of contention and blocking on these objects, an important consideration in creating highly scalable and pthread-based applications.

The pthread library instrumentation does not provide names or classes of synchronizers, so the addresses are the only way that you have to identify them. Under certain conditions, the instrumentation can capture the return addresses of the function call stack, and these addresses are used with the output of the gensyms tool to identify the call chains when these synchronizers are created. The creation and deletion times of the synchronizer can sometimes be determined as well, along with the ID of the pthread that created them.

An example of a mutex report from splat is shown in Example B-7.

Example B-7 Mutex report from splat

[pthread MUTEX] ADDRESS: 00000000F0154CD0

Parent Thread: 0000000000000001 creation time: 26.232305

Pid: 18396 Process Name: trcstop

Creation call-chain ==================================================================

00000000D268606C .pthread_mutex_lock

00000000D268EB88 .pthread_once

00000000D01FE588 .__libs_init

00000000D01EB2FCdne_callbacks

00000000D01EB280 ._libc_declare_data_functions

00000000D269F960 ._pth_init_libc

00000000D268A2B4 .pthread_init

00000000D01EAC08 .__modinit

000000001000014C .__start

| | | Percent Held ( 26.235284s )

Acqui- | Miss Spin Wait Busy | Secs Held | Real Real Comb Real

sitions | Rate Count Count Count |CPU Elapsed | CPU Elapsed Spin Wait

1 | 0.000 0 0 0 |0.000006 0.000006 | 0.00 0.00 0.00 0.00

-------------------------------------------------------------------------------------

Depth Min Max Avg

SpinQ 0 0 0

WaitQ 0 0 0

Recursion 0 1 0

Acqui- Miss Spin Wait Busy Percent Held of Total Time

PThreadID sitions Rate Count Count Count CPU Elapse Spin Wait

~~~~~~~~~~ ~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~

1 1 0.00 0 0 0 0.00 0.00 0.00 0.00

Acqui- Miss Spin Wait Busy Percent Held of Total Time

Function Name sitions Rate Count Count Count CPU Elapse Spin Wait Return Address Start Address Offset

^^^^^^^^^^^^^ ^^^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^

^^^^^^^^

.pthread_once 0 0.00 0 0 0 99.99 99.99 0.00 0.00 00000000D268EC98 00000000D2684180

.pthread_once 1 0.00 0 0 0 0.01 0.01 0.00 0.00 00000000D268EB88 00000000D2684180

In addition to the common header information and the [pthread MUTEX] identifier, this report lists the following lock details:

Parent thread Pthread ID of the parent pthread

Creation time Elapsed time in seconds after the first event recorded in trace (if available)

Deletion time Elapsed time in seconds after the first event recorded in trace (if available)

PID Process identifier

Process Name Name of the process using the lock

Call-chain Stack of called methods (if available)

Acquisitions The number of times the lock was acquired in the analysis interval

Miss Rate The percentage of attempts that failed to acquire the lock

Spin Count The number of unsuccessful attempts to acquire the lock

Wait Count The number of times a thread is forced into a suspended wait state while waiting for the lock to come available

Busy Count The number of trylock calls that returned busy

Seconds Held This field contains the following subfields:

CPU The total number of processor seconds the lock is held by a running thread.

Elapse(d) The total number of elapsed seconds the lock is held, whether the thread was running
or suspended.

Percent Held This field contains the following subfields:

Real CPU The percentage of the cumulative processor time the lock was held by a running thread.

Real Elapsed The percentage of the elapsed real time the lock is held by any thread, either running or suspended.

Comb(ined) Spin The percentage of the cumulative processor time that running threads spend spinning while it tries to acquire this lock.

Real Wait The percentage of elapsed real time that any thread was waiting to acquire this lock. If two or more threads are waiting simultaneously, this wait time is only charged one time. To learn how many threads are waiting simultaneously, look at the WaitQ Depth statistics.

Depth This field contains the following subfields:

SpinQ The minimum, maximum, and average number of threads that are spinning on the lock, whether running or suspended, across the analysis interval

WaitQ The minimum, maximum, and average number of threads that are waiting on the lock, across the
analysis interval

Recursion The minimum, maximum, and average recursion depth to which each thread held the lock

Finding alignment issues

Improperly aligned code or data can cause performance degradation. By default, the IBM compilers and linkers correctly align code and data, including stack and statically allocated variables. Incorrect typecasting can result in references to storage that are not correctly aligned. There are two types of alignment issues to be concerned with:

•Alignment issues that are handled by microcode in the POWER7 processor

•Alignment issues that are handled through alignment interrupts.

Examples of alignment issues that are handled by microcode with a performance penalty in the POWER7 processor are loads that cross a 128-byte boundary and stores that cross a
4 KB page boundary. To give an indication of the penalty for this type of misalignment, on a
4 GHz processor, a nine-instruction loop that contains an 8 byte load that crosses a 128-byte boundary takes double the time of the same loop with the load correctly aligned.

Alignment issues that are handled by microcode can be detected by running hpmcount or hpmstat. The hpmcount command is a command-line utility that runs a command and collects statistics from the POWER7 PMU while the command runs. To detect alignment issues that are handled by microcode, run hpmcount to collect data for group 38. An example is provided in Example B-8.

Example B-8 Example of the results of the hpmcount command

# hpmcount -g 38 ./unaligned

Group: 38

Counting mode: user

Counting duration: 21.048874056 seconds

PM_LSU_FLUSH_ULD (LRQ unaligned load flushes) : 4320840034

PM_LSU_FLUSH_UST (SRQ unaligned store flushes) : 0

PM_LSU_FLUSH_LRQ (LRQ flushes) : 450842085

PM_LSU_FLUSH_SRQ (SRQ flushes) : 149

PM_RUN_INST_CMPL (Run instructions completed) : 19327363517

PM_RUN_CYC (Run cycles) : 84219113069

Normalization base: time

Counting mode: user

Derived metric group: General

[ ] Run cycles per run instruction : 4.358

The hpmstat command is similar to hpmcount, except that it collects performance data on a system-wide basis, rather than just for the execution of a command.

Generally, scenarios in which the ratio of (LRQ unaligned load flushes + SRQ unaligned store flushes) divided by Run instructions completed is greater than 0.5% must be further investigated. The tprof command can be used to further pinpoint where in the code the unaligned storage references are occurring. To pinpoint unaligned loads, the -E PM_MRK_LSU_FLUSH_ULD flag is added to the tprof command line, and to pinpoint unaligned stores, the -E PM_MRK_LSU_FLUSH_UST flag is added. When these flags are used, tprof generates a profile where unaligned loads and stores are sampled instead of
time-based sampling.

Examples of alignment issues that cause an alignment interrupt include execution of a lmw or lwarx instruction on a non-word-aligned boundary. These issues can be detected by running alstat. This command can be invoked with an interval, which is the number of seconds between each report. An example is presented in Example B-9.

Example B-9 Alignment issues can be addressed with the alstat command

> alstat 5

Alignment Alignment

SinceBoot Delta

2016 0

The key metric in the alstat report is the Alignment Delta. This metric is the number of alignment interrupts that occurred during the interval. Non-zero counts in this column merit further investigation with tprof. Invoking tprof with the -E ALIGNMENT flag generates a profile that shows where the unaligned references are occurring.

For more information, see alstat Command, available at:

http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.cmds/doc/aixcmds1/alstat.htm

Finding emulation issues

Over the 20+ year evolution of the Power instruction set, a few instructions were removed. Instead of trapping programs that run these instructions, AIX emulates them in the kernel, although with a significant processing impact. Generally, programs that are written in a third-generation language (for example, C and C++) and compiled with an up-to-date compiler do not contain these emulated instructions. However, older binary files or older hand-written assembly language might contain such instructions, and because they are silently emulated by AIX, the performance penalty might not be readily apparent.

The emstat command detects the presence of these instructions. Like alstat, it is invoked with an interval, which is the number of seconds between reports. An example is shown in Example B-10.

Example B-10 The emstat command detects the presence of emulated instructions

> emstat 5

Emulation Emulation

SinceBoot Delta

0 0

The key metric is the Emulation Delta (the number of instructions that are emulated during each interval). Non-zero values merit further investigation. Invoking tprof with the -E EMULATION flag generates a profile that shows where the emulated instructions are.

For more information, see emstat Command, available at:

http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.cmds/doc/aixcmds2/emstat.htm

hpmstat, hpmcount, and tprof -E

The POWER7 processor provides a powerful on-chip PMU that can be used to count the number of occurrences of performance-critical processor events. A rich set of events is countable; examples include level 2 and level 3 d-cache misses, and cache reloads from local, remote, and distant memory. Local memory is memory that is attached to the same POWER7 processor chip that the software thread is running on. Remote memory is memory that is attached to a different POWER7 processor that is in the same CEC (that is, the same node or building block in the case of a multi-CEC system, such as a Power 780) that the software thread is running on. Distant memory is memory that is attached to a POWER7 processor that is in a different CEC from the CEC the software thread is running on.

Two commands exist to count PMU events: hpmcount and hpmstat. The hpmcount command is a command-line utility that runs a command and collects statistics from the PMU while the command runs. The hpmstat command is similar to hpmcount, except that it collects performance data on a system-wide basis, rather than just for the execution of a command.

Further documentation about hpmcount and hpmstat can be found at:

•http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.cmds/doc/aixcmds2/hpmcount.htm

•http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.cmds/doc/aixcmds2/hpmstat.htm

In addition to simply counting processor events, the PMU can be configured to sample instructions based on processor events. With this capability, profiles can be generated that show which parts of an application are experiencing specified processor events. For example, you can show which subroutines of an application are generating level 2 or level 3 cache misses. The tprof profiler includes this functionality through the -E flag, which allows a PMU event name to be provided to tprof as the sampled event. The list of PMU events can be generated by running pmlist -c -1. Whenever possible, perform profiling using marked events, as profiling using marked events is more accurate than using unmarked events. The marked events begin with the prefix PM_MRK_.

For more information about using the -E flag of tprof, go to:

http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.cmds/doc/aixcmds5/tprof.htm

Linux

The section introduces tools and techniques used for optimizing software on the combination of Power Systems and Linux. The intended audience for this section is software
development teams.

Empirical performance analysis using the IBM SDK for PowerLinux

After you apply the best high-level optimization techniques, a deeper level of analysis might be required to gain more performance improvements. You can use the IBM SKD for PowerLinux to help you gain these improvements.

The IBM SDK for PowerLinux is a set of tools that support:

•Hot spot analysis

•Analysis of ported code for missed platform-specific optimization

•Whole program analysis for coding issues, for example, pipeline hazards, inlining opportunities, early exits and hidden path length, devirtualization, and branch
prediction hints

•Lock contention and IO delay analysis

The IBM SDK for PowerLinux can be found at:

http://www14.software.ibm.com/webapp/set2/sas/f/lopdiags/sdklop.html

The SDK provides an Eclipse C/C++ IDE with Linux tools integrations. The SDK provides graphical presentation and source code view integration with Linux execution profiling (gprof/Oprofile), malloc and memory usage (valgrind), pthread synchronization (helgrind), SystemTap tapsets, and tapset development.

Hotspot analysis

You should profile the application and look for hotspots. When you run the application under one or more representative workloads, use a hardware-based profiling tool such as OProfile. OProfile can be run directly as a command-line tool or under the IBM SDK for PowerLinux.

The OProfile tools can monitor the whole system (LPAR), including all the tasks and the kernel. This action requires root authority, but is the best way to profile the kernel and complex applications with multiple cooperating processes. OProfile is fully enabled to take samples using the full set of the PMU events (run ophelp for a complete list of events). OProfile can produce text file reports organized by process, program and libraries, function symbols, and annotated source file and line number or machine code disassembly.

The IBM SDK can profile applications that are associated with Eclipse projects. The SDK automates the setup and running of the profile, but is restricted to a single application, its libraries, and direct kernel calls. The SDK is easier to use, as it is hierarchically organized by percentage with program, function symbol, and line number. Clicking the line number in the profile pane jumps the source view pane to the matching source file and line number. This action simplifies edit, compile, and profile tuning activities.

The whole system profile is a good place to start. You might find that your application is consuming most of the CPU cycles, and deeper analysis of the application is the next logical step. The IBM SDK for PowerLinux provides a number of helpful tools, including integrated application profiling (OProfile and valgrind), Migration Assistant, and the Source
Code Advisor.

High kernel usage

If the bulk of the CPU cycles are consumed in the kernel or runtime libraries that are not part of your application, then a different type of analysis is required. If the kernel is consuming significant cycles, then the application might be I/O or lock contention bound. This situation can occur when an application moves to larger systems (higher core count) and fails to
scale up.

I/O bound applications can be constrained by small buffer sizes or a poor choice of an access method. One issue to look for is applications that use local loopback sockets for interprocess communications (IPC). This situation is common for applications that are migrating from early scale-out designs to larger systems (and core-count). The first application change is to choose a lighter weight form of IPC for in-system communications.

Excessive locking or poor lock granularity can also result is high kernel usage (in the kernel’s spin_lock, futex, and scheduler components) when applications move to larger system configurations. This situation might require adjusting the application lock strategy and possibly the type of lock mechanism that is used as well:

•POSIX pthread_mutex and pthread_rwlock locks are complex and heavy, and POSIX semaphores are simpler and lighter.

•Use trylock forms to spin in user mode for a limited time when appropriate. Use this technique when there is normally a finite lock hold time and limited contention for the resource. This situation avoids context switch and scheduler impact in the kernel.

•Reserve POSIX pthread_spinlock and sched_yield for applications that have exclusive use of the system and with carefully designed thread affinity (assigning specific threads to specific cores).

•The compiler provides inline functions (__sync_fetch_and_add, __sync_fetch_and_or, and so on) that are better suited for simple atomic updates than POSIX lock and unlock. Use thread local storage, where appropriate, to avoid locking for thread safe code.

Using the IBM SDK for PowerLinux Trace Analyzer

The IBM SDK for PowerLinux provides tools, including the SystemTap and pthread monitor, for tracking I/O and lock usage of a running application. The higher level Trace Analyzer tools can target a specific application for combined SystemTap syscall trace and Lock Trace. The resulting trace information is correlated for time strip display and analysis within the tool.

High library usage

If libraries are consuming significant cycles, then you must determine if:

•Those libraries are part of your application, provided by a third party, or the Linux distribution

•There are alternative libraries that are better optimized

•You can recompile those libraries at a higher optimization

Libraries that are part of your application require the same level of empirical analysis as the rest of your application (by using source profiling and the Source Code Advisor). Libraries that are used by but not part of your application implies a number of options and strategies:

•Most open source packages in the Linux environment are compiled with optimization level -O2 and tend to avoid additional (higher level GCC) compiler options. This configuration might be sufficient for a CISC processor with limited register resources, but not sufficient for a RISC based register-rich processor, such as POWER7.

•A RISC-based, superscalar, out-of-order execution processor chip such as POWER7 requires more aggressive inlining and loop-unrolling to capitalize on the larger register set and superscalar design point. Also, automatic vectorization is not enabled at this lower (-O2) optimization level, and so the vector registers and ISA feature go unused.

•In GCC, you must specify the -O3 optimization level and inform the compiler that you are running on a newer processor chip with the Vector ISA extensions. In fact, with GCC, you need both -O3 and -mcpu=power7 for the compiler to generate code that capitalizes on the new VSX feature of POWER7.

One source of optimized libraries is the IBM Advance Toolchain for PowerLinux. The Advance Toolchain provides alternative runtime libraries for all the common POSIX C language, Math, and pthread libraries that are highly optimized (-O3 and -mcpu=) for multiple Power platforms (including POWER7). The Advance Toolchain run time RPM provides multiple CPU tuned library instances and automatically selects the specific library version that is optimized for the specific POWER5, POWER6, or POWER7 machine.

If there are specific open source or third-party libraries that are dominating the execution profile of your application, you must ask the distribution or library product owner to provide a build using higher optimization. Alternatively, for open source library packages, you can build your own optimized binary version of those packages.

Deeper empirical analysis

If simple recompilation with higher optimization options or even a more capable compiler does not provide acceptable performance, then deeper analysis is required. The IBM SDK for PowerLinux integrates the following analysis tools:

•Migration Assistant analysis, non-performing codes, and data types

•Application-specific hotspot profiling

•Source Code Advisor (SCA) analysis for non-performing code idioms and induced execution hazards

The Migration Assistant analyzes the source code directly and does not require a running binary application for analysis. Profiling and the SCA do require compiled application binary files and an application-specific benchmark or repeatable workload for analysis.

The Migration Assistant

For applications that originate on another platform, the Migration Assistant (MA) can identify non-portable code that must be addressed for a successful port to Power Systems. The MA uses the Eclipse infrastructure to analyze:

•Data endian dependent unions and structures

•Casts with potential endian issues

•Non-portable data types

•Non-portable inline assembler code

•Non-portable or arch dependent compiler built-ins

•Proprietary or architectural-specific APIs

Program usage of non-portable data types and an inline assembler can cause poor performance on the POWER processor, which always must be investigated and addressed.

For example, the long double data type is supported for both Intel x86 and Power, but has a different size, data range, and implementation. The x86 80-bit Floating Point format is implemented in hardware and is usually faster than the AIX long double, which is implemented as an algorithm using two, 64-bit doubles. Neither one is fully IEEE-compliant, and both must be avoided in cross-platform application codes and libraries.

Another example is small Intel specific optimization using inline x86 assembler and conditionally providing a generic C implementation for other platforms. In most cases, GCC provides an equivalent built-in function that generates the optimal code for each platform. Replacing inline assembler with GCC built-in functions makes the application more portable and provides equivalent or better performance on all platforms.

To use the MA tool, complete the following steps:

1. Import your project into the SDK.

2. Select Project properties.

3. Check the Linux/x86 to PowerLinux application Migration check box under C/C++ General/Code Analysis.

Hotspot profiling

IBM SDK for PowerLinux integrates the Linux Oprofile hardware event profiling with the application source code view. This configuration is a convenient way to do hotspot analysis. The integrated Linux Tools profiler focuses on an application that is selected from the current
SDK project.

After you run the application, the SDK opens an Oprofile tab in console window. This window shows a nested set of twisties, starting with the event (cycles by default), then program/library, function, and source line (within function). The developer drills-down by opening the twisties in the profile window, opening the next level of detail. Items are ordered by profile frequency with highest frequency first. Clicking the function or line number entries in the profile window causes the source view to jump to the corresponding source file or
line number.

This process is a convenient way to do hotspot analysis, focusing only on the top three to five items at each level in the profile. Examine the source code for algorithmic problems, excess conversions, unneeded debug code, and so on, and make the appropriate source
code changes.

With your application code (or subset) imported in to the SDK, it is easy to edit, compile, and profile code changes and verify improvements. As the developer makes code improvements, the hotspots in the profile change. Repeat this process until performance is satisfactory or all the profile entries at the function level are in the low single digits.

To use the integrated profiler, right-click the project and select Profile As → Profile with Oprofile. If you project containers multiple applications or the application needs setup or inputs to run the specific workload, then create Profile Configurations as needed.

Detailed analysis with the Source Code Advisor

Hotspot analysis might not find all of the latent performance problems, especially coding style and some machine-specific hazards. These problems tend to be diffused across the application, and do not show up in hotspot analysis. Common examples of machine hazards include address translation, cache misses, and branch miss-predictions.

Complex C++ applications or C programs that use object-based techniques might see performance issues that are related to using many small functions of indirect calls. Unless the compiler or optimizer can see the whole program or library, it cannot prove that it is safe to optimize these cases. However, it is possible for the developer to manually optimize at the source level, as the developer knows the original intent or actual usage in context.

The SCA can find and recommend solutions for many of these coding style and machine hazards. The process generates a journal that associates performance problems (including hazards) with specific source file and line numbers.

The SCA window has a drill-down hierarchy similar to the profile window described in “Hotspot profiling” on page 175. The SCA window is organized as a list of problem categories, and then nested twisties, for affected functions and source line numbers within functions. Functions and lines are ordered by the percent of overall contribution to execution time. Associated with each problem is a plain language description and suggested solution that describes a source change or compiler or linker options that are expected to resolve the problem. Clicking the line number item jumps the source display to the associated source file and line number for editing.

SCA uses the Feedback Directed Program Restructuring (FDPR) tool to instrument your application (or library) for code and data flow trace while you run a workload. The resulting FDPR journal is used to drive the SCA analysis. Running FDPR and retrieving the journal is can be automated by clicking Profile as → Profile with Source Code Advisor.

Java (either AIX or Linux)

Focused empirical analysis of Java applications involves gathering specific types of performance information, making and assessing changes, and repeating the process. The specific areas to consider, the types of performance information to gather, and the tools to use, are described in this section.

32-bit or 64-bit JDK

All other things being equal, a 64-bit JDK using -Xcompressedrefs generally has about 5% lower performance than does a 32-bit JDK. Without the -Xcompressedrefs option, a 64-bit JDK might have 10% or more reduced performance, which is compared to a 32-bit JDK. Give careful consideration to the choice of a 32-bit or 64-bit JVM. It is not a good choice to take an application that suffers from excessive object allocation rates and switch to a 64-bit JVM simply to allow a larger heap size. The references in the related tools and analysis techniques information can be used to diagnose object allocation issues in an application.

For more information about this topic, see:

•“Verbose GC Log” on page 177

•7.2, “32-bit versus 64-bit Java” on page 126.

Java heap size, and garbage collection policies and parameters

The performance of Java applications is often influenced by the heap size, GC policy, and GC parameters. Try different combinations, which are guided by appropriate data gathering and analysis. Various tools and diagnostic options are available that can provide detailed information about the state of the JVM. The information that is provided can be used to guide tuning decisions to maximize performance for an application or workload.

Verbose GC Log

The verbose GC log is a key tool to understanding the memory characteristics of a particular workload. The information that is provided in the log can be used to guide tuning decisions to minimize GC impact and improve overall performance. Logging can be activated with the -verbose:gc option and is directed to the command terminal. Logging can be redirected to a file with the -Xverbosegclog:<file> option.

Verbose logs capture many types of GC events, such as regular GC cycles, allocation failures, heap expansion and contraction, events related to concurrent marking, and scavenger collections. Verbose logs also show the approximate length of time many events take, the number of bytes processed (if applicable), and other relevant metrics. Information relevant to many of the tuning issues for GC can be obtained from the log, such as appropriate GC policies, optimal constant heap size, optimal min and max free space factors, and growth and shrink sizes. For a detailed description of verbose log output, consult the material on this subject in the Java Diagnostics Guide, available at:

http://publib.boulder.ibm.com/infocenter/javasdk/v6r0/topic/com.ibm.java.doc.diagnostics.60/diag/tools/gcpd_verbosegc.html

Garbage collection and memory visualizer

For large, long-running workloads, verbose logs can quickly grow in size, making them difficult to work with and to analyze an application's behavior over time. The Garbage Collection and Memory Visualizer is a tool that can parse verbose GC logs and present them in a visual manner using graphs and other diagrams, allowing trends and totals to be easily and quickly recognized. The graphs can be used to determine the minimum and maximum heap usage, growth and shrink rates over time, and identify oscillating behaviors. This information can be especially helpful when you choose optimal GC parameters. The GC and Memory Visualizer can also compare multiple logs side by side, which can aid in testing various options in isolation and determining their effects.

For more information about the GC and Memory Visualizer, see Java diagnostics, IBM style, Part 2: Garbage collection with the IBM Monitoring and Diagnostic Tools for Java – Garbage Collection and Memory Visualizer, available at:

http://www.ibm.com/developerworks/java/library/j-ibmtools2

Java Health Center

The Java Health Center is the successor to both the GC and Memory Visualizer and the Java Lock Monitor. It is an all-in-one tool that provides information about GC activity, memory usage, and lock contention. The Health Center also functions as a profiler, providing sample-based statistics on method execution. The Health Center functions as an agent of the JVM being monitored and can provide information throughout the life of a running application.

For more information about the Java Health Center, see Java diagnostics, IBM style, Part 5: Optimizing your application with the Health Center, available at:

https://www.ibm.com/developerworks/java/library/j-ibmtools5

For more information, see 7.4, “Java garbage collection tuning” on page 130.

Hot method or routine analysis

A CPU profile shows a breakdown of the time that is spent in Java methods and JNI or system routines. Investigate any hot methods or routines to determine if the concentration of execution time in them is warranted or whether there is poor coding or other issues.

Some tools and techniques for this analysis include:

•AIX tprof profiling. For more information, see tprof Command, available at:

http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.cmds/doc/aixcmds5/tprof.htm

•Linux Oprofile profiling. For more information, see the following resources:

– Taking advantage of oprofile, available at:

https://www.ibm.com/developerworks/wikis/display/LinuxP/Taking+advantage+of+oprofile

– Oprofile with Java Support, available at:

https://www.ibm.com/developerworks/wikis/display/LinuxP/Java+Performance+on+POWER7#JavaPerformanceonPOWER7-4.5OprofilewithJavaSupport

– OProfile manual, available at:

http://oprofile.sourceforge.net/doc/index.html

General information about running the profiler and interpreting the results are contained in the sections on profiling in “AIX” on page 162 and “Linux” on page 171. For Java profiling, additional Java options are required to be able to profile the machine code that is generated for methods by the JIT compiler:

•AIX 32-bit: -agentlib:jpa=instructions=1

•AIX 64-bit: -agentlib:jpa64=instructions=1

•Linux Oprofile: -agentlib:jvmti_oprofile

The entire execution of a Java program can be profiled, for example on AIX by running the following command:

tprof -ujeskzl -A -I -E -x java …

However, it is more common to profile Java after a warm-up period so that JIT compilation activity has generally completed. To profile after a warm-up, start Java and wait an appropriate interval until steady-state performance is reached, which is anywhere from a few seconds to a few minutes for large applications. Then, invoke the profiler, for example, on AIX, by running the following command:

tprof -ujeskzl -A -I -E -x sleep 60

On Linux, Oprofile can be used in a similar fashion; for more information, see “Java profiling example”, and follow the appropriate documentation in the resources included in this section.

Java profiling example

Example B-11 contains a sample Java program that is profiled on AIX and Linux. This program does some meaningless work and is purposely poorly written to illustrate lock contention and GC impact in the profile. The program creates three threads but serializes their execution by having them attempt to lock the same object. One thread at a time acquires the lock, forcing the other two threads to wait until they can get the lock and run the code that is protected by the synchronized statement in the doWork method. While they wait to acquire the lock, the threads initially use spin locking, repeatedly checking if the lock is free. After a suitable amount of spinning, the threads block rather than continuing to use CPU resources.

Example B-11 Sample Java program

public class ProfileTest extends Thread {

static Object o; /* used for locking to serialize threads */

static Double A[], B[], C[];

static int Num=1000;

public static void main(String[] args) {

o = new Object();

new ProfileTest().start(); /* start 3 threads */

new ProfileTest().start(); /* each thread executes the "run" method */

new ProfileTest().start();

}

public void run() {

double sum = 0.0;

for (int i = 0; i < 50000; i++) {

sum += doWork(); /* repeatedly do some work */

}

System.out.println("sum: "+sum); /* use the results of the work */

}

public double doWork() {

double d;

synchronized (o) { /* serialize the threads to create lock contention */

A = new Double [Num];

B = new Double [Num];

C = new Double [Num];

initialize();

calculate();

d = C[0].doubleValue();

}

return(d); /* use the calculated values */

}

public static void initialize() {

/* Initialize A and B. */

for (int i = 0; i < Num; i++) {

A[i] = new Double(Math.random()); /* use new to create objects */

B[i] = new Double(Math.random()); /* to force garbage collection */

}

public static void calculate() {

for (int i = 0; i < Num; i++) {

C[i] = new Double(A[i].doubleValue() * B[i].doubleValue());

}

The program also uses the Double class, creating many short-lived objects by using new. By running the program with a small Java heap, GC frequently is required to free the Java heap space that is taken by the Double objects that are no longer in use.

Example B-12 shows how this program was run and profiled on AIX. 64-bit Java was used with the options -Xms10m and -Xmx10m to specify the size of the Java heap. The profile that is generated appears in the java.prof file.

Example B-12 Results of running tprof on AIX

# tprof -ujeskzl -A -I -E -x java -Xms10m -Xmx10m -agentlib:jpa64=instructions=1 ProfileTest

Starting Command java -Xms10m -Xmx10m -agentlib:jpa64=instructions=1 ProfileTest

sum: 12518.481782746869

sum: 12507.63528674597

sum: 12526.320955364286

stopping trace collection.

Sun Oct 30 15:04:21 2011

System: AIX 6.1 Node: el9-90-28 Machine: 00F603F74C00

Generating java.trc

Generating java.syms

Generating java.prof

Example B-13 and Example B-14 on page 181 contain excerpts from the java.prof file that is created on AIX. The notable elements of the profile are:

•Lock contention impact: The impact of spin locking is shown in Example B-13 as ticks in the libj9jit24.so helper routine jitMonitorEntry, in the AIX pthreads library libpthreads.a, and in the AIX kernel routine_check_lock. This Java program clearly has excessive lock contention with jitMonitorEntry consuming 26.66% of the ticks in the profile. jitMonitorEntry and other routines, such as jitMethodMonitorEntry, indicate spin locking at the Java language level, and impact in the pthreads library or _check_lock is locking at the system level, which might or might not be associated with Java locks. For example, libpthreads.a and _check_lock are active for lock contention that is related to malloc on AIX.

Example B-13 AIX profile excerpt showing kernel and shared library ticks

Total Ticks For All Processes (KERNEL) = 690

Subroutine Ticks % Source Address Bytes

========== ===== ====== ====== ======= =====

._check_lock 240 5.71 low.s 3420 40

Shared Object Ticks % Address Bytes

============= ===== ====== ======= =====

libj9jit24.so 1157 27.51 900000003e81240 5c8878

libj9gc24.so 510 12.13 900000004534200 91d66

/usr/lib/libpthreads.a[shr_xpg5_64.o] 175 4.16 900000000b83200 30aa0

Profile: libj9jit24.so

Total Ticks For All Processes (libj9jit24.so) = 1157

Subroutine Ticks % Source Address Bytes

========== ===== ====== ====== ======= =====

.jitMonitorEntry 1121 26.66 nathelp.s 549fc0 cc0

•Garbage Collection impact: The impact of initializing new objects and of GC is shown in Example B-13 on page 180 as the 12.13% of ticks in the libj9gc24.so shared object. This high GC impact is related to the excessive creation of Double objects in the sample program.

•Java method execution: In Example B-14, the profile shows the time that is spent in the ProfileTest class, which is broken down by method. Some methods appear more than one time in the breakdown because they are compiled multiple times at increasing optimization levels by the JIT compiler. Most of the ticks appear in the final highly optimized version of the doWork()D method, into which the initialize()V and calculate()V methods are inlined by the JIT compiler.

Example B-14 AIX profile excerpt showing Java classes and methods

Total Ticks For All Processes (JAVA) = 1450

Class Ticks %

===== ===== ======

ProfileTest 1401 33.32

java/util/Random 38 0.90

java/lang/Float 5 0.12

java/lang/Double 3 0.07

java/lang/Math 3 0.07

Profile: ProfileTest

Total Ticks For All Processes (ProfileTest) = 1401

Method Ticks % Source Address Bytes

====== ===== ====== ====== ======= =====

doWork()D 1385 32.94 ProfileTest.java 1107283bc b54

doWork()D 6 0.14 ProfileTest.java 110725148 464

doWork()D 4 0.10 ProfileTest.java 110726e3c 156c

initialize()V 3 0.07 ProfileTest.java 1107262dc b4c

calculate()V 2 0.05 ProfileTest.java 110724400 144

initialize()V 1 0.02 ProfileTest.java 1107255c4 d04

Example B-15 contains a shell program to collect a profile on Linux using Oprofile. The resulting profile might be similar to the previous example profile on AIX, indicating substantial time in spin locking and in GC. Depending on some specifics of the Linux system, however, the locking impact can appear in routines in the libj9thr24.so shared object, as compared to the AIX spin locking seen in libj9jit24.so. In some cases, an environment variable setting might be necessary to indicate the location of the JVMTI library that is needed for running Oprofile with Java:

•Linux 32-bit: LD_LIBRARY_PATH=/usr/lib/oprofile

•Linux 64-bit: LD_LIBRARY_PATH=/usr/lib64/oprofile

Example B-15 Linux shell to collect a profile using Oprofile

#!/bin/ksh

# Use --no-vmlinux if we either have a compressed kernel or do not care about the kernel symbols.

# Otherwise, use "opcontrol --vmlinux=/boot/vmlinux", for example.

opcontrol -–no-vmlinux

# Stop data collection and remove daemon. Make sure we start from scratch.

opcontrol -–shutdown

# Load the Oprofile module if required and makes the Oprofile driver interface available.

opcontrol -–init

# Clear out data from current session.

# opcontrol -–reset

# Select the performance counter that counts non-idle cycles and generates a sample after 500,000

# such events.

opcontrol -e PM_RUN_CYC_GRP1:500000

# Start the daemon for data collection.

opcontrol --start

# Run the Java program. "-agentlib:jvmti_oprofile" allows Oprofile to resolve the jitted methods.

java -Xms10m -Xmx10m -agentlib:jvmti_oprofile ProfileTest

# Stop data collection.

opcontrol --stop

# Flush the collected profiling data.

opcontrol --dump

# Generate a summary report at the module level.

opreport > ProfileTest_summary.log

# Generate a long report at the function level.

opreport -l > ProfileTest_long.log

Locking analysis

Locking bottlenecks are fairly common in Java applications. Collect locking information to identify any bottlenecks, and then take appropriate steps to eliminate the problems. A common case is when older java/util classes, such as Hashtable, do not scale well and cause a locking bottleneck. An easy solution is to use java/util/concurrent classes instead, such
as ConcurrentHashMap.

Locking can be at the Java code level or at the system level. Java Lock Monitor is an easy to use tool that identifies locking bottlenecks at the Java language level or in internal JVM locking. A profile that is slowing a significant fraction of time in kernel locking routines indicates that system level locking that might be related to an underlying Java locking issue. Other AIX tools, such as splat, are helpful in diagnosing locking problems at the system level.

Always evaluate locking in the largest required scalability configuration (the largest number
of cores).

Java Lock Monitor

The Java Lock Monitor is a valuable tool to deal with concurrency and synchronization in multi-threaded applications. The JLM can provide detailed information, such as how contested every monitor in the application is, how often a particular thread acquires a particular monitor, and how often a monitor is reacquired by a thread that already owns it. The locks that are surveyed by the JLM include both application locks and locks used internally by the JVM, such as GC locks. These statistics can be used to make decisions about GC policies, lock reservation, and so on, to make optimal usage of processing resources. For more information about the Java Lock Monitor, see Java diagnostics, IBM style, Part 3: Diagnosing synchronization and locking problems with the Lock Analyzer for Java,
available at:

http://www.ibm.com/developerworks/java/library/j-ibmtools3

Also, see “Hot method or routine analysis” on page 177.

Thread state analysis

Multi-threaded Java applications, especially applications that are running on top of WebSphere Application Server, often have many threads that might be blocked or waiting on locks, database operations, or file system operations. A powerful analysis technique is to look at the state of the threads to diagnose performance issues.

Always evaluate thread state analysis in the largest required scalability configuration (the largest number of cores).

WAIT

WAIT is a lightweight tool to assess various performance issues that range from GC to lock contention to file system bottlenecks and database bottlenecks, to client delays and authentication server delays, and more, including traditional performance issues, such as identifying hot methods.

WAIT was originally developed for Java and Java Platform, Enterprise Edition workloads, but a beta version that works with C/C++ native code is also available. The WAIT diagnostic capabilities are not limited to traditional Java bottlenecks such as GC problems or hot methods. WAIT employs an expert rule system to look at how Java code communicates with the wider world to provide a high-level view of system and application bottlenecks.

WAIT is also agentless (relying on javacores, ps, vmstat, and similar information, all of which are subject to availability). For example, WAIT produces a report with whatever subset of data can be extracted on a machine. Getting javacores, ps, and vmstat data almost never requires a change to command lines, environment variables, and so on.

Output is viewed in a browser such as Firefox, Chrome, Safari, and Internet Explorer, and assuming one has a browser, no additional installation is needed to view the WAIT output. Reports are interactive, and clicking different elements reveals more information. Manuals, animated demonstrations, and sample reports are also available on the WAIT website.

For more information about WAIT, go to:

http://wait.researchlabs.ibm.com

This site also has sample input files for WAIT, so users can try out the data analysis and visualization aspects without collecting any data.

¹ trace Daemon, available at: http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.cmds/doc/aixcmds5/trace.htm

² CPU Utilization Reporting Tool (curt), available at: http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.prftools/doc/prftools/idprftools_cpu.htm

³ Simple performance lock analysis tool (splat), available at:

http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.prftools/doc/prftools/idprftools_splat.htm

⁴ splat Command, available at:

http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.cmds/doc/aixcmds5/splat.htm

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Appendix B. Performance tooling and empirical performance analysis

Create new playlist

Sign In

Sign Up

Table of Contents for
Appendix B. Performance tooling and empirical performance analysis