Performance monitoring tools and what they are telling us
This appendix discusses performance monitoring tools and explains what the tools are telling us while trying to tune your IBM POWER Systems environment.
The following topics are discussed:
NMON
NMON
Any AIX administrator who has worked on any performance issue should already be very familiar with and fond of the nmon tool. For those who have not, here is a short history lesson:
nmon (Nigel’s performance MONitor) was created in 1997 by Nigel Griffiths, an employee of IBM UK Ltd. He saw a need for a simple, small and safe tool to monitor AIX in real-time, and additionally capture numerous elements of AIX usage including processor, processes, memory, network and storage.
Originally written for himself, news of its existence quickly spread.
The first releases supported AIX 4 (which was the current major release at the time) and Nigel continued to improve and enhance nmon to support subsequent AIX 5 and 6 releases.
Included and installed by default since 2008 with AIX from 5.3 TL09, 6.1 TL02, VIOS 2.1 and beyond. The version included with AIX is maintained and supported by the AIX Performance Tools team.
nmon has since been ported to Linux and is included in various distributions for PowerPC (PPC), x86 and even Raspberry Pi.
 
Note: Because nmon is included with all current releases of AIX, there is no need to install a legacy nmon release (known as classic nmon). Classic nmon was written for versions of AIX no longer supported and that do not contain support for newer hardware or software generations. Neither IBM nor Nigel support or recommend the use of classic nmon on any current AIX releases.
lpar2rrd
lpar2rrd is a tool based on perl that connects to HMC (or IVM), via SSH, and collects performance data. It is agentless and collects processor utilization for each LPAR and global processor utilization of the frames managed by the HMCs. Data is stored in an RRD database. Graphs are automatically generated and presented by a web server. Information is available for each LPAR or grouped in shared processor pools and entire frames.
The tool is free (GPL) and there is an option to contract support to get enhanced features. lpar2rrd can be downloaded at:
www.lpar2rrd.com
It can be installed in any UNIX. The version used in this book is 3.20.
Trace tools and PerfPMR
The AIX system trace is designed for tracing of activities inside both the kernel and kernel extension. When trace is activated, the selected trace events are recorded in an internal buffer. The data in the buffer is written into the trace log file later. The raw trace data is in binary format, and AIX provides further utilities to process it and generate readable report from different perspectives. Such utilities include trcrpt, tprof, curt, splat, pprof, filemon, and netpmon.
The tracing utilities are very useful for identifying difficult performance problems. In the following section we discuss some details on how to use the tools and read the reports.
 
Note: Although trace facilities are designed to have little impact on system performance, it is not introduced for auditing purposes. You should not use it as a batch job in production systems unless you are required to by the IBM support personnel.
AIX system trace basics
The system trace is activated using the trace command. Here we introduce the trace command and some of the trace options, and offer examples on how to use it.
A quick start example
Example A-1 shows how to get a filemon report from trace. This is useful when you need to postprocess the trace log file from a remote machine, or when the system load is high and trace events are being missed by filemon.
Example A-1 Quick start example of using the trace utilities
#trace -andfp -J filemon -C all -r PURR -T 30000000 -L 30000000 -o trace.raw
#trcon
#sleep 5
#trcstop
#gensyms -F > gensyms.out
#/usr/bin/trcrpt -C all -r trace.raw > trace.bin
 
#filemon -i trace.bin -n gensyms.out -O all,detailed -o filemon.out
System trace mode
The system trace can be used in asynchronous mode or in subcommand mode. Choose asynchronous mode by specifying the -a option with the trace command. The trace daemon runs in the background in this mode, and you can use trcon, trcoff, and trcstop to start tracing, stop tracing, and exit a tracing session, respectively (Example A-2).
Example A-2 Start, stop system trace in asynchronous mode
To start trace daemon and trace data collection:
#trace -a
To stop:
#trcstop
 
To start trace daemon and delay the data collection
#trace -a -d
 
To start data collection:
#trcon
 
To stop data collection:
#trcoff
 
To exit tracing session:
#trcstop
 
The system trace is in subcommand mode if -a is not specified. You get a dialog to interact with the tracing facilities in this mode, which is seldom used and is not discussed here.
Default trace setting
Get the default trace setting with the command trcctl as shown in Example A-3. You can see that the default trace log file is located in /var/adm/ras with the name trcfile and size 2621440.
Example A-3 Get the default trace settings using trcctl
# trcctl -l
Default Buffer Size: 262144
Default Log File Size: 2621440
Default Log File: /var/adm/ras/trcfile
Non-Root User Buffer Size Maximum: 1048576
Default Components Directory: /var/adm/ras/trc_ct
Default LMT Log Dir: /var/adm/ras/mtrcdir
We suggest that you keep the default value of trcctl and use trace options to set values other than default, which are discussed later. If you change it by mistake, you can use trcctl -r to restore the default settings, as shown in Example A-4.
Example A-4 Restore the default trace settings using trcctl
# trcctl -r
System trace buffer size and buffer mode options
The system trace utilities have three buffer modes. Table A-1 shows the available buffer modes and the corresponding options in parentheses.
Table A-1 Available trace buffer modes
Buffer mode
Description
Alternative (default)
Two trace buffers are allocated. All trace events written into the buffers are captured in the trace log file. The data in one buffer will be written to the trace log file while trace events start recording into the alternative buffer.
Trace buffer wraparound will occur if the buffer switching happens and the trace events are not written into the trace log file in a timely manner.
Also, the trace log wraparound will occur when it fills up and the trace daemon continues to collect trace data. However, this can be avoided by specifying the -s option with trace, to stop tracing when the trace log fills.
Circular (-l)
The trace events wrap within the trace buffers and are not captured in the trace log file until the trace data collection is stopped.
Single (-f)
The collection of trace events stops when the trace buffer fills up and the contents of the buffer are captured in the trace log file.
This option is frequently used to analyze performance issues. Most of the examples used in Appendix A adopt this option, such as Example A-1, Example A-10, Example A-11, and so on.
Use -T to specify a trace buffer size other than the default. For alternative and circular mode, the buffer size ranges from 8192 bytes to 268,435,184 bytes. For single buffer mode, the buffer size ranges from 16,392 bytes to 536,870,368 bytes.
If you specify a trace buffer size using -T, you should also need to specify the trace log file size using -L. The following criteria must be met:
In the circular and alternate modes, the trace buffer size must be one-half or less the size of the trace log file.
In the single mode, the trace log file must be at least the size of the buffer.
By default, all logical processors share the same trace buffer. However, the system trace subsystem can use separate trace buffers for each logical processor, too. This is useful to avoid the situation that one logical processor has much more activities and overflows the trace events of other logical processors. This is achieved by specifying the -C all option.
 
Note: The trace buffer is pinned in memory, and so it consumes more physical memory if you specify a larger value.
Trace hooks and event groups
Because there are large numbers of trace events, AIX uses trace event identifiers and trace event groups to categorize and manage them. The trace event identifier is also called trace hook identifier.
Example A-5 shows how to get all the trace hook identifiers using trcrpt. The four digits at the beginning of each line are the trace event identifier, for example, 0010 is the trace event identifier for the TRACE ON event.
Example A-5 Using trcrpt -j to get all the track hook identifiers
#trcrpt -j|pg
0010 TRACE ON
0020 TRACE OFF
0030 TRACE HEADER
0040 TRACEID IS ZERO
0050 LOGFILE WRAPAROUND
0060 TRACEBUFFER WRAPAROUND
0070 UNDEFINED TRACE ID
0080 DEFAULT TEMPLATE
0090 trace internal events
00a0 TRACE_UTIL
1000 FLIH
1001 OFED Trace
1002 STNFS GENERAL
1003 STNFS VNOPS
1004 STNFS VNOPS HOLD/RELE
1005 STNFS ID
1006 STNFS IO
1007 STNFS OTW
1008 STNFS TREE
1009 STNFS VFSOPS
100b TRACE_UNMANAGED
1010 SYSTEM CALL
101d DISPATCH SRAD
101e DISPATCH Affinity
1020 SLIH
1022 MSTOR_ERR
1025 SFWCOM
1027 THRD_REAFF
1028 THRD_MREAFF
1029 SFW
...
Trace event groups are defined sets of trace events. Example A-6 shows how to get the trace event group using trcrpt. The name at the beginning of each stanza is the name of the trace group. For example, tidhk is the name of the group “Hooks needed to display thread name”.
Example A-6 Using trcrpt to get all trace event groups
#trcrpt -G|pg
tidhk - Hooks needed to display thread name (reserved)
106,10C,134,139,465
...
tprof - Hooks for TPROF performance tool (reserved)
134,139,210,234,38F,465,5A2,5A5,5D8
pprof - Hooks for PPROF performance tool (reserved)
106,10C,134,135,139,419,465,467,4B0,5D8
filemon - Hooks for FILEMON performance tool (reserved)
101,102,104,106,107,10B,10C,10D,12E,130,137,139,154,15B,163,19C,1BA,1BE,1BC,1C9,221,222,228,232,2A1,2A2,3D3,419,45B,4B0,5D8,AB2
netpmon - Hooks for NETPMON performance tool (reserved)
100,101,102,103,104,106,107,10C,134,135,139,163,19C,200,210,211,212,213,215,216,252,255,256,262,26A,26B,2A4,2A5,2A7,2A8,2C3,2C4,2DA,2DB,2E6,2E7,2EA,2EB,30A,30B,320,321,32D,32E,330,331,334,335,351,352,38F,419,465,467,46A,470,471,473,474,488,48A,48D,4A1,4A2,4B0,4C5,4C6,598,599,5D8
curt - Hooks for CURT performance tool (reserved)
100,101,101D,102,103,104,106,10C,119,134,135,139,200,210,215,38F,419,465,47F,488,489,48A,48D,492,4B0,5D8,600,603,605,606,607,608,609
splat - Hooks for SPLAT performance tool (reserved)
106,10C,10E,112,113,134,139,200,419,465,46D,46E,492,5D8,606,607,608,609
perftools - Hooks for all six performance tools (reserved)
100,101,101D,102,103,104,106,107,10B,10C,10D,10E,112,113,119,12E,130,134,135,137,139,154,15B,163,19C,1BA,1BC,1BE,1BC,1C9,200,210,211,212,213,215,216,221,222,228,232,234,252,255,256,262,26A,26B,2A1,2A2,2A4,2A5,2A7,2A8,2C3,2C4,2DA,2DB,2E6,2E7,2EA,2EB,30A,30B,320,321,32D,32E,330,331,334,335,351,352,38F,3D3,419,45B,465,467,46A,46D,46E,470,471,473,474,47F,488,489,48A,48D,492,4A1,4A2,4B0,4C5,4C6,598,599,5A2,5A5,5D8,600,603,605,606,607,608,609,AB2
tcpip - TCP/IP PROTOCOLS AND NETWORK MEMORY (reserved)
252,535,536,537,538,539,254,25A,340
...
Trace event groups are useful because you usually need to trace a collection of trace events to identify a specific functional or performance problem. As an example, the most frequently used trace-based tools tprof, pprof, filemon, netpmon, curt, and splat all have their corresponding specific trace event groups that contain all necessary trace event identifiers for generating reports.
You can also use trcevgrp -l to display details about specific event groups, as shown in Example A-7.
Example A-7 Using trcevgrp to display specific event groups
#trcevgrp -l filemon
filemon - Hooks for FILEMON performance tool (reserved)
101,102,104,106,107,10B,10C,10D,12E,130,137,139,154,15B,163,19C,1BA,1BE,1BC,1C9,221,222,228,232,2A1,2A2,3D3,419,45B,4B0,5D8,AB2
Choosing trace hooks and event groups
You can choose the trace hooks and event groups using specific trace options, shown in Table A-2.
Table A-2 Trace options and descriptions
Trace options
Description
-j <event identifier>
Specifies the user-defined events to collect trace data1.
Specifies a two-digit hook ID in the form hh as in hh00, hh10,...,hhF0. Specifies a three-digit hook ID in the form hhh as in hhh0. Specifies a four-digit hook ID in the form hhhh as in hhhh.
-J <event group>
Specifies the event groups to be included2.
-k <event identifier>
Specifies the user-defined events to exclude trace dataa.
-K <event group>
Specifies the event groups to be excludedb.

1 Multiple events can be separated by commas.
2 Multiple groups can be separated by commas.
 
Note: Usually, you need to include the tidhk event group to get detailed process and thread information, and exclude vmm events if it is irrelevant to the problem you are analyzing, because usually there are lots of vmm events.
Merging trace log files and reordering
In alternative mode, the trace daemon normally wraps around the trace log when it fills up and continues to collect trace data, unless the -s option is specified. In circular mode, the trace buffer wraparound might occur. Due to the wraparound events, the trace log file might not be in chronological order. You need the trace log file in chronological order to do the post processing with curt, tprof, filemon, and so on.
You can force the reordering with the -r command, as shown in Example A-8.
Example A-8 Reordering the trace log file
The trace.raw is the original trace file, and we can reoder it chronologically to trace.r.
#trcrpt -o trace.r -r trace.raw
In systems with multiple logical processors and trace with the -C all option, you need to specify the -C all option with the -r option to merge, reorder and sort the trace log files into one raw data file in chronological order. We suggest that you perform this step before every post processing. Refer to Example A-9 on page 322.
Example A-9 Merging, reordering, and sorting the trace log file in chronological order
#trcrpt -o trace.r -r -C all trace.raw
Generate the trace report
We demonstrated using trcrpt to generate a report in ASCII format in 4.4.3, “File system best practice” on page 163. Specifying the -O option can generate more detailed reports, shown in Table A-3. Multiple options can be separated by commas. smitty and trcrpt are also available for use.
Table A-3 trcrpt -O options
Frequently used trcrpt -O options
Explanation
pid=on
Displays the process IDs in the trace report.
tid=on
Displays the thread ID in the trace report.
exec=on
Displays the exec path names in the trace report.
svc=on
Displays the value of the system call in the trace report.
timestamp=[0|1|2|3|4]
The most frequently used values are 0 and 1.
0 - Time elapsed since the trace was started and delta time from the previous event. The elapsed time is in seconds and the delta time is in milliseconds. This is the default.
1 - Short elapsed time. Reports only the elapsed time (in seconds) from the start of the trace.
cpuid=on
Displays the physical processor number in the trace report.
Example A-10 shows how to start the trace data collection and generate a trace report.
Example A-10 General a trace report
//comments: start trace immediately.
#trace -a -f -T 10000000 -L 10000000 -o trace.raw
 
//comments: monitor for 5 seconds; then stop tracing.
#sleep 5; trcstop
 
//comments: generate trace report “trace.int” from the trace log file “trace.raw”.
#trcrpt -Opid=on,tid=on,exec=on,svc=on,cpuid=on,timestamp=1 -o trace.int ./trace.raw
 
//comments: now you can read the trace report.
#more trace.int
...
 
Note: This example does not specify specific trace hooks, or exclude any trace hooks. Thus all trace events are logged. The trace buffer might get filled up before any useful trace events are written.
You can now get a more detailed report using the -O option for similar DIO demotion cases; refer to Example A-11. From this report, you can see that “dd” is causing the IO demotion with unaligned IO length (0x1001).
Example A-11 Using trcrpt with -O options
//comments: include hook id “59b” for DIO activities, and “tidhk” for process name reporting.
#trace -J tidhk -j 59b -a -f -T 10000000 -L 10000000 -o trace.raw
 
//comments: monitor for 5 seconds, and then stop tracing.
#sleep 5; trcstop
 
//comments: filter the trace report to display only hook id “59b” events.
#trcrpt -d 59b -Opid=on,tid=on,exec=on,svc=on,cpuid=on,timestamp=1 -o trace.int ./trace.raw
#more trace.int
...
ID PROCESS NAME CPU PID TID I SYSTEM CALL ELAPSED APPL SYSCALL KERNEL INTERRUPT
...
59B dd 0 2162750 30933019 0.000113 JFS2 IO write: vp = F1000A0232DF8420, sid = DE115E, offset = 00000000052FB2F6, length = 1001
59B dd 0 2162750 30933019 0.000113 JFS2 IO dio move: vp = F1000A0232DF8420, sid = DE115E, offset = 00000000052FB2F6, length = 1001
59B dd 0 2162750 30933019 0.000126 JFS2 IO devstrat (pager strategy): bplist = F1000A00E067C9A0, vp = F1000A0232DF8420, sid = DE115E, lv blk = A60650, bcount = 0400
59B dd 4 2162750 30933019 0.000492 JFS2 IO gather: bp = F1000A00E067D2E0, vp = F1000A0232DF8420, sid = DE115E, file blk = 297D8, bcount = 2000
59B dd 4 2162750 30933019 0.000495 JFS2 IO devstrat (pager strategy): bplist = F1000A00E067D2E0, vp = F1000A0232DF8420, sid = DE115E, lv blk = A60650, bcount = 1400
59B dd 4 2162750 30933019 0.000514 JFS2 IO dio demoted: vp = F1000A0232DF8420, mode = 0001, bad = 0002, rc = 0000, rc2 = 0000
If using trace data from another machine, you need the trcnm output and the /etc/trcfmt file to generate the ASCII report, otherwise the output will be incorrect. Refer to Example A-12.
Example A-12 Generate a trace report for a remote machine
On remote machine,
#trcnm > trcnm.out
#cp /etc/trcfmt trace.fmt
 
On reporting machine,
Download the “trcnm.out” and “trace.fmt” together with the trace raw data.
#trcrpt -n trcnm.out -t trace.fmt <other options>
Generate curt, tprof, filemon, splat, and netpmon reports from trace logs
You can generate curt, tprof, pprof, splat, filemon, and netpmon reports from the trace log if the related trace events are included in the trace log. Example A-13 shows an example of using the trace log to generate a curt report.
Example A-13 Generate a curt report using trace log
//comments: start trace using single buffer mode to collect trace event group “curt”, and delay starting of the trace data collection.
#/usr/bin/trace -andfp -C all -J curt -r PURR -T 20000000 -L 20000000 -o trace_bin
 
//comments: start trace data collection and monitor for 10 seconds, then stop tracing.
#trcon; sleep 10; trcstop
 
//comments: gather the symbol information necessary to run the “curt” command.
#gensyms > gensyms.out
 
//comments: reorder the trace logs to a single trace raw file “trace_curt”.
#trcrpt -C all -r trace_bin > trace_curt
 
//comments: generate curt report
#/usr/bin/curt -i trace_curt -n gensyms.out -ptes > curt.out
Example A-14 shows an example of using the trace log to generate a tprof report. Note that to run the tprof command in manual offline mode, the following files must be available:
The symbolic information file rootstring.syms
The trace log file rootstring.trc [-cpuid]
Example A-14 Generate a tprof report using trace log
//comments: start trace using single buffer mode to collect trace event group “tprof”, and delay starting of the trace data collection.
#/usr/bin/trace -andfp -C all -J tprof -r PURR -T 20000000 -L 20000000 -o myprof.trc
 
//comments: start trace data collection and monitor for 10 seconds, then stop tracing.
#trcon; sleep 10; trcstop
 
//comments: gather the symbol information necessary to run the “tprof” command.
#gensyms > myprof.syms
 
//comments: generate tprof report; rootstring equals to “myprof” here.
#tprof -uskejzlt -r myprof
# more myprof.prof
For a detailed explanation of the tprof and curt reports, refer to POWER7 and POWER7+ Optimization and Tuning Guide, SG24-8079.
Example A-15 on page 325 shows an example of using the trace log to generate a filemon report. It is good practice to avoid “trace events lost” by using -f to select the single buffer mode, as in this example.
Example A-15 Generate a filemon report using the trace log
//comments: start trace using single buffer mode to collect trace event group “filemon”, and delay starting of the trace data collection.
#trace -andfp -J filemon -C all -r PURR -T 20000000 -L 20000000 -o trace.raw
 
//comments: start trace data collection and monitor for 10 seconds, then stop tracing.
#trcon; sleep 10; trcstop
 
//comments: gather the symbol information necessary to run the “filemon” command.
#gensyms -F > gensyms.out
 
//comments: reorder the trace logs to a single trace raw file “trace.fmon”.
#/usr/bin/trcrpt -C all -r trace.raw > trace.fmon
 
//comments: generate filemon report “filemon.out”
#filemon -u -i trace.fmon -n gensyms.out -O all,detailed -o filemon.out
For a detailed explanation of the filemon report, refer to 4.4.4, “The filemon utility” on page 176.
Using the truss command
The truss command is useful when identifying application problems together with the system tracing utilities such as tprof, curt, and so on.
When we identifying the problematic process with tprof, curt, filemon, and so on, use the truss command to trace the process. Some frequently used options are in Table A-4.
Table A-4 truss options
truss options
Explanation
-p <pid>
Specifies the processes for tracing the execution.
-c
Counts traced system calls, faults, and signals instead of displaying trace results line by line.
Produces a summary report after the traced command terminates or when truss is interrupted.
Use this option alone or with -f.
-d
Includes a timestamp with each line of output.
-f
Follows all child processes.
-l
Displays the thread ID in the output.
-t[!]syscall
Includes or excludes system calls from the trace process.
-u [!] [LibraryName [...]::[!]FunctionName [...] ]
Traces dynamically loaded user level function calls from user libraries.
Example A-16 gives an example of using truss to debug application issues.
Example A-16 Using truss to identify application issues
To get a summary report of the process, truss -c -p <pid> for an interval and then Ctrl+C:
#truss -c -p 6619376
^Csyscall seconds calls errors
kwrite .00 6
munmap 12.06 81
msync 24.36 81
mmap .00 79
---- --- ---
sys totals: .00 247 0
usr time: .00
elapsed: .00
 
truss the execution of the process for an interval and Ctrl+C to stop truss:
#truss -d -o truss.log -l -p 6619376
^C
With -l, the thread id is shown at each line:
#more truss.log
Mon Oct 15 16:54:36 2012
29294829: 0.0000: mmap(0x00000000, 4194304, PROT_READ|PROT_WRITE, MAP_FILE|MAP_VARIABLE|MAP_SHARED, 10, 29360128) = 0x3040000
0
30933087: kwrite(1, " m s y n c i n t e r v".., 45) = 45
30933087: 0.0007: munmap(0x30000000, 4194304) = 0
30933087: 0.0014: mmap(0x00000000, 4194304, PROT_READ|PROT_WRITE, MAP_FILE|MAP_VARIABLE|MAP_SHARED, 8, 20971520) = 0x30000000
29294829: 0.0062: msync(0x30400000, 4194304, 32) = 0
30933087: 0.2532: msync(0x30000000, 4194304, 32) = 0
29294829: 0.6684: munmap(0x30400000, 4194304) = 0
30933087: 0.6693: munmap(0x30000000, 4194304) = 0
29294829: 0.6699: mmap(0x00000000, 4194304, PROT_READ|PROT_WRITE, MAP_FILE|MAP_VARIABLE|MAP_SHARED, 10, 29360128) = 0x3000000
0
30671061: 0.6702: mmap(0x00000000, 4194304, PROT_READ|PROT_WRITE, MAP_FILE|MAP_VARIABLE|MAP_SHARED, 9, 25165824) = 0x30400000
29294829: 0.6751: msync(0x30000000, 4194304, 32) = 0
30671061: 0.7616: msync(0x30400000, 4194304, 32) = 0
 
Use “-t” flag to trace system calls, as follows:
#truss -tmmap,msync,munmap -p 6619376
munmap(0x30400000, 4194304) = 0
munmap(0x31400000, 4194304) = 0
mmap(0x00000000, 4194304, PROT_READ|PROT_WRITE, MAP_FILE|MAP_VARIABLE|MAP_SHARED, 10, 29360128) = 0x30400000
munmap(0x30800000, 4194304) = 0
mmap(0x00000000, 4194304, PROT_READ|PROT_WRITE, MAP_FILE|MAP_VARIABLE|MAP_SHARED, 8, 20971520) = 0x30800000
munmap(0x30C00000, 4194304) = 0
mmap(0x00000000, 4194304, PROT_READ|PROT_WRITE, MAP_FILE|MAP_VARIABLE|MAP_SHARED, 6, 12582912) = 0x30C00000
msync(0x30800000, 4194304, 32) = 0
 
Use “-u” to trace dynamically loaded user level function calls from user libraries, such as libc.a.
#truss -u libc.a::* -p 6619376
kwrite(1, " m s y n c i n t e r v".., 46) = 46
->libc.a:gettimeofday(0x200dba98, 0x0)
->libc.a:gettimeofday(0x200f9a90, 0x0)
<-libc.a:gettimeofday() = 0 0.000000
munmap(0x31400000, 4194304) = 0
<-libc.a:gettimeofday() = 0 0.000000
->libc.a:gettimeofday(0x200f9a98, 0x0)
<-libc.a:gettimeofday() = 0 0.000000
mmap(0x00000000, 4194304, PROT_READ|PROT_WRITE, MAP_FILE|MAP_VARIABLE|MAP_SHARED, 10, 29360128) = 0x30000000
mmap(0x00000000, 4194304, PROT_READ|PROT_WRITE, MAP_FILE|MAP_VARIABLE|MAP_SHARED, 9, 25165824) = 0x30400000
->libc.a:gettimeofday(0x200bda98, 0x0)
->libc.a:gettimeofday(0x200f9a90, 0x0)
 
 
Tip: You can implement most of the truss features using probevue, too. Refer to the probevue user guide at:
http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.genprogc/doc/genprogc/probevue_userguide.htm
Real case studies using tracing facilities
Here we introduce a real case scenario to explain how to use the trace tools to identify a performance problem, based on the approaches mentioned.
 
Note: To avoid disclosure of sensitive data, we used a sample program here to duplicate the client problem.
Problem description
The client complains that I/O is quite slow when writing to a mmaped data file using a multithreaded application. From the application log, the application occasionally hangs many seconds waiting for I/O to complete.
Also, the client believes the I/O pattern should be sequential but the bandwidth is limited, only 70 MBps on V7000 storage.
The sample code used to duplicate the client problem is in Example A-17. It opens a file, truncates it to size so that there is 4 MB space for each thread. Then all the threads will mmap their own 4 MB space, modify it, msync it, and then munmap it. We simply duplicated the client problem by running ./testmmap 8.
Example A-17 Sample code for real case studies
/*
* The following [enclosed] code is sample code created by IBM
* Corporation. This sample code is not part of any standard IBM product
* and is provided to you solely for the purpose of demonstration.
* The code is provided 'AS IS',
* without warranty of any kind. IBM shall not be liable for any damages
* arising out of your use of the sample code, even if they have been
* advised of the possibility of such damages.
*/
/*
To compile: xlC_r testmmap.cpp -o testmmap
Problem report: [email protected]
*/
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/select.h>
 
#define MAXTHREADNUM 128
#define INTERVAL 10000
#define BUF_SIZE 4*1024*1024
 
 
int testmmap(int arg)
{
int fd = 0, n = 0, i = 0;
timeval start, end;
void *mp = NULL;
struct stat sbuf;
char idx = 0;
long interval;
int mplen = BUF_SIZE;
 
if((fd = open("m.txt", O_RDWR, 0644)) >= 0)
{
fstat(fd, &sbuf);
printf("file size = %d ", sbuf.st_size);
}
else
{
printf("open failed ");
return 0;
}
while(1)
{
idx++;
if((mp = mmap(NULL, mplen, PROT_READ|PROT_WRITE, MAP_SHARED, fd, arg*BUF_SIZE)) != (void*)-1 )
{
memset(mp, idx, mplen);
msync(mp, mplen, MS_SYNC);
munmap(mp, mplen);
}
} //while
close(fd);
return 0;
}
extern "C" void* testfunc(void* arg)
{
unsigned int i=0;
int ret = 0;
testmmap((int)arg);
return NULL;
}
 
 
int main(int argc, char* argv[])
{
pthread_t tid[MAXTHREADNUM] = {0};
int ret;
struct stat sbuf;
 
if(argc < 2)
{
printf("usage: %s <num> ", argv[0]);
return 0;
}
 
int count = atoi(argv[1]);
if(count > MAXTHREADNUM)
count = MAXTHREADNUM;
 
int fd, bufsz;
bufsz = count * BUF_SIZE;
if((fd = open("m.txt", O_RDWR|O_CREAT, 0644)) > 0)
{
fstat(fd, &sbuf);
printf("file size = %d ", sbuf.st_size);
ftruncate(fd, bufsz); //file is truncated, this might cause fragmentation if the file is not allocated.
fstat(fd, &sbuf);
printf("file size after truncate = %d ", sbuf.st_size);
}
close(fd);
 
for(int i =0; i < count; i++)
{
ret = pthread_create(&tid[i], NULL, testfunc, (void*)i);
if(ret != 0)
{
printf("pthread_create error, err=%d ", ret);
return -1;
}
}
void* status;
for(int i =0; i < count; i++)
{
pthread_join(tid[i], &status);
printf("thread %d exit status = %d ", tid[i], (int)status);
}
return 0;
}
Problem analysis
We generated a curt report, shown in Example A-18, using the approach illustrated in Example A-13 on page 324. We found that a “msync” system call was very slow, with 261.1781 ms avg elapsed time, and 610.8883 ms max elapsed time, while the average processor time consumed was only 1.3897 ms. The behavior seemed quite abnormal. Looking into the process detail part, we could see that the msync subroutine is called by the testmmap process. The PID of testmmap was 17629396.
Example A-18 Curt report
#more curt.out
...
System Calls Summary
--------------------
Count Total Time % sys Avg Time Min Time Max Time Tot ETime Avg ETime Min ETime Max ETime SVC (Address)
(msec) time (msec) (msec) (msec) (msec) (msec) (msec) (msec)
======== =========== ====== ======== ======== ======== ======== ========= ========= ========= ================
48 66.7060 0.32% 1.3897 0.0279 7.5759 12536.5502 261.1781 56.6501 610.8883 msync(32a63e8)
47 21.8647 0.10% 0.4652 0.4065 0.5593 7108.8461 151.2520 3.1755 558.8734 munmap(32a63d0)
55 0.8696 0.00% 0.0158 0.0071 0.0274 509.6747 9.2668 2.4847 20.3143 mmap(32a6400)
...
 
Process Details for Pid: 17629396
Process Name: testmmap
8 Tids for this Pid: 38731907 37945371 32964661 29622293 29425739 26476695
9961589 9502927
 
Total Application Time (ms): 70.766818
Total System Call Time (ms): 89.458826
Total Hypervisor Call Time (ms): 8.406510
 
Process System Call Summary
---------------------------
Count Total Time % sys Avg Time Min Time Max Time Tot ETime Avg ETime Min ETime Max ETime SVC (Address)
(msec) time (msec) (msec) (msec) (msec) (msec) (msec) (msec)
======== =========== ====== ======== ======== ======== ======== ========= ========= ========= ================
48 66.7060 0.32% 1.3897 0.0279 7.5759 12536.5502 261.1781 56.6501 610.8883 msync(32a63e8)
47 21.8647 0.10% 0.4652 0.4065 0.5593 7108.8461 151.2520 3.1755 558.8734 munmap(32a63d0)
55 0.8696 0.00% 0.0158 0.0071 0.0274 509.6747 9.2668 2.4847 20.3143 mmap(32a6400)
 
Pending System Calls Summary
----------------------------
Accumulated Accumulated SVC (Address) Tid
Time (msec) ETime (msec)
============ ============ ========================= ================
0.0054 7.7179 msync(32a63e8) 38731907
0.0050 1.5159 msync(32a63e8) 9502927
0.0042 6.4927 msync(32a63e8) 37945371
0.0022 5.3160 msync(32a63e8) 9961589
0.0019 13.7364 mmap(32a6400) 32964661
Then we could trace the msync system call via the trace or truss command. Because we already had the process ID here, it was simpler to use truss directly, as in Example A-19. We saw that the msync I/O size was always 4194304 (4 MB). According to the application logic, the 4 MB is purely dirty pages, so it is a large block of sequential I O, and should be very fast.
Example A-19 trace msync subroutine
#truss -d -fael -tmsync,munmap -p 17629396
17629396: psargs: ./testmmap 8
Tue Oct 16 12:58:09 2012
17629396: 32964661: 0.0000: munmap(0x30400000, 4194304) = 0
17629396: 9961589: 0.0008: munmap(0x30C00000, 4194304) = 0
17629396: 9502927: 0.0014: munmap(0x30800000, 4194304) = 0
17629396: 26476695: 0.0022: munmap(0x31800000, 4194304) = 0
17629396: 9961589: 0.0074: msync(0x30000000, 4194304, 32) = 0
17629396: 29622293: 0.0409: munmap(0x31400000, 4194304) = 0
17629396: 9502927: 0.0419: msync(0x30400000, 4194304, 32) = 0
17629396: 26476695: 0.6403: msync(0x30C00000, 4194304, 32) = 0
17629396: 37945371: 0.6723: munmap(0x31000000, 4194304) = 0
17629396: 29622293: 0.6779: msync(0x31800000, 4194304, 32) = 0
17629396: 38731907: 0.7882: msync(0x30800000, 4194304, 32) = 0
17629396: 29622293: 0.9042: munmap(0x31800000, 4194304) = 0
17629396: 9502927: 0.9049: munmap(0x30400000, 4194304) = 0
17629396: 32964661: 0.9104: msync(0x31C00000, 4194304, 32) = 0
17629396: 9502927: 0.9105: msync(0x30400000, 4194304, 32) = 0
17629396: 9961589: 0.9801: munmap(0x30000000, 4194304) = 0
17629396: 38731907: 0.9809: munmap(0x30800000, 4194304) = 0
17629396: 32964661: 0.9815: munmap(0x31C00000, 4194304) = 0
17629396: 29425739: 0.9829: msync(0x31400000, 4194304, 32) = 0
17629396: 32964661: 1.0529: msync(0x30800000, 4194304, 32) = 0
However, from filemon, the average I/O size underneath msync was 8.3 512-byte blocks (~4 KB), as in Example A-20. Also, the seek ratio was 100%, which means the I/O is purely random, while the seek distance is quite short, only about 110.7 blocks (~55 KB).
Example A-20 filemon report
------------------------------------------------------------------------
Detailed Logical Volume Stats (512 byte blocks)
------------------------------------------------------------------------
 
VOLUME: /dev/fslv05 description: /data
writes: 23756 (0 errs)
write sizes (blks): avg 8.3 min 8 max 1648 sdev 21.3
write times (msec): avg 128.851 min 3.947 max 493.747 sdev 117.692
write sequences: 23756
write seq. lengths: avg 8.3 min 8 max 1648 sdev 21.3
seeks: 23756 (100.0%)
seek dist (blks): init 5504,
avg 110.7 min 8 max 65904 sdev 1831.6
seek dist (%tot blks):init 0.01312,
avg 0.00026 min 0.00002 max 0.15713 sdev 0.00437
time to next req(msec): avg 0.107 min 0.000 max 1423.920 sdev 10.018
throughput: 37427.0 KB/sec
utilization: 0.44
PS: we can also get some clue of this using iostat command.
Usually such issues are likely caused by file fragmentation. We used the fileplace command to confirm this, as in Example A-21. There were 7981 fragments in the 32 MB file.
Example A-21 fileplace -pv output
#fileplace -pv m.txt|pg
 
File: m.txt Size: 33554432 bytes Vol: /dev/fslv05
Blk Size: 4096 Frag Size: 4096 Nfrags: 8192
Inode: 6 Mode: -rw-r--r-- Owner: root Group: system
 
Physical Addresses (mirror copy 1) Logical Extent
---------------------------------- ----------------
26732319 hdisk1 1 frags 4096 Bytes, 0.0% 00000671
26732324-26732330 hdisk1 7 frags 28672 Bytes, 0.1% 00000676-00000682
26732332 hdisk1 1 frags 4096 Bytes, 0.0% 00000684
26732337 hdisk1 1 frags 4096 Bytes, 0.0% 00000689
26732348 hdisk1 1 frags 4096 Bytes, 0.0% 00000700
26732353 hdisk1 1 frags 4096 Bytes, 0.0% 00000705
...
26738415 hdisk1 1 frags 4096 Bytes, 0.0% 00006767
26738422 hdisk1 1 frags 4096 Bytes, 0.0% 00006774
26738428 hdisk1 1 frags 4096 Bytes, 0.0% 00006780
 
8192 frags over space of 8252 frags: space efficiency = 99.3%
7981 extents out of 8192 possible: sequentiality = 2.6%
Problem solution
Further analysis shows that the fragmentation is caused by concurrent writes to an unallocated mmaped file (m.txt). You can create the m.txt in advance, and the problem will not occur, as in Example A-22. After the adjustment, the I/O bandwidth on V7000 storage is ~350 MBps, as compared to less than 70 MBps before adjustment.
Example A-22 Create and allocate the mmaped file before concurrent write
#dd if=/dev/zero of=./m.txt bs=1M count=32
#./testmmap 8
From curt and filemon output in Example A-23, the average elapsed time now is 26.3662; the average I/O is 2048 blocks, which is 1 MB in size, and the seek percent is 25.2%, which means the majority of I/O is sequential. Also you can see from the fileplace output at the bottom of Example A-23 that the file sequentially is 100%.
Example A-23 Curt and filemon report after adjustment
curt report:
...
System Calls Summary
--------------------
Count Total Time % sys Avg Time Min Time Max Time Tot ETime Avg ETime Min ETime Max ETime SVC (Address)
(msec) time (msec) (msec) (msec) (msec) (msec) (msec) (msec)
======== =========== ====== ======== ======== ======== ======== ========= ========= ========= ================
460 1885.0173 4.29% 4.0979 0.0192 9.0059 12128.4694 26.3662 10.4989 61.4130 msync(32a63e8)
441 294.1300 0.67% 0.6670 0.0178 1.1844 2540.1148 5.7599 0.6690 33.9758 munmap(32a63d0)
459 19.4737 0.04% 0.0424 0.0090 0.1565 751.7646 1.6378 0.0467 8.8085 mmap(32a6400)
 
filemon report:
...
------------------------------------------------------------------------
Detailed Logical Volume Stats (512 byte blocks)
------------------------------------------------------------------------
 
VOLUME: /dev/fslv05 description: /data
writes: 3264 (0 errs)
write sizes (blks): avg 2048.0 min 2048 max 2048 sdev 0.0
write times (msec): avg 5.379 min 2.724 max 29.700 sdev 2.657
write sequences: 822
write seq. lengths: avg 8132.2 min 2048 max 32768 sdev 3776.7
seeks: 822 (25.2%)
seek dist (blks): init 22528,
avg 25164.7 min 2048 max 67584 sdev 16029.7
time to next req(msec): avg 3.062 min 0.001 max 39.448 sdev 5.873
throughput: 333775.0 KB/sec
utilization: 0.59
 
fileplace report:
#fileplace -pv m.txt
 
File: m.txt Size: 33554432 bytes Vol: /dev/fslv05
Blk Size: 4096 Frag Size: 4096 Nfrags: 8192
Inode: 6 Mode: -rw-r--r-- Owner: root Group: system
 
Physical Addresses (mirror copy 1) Logical Extent
---------------------------------- ----------------
26732416-26740351 hdisk1 7936 frags 32505856 Bytes, 96.9% 00000768-00008703
26740608-26740863 hdisk1 256 frags 1048576 Bytes, 3.1% 00008960-00009215
 
8192 frags over space of 8448 frags: space efficiency = 97.0%
2 extents out of 8192 possible: sequentiality = 100.0%
 
PerfPMR
PerfPMR is an official AIX Support tool. It provides a set of scripts to gather AIX performance data, including:
600 seconds (default) of general system performance data (monitor.sh 600)
System trace data
 – Trace data for reporting
 – Trace data for post processing tools (curt, tprof, pprof, and splat)
 – Stand alone tprof and filemon data collection
PMU events count data
Hardware and software configuration data
iptrace and tcpdump data
PerfPMR is available from this public website:
ftp://ftp.software.ibm.com/aix/tools/perftools/perfpmr
You can refer to the README in the PerfPMR package for data collection and how to send data to IBM. We suggest you get a baseline PerfPMR data package under normal situations in advance, and collect more performance data when the problem is occurring.
You can get more information about PerfPMR at the following website:
http://www-01.ibm.com/support/docview.wss?uid=aixtools-27a38cfb
 
Note: We suggest you only use the PerfPMR tools when collecting data on production systems. Do not use the trace command on production directly unless you are required to do so.
The hpmstat and hpmcount utilities
IBM Power processor provides on-chip performance monitor units (PMUs) to count the number of performance critical processor events. Use the PMU-based utilities to analyze specific processor performance problems including memory affinity, TLB misses, SLB misses, and so on.
Two commands can be used to count and analyze PMU events, hpmcount and hpmstat, respectively. The hpmcount command is used to collect PMU statistics, while the hpmstat command collects data system wide. We have already shown an example using hpmstat to monitor SLB misses in 4.2.3, “One TB segment aliasing” on page 129.
 
Note: The ASO/DSO functionality also uses PMU counters for performance analysis. When you are running the hpmstat and pmcount commands, the ASO/DSO might stop functioning for a while.
Here is another example using hpmstat to identify memory affinity issues. As from the pmlist output, Group #108 is used for memory affinity analyzing; refer to Example A-24.
Example A-24 Determine the processor event group for memory affinity
#pmlist -g -1|pg
...
Group #108: pm_dsource12
Group name: Data source information
Group description: Data source information
Group status: Verified
Group members:
Counter 1, event 27: PM_DATA_FROM_RL2L3_MOD : Data loaded from remote L2 or L3 modified
Counter 2, event 25: PM_DATA_FROM_DMEM : Data loaded from distant memory
Counter 3, event 24: PM_DATA_FROM_RMEM : Data loaded from remote memory
Counter 4, event 31: PM_DATA_FROM_LMEM : Data loaded from local memory
Counter 5, event 0: PM_RUN_INST_CMPL : Run_Instructions
Counter 6, event 0: PM_RUN_CYC : Run_cycles
Thus you can use the hpmstat command to monitor memory affinity status, as in Example A-25. In the example, we monitored the PMU event group #108, which contains the memory affinity metrics for 20 seconds. The memory locality value is 0.667, which means 66.7% of memory access is local, which indicates good memory affinity. We can try the RSET command and vmo options to get even better memory affinity.
Example A-25 Memory affinity report
# hpmstat -r -g 108 20
Execution time (wall clock time): 20.009228351 seconds
 
Group: 108
Counting mode: user+kernel+hypervisor+runlatch
Counting duration: 1040.320869363 seconds
PM_DATA_FROM_RL2L3_MOD : 66621
(Data loaded from remote L2 or L3 modified)
PM_DATA_FROM_DMEM (Data loaded from distant memory) : 0
PM_DATA_FROM_RMEM (Data loaded from remote memory) : 60922
PM_DATA_FROM_LMEM (Data loaded from local memory) : 122057
PM_RUN_INST_CMPL (Run_Instructions) : 7322552457
PM_RUN_CYC (Run_cycles) : 78132452650
 
Normalization base: time
 
Counting mode: user+kernel+hypervisor+runlatch
 
Derived metric group: Memory
 
[ ] Memory locality : 0.667
[ ] Memory load traffic : 2859.047 MBytes
[ ] Memory load bandwidth per processor : 2.748 MBytes/s
[ ] Number of loads from local memory per loads from remote memory: 2.003
[ ] Number of loads from local memory per loads from remote and distant memory: 2.003
 
Derived metric group: dL1_Reloads_percentage_per_inst
 
[ ] % of DL1 Reloads from Remote L2 or L3 (Modified) per Inst: 0.001 %
[ ] % of DL1 Reloads from Local Memory per Inst : 0.002 %
[ ] % of DL1 Reloads from Remote Memory per Inst : 0.001 %
[ ] % of DL1 Reloads from Distant Memory per Inst : 0.000 %
 
Derived metric group: General
 
[ ] Run cycles per run instruction : 10.670
[ ] MIPS : 7.039 MIPS
 
u=Unverified c=Caveat R=Redefined m=Interleaved
 
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset