Appendix A. Performance monitoring tools and what they are telling us

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Performance monitoring tools and what they are telling us

This appendix discusses performance monitoring tools and explains what the tools are telling us while trying to tune your IBM POWER Systems environment.

The following topics are discussed:

•NMON

•lpar2rrd

•Trace tools and PerfPMR

NMON

Any AIX administrator who has worked on any performance issue should already be very familiar with and fond of the nmon tool. For those who have not, here is a short history lesson:

•nmon (Nigel’s performance MONitor) was created in 1997 by Nigel Griffiths, an employee of IBM UK Ltd. He saw a need for a simple, small and safe tool to monitor AIX in real-time, and additionally capture numerous elements of AIX usage including processor, processes, memory, network and storage.

•Originally written for himself, news of its existence quickly spread.

•The first releases supported AIX 4 (which was the current major release at the time) and Nigel continued to improve and enhance nmon to support subsequent AIX 5 and 6 releases.

•Included and installed by default since 2008 with AIX from 5.3 TL09, 6.1 TL02, VIOS 2.1 and beyond. The version included with AIX is maintained and supported by the AIX Performance Tools team.

•nmon has since been ported to Linux and is included in various distributions for PowerPC (PPC), x86 and even Raspberry Pi.

Note: Because nmon is included with all current releases of AIX, there is no need to install a legacy nmon release (known as classic nmon). Classic nmon was written for versions of AIX no longer supported and that do not contain support for newer hardware or software generations. Neither IBM nor Nigel support or recommend the use of classic nmon on any current AIX releases.

lpar2rrd

lpar2rrd is a tool based on perl that connects to HMC (or IVM), via SSH, and collects performance data. It is agentless and collects processor utilization for each LPAR and global processor utilization of the frames managed by the HMCs. Data is stored in an RRD database. Graphs are automatically generated and presented by a web server. Information is available for each LPAR or grouped in shared processor pools and entire frames.

The tool is free (GPL) and there is an option to contract support to get enhanced features. lpar2rrd can be downloaded at:

www.lpar2rrd.com

It can be installed in any UNIX. The version used in this book is 3.20.

Trace tools and PerfPMR

The AIX system trace is designed for tracing of activities inside both the kernel and kernel extension. When trace is activated, the selected trace events are recorded in an internal buffer. The data in the buffer is written into the trace log file later. The raw trace data is in binary format, and AIX provides further utilities to process it and generate readable report from different perspectives. Such utilities include trcrpt, tprof, curt, splat, pprof, filemon, and netpmon.

The tracing utilities are very useful for identifying difficult performance problems. In the following section we discuss some details on how to use the tools and read the reports.

Note: Although trace facilities are designed to have little impact on system performance, it is not introduced for auditing purposes. You should not use it as a batch job in production systems unless you are required to by the IBM support personnel.

AIX system trace basics

The system trace is activated using the trace command. Here we introduce the trace command and some of the trace options, and offer examples on how to use it.

A quick start example

Example A-1 shows how to get a filemon report from trace. This is useful when you need to postprocess the trace log file from a remote machine, or when the system load is high and trace events are being missed by filemon.

Example A-1 Quick start example of using the trace utilities

#trace -andfp -J filemon -C all -r PURR -T 30000000 -L 30000000 -o trace.raw

#trcon

#sleep 5

#trcstop

#gensyms -F > gensyms.out

#/usr/bin/trcrpt -C all -r trace.raw > trace.bin

#filemon -i trace.bin -n gensyms.out -O all,detailed -o filemon.out

System trace mode

The system trace can be used in asynchronous mode or in subcommand mode. Choose asynchronous mode by specifying the -a option with the trace command. The trace daemon runs in the background in this mode, and you can use trcon, trcoff, and trcstop to start tracing, stop tracing, and exit a tracing session, respectively (Example A-2).

Example A-2 Start, stop system trace in asynchronous mode

To start trace daemon and trace data collection:

#trace -a

To stop:

#trcstop

To start trace daemon and delay the data collection

#trace -a -d

To start data collection:

#trcon

To stop data collection:

#trcoff

To exit tracing session:

#trcstop

The system trace is in subcommand mode if -a is not specified. You get a dialog to interact with the tracing facilities in this mode, which is seldom used and is not discussed here.

Default trace setting

Get the default trace setting with the command trcctl as shown in Example A-3. You can see that the default trace log file is located in /var/adm/ras with the name trcfile and size 2621440.

Example A-3 Get the default trace settings using trcctl

# trcctl -l

Default Buffer Size: 262144

Default Log File Size: 2621440

Default Log File: /var/adm/ras/trcfile

Non-Root User Buffer Size Maximum: 1048576

Default Components Directory: /var/adm/ras/trc_ct

Default LMT Log Dir: /var/adm/ras/mtrcdir

We suggest that you keep the default value of trcctl and use trace options to set values other than default, which are discussed later. If you change it by mistake, you can use trcctl -r to restore the default settings, as shown in Example A-4.

Example A-4 Restore the default trace settings using trcctl

# trcctl -r

System trace buffer size and buffer mode options

The system trace utilities have three buffer modes. Table A-1 shows the available buffer modes and the corresponding options in parentheses.

Table A-1 Available trace buffer modes

Buffer mode	Description
Alternative (default)	Two trace buffers are allocated. All trace events written into the buffers are captured in the trace log file. The data in one buffer will be written to the trace log file while trace events start recording into the alternative buffer. Trace buffer wraparound will occur if the buffer switching happens and the trace events are not written into the trace log file in a timely manner. Also, the trace log wraparound will occur when it fills up and the trace daemon continues to collect trace data. However, this can be avoided by specifying the -s option with trace, to stop tracing when the trace log fills.
Circular (-l)	The trace events wrap within the trace buffers and are not captured in the trace log file until the trace data collection is stopped.
Single (-f)	The collection of trace events stops when the trace buffer fills up and the contents of the buffer are captured in the trace log file. This option is frequently used to analyze performance issues. Most of the examples used in Appendix A adopt this option, such as Example A-1, Example A-10, Example A-11, and so on.

Use -T to specify a trace buffer size other than the default. For alternative and circular mode, the buffer size ranges from 8192 bytes to 268,435,184 bytes. For single buffer mode, the buffer size ranges from 16,392 bytes to 536,870,368 bytes.

If you specify a trace buffer size using -T, you should also need to specify the trace log file size using -L. The following criteria must be met:

•In the circular and alternate modes, the trace buffer size must be one-half or less the size of the trace log file.

•In the single mode, the trace log file must be at least the size of the buffer.

By default, all logical processors share the same trace buffer. However, the system trace subsystem can use separate trace buffers for each logical processor, too. This is useful to avoid the situation that one logical processor has much more activities and overflows the trace events of other logical processors. This is achieved by specifying the -C all option.

Note: The trace buffer is pinned in memory, and so it consumes more physical memory if you specify a larger value.

Trace hooks and event groups

Because there are large numbers of trace events, AIX uses trace event identifiers and trace event groups to categorize and manage them. The trace event identifier is also called trace hook identifier.

Example A-5 shows how to get all the trace hook identifiers using trcrpt. The four digits at the beginning of each line are the trace event identifier, for example, 0010 is the trace event identifier for the TRACE ON event.

Example A-5 Using trcrpt -j to get all the track hook identifiers

#trcrpt -j|pg

0010 TRACE ON

0020 TRACE OFF

0030 TRACE HEADER

0040 TRACEID IS ZERO

0050 LOGFILE WRAPAROUND

0060 TRACEBUFFER WRAPAROUND

0070 UNDEFINED TRACE ID

0080 DEFAULT TEMPLATE

0090 trace internal events

00a0 TRACE_UTIL

1000 FLIH

1001 OFED Trace

1002 STNFS GENERAL

1003 STNFS VNOPS

1004 STNFS VNOPS HOLD/RELE

1005 STNFS ID

1006 STNFS IO

1007 STNFS OTW

1008 STNFS TREE

1009 STNFS VFSOPS

100b TRACE_UNMANAGED

1010 SYSTEM CALL

101d DISPATCH SRAD

101e DISPATCH Affinity

1020 SLIH

1022 MSTOR_ERR

1025 SFWCOM

1027 THRD_REAFF

1028 THRD_MREAFF

1029 SFW

...

Trace event groups are defined sets of trace events. Example A-6 shows how to get the trace event group using trcrpt. The name at the beginning of each stanza is the name of the trace group. For example, tidhk is the name of the group “Hooks needed to display thread name”.

Example A-6 Using trcrpt to get all trace event groups

#trcrpt -G|pg

tidhk - Hooks needed to display thread name (reserved)

106,10C,134,139,465

...

tprof - Hooks for TPROF performance tool (reserved)

134,139,210,234,38F,465,5A2,5A5,5D8

pprof - Hooks for PPROF performance tool (reserved)

106,10C,134,135,139,419,465,467,4B0,5D8

filemon - Hooks for FILEMON performance tool (reserved)

101,102,104,106,107,10B,10C,10D,12E,130,137,139,154,15B,163,19C,1BA,1BE,1BC,1C9,221,222,228,232,2A1,2A2,3D3,419,45B,4B0,5D8,AB2

netpmon - Hooks for NETPMON performance tool (reserved)

100,101,102,103,104,106,107,10C,134,135,139,163,19C,200,210,211,212,213,215,216,252,255,256,262,26A,26B,2A4,2A5,2A7,2A8,2C3,2C4,2DA,2DB,2E6,2E7,2EA,2EB,30A,30B,320,321,32D,32E,330,331,334,335,351,352,38F,419,465,467,46A,470,471,473,474,488,48A,48D,4A1,4A2,4B0,4C5,4C6,598,599,5D8

curt - Hooks for CURT performance tool (reserved)

100,101,101D,102,103,104,106,10C,119,134,135,139,200,210,215,38F,419,465,47F,488,489,48A,48D,492,4B0,5D8,600,603,605,606,607,608,609

splat - Hooks for SPLAT performance tool (reserved)

106,10C,10E,112,113,134,139,200,419,465,46D,46E,492,5D8,606,607,608,609

perftools - Hooks for all six performance tools (reserved)

100,101,101D,102,103,104,106,107,10B,10C,10D,10E,112,113,119,12E,130,134,135,137,139,154,15B,163,19C,1BA,1BC,1BE,1BC,1C9,200,210,211,212,213,215,216,221,222,228,232,234,252,255,256,262,26A,26B,2A1,2A2,2A4,2A5,2A7,2A8,2C3,2C4,2DA,2DB,2E6,2E7,2EA,2EB,30A,30B,320,321,32D,32E,330,331,334,335,351,352,38F,3D3,419,45B,465,467,46A,46D,46E,470,471,473,474,47F,488,489,48A,48D,492,4A1,4A2,4B0,4C5,4C6,598,599,5A2,5A5,5D8,600,603,605,606,607,608,609,AB2

tcpip - TCP/IP PROTOCOLS AND NETWORK MEMORY (reserved)

252,535,536,537,538,539,254,25A,340

...

Trace event groups are useful because you usually need to trace a collection of trace events to identify a specific functional or performance problem. As an example, the most frequently used trace-based tools tprof, pprof, filemon, netpmon, curt, and splat all have their corresponding specific trace event groups that contain all necessary trace event identifiers for generating reports.

You can also use trcevgrp -l to display details about specific event groups, as shown in Example A-7.

Example A-7 Using trcevgrp to display specific event groups

#trcevgrp -l filemon

filemon - Hooks for FILEMON performance tool (reserved)

101,102,104,106,107,10B,10C,10D,12E,130,137,139,154,15B,163,19C,1BA,1BE,1BC,1C9,221,222,228,232,2A1,2A2,3D3,419,45B,4B0,5D8,AB2

Choosing trace hooks and event groups

You can choose the trace hooks and event groups using specific trace options, shown in Table A-2.

Table A-2 Trace options and descriptions

Trace options	Description
-j <event identifier>	Specifies the user-defined events to collect trace data¹. Specifies a two-digit hook ID in the form hh as in hh00, hh10,...,hhF0. Specifies a three-digit hook ID in the form hhh as in hhh0. Specifies a four-digit hook ID in the form hhhh as in hhhh.
-J <event group>	Specifies the event groups to be included².
-k <event identifier>	Specifies the user-defined events to exclude trace dataa.
-K <event group>	Specifies the event groups to be excludedb.

¹ Multiple events can be separated by commas.

² Multiple groups can be separated by commas.

Note: Usually, you need to include the tidhk event group to get detailed process and thread information, and exclude vmm events if it is irrelevant to the problem you are analyzing, because usually there are lots of vmm events.

Merging trace log files and reordering

In alternative mode, the trace daemon normally wraps around the trace log when it fills up and continues to collect trace data, unless the -s option is specified. In circular mode, the trace buffer wraparound might occur. Due to the wraparound events, the trace log file might not be in chronological order. You need the trace log file in chronological order to do the post processing with curt, tprof, filemon, and so on.

You can force the reordering with the -r command, as shown in Example A-8.

Example A-8 Reordering the trace log file

The trace.raw is the original trace file, and we can reoder it chronologically to trace.r.

#trcrpt -o trace.r -r trace.raw

In systems with multiple logical processors and trace with the -C all option, you need to specify the -C all option with the -r option to merge, reorder and sort the trace log files into one raw data file in chronological order. We suggest that you perform this step before every post processing. Refer to Example A-9 on page 322.

Example A-9 Merging, reordering, and sorting the trace log file in chronological order

#trcrpt -o trace.r -r -C all trace.raw

Generate the trace report

We demonstrated using trcrpt to generate a report in ASCII format in 4.4.3, “File system best practice” on page 163. Specifying the -O option can generate more detailed reports, shown in Table A-3. Multiple options can be separated by commas. smitty and trcrpt are also available for use.

Table A-3 trcrpt -O options

Frequently used trcrpt -O options	Explanation
pid=on	Displays the process IDs in the trace report.
tid=on	Displays the thread ID in the trace report.
exec=on	Displays the exec path names in the trace report.
svc=on	Displays the value of the system call in the trace report.
timestamp=[0\|1\|2\|3\|4]	The most frequently used values are 0 and 1. 0 - Time elapsed since the trace was started and delta time from the previous event. The elapsed time is in seconds and the delta time is in milliseconds. This is the default. 1 - Short elapsed time. Reports only the elapsed time (in seconds) from the start of the trace.
cpuid=on	Displays the physical processor number in the trace report.

Example A-10 shows how to start the trace data collection and generate a trace report.

Example A-10 General a trace report

//comments: start trace immediately.

#trace -a -f -T 10000000 -L 10000000 -o trace.raw

//comments: monitor for 5 seconds; then stop tracing.

#sleep 5; trcstop

//comments: generate trace report “trace.int” from the trace log file “trace.raw”.

#trcrpt -Opid=on,tid=on,exec=on,svc=on,cpuid=on,timestamp=1 -o trace.int ./trace.raw

//comments: now you can read the trace report.

#more trace.int

...

Note: This example does not specify specific trace hooks, or exclude any trace hooks. Thus all trace events are logged. The trace buffer might get filled up before any useful trace events are written.

You can now get a more detailed report using the -O option for similar DIO demotion cases; refer to Example A-11. From this report, you can see that “dd” is causing the IO demotion with unaligned IO length (0x1001).

Example A-11 Using trcrpt with -O options

//comments: include hook id “59b” for DIO activities, and “tidhk” for process name reporting.

#trace -J tidhk -j 59b -a -f -T 10000000 -L 10000000 -o trace.raw

//comments: monitor for 5 seconds, and then stop tracing.

#sleep 5; trcstop

//comments: filter the trace report to display only hook id “59b” events.

#trcrpt -d 59b -Opid=on,tid=on,exec=on,svc=on,cpuid=on,timestamp=1 -o trace.int ./trace.raw

#more trace.int

...

ID PROCESS NAME CPU PID TID I SYSTEM CALL ELAPSED APPL SYSCALL KERNEL INTERRUPT

...

59B dd 0 2162750 30933019 0.000113 JFS2 IO write: vp = F1000A0232DF8420, sid = DE115E, offset = 00000000052FB2F6, length = 1001

59B dd 0 2162750 30933019 0.000113 JFS2 IO dio move: vp = F1000A0232DF8420, sid = DE115E, offset = 00000000052FB2F6, length = 1001

59B dd 0 2162750 30933019 0.000126 JFS2 IO devstrat (pager strategy): bplist = F1000A00E067C9A0, vp = F1000A0232DF8420, sid = DE115E, lv blk = A60650, bcount = 0400

59B dd 4 2162750 30933019 0.000492 JFS2 IO gather: bp = F1000A00E067D2E0, vp = F1000A0232DF8420, sid = DE115E, file blk = 297D8, bcount = 2000

59B dd 4 2162750 30933019 0.000495 JFS2 IO devstrat (pager strategy): bplist = F1000A00E067D2E0, vp = F1000A0232DF8420, sid = DE115E, lv blk = A60650, bcount = 1400

59B dd 4 2162750 30933019 0.000514 JFS2 IO dio demoted: vp = F1000A0232DF8420, mode = 0001, bad = 0002, rc = 0000, rc2 = 0000

If using trace data from another machine, you need the trcnm output and the /etc/trcfmt file to generate the ASCII report, otherwise the output will be incorrect. Refer to Example A-12.

Example A-12 Generate a trace report for a remote machine

On remote machine,

#trcnm > trcnm.out

#cp /etc/trcfmt trace.fmt

On reporting machine,

Download the “trcnm.out” and “trace.fmt” together with the trace raw data.

#trcrpt -n trcnm.out -t trace.fmt <other options>

Generate curt, tprof, filemon, splat, and netpmon reports from trace logs

You can generate curt, tprof, pprof, splat, filemon, and netpmon reports from the trace log if the related trace events are included in the trace log. Example A-13 shows an example of using the trace log to generate a curt report.

Example A-13 Generate a curt report using trace log

//comments: start trace using single buffer mode to collect trace event group “curt”, and delay starting of the trace data collection.

#/usr/bin/trace -andfp -C all -J curt -r PURR -T 20000000 -L 20000000 -o trace_bin

//comments: start trace data collection and monitor for 10 seconds, then stop tracing.

#trcon; sleep 10; trcstop

//comments: gather the symbol information necessary to run the “curt” command.

#gensyms > gensyms.out

//comments: reorder the trace logs to a single trace raw file “trace_curt”.

#trcrpt -C all -r trace_bin > trace_curt

//comments: generate curt report

#/usr/bin/curt -i trace_curt -n gensyms.out -ptes > curt.out

Example A-14 shows an example of using the trace log to generate a tprof report. Note that to run the tprof command in manual offline mode, the following files must be available:

•The symbolic information file rootstring.syms

•The trace log file rootstring.trc [-cpuid]

Example A-14 Generate a tprof report using trace log

//comments: start trace using single buffer mode to collect trace event group “tprof”, and delay starting of the trace data collection.

#/usr/bin/trace -andfp -C all -J tprof -r PURR -T 20000000 -L 20000000 -o myprof.trc

//comments: start trace data collection and monitor for 10 seconds, then stop tracing.

#trcon; sleep 10; trcstop

//comments: gather the symbol information necessary to run the “tprof” command.

#gensyms > myprof.syms

//comments: generate tprof report; rootstring equals to “myprof” here.

#tprof -uskejzlt -r myprof

# more myprof.prof

For a detailed explanation of the tprof and curt reports, refer to POWER7 and POWER7+ Optimization and Tuning Guide, SG24-8079.

Example A-15 on page 325 shows an example of using the trace log to generate a filemon report. It is good practice to avoid “trace events lost” by using -f to select the single buffer mode, as in this example.

Example A-15 Generate a filemon report using the trace log

//comments: start trace using single buffer mode to collect trace event group “filemon”, and delay starting of the trace data collection.

#trace -andfp -J filemon -C all -r PURR -T 20000000 -L 20000000 -o trace.raw

//comments: start trace data collection and monitor for 10 seconds, then stop tracing.

#trcon; sleep 10; trcstop

//comments: gather the symbol information necessary to run the “filemon” command.

#gensyms -F > gensyms.out

//comments: reorder the trace logs to a single trace raw file “trace.fmon”.

#/usr/bin/trcrpt -C all -r trace.raw > trace.fmon

//comments: generate filemon report “filemon.out”

#filemon -u -i trace.fmon -n gensyms.out -O all,detailed -o filemon.out

For a detailed explanation of the filemon report, refer to 4.4.4, “The filemon utility” on page 176.

Using the truss command

The truss command is useful when identifying application problems together with the system tracing utilities such as tprof, curt, and so on.

When we identifying the problematic process with tprof, curt, filemon, and so on, use the truss command to trace the process. Some frequently used options are in Table A-4.

Table A-4 truss options

truss options	Explanation
-p <pid>	Specifies the processes for tracing the execution.
-c	Counts traced system calls, faults, and signals instead of displaying trace results line by line. Produces a summary report after the traced command terminates or when truss is interrupted. Use this option alone or with -f.
-d	Includes a timestamp with each line of output.
-f	Follows all child processes.
-l	Displays the thread ID in the output.
-t[!]syscall	Includes or excludes system calls from the trace process.
-u [!] [LibraryName [...]::[!]FunctionName [...] ]	Traces dynamically loaded user level function calls from user libraries.

Example A-16 gives an example of using truss to debug application issues.

Example A-16 Using truss to identify application issues

To get a summary report of the process, truss -c -p <pid> for an interval and then Ctrl+C:

#truss -c -p 6619376

^Csyscall seconds calls errors

kwrite .00 6

munmap 12.06 81

msync 24.36 81

mmap .00 79

---- --- ---

sys totals: .00 247 0

usr time: .00

elapsed: .00

truss the execution of the process for an interval and Ctrl+C to stop truss:

#truss -d -o truss.log -l -p 6619376

With -l, the thread id is shown at each line:

#more truss.log

Mon Oct 15 16:54:36 2012

29294829: 0.0000: mmap(0x00000000, 4194304, PROT_READ|PROT_WRITE, MAP_FILE|MAP_VARIABLE|MAP_SHARED, 10, 29360128) = 0x3040000

30933087: kwrite(1, " m s y n c i n t e r v".., 45) = 45

30933087: 0.0007: munmap(0x30000000, 4194304) = 0

30933087: 0.0014: mmap(0x00000000, 4194304, PROT_READ|PROT_WRITE, MAP_FILE|MAP_VARIABLE|MAP_SHARED, 8, 20971520) = 0x30000000

29294829: 0.0062: msync(0x30400000, 4194304, 32) = 0

30933087: 0.2532: msync(0x30000000, 4194304, 32) = 0

29294829: 0.6684: munmap(0x30400000, 4194304) = 0

30933087: 0.6693: munmap(0x30000000, 4194304) = 0

29294829: 0.6699: mmap(0x00000000, 4194304, PROT_READ|PROT_WRITE, MAP_FILE|MAP_VARIABLE|MAP_SHARED, 10, 29360128) = 0x3000000

30671061: 0.6702: mmap(0x00000000, 4194304, PROT_READ|PROT_WRITE, MAP_FILE|MAP_VARIABLE|MAP_SHARED, 9, 25165824) = 0x30400000

29294829: 0.6751: msync(0x30000000, 4194304, 32) = 0

30671061: 0.7616: msync(0x30400000, 4194304, 32) = 0

Use “-t” flag to trace system calls, as follows:

#truss -tmmap,msync,munmap -p 6619376

munmap(0x30400000, 4194304) = 0

munmap(0x31400000, 4194304) = 0

mmap(0x00000000, 4194304, PROT_READ|PROT_WRITE, MAP_FILE|MAP_VARIABLE|MAP_SHARED, 10, 29360128) = 0x30400000

munmap(0x30800000, 4194304) = 0

mmap(0x00000000, 4194304, PROT_READ|PROT_WRITE, MAP_FILE|MAP_VARIABLE|MAP_SHARED, 8, 20971520) = 0x30800000

munmap(0x30C00000, 4194304) = 0

mmap(0x00000000, 4194304, PROT_READ|PROT_WRITE, MAP_FILE|MAP_VARIABLE|MAP_SHARED, 6, 12582912) = 0x30C00000

msync(0x30800000, 4194304, 32) = 0

Use “-u” to trace dynamically loaded user level function calls from user libraries, such as libc.a.

#truss -u libc.a::* -p 6619376

kwrite(1, " m s y n c i n t e r v".., 46) = 46

->libc.a:gettimeofday(0x200dba98, 0x0)

->libc.a:gettimeofday(0x200f9a90, 0x0)

<-libc.a:gettimeofday() = 0 0.000000

munmap(0x31400000, 4194304) = 0

<-libc.a:gettimeofday() = 0 0.000000

->libc.a:gettimeofday(0x200f9a98, 0x0)

<-libc.a:gettimeofday() = 0 0.000000

mmap(0x00000000, 4194304, PROT_READ|PROT_WRITE, MAP_FILE|MAP_VARIABLE|MAP_SHARED, 10, 29360128) = 0x30000000

mmap(0x00000000, 4194304, PROT_READ|PROT_WRITE, MAP_FILE|MAP_VARIABLE|MAP_SHARED, 9, 25165824) = 0x30400000

->libc.a:gettimeofday(0x200bda98, 0x0)

->libc.a:gettimeofday(0x200f9a90, 0x0)

Tip: You can implement most of the truss features using probevue, too. Refer to the probevue user guide at:

http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.genprogc/doc/genprogc/probevue_userguide.htm

Real case studies using tracing facilities

Here we introduce a real case scenario to explain how to use the trace tools to identify a performance problem, based on the approaches mentioned.

Note: To avoid disclosure of sensitive data, we used a sample program here to duplicate the client problem.

Problem description

The client complains that I/O is quite slow when writing to a mmaped data file using a multithreaded application. From the application log, the application occasionally hangs many seconds waiting for I/O to complete.

Also, the client believes the I/O pattern should be sequential but the bandwidth is limited, only 70 MBps on V7000 storage.

The sample code used to duplicate the client problem is in Example A-17. It opens a file, truncates it to size so that there is 4 MB space for each thread. Then all the threads will mmap their own 4 MB space, modify it, msync it, and then munmap it. We simply duplicated the client problem by running ./testmmap 8.

Example A-17 Sample code for real case studies

* The following [enclosed] code is sample code created by IBM

* Corporation. This sample code is not part of any standard IBM product

* and is provided to you solely for the purpose of demonstration.

* The code is provided 'AS IS',

* without warranty of any kind. IBM shall not be liable for any damages

* arising out of your use of the sample code, even if they have been

* advised of the possibility of such damages.

To compile: xlC_r testmmap.cpp -o testmmap

Problem report: [email protected]

#include <pthread.h>

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

#include <sys/types.h>

#include <sys/stat.h>

#include <fcntl.h>

#include <string.h>

#include <sys/mman.h>

#include <sys/time.h>

#include <sys/select.h>

#define MAXTHREADNUM 128

#define INTERVAL 10000

#define BUF_SIZE 4*1024*1024

int testmmap(int arg)

{

int fd = 0, n = 0, i = 0;

timeval start, end;

void *mp = NULL;

struct stat sbuf;

char idx = 0;

long interval;

int mplen = BUF_SIZE;

if((fd = open("m.txt", O_RDWR, 0644)) >= 0)

{

fstat(fd, &sbuf);

printf("file size = %d ", sbuf.st_size);

}

else

{

printf("open failed ");

return 0;

}

while(1)

{

idx++;

if((mp = mmap(NULL, mplen, PROT_READ|PROT_WRITE, MAP_SHARED, fd, arg*BUF_SIZE)) != (void*)-1 )

{

memset(mp, idx, mplen);

msync(mp, mplen, MS_SYNC);

munmap(mp, mplen);

}

} //while

close(fd);

return 0;

}

extern "C" void* testfunc(void* arg)

{

unsigned int i=0;

int ret = 0;

testmmap((int)arg);

return NULL;

}

int main(int argc, char* argv[])

{

pthread_t tid[MAXTHREADNUM] = {0};

int ret;

struct stat sbuf;

if(argc < 2)

{

printf("usage: %s <num> ", argv[0]);

return 0;

}

int count = atoi(argv[1]);

if(count > MAXTHREADNUM)

count = MAXTHREADNUM;

int fd, bufsz;

bufsz = count * BUF_SIZE;

if((fd = open("m.txt", O_RDWR|O_CREAT, 0644)) > 0)

{

fstat(fd, &sbuf);

printf("file size = %d ", sbuf.st_size);

ftruncate(fd, bufsz); //file is truncated, this might cause fragmentation if the file is not allocated.

fstat(fd, &sbuf);

printf("file size after truncate = %d ", sbuf.st_size);

}

close(fd);

for(int i =0; i < count; i++)

{

ret = pthread_create(&tid[i], NULL, testfunc, (void*)i);

if(ret != 0)

{

printf("pthread_create error, err=%d ", ret);

return -1;

}

void* status;

for(int i =0; i < count; i++)

{

pthread_join(tid[i], &status);

printf("thread %d exit status = %d ", tid[i], (int)status);

}

return 0;

}

Problem analysis

We generated a curt report, shown in Example A-18, using the approach illustrated in Example A-13 on page 324. We found that a “msync” system call was very slow, with 261.1781 ms avg elapsed time, and 610.8883 ms max elapsed time, while the average processor time consumed was only 1.3897 ms. The behavior seemed quite abnormal. Looking into the process detail part, we could see that the msync subroutine is called by the testmmap process. The PID of testmmap was 17629396.

Example A-18 Curt report

#more curt.out

...

System Calls Summary

--------------------

Count Total Time % sys Avg Time Min Time Max Time Tot ETime Avg ETime Min ETime Max ETime SVC (Address)

(msec) time (msec) (msec) (msec) (msec) (msec) (msec) (msec)

======== =========== ====== ======== ======== ======== ======== ========= ========= ========= ================

48 66.7060 0.32% 1.3897 0.0279 7.5759 12536.5502 261.1781 56.6501 610.8883 msync(32a63e8)

47 21.8647 0.10% 0.4652 0.4065 0.5593 7108.8461 151.2520 3.1755 558.8734 munmap(32a63d0)

55 0.8696 0.00% 0.0158 0.0071 0.0274 509.6747 9.2668 2.4847 20.3143 mmap(32a6400)

...

Process Details for Pid: 17629396

Process Name: testmmap

8 Tids for this Pid: 38731907 37945371 32964661 29622293 29425739 26476695

9961589 9502927

Total Application Time (ms): 70.766818

Total System Call Time (ms): 89.458826

Total Hypervisor Call Time (ms): 8.406510

Process System Call Summary

---------------------------

Count Total Time % sys Avg Time Min Time Max Time Tot ETime Avg ETime Min ETime Max ETime SVC (Address)

(msec) time (msec) (msec) (msec) (msec) (msec) (msec) (msec)

======== =========== ====== ======== ======== ======== ======== ========= ========= ========= ================

48 66.7060 0.32% 1.3897 0.0279 7.5759 12536.5502 261.1781 56.6501 610.8883 msync(32a63e8)

47 21.8647 0.10% 0.4652 0.4065 0.5593 7108.8461 151.2520 3.1755 558.8734 munmap(32a63d0)

55 0.8696 0.00% 0.0158 0.0071 0.0274 509.6747 9.2668 2.4847 20.3143 mmap(32a6400)

Pending System Calls Summary

----------------------------

Accumulated Accumulated SVC (Address) Tid

Time (msec) ETime (msec)

============ ============ ========================= ================

0.0054 7.7179 msync(32a63e8) 38731907

0.0050 1.5159 msync(32a63e8) 9502927

0.0042 6.4927 msync(32a63e8) 37945371

0.0022 5.3160 msync(32a63e8) 9961589

0.0019 13.7364 mmap(32a6400) 32964661

Then we could trace the msync system call via the trace or truss command. Because we already had the process ID here, it was simpler to use truss directly, as in Example A-19. We saw that the msync I/O size was always 4194304 (4 MB). According to the application logic, the 4 MB is purely dirty pages, so it is a large block of sequential I O, and should be very fast.

Example A-19 trace msync subroutine

#truss -d -fael -tmsync,munmap -p 17629396

17629396: psargs: ./testmmap 8

Tue Oct 16 12:58:09 2012

17629396: 32964661: 0.0000: munmap(0x30400000, 4194304) = 0

17629396: 9961589: 0.0008: munmap(0x30C00000, 4194304) = 0

17629396: 9502927: 0.0014: munmap(0x30800000, 4194304) = 0

17629396: 26476695: 0.0022: munmap(0x31800000, 4194304) = 0

17629396: 9961589: 0.0074: msync(0x30000000, 4194304, 32) = 0

17629396: 29622293: 0.0409: munmap(0x31400000, 4194304) = 0

17629396: 9502927: 0.0419: msync(0x30400000, 4194304, 32) = 0

17629396: 26476695: 0.6403: msync(0x30C00000, 4194304, 32) = 0

17629396: 37945371: 0.6723: munmap(0x31000000, 4194304) = 0

17629396: 29622293: 0.6779: msync(0x31800000, 4194304, 32) = 0

17629396: 38731907: 0.7882: msync(0x30800000, 4194304, 32) = 0

17629396: 29622293: 0.9042: munmap(0x31800000, 4194304) = 0

17629396: 9502927: 0.9049: munmap(0x30400000, 4194304) = 0

17629396: 32964661: 0.9104: msync(0x31C00000, 4194304, 32) = 0

17629396: 9502927: 0.9105: msync(0x30400000, 4194304, 32) = 0

17629396: 9961589: 0.9801: munmap(0x30000000, 4194304) = 0

17629396: 38731907: 0.9809: munmap(0x30800000, 4194304) = 0

17629396: 32964661: 0.9815: munmap(0x31C00000, 4194304) = 0

17629396: 29425739: 0.9829: msync(0x31400000, 4194304, 32) = 0

17629396: 32964661: 1.0529: msync(0x30800000, 4194304, 32) = 0

However, from filemon, the average I/O size underneath msync was 8.3 512-byte blocks (~4 KB), as in Example A-20. Also, the seek ratio was 100%, which means the I/O is purely random, while the seek distance is quite short, only about 110.7 blocks (~55 KB).

Example A-20 filemon report

------------------------------------------------------------------------

Detailed Logical Volume Stats (512 byte blocks)

------------------------------------------------------------------------

VOLUME: /dev/fslv05 description: /data

writes: 23756 (0 errs)

write sizes (blks): avg 8.3 min 8 max 1648 sdev 21.3

write times (msec): avg 128.851 min 3.947 max 493.747 sdev 117.692

write sequences: 23756

write seq. lengths: avg 8.3 min 8 max 1648 sdev 21.3

seeks: 23756 (100.0%)

seek dist (blks): init 5504,

avg 110.7 min 8 max 65904 sdev 1831.6

seek dist (%tot blks):init 0.01312,

avg 0.00026 min 0.00002 max 0.15713 sdev 0.00437

time to next req(msec): avg 0.107 min 0.000 max 1423.920 sdev 10.018

throughput: 37427.0 KB/sec

utilization: 0.44

PS: we can also get some clue of this using iostat command.

Usually such issues are likely caused by file fragmentation. We used the fileplace command to confirm this, as in Example A-21. There were 7981 fragments in the 32 MB file.

Example A-21 fileplace -pv output

#fileplace -pv m.txt|pg

File: m.txt Size: 33554432 bytes Vol: /dev/fslv05

Blk Size: 4096 Frag Size: 4096 Nfrags: 8192

Inode: 6 Mode: -rw-r--r-- Owner: root Group: system

Physical Addresses (mirror copy 1) Logical Extent

---------------------------------- ----------------

26732319 hdisk1 1 frags 4096 Bytes, 0.0% 00000671

26732324-26732330 hdisk1 7 frags 28672 Bytes, 0.1% 00000676-00000682

26732332 hdisk1 1 frags 4096 Bytes, 0.0% 00000684

26732337 hdisk1 1 frags 4096 Bytes, 0.0% 00000689

26732348 hdisk1 1 frags 4096 Bytes, 0.0% 00000700

26732353 hdisk1 1 frags 4096 Bytes, 0.0% 00000705

...

26738415 hdisk1 1 frags 4096 Bytes, 0.0% 00006767

26738422 hdisk1 1 frags 4096 Bytes, 0.0% 00006774

26738428 hdisk1 1 frags 4096 Bytes, 0.0% 00006780

8192 frags over space of 8252 frags: space efficiency = 99.3%

7981 extents out of 8192 possible: sequentiality = 2.6%

Problem solution

Further analysis shows that the fragmentation is caused by concurrent writes to an unallocated mmaped file (m.txt). You can create the m.txt in advance, and the problem will not occur, as in Example A-22. After the adjustment, the I/O bandwidth on V7000 storage is ~350 MBps, as compared to less than 70 MBps before adjustment.

Example A-22 Create and allocate the mmaped file before concurrent write

#dd if=/dev/zero of=./m.txt bs=1M count=32

#./testmmap 8

From curt and filemon output in Example A-23, the average elapsed time now is 26.3662; the average I/O is 2048 blocks, which is 1 MB in size, and the seek percent is 25.2%, which means the majority of I/O is sequential. Also you can see from the fileplace output at the bottom of Example A-23 that the file sequentially is 100%.

Example A-23 Curt and filemon report after adjustment

curt report:

...

System Calls Summary

--------------------

Count Total Time % sys Avg Time Min Time Max Time Tot ETime Avg ETime Min ETime Max ETime SVC (Address)

(msec) time (msec) (msec) (msec) (msec) (msec) (msec) (msec)

======== =========== ====== ======== ======== ======== ======== ========= ========= ========= ================

460 1885.0173 4.29% 4.0979 0.0192 9.0059 12128.4694 26.3662 10.4989 61.4130 msync(32a63e8)

441 294.1300 0.67% 0.6670 0.0178 1.1844 2540.1148 5.7599 0.6690 33.9758 munmap(32a63d0)

459 19.4737 0.04% 0.0424 0.0090 0.1565 751.7646 1.6378 0.0467 8.8085 mmap(32a6400)

filemon report:

...

------------------------------------------------------------------------

Detailed Logical Volume Stats (512 byte blocks)

------------------------------------------------------------------------

VOLUME: /dev/fslv05 description: /data

writes: 3264 (0 errs)

write sizes (blks): avg 2048.0 min 2048 max 2048 sdev 0.0

write times (msec): avg 5.379 min 2.724 max 29.700 sdev 2.657

write sequences: 822

write seq. lengths: avg 8132.2 min 2048 max 32768 sdev 3776.7

seeks: 822 (25.2%)

seek dist (blks): init 22528,

avg 25164.7 min 2048 max 67584 sdev 16029.7

time to next req(msec): avg 3.062 min 0.001 max 39.448 sdev 5.873

throughput: 333775.0 KB/sec

utilization: 0.59

fileplace report:

#fileplace -pv m.txt

File: m.txt Size: 33554432 bytes Vol: /dev/fslv05

Blk Size: 4096 Frag Size: 4096 Nfrags: 8192

Inode: 6 Mode: -rw-r--r-- Owner: root Group: system

Physical Addresses (mirror copy 1) Logical Extent

---------------------------------- ----------------

26732416-26740351 hdisk1 7936 frags 32505856 Bytes, 96.9% 00000768-00008703

26740608-26740863 hdisk1 256 frags 1048576 Bytes, 3.1% 00008960-00009215

8192 frags over space of 8448 frags: space efficiency = 97.0%

2 extents out of 8192 possible: sequentiality = 100.0%

PerfPMR

PerfPMR is an official AIX Support tool. It provides a set of scripts to gather AIX performance data, including:

•600 seconds (default) of general system performance data (monitor.sh 600)

•System trace data

– Trace data for reporting

– Trace data for post processing tools (curt, tprof, pprof, and splat)

– Stand alone tprof and filemon data collection

•PMU events count data

•Hardware and software configuration data

•iptrace and tcpdump data

PerfPMR is available from this public website:

ftp://ftp.software.ibm.com/aix/tools/perftools/perfpmr

You can refer to the README in the PerfPMR package for data collection and how to send data to IBM. We suggest you get a baseline PerfPMR data package under normal situations in advance, and collect more performance data when the problem is occurring.

You can get more information about PerfPMR at the following website:

http://www-01.ibm.com/support/docview.wss?uid=aixtools-27a38cfb

Note: We suggest you only use the PerfPMR tools when collecting data on production systems. Do not use the trace command on production directly unless you are required to do so.

The hpmstat and hpmcount utilities

IBM Power processor provides on-chip performance monitor units (PMUs) to count the number of performance critical processor events. Use the PMU-based utilities to analyze specific processor performance problems including memory affinity, TLB misses, SLB misses, and so on.

Two commands can be used to count and analyze PMU events, hpmcount and hpmstat, respectively. The hpmcount command is used to collect PMU statistics, while the hpmstat command collects data system wide. We have already shown an example using hpmstat to monitor SLB misses in 4.2.3, “One TB segment aliasing” on page 129.

Note: The ASO/DSO functionality also uses PMU counters for performance analysis. When you are running the hpmstat and pmcount commands, the ASO/DSO might stop functioning for a while.

Here is another example using hpmstat to identify memory affinity issues. As from the pmlist output, Group #108 is used for memory affinity analyzing; refer to Example A-24.

Example A-24 Determine the processor event group for memory affinity

#pmlist -g -1|pg

...

Group #108: pm_dsource12

Group name: Data source information

Group description: Data source information

Group status: Verified

Group members:

Counter 1, event 27: PM_DATA_FROM_RL2L3_MOD : Data loaded from remote L2 or L3 modified

Counter 2, event 25: PM_DATA_FROM_DMEM : Data loaded from distant memory

Counter 3, event 24: PM_DATA_FROM_RMEM : Data loaded from remote memory

Counter 4, event 31: PM_DATA_FROM_LMEM : Data loaded from local memory

Counter 5, event 0: PM_RUN_INST_CMPL : Run_Instructions

Counter 6, event 0: PM_RUN_CYC : Run_cycles

Thus you can use the hpmstat command to monitor memory affinity status, as in Example A-25. In the example, we monitored the PMU event group #108, which contains the memory affinity metrics for 20 seconds. The memory locality value is 0.667, which means 66.7% of memory access is local, which indicates good memory affinity. We can try the RSET command and vmo options to get even better memory affinity.

Example A-25 Memory affinity report

# hpmstat -r -g 108 20

Execution time (wall clock time): 20.009228351 seconds

Group: 108

Counting mode: user+kernel+hypervisor+runlatch

Counting duration: 1040.320869363 seconds

PM_DATA_FROM_RL2L3_MOD : 66621

(Data loaded from remote L2 or L3 modified)

PM_DATA_FROM_DMEM (Data loaded from distant memory) : 0

PM_DATA_FROM_RMEM (Data loaded from remote memory) : 60922

PM_DATA_FROM_LMEM (Data loaded from local memory) : 122057

PM_RUN_INST_CMPL (Run_Instructions) : 7322552457

PM_RUN_CYC (Run_cycles) : 78132452650

Normalization base: time

Counting mode: user+kernel+hypervisor+runlatch

Derived metric group: Memory

[ ] Memory locality : 0.667

[ ] Memory load traffic : 2859.047 MBytes

[ ] Memory load bandwidth per processor : 2.748 MBytes/s

[ ] Number of loads from local memory per loads from remote memory: 2.003

[ ] Number of loads from local memory per loads from remote and distant memory: 2.003

Derived metric group: dL1_Reloads_percentage_per_inst

[ ] % of DL1 Reloads from Remote L2 or L3 (Modified) per Inst: 0.001 %

[ ] % of DL1 Reloads from Local Memory per Inst : 0.002 %

[ ] % of DL1 Reloads from Remote Memory per Inst : 0.001 %

[ ] % of DL1 Reloads from Distant Memory per Inst : 0.000 %

Derived metric group: General

[ ] Run cycles per run instruction : 10.670

[ ] MIPS : 7.039 MIPS

u=Unverified c=Caveat R=Redefined m=Interleaved

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Appendix A. Performance monitoring tools and what they are telling us

Create new playlist

Sign In

Sign Up

Table of Contents for
Appendix A. Performance monitoring tools and what they are telling us