Here, we'll look at switch information relating to fault management. We'll identify some key MIBs and show commands that relate to switch health.
From CISCO-STACK-MIB, the ModuleStatus variable provides the operational status of the module. If the status is not ok, the value of moduleTestResult gives more detailed information about the module's failure condition(s). The possible values seen in this MIB object are as follows:
other(1)—none of the following
ok(2)—status ok
minorFault(3)—minor problem
majorFault(4)—major problem
By polling this MIB, you can keep watch on the modules installed in the switch versus keeping track of every port on a switch. The latter can be excessive, except for the trunk and other “critical” ports that you identify.
A related MIB object from CISCO-STACK MIB is ModuleTestResult, which provides the result of the module's self-test. A zero indicates that the module passed all tests. Bits set in the result indicate error conditions.
The show module and show test commands are related to the ModuleStatus MIB. For details on the output from the show module command, see Chapter 10.
The show test command shows you the status of the self-tests run against the individual modules. The status of the test results assists you in pinpointing the possible cause for minorFault or majorFault, as indicated by the values in the moduleStatus MIB.
Example 11-13 shows sample output for show test.
Switch> sh test 2 Module 2 : 48-port 4 Segment 10BaseT Ethernet Repeater Port Status: Ports 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ----------------------------------------------------------------------------- . . . . . . . . . . . . . . . . . . . . . . . . 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 ------------------------------------------------------------------------ . . . . . . . . . . . . . . . . . . . . . . . . LCP Diag Status for Module 2 (. = Pass, F = Fail, N = N/A) CPU : . Sprom : . Bootcsum : . Archsum : N RAM : . LTL : . CBL : N DPRAM : N SAMBA : N Saints : . Pkt Bufs : . Repeater : . FLASH : N SAINT/SAGE Status : Saint 1 2 3 4 ----------------- . . . . Packet Buffer Status : Saint 1 2 3 4 ----------------- . . . . Loopback Status [Reported by Module 1] : Saint 1 2 3 4 ----------------- . . . . |
The type of card you have installed in each slot determines what kind of output you see in the show test [mod_num] output. If the card is working properly, you should see all “.” next to the individual tests. If something failed on the card, you'll see an “F.”
The show log command shows you the error log of the system, such as reboot histories, module reset counts, exception errors with corresponding hex dumps, and self-test results for the supervisor modules.
This command is very useful for examining the overall health and stability of your switch. If there are any exceptions to why the Supervisor card reset, the results are stored here.
The show log output on a switch is stored in NVRAM, so it is not cleared after a reset of the switch. You have to manually clear the log to take all values back to 0. It is good practice to clear the log every time you upgrade the software on the switch, due to possible exception counters stored under the network management processor. There is no need to store an exception count for a software release other than the current running release.
Output from show log also is good for comparing the last reset time and date of the supervisor with that of the other modules in the switch. Drawing that correlation can assist you in determining when module cards were OIRed (online insertion and removal) or reset by other methods without the entire switch resetting.
Example 11-14 shows sample output from show log.
Switch> show log
Network Management Processor (ACTIVE NMP) Log:
Reset count: 3A
Re-boot History: Feb 18 1998 17:14:18 0 B, Feb 05 1998 15:16:28 0
Feb 05 1998 14:20:33 0
Bootrom Checksum Failures: 0 C UART Failures: 0
Flash Checksum Failures: 0 Flash Program Failures: 0
Power Supply 1 Failures: 0 Power Supply 2 Failures: 0
Swapped to CLKA: 0 Swapped to CLKB: 0
Swapped to Processor 1: 0 Swapped to Processor 2: 0
DRAM Failures: 0
Exceptions: 9
Last Exception occurred on Feb 18 1998 17:14:18 B
Software version = 2.4(2) D
NVRAM log:
Network Management Processor (STANDBY NMP) Log:
Reset count: 3 A
Re-boot History: Feb 18 1998 17:14:18 0 B, Feb 05 1998 15:16:28 0
Feb 05 1998 14:20:33 0
Bootrom Checksum Failures: 0 C UART Failures: 0
Flash Checksum Failures: 0 Flash Program Failures: 0
Power Supply 1 Failures: 0 Power Supply 2 Failures: 0
Swapped to CLKA: 0 Swapped to CLKB: 0
Swapped to Processor 1: 0 Swapped to Processor 2: 0
DRAM Failures: 0
Exceptions: 0
NVRAM log:
Module 3 Log:
Reset Count: 4 A
Reset History: Wed Feb 18 1998, 17:14:18 B
Sun Feb 15 1998, 04:34:12
Thu Feb 5 1998, 15:17:38
Thu Feb 5 1998, 14:21:43
|
The following items are highlighted in Example 11-14:
A “Reset count” is the number of times that particular line card resets. Notice the difference between the reset count on the two Network Management Processors (slots 1 and 2) and the slot 3 module. Slot 3 must have been reset manually or by the reset command one extra time.
B The “Re-boot History” line indicates the time and date of the all the resets the line card exhibited, up to 10. You can compare this to the line “Last Exception occurred…” below it.
C The failures for the Supervisor cards are highlighted here. These are cumulative counts of failures that occurred on the Network Management Processor or Supervisor card. Typically, you'll see power supply failure increase more than others because every time the switch resets, the power supply failure increments.
D The “Software version” line, as indicated here, is the software version the exception occurred in. If this is not the current software running, you should clear the log to get an accurate count of the appropriate errors that may occur with the current release of software.
The moduleUp and moduleDown traps (CISCO-STACK-MIB.traps) indicate that a module in the switch chassis has either just come online or just gone offline. Here, you can track when cards are inserted into the chassis by OIR, or track when cards are removed or having problems.
The coldStart and whyreload trap (CISCO-GENERAL-TRAPS) indicate that the switch was powered on and restarted. These traps will be sent when the switch is coming up very similar to that of the router or when the switch unexpectedly restarts.
A number of syslog messages are useful for analyzing switch health, and apply directly to the MIB objects and CLI commands previously discussed. They are collected in Table 11-5.
Message | Explanation |
---|---|
SYS-5-SYS_RESET: System reset from [chars] | The switch has been reset, either by a failure or by manual intervention, such as from a change management window. |
SYS-3-MOD_MINORFAIL: Minor problem in module [dec]
SYS-3-MOD_FAILREASON: Module [dec] failed due to CBL0 error SYS-3-MOD_FAIL: Module [dec] failed to come online | These three syslog messages indicate that some type of failure on a particular line card or Supervisor card has occurred. These can be correlated to the moduleDown trap received or to the moduleStatus MIB object. Based on this result, you should actively poll for the moduleStatus for the given module number as indicated by the [dec] placement in the message. |
SYS-5-MOD_INSERT: Module [dec] has been inserted
SYS-5-MOD_REMOVE: Module [dec] has been removed SYS-5-MOD_RESET: Module [dec] reset from [chars] | These three syslog messages explain when a module is inserted, removed, or reset—either by a failure as illustrated above, or by manual intervention. |
SNMP-5-MODULETRAP: Module [dec] [[chars]] Trap
SNMP-5-COLDSTART: Cold Start Trap SNMP-5-WARMSTART: Warm Start Trap | These three SNMP syslog messages are indications that a SNMP trap was sent out based on the message type. The moduleUp/Down trap, coldStart trap, and warmStart trap are indicated here. The warmStart trap is an indication that the switch has supervisor redundancy and the backup Supervisor card is now active. You can correlate these syslog messages to the trapd daemon running on your management station to see whether the appropriate trap was received. |
The key system resources needing evaluation on switches, such as resource errors and low clusters, cannot be gathered from SNMP MIB objects, so CLI commands are used instead.
Here are several show commands relating to the evaluation of system resources on a switch. This section will cover the following:
The show inband command applies to the Supervisor III engines and the show biga command applies to Supervisor I and II engines.
The show inband or show biga command shows statistics from the SAGE ASIC chip that front-ends the processor for data traffic. The chip resides on the processor card. The output you need to concern yourself with here is the field RsrcErrors. These commands can be executed only from the enable mode.
Resource errors are important to look at over time when you are experiencing performance problems. If this counter is increasing rapidly over a short amount of time, you are “starving” the resources on the switch processor. Thus, it cannot process frames such as BPDUs, VTP, ISL, and CDP. Incrementing resource errors typically means that the switch cannot allocate memory or buffers (mbufs) for frames received on the processor. When the switch cannot process these frames, especially BPDUs, the switch network can become unstable. For example, if the processor does not see BPDUs, ports in blocking mode can go to forwarding mode and thus cause a snowball effect of a bridge loop and disable.
Example 11-15 shows sample output for show biga and show inband.
Switch (enable) sh biga
BIGA Registers:
cstat: 00 upad : FFFF pctrl : 0000 nist : 0000
sist : 0098 hica : 0000 hicb : 0000 hicc : 00
dctrl: F5FF dstat: 0000 dctrl2: 80 npim : 00F8
thead: 102FC804 ttail: 102FC804 ttmph : 102FC804 tptr : 10497E62
tdsc : 00000500 tlen : 0000 tqsel : 05
rhead: 102FA5D0 rtail: 102FA5B4 rtmph : 102FA5EC rptr : 104E5280
rdsc : 80000000 rplen: 102FA5E4 rtlen : 00000000 rlen : 1572
fltr : 00FF fc : 00 Rev : 04 CFG : 02020202
BIGA Driver:
Initializd: TRUE SpurusIntr: 00000000 NPIMShadow: 00F8
BIGA Receive:
RxDone : FALSE
First RBD : 102FA534 Last RBD : 102FC118
SoftRHead : 102FA5C0 SoftRTail : 102FA5A4
FramesRcvd: 00202501 BytesRcvd : 21197580
QueuedRBDs: 00000256 RsrcErrors: 00006520A
BIGA Transmit:
First TBD : 102FC134 Last TBD : 102FC818
SoftTHead : 102FC134 SoftTTail : 102FC134
Free TBDs : 00000064 No TBDs : 00000000
AcknowErrs: 00000000 HardErrors: 00000000
QueuedPkts: 00000000 XmittedPkt: 01604290
XmittedByt: 136665648 Panic : 00000000
Frag<=4Byt: 00000000
Switch(enable) sh inband
Inband Driver:
DriverPtr: A067D300 Initializd: TRUE SpurusIntr: 00000000
RxDone: FALSE TxDMAWorking: FALSE RxRecovPtr: 00000000(-1)
FPGACntl: 004F Characteristics:0000 LastISRCause: 04
Transmit:
First TBD : A0681B84(0 ) Last TBD : A0682B64(0 )
TxHead : A0681D44(14 ) TxTail : A0681D44(14 )
AvailTBDs : 00000128 QueuedPkts: 00000000
XmittedPkt: 00247610 XmittedByt: 22625836
PanicEnd : 00000000 PanicNullP: 00000000
BufLenErrs: 00000000 Len0Errs : 00000000
Frag<=4Byt: 00000665 SpursTxInt: 00000000
No TBDs : 00000000 NullMbuf : 00000000
Receive:
First RBD : A067D384(0 ) Last RBD : A0681B60(511)
RxHead : A067E320(111) RxTail : A067E2FC(110)
AvailRBD : 00000512 RsrcErrors: 00000000A
PanicNullP: 00000000 PanicFakeI: 00000000
FramesRcvd: 03173999 BytesRcvd : 246115897
RuntsRcvd : 00000000 HugeRcvd : 00000000
GT64010 IntMask: F00F0000 IntCause: 0330E083
GT64010 TX DMA (CH 1):
Count: 0000 Src : 013D5C62 Dst : 4ff10056 NRP : 000000
Cntl : 15C0
GT64010 RX DMA (CH 2):
Count: 0680 Src : 4FF20000 Dst : 01c84d80 NRP : 0067BC
Cntl : 55C0
PSI (PCI SAGE/PHOENIX Interface) FPGA:
Control : 004F TxCount : 0056
RxDMACmd: 35C0 RxBufSiz: 0680 MaxPkt : 0680
IntCause: 0000 IntMask : 0003
|
Monitoring RsrcErrors (A) is important over time, especially over a short time frame when switch performance problems are occurring. If this counter is incrementing over a long period of time, it is not as crucial.
The fixed buffers on switches are permanently set and come in two flavors: mbuf and clusters. Each mbuf is segmented into 128 bytes (116 data bytes), whereas clusters are packets greater than 1664 bytes (13 mbufs and 1508 data bytes). The only traffic that affects the mbuf and cluster counters is traffic destined to the supervisor engine, such as BPDUs, VTP, or CDP. The show mbuf all output displays the current amount of mbufs free and clusters free, as well as the lowest mbufs and clusters free.
The critical values that need to be looked at with switch buffers are the “free” and “lowest free” mbufs and clusters because they can help identify possible memory leaks or lack of proper memory resources. Free mbufs, lowest free mbufs, clfree, and lowest clfree should be flagged if they go below 100, which is used as an initial baseline threshold.
Example 11-16 shows sample output from show mbuf.
Switch(enable) sh mbuf MBSTATS: mbufs 10224 clusters 3932 free mbufs 9946A clfree 3675 B lowest free mbufs 9935 C lowest clfree 3665 D MALLOC STATS : Block Size Free Blocks 16 1 48 2 112 1 144 1 208 1 240 1 400 1 > 496 4 Largest block available : 7510096 Total Memory available : 7546400 E Total Memory used : 563952 |
The highlighted information from Example 11-16 is as follows:
A “free mbufs” is the number of current mbufs free for the processor. The amount of DRAM installed in the switch determines the size of the mbufs allocated at boot time, as indicated by the mbufs row.
B “clfree” is the number of current clusters free for the processor. The amount of DRAM installed in the switch determines the size of the clusters allocated at boot time, as indicated by the clusters row.
C “lowest free mbufs” is the field you need to trend and watch for memory resource usage.
D “lowest clfree” needs close attention as well because it also trends memory resource usage.
E “Total Memory available” is the amount of fixed DRAM memory allocated for mbufs.