Troubleshooting
This chapter provides information to troubleshoot common problems that can occur in an IBM FlashSystem and VMware Elastic Sky X integrated (ESXi) environment. It also explains how to collect the necessary problem determination data.
This chapter includes the following sections:
8.1 Collecting data for support
This section discusses the data that needs to be collected before contacting support for assistance. When interacting with support, it is important to provide a clear problem description that empowers the support engineers to help resolve the issue. A good problem description includes:
What was expected?
What was not expected?
What are the resources that are involved (volumes, hosts, and so forth)?
When did the problem take place?
8.1.1 Data collection guidelines for SAN Volume Controller and IBM FlashSystem
On SAN Volume Controller (SVC) and IBM FlashSystem, system logs can be collected in the product GUI by selecting Settings → Support Package (Figure 8-1).
Figure 8-1 Collecting a support package in the GUI
For more information about the level of logs to collect for various issues, see What Data Should You Collect for a Problem on IBM Spectrum Virtualize systems?
For the topics covered in the scope of this document, you typically need to gather a snap (option) 4, which contains standard logs plus new statesaves. Because this data often takes a long time to collect, it might be advantageous to manually create the statesaves, and then collect the standard logs afterward. This task can be done by using the svc_livedump command-line interface (CLI) utility, which is available in the product command-line interface (Example 8-1 on page 179).
Example 8-1 Using svc_livedump to manually generate statesaves
IBM_FlashSystem:Cluster_9.42.162.160:superuser>svc_livedump -nodes all -y
Livedump - Fetching Node Configuration
Livedump - Checking for dependent VDisks
Livedump - Check Node status
Livedump - Preparing specified nodes - this may take some time...
Livedump - Prepare node 1
Livedump - Prepare node 2
Livedump - Trigger specified nodes
Livedump - Triggering livedump on node 1
Livedump - Triggering livedump on node 2
Livedump - Waiting for livedumps to complete dumping on nodes 1,2
Livedump - Waiting for livedumps to complete dumping on nodes 2
Livedump - Successfully captured livedumps on nodes 1,2
After you generate the necessary statesaves, collect standard logs and the latest statesaves (option 3), and use the GUI to create a support package including the manually generated livedumps. Alternatively, you can create the support package by using the CLI (Example 8-2).
Example 8-2 Using svc_snap to generate a support package in the CLI
IBM_FlashSystem:Cluster_9.42.162.160:superuser>svc_snap -gui3
Collecting data
Packaging files
Snap data collected in /dumps/snap.78E35HW-2.210329.170759.tgz
When the support package is generated by using the command line, you can download it by using the GUI or using a Secure Copy Protocol (SCP) client.
8.1.2 Data collection guidelines for VMware ESXi
For issues involving VMware ESXi hypervisor (including storage access errors), it is vital to ensure that the logs from the host-side of the connection are collected in addition to the storage subsystem. For the VMware instructions about collecting ESXi log packages, see Collecting diagnostic information for VMware ESXi (653).
When downloading a package for an ESXi host, the default settings provide the information that is needed to analyze most problems.
8.1.3 Data collection guidelines for VMware Site Recovery Manager
Troubleshooting problems that involve VMware Site Recovery Manager usually require the analyzing of data from the following sources:
1. The storage systems associated in all related sites as shown in 8.1.1, “Data collection guidelines for SAN Volume Controller and IBM FlashSystem” on page 178.
2. The IBM Storage Replication Adapter (SRA) appliance logs in all related sites.
3. The VMware Site Recovery Manager logs.
IBM SRA log collection
Current versions of the IBM SRA are deployed inside of the VMware Site Recovery Manager (SRM) server. By default, the SRA application logs all data in /var/log/vmware/srm on the SRM server where SRA is deployed.
VMware SRM log collection
For the VMware instructions for creating and downloading SRM logs, see Collecting diagnostic information for VMware Site Recovery Manager (1009253).
 
Important: When collecting data for problems that are related to SRM, make sure to collect data from all sites associated with the problem.
8.1.4 Data collection guidelines for IBM Spectrum Connect (VASA or vVols)
For troubleshooting issues associated with VASA or VMware vSphere Virtual Volume (vVol), the following sets of data are required for troubleshooting:
2. A support package from IBM Spectrum Connect.
3. A support package from the management application interfacing with IBM Spectrum Connect.
4. If the problem includes access to the data, ESXi logs as shown in 8.1.2, “Data collection guidelines for VMware ESXi” on page 179.
Collecting Data for IBM Spectrum Connect
IBM Spectrum Connect logs can be collected in the following two ways:
1. Using the Operating System Shell
IBM Spectrum Connect stores data in /var/log/sc by default. Copy the contents of this directory off the system for use.
2. Using the IBM Spectrum Connect User Interface
In the IBM Spectrum Connect User Interface, select Settings → Support → Collect Log to gather and download the log files (Figure 8-2).
Figure 8-2 Collecting IBM Spectrum Connect logs
Collecting data for VMware vCenter
vCenter logs can be collected by using the same process as ESXi hosts, as described in 8.1.2, “Data collection guidelines for VMware ESXi” on page 179. The difference is when selecting resources for which to collect logs, select the vCenter server instead of (or in addition to) an ESXi host.
Collecting data for VMware vRealize Orchestrator
For the VMware data collection instructions for the VMware vRealize Orchestrator (vRO), see Generating a log bundle from command line for a vRealize Orchestrator 7.x appliance (2150664).
Collecting data for VMware vRealize Operations Manager
For the VMware data collection instructions for the VMware vRealize Operations (vROps) Manager, see Collecting diagnostic information from vRealize Operations (2074601).
8.2 Common support cases
This section describes topics that are commonly raised to the support center. This section is not meant to be a comprehensive guide on debugging interoperability issues between SVC, IBM FlashSystem, and VMware products.
8.2.1 Storage loss of access
When troubleshooting the loss of access to storage, it is important to properly classify how access was loss and what resources are involved.
The three general categories of storage loss of access events in VMware products are:
All paths down (APD)
Permanent device loss (PDL)
Virtual machine (VM) crash
All Paths Down
An APD event takes place when all the paths to a data store are marked offline. Example 8-3 shows the vmkernel log signature for an APD event.
Example 8-3 ESXi All Paths Down log signature
cpu1:2049)WARNING: NMP: nmp_IssueCommandToDevice:2954:I/O could not be issued to device "naa.600507681081025a1000000000000003" due to Not found
cpu1:2049)WARNING: NMP: nmp_DeviceRetryCommand:133:Device "naa.600507681081025a1000000000000003": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
cpu1:2049)WARNING: NMP: nmp_DeviceStartLoop:721:NMP Device "naa.600507681081025a1000000000000003" is blocked. Not starting I/O from device.
cpu1:2642)WARNING: NMP: nmpDeviceAttemptFailover:599:Retry world failover device "naa.600507681081025a1000000000000003" - issuing command 0x4124007ba7c0
cpu1:2642)WARNING: NMP: nmpDeviceAttemptFailover:658:Retry world failover device "naa.600507681081025a1000000000000003" - failed to issue command due to Not found (APD), try again...
cpu1:2642)WARNING: NMP: nmpDeviceAttemptFailover:708:Logical device "naa.600507681081025a1000000000000003": awaiting fast path state update...
When all paths are lost, if there is no path update lasting through the Misc.APDTimeout value (default of 140 seconds), then the APD condition is latched. Example 8-4 shows the log signature in the vobd.log file for a latched APD state.
Example 8-4 ESXi All Paths Down timeout
[APDCorrelator] 2682686563317us: [esx.problem.storage.apd.timeout] Device or filesystem with identifier [11ace9d3-7bebe4e8] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.
These issues are typically the result of errors in path recovery. Corrective actions include:
Validating the best practice multipathing configuration is in use as shown in 2.3, “Multi-path considerations” on page 18.
Validate all server driver and firmware levels are at the latest supported level.
Validate the network infrastructure connecting the host and storage is operating correctly.
Permanent device loss
A PDL event is the response to a unrecoverable I/O error that is returned by a storage controller. Example 8-5 shows the vmkernel log signature for a PDL event.
Example 8-5 ESXi permanent device loss log signature
cpu17:10107)WARNING: Vol3: 1717: Failed to refresh FS 4beb089b-68037158-2ecc-00215eda1af6 descriptor: Device is permanently unavailable
cpu17:10107)ScsiDeviceIO: 2316: Cmd(0x412442939bc0) 0x28, CmdSN 0x367bb6 from world 10107 to dev "naa.600507681081025a1000000000000003" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x2/0x3e/0x1
cpu17:10107)Vol3: 1767: Error refreshing PB resMeta: Device is permanently unavailable
These types of events are often the result of a hardware failure or low-level protocol error in the server host bus adapter (HBA), storage area network (SAN), or the storage array. If hardware errors are found that match the time in which the PDL happens, PDF is likely the cause.
Virtual machine crash
If a VM fails in absence of an APD or PDL event, then this scenario should be treated as an operating system or application failure inside the VM. If the analysis of the guest VM points to a storage I/O timeout, then this analysis might point to latency in processing VM I/O requests. In such situations, it is important to review the following sets of data:
The vmkernel log of the ESXi host that houses the VM that failed. Specifically, look for events that are related to the physical device backing the data store that houses the VM.
The storage array’s performance data. Specifically, check for peak read-and-write latency during the time when the VM failed.
Operating System and application logs for the VM that failed. Specifically, identify key timeout values and the time of the crash.
8.2.2 VMware migration task failures
Two types of migrate tasks are as follows:
vMotion is a migrate task that is used to move the running state of the VM (for example, memory and compute resource) between ESXi hosts.
Storage vMotion is a migrate task that is used to move the VM storage resources between data stores, for example VMDK files data stores.
vMotion tasks
vMotion tasks are largely dependent on the Ethernet infrastructure between the ESXi hosts. The only real storage interaction is at the end when file locks must move from one host to another. In this phase, it is possible for Small Computer System Interface (SCSI) Reservation Conflicts or file lock contention to result in the failing of the migration task. The following articles describe the most frequent issues:
Storage vMotion tasks
Storage vMotion tasks are primarily dependent on storage throughput. When moving between data stores in the same storage controller, the task is typically offloaded to the storage array by using extended copy (XCOPY) (VAAI Hardware Accelerated Move). If the migration task is between storage systems, the copy is performed by using standard read and write commands.
The default timeout for the task to complete is 100 seconds. If the migration takes longer than 100 seconds to complete, then the task fails with a timeout, as shown in Example 8-6.
Example 8-6 VMware Log Storage vMotion timeout
vmkernel: 114:03:25:51.489 cpu0:4100)WARNING: FSR: 690: 1313159068180024 S: Maximum switchover time (100 seconds) reached. Failing migration; VM should resume on source.
vmkernel: 114:03:25:51.489 cpu2:10561)WARNING: FSR: 3281: 1313159068180024 D: The migration exceeded the maximum switchover time of 100 seconds. ESX has preemptively failed the migration to allow the VM to continue running on the source host.
vmkernel: 114:03:25:51.489 cpu2:10561)WARNING: Migrate: 296: 1313159068180024 D: Failed: Maximum switchover time for migration exceeded(0xbad0109) @0x41800f61cee2
The task is generic by nature and the root cause behind the timeout typically requires performance analysis of the storage arrays that are involved and a detailed review of the ESXi logs for the host performing the task. In some circumstances, it might be appropriate to increase the default timeout, as described at Using Storage Motion to migrate a virtual machine with many disks fails without timeout (1010045).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset