Chapter 7. The Troubleshooting Scenario

“There are no big problems; there are just a lot of little problems.”

—Henry Ford

The VCDX Troubleshooting Scenario demonstrates a candidate’s ability to troubleshoot between a design issue and an implementation or operational issue. The VCDX-DV is 15 minutes in length, and the VCDX-Cloud and VCDX-DT are 30 minutes in length. You are presented with one to two slides that provide information on the scenario as a starting point. Demonstrate your ability to work through a problem in a methodical and logical manner. The problem could be design, build, or operations related.

Conducting the Troubleshooting Scenario

Troubleshooting scenarios include many options. Think of it as a mystery role-play. This is an opportunity for the panelists to see your analytical skills and decision-making process.

Thinking Aloud

In the troubleshooting scenarios, think of the architecture and implementation as you work on the root cause. Provide guidance on your approach to resolution. Consider reviewing design decisions, implementation decisions, and the methods for gaining evidence (logs, approach to resolution, and use of commands). The panelists will judge you on your approach, not just the solution. Clearly stating what you think and why will improve your chances of a better score.

Asking Questions

Think methodically, and ask questions relevant to the problematic area. For example, if the presented symptom suggests network-related issues, ask questions about the logical and physical network design. However, this might not be the culprit. Be sure to use the panelists’ responses to eliminate one area at a time.

Utilize all responses; do not ignore any of them.

Do not jump to conclusions or quickly answer, even if you are certain of the solution. This limits the opportunity to show that you understand the process of methodically troubleshooting a problem in scenarios where you might not know the answer. Even if you get to the answer, the panelists might continue to ask for additional validation.

Using a Whiteboard or Paper

Use the whiteboard or paper easel to record the answers to your questions, as well as any information presented to you on the troubleshooting slides. Draw any logical or physical diagrams relevant to the design. Illustrate what you are thinking and explain verbally the “what” and “why.”

Troubleshooting Analysis

As a starting point, not all information is provided. In addition, as in the real world, the scenario includes potentially conflicting information.

Example: Conflicting Information

• Trouble Report Item 1:

The storage administrator states that nothing was changed around the time the problem started.

• Trouble Report Item 2:

All ESXi hosts experienced loss of access to the same LUNs on a storage array.

• Analysis:

• You must determine how to address the conflict, provide justification, and demonstrate analysis skills to resolve conflicts.

• Ask the storage administrator whether any SAN events occurred around the time the problem began and ask for an explanation of what would have caused the outage. If the SAN admin still insists that nothing happened, ask for the ESXi logs (such as vmkernel/vmkernel.log) and look for SAN-related messages indicating loss of connectivity or failover events from all affected hosts.

Requirements Analysis

What are you trying to solve? What other information can you get? What constraints, assumptions, and risks are you working with to help with troubleshooting?

The Panelist Perspective

The panelists are looking for skills in evaluating a problem, determining whether it is design or implementation related, and deciding on the path needed for resolution.

A methodological approach is best. Based on the scenario presented, provide the target areas you will pursue initially. The panelists will compare your approach to the paths that you could have taken to resolve the problem.

The panelists will focus on identifying your approach and how you adjust when a path taken does not work out.

Example Scenarios

For the following examples, we provide guidance on one approach for each scenario. The steps provided are a subset of what would occur in a full scenario in a panel defense session. They do not include all the steps required for achieving passing scores, so use them as a learning aid.

These examples provide two slides for each scenario. Some sample interactions between the candidate and the panelists are shown. Notes on the approach successful VCDX candidates take are provided as well.

Example Troubleshooting Scenario #1

You will be presented with one or more slides that cover the scenario information. Some information might be presented in response to your questions to the panelists. Figure 7.1 represents the first slide.

Figure 7.1. Troubleshooting Scenario Role-Play: Scenario 1, Slide 1

Image

Figure 7.2 shows a second slide with additional information.

Figure 7.2. Troubleshooting Scenario Role-Play: Scenario 1, Slide 2.

Image
Analysis of Existing Information

To analyze existing information, identify validated facts, possible solution areas and questionable details.

Validated Facts

• Applications not accessible are in VMs on the affected host.

• The host is not accessible via SSH/PuTTY.

Solution Areas

The diagram shows both network and SAN connectivity. Rule out one of them by asking further questions.

Questionable Details

The slide does not specify whether the problem is network or storage related. Rule out each section at a time by asking relevant questions.

Asking Questions

1. Has a support ticket been opened?

2. Can the host be accessed via DCUI (Direct Connect UI) either physically at the host or via a remote management board (such as ILO or DRAC)? This should rule out management network connectivity. If the host is not responsive even via DCUI, it is not a management network problem.

3. Can you ping the VM? This should rule out VM network. However, it might not be conclusive—a hung VM might not respond even when there is no network problem.

4. Are other hosts experiencing this problem? If so, what is in common with these hosts?

5. If you rule out the network, get more details on the storage. Look more closely at the connectivity diagram for any inconsistencies with best practices.

6. What type of storage array is this (active/active or active/passive, ALUA)?

7. Ask the SAN admins whether any SAN events have occurred, and request that they look through the storage array logs.

Whiteboarding

Write key points on the whiteboard. Use the answers to your questions to write down what they mean and how they affect your decision path and conclusions.

Write down the areas you want to cover and verbally explain why.

Draw out alternative connections. If you see any incorrect connectivity, draw what you propose to correct it.

Methodical Approach

As outlined earlier, identify the possible problem areas. In this example, it could be VMs, hosts, or infrastructure. Eliminate each area using the answers to your questions.

You are not expected to be an expert on each component outside of vSphere—ask questions to become familiar with these external factors.

It is always advisable to request logs, as long as you know which events you need to look for. Don’t just say, “Are there any errors in the logs?” The best approach is to identify the logs and type of errors. For example, you might ask, “Are there any path failover events in the vmkernel/vmkernel.log files?” or “Are there any SCSI sense codes?” If you find any, you could ask the panel to look up what they mean.

After you zone in on the possible problem area, ask more detailed questions.

In this example, the array is active/passive, and you identify that the cabling does not meet best practices because each fabric is connected to one Storage Processor (SP) instead of both. This would be something to rule out. This could cause a path thrashing state. Look for events that indicate one or more LUNs flapping between SPs. Are the symptoms consistent with such a state?

It is possible that Host A lost SAN connectivity on HBA1, and then later Host B lost connectivity to the SAN on HBA0. This leaves Host A with access to SP A only and Host B with access to SP B only. As a result, each of them requests the array to transfer ownership of the LUNs to the SP to which they have access. This results in the same condition of the LUNs bouncing between SPs.

Figure 7.3 shows how connectivity failures can lead to path thrashing.

Figure 7.3. Path thrashing from connectivity failure.

Image

If you confirm that this is the root cause, request that the cabling connectivity be changed, and clearly state the necessary changes. For example, change the Fibre Channel (FC) Switch 1 connection from SP A, Port 2 to SP B, Port 2; and change FC switch 2’s connection from SP B, port 2 to SP A, port 2.

Another possibility is that the vSphere admin decides to load-balance the I/O over the SPs by changing the default preferred storage path (PSP) from VMW_PSP_MRU to VMW_PSP_FIXED and then sets the preferred path to some LUNs via HBA 0 and others via HBA 1 on Host A, but accidentally sets the reverse on Host B. This would result in the same condition, even without connectivity failure.

Figure 7.4 shows how the wrong PSP can lead to path thrashing.

Figure 7.4. Path thrashing from an incorrect PSP choice.

Image

If this is the root cause, change the default PSP back to VMW_PSP_MRU and connect each FC switch to a port on each SP. Figure 7.5 shows the corrected design.

Figure 7.5. Corrected SAN design to resolve the path thrashing.

Image

Does this alleviate the problem?


You are not expected to reach a resolution within the allotted 15 minutes. The main goal is to showcase your methodical, systematic approach in identifying which design error could have resulted in the problem presented.


Example Troubleshooting Scenario #2

You are presented with one or more slides that cover the scenario information. Some information might be presented in response to your questions to the panelists. Figure 7.6 represents the first slide.

Figure 7.6. Troubleshooting Scenario Role-Play: Scenario 2, Slide 1.

Image

Figure 7.7 shows a second slide with additional information.

Figure 7.7. Troubleshooting Scenario Role-Play: Scenario 2, Slide 2.

Image
Analysis of Existing Information

To analyze existing information, identify validated facts, possible solution areas, and questionable details.

Validated Facts

From the presented information, the following are the identified and validated facts:

1. The last change done before the problem was reported was that the administrator added a backup network.

2. Access to vCenter Server is intermittent over the network.

3. Attempts to access vCenter Server via management tools and other clients failed.

4. The backup network was added to back up the local SQL server.

Solution Areas

The following are possible solution areas:

1. Network configuration problems

2. Storage configuration problems

Questionable Details

• It is not clear what “local SQL server” means.

• It is not clear which management tools or clients are used.

Asking Questions

First, verify the nature of the problem:

1. Which “management tools” or clients are used to access the VMware vCenter Server?

2. Which version of vCenter Server and ESXi is used?

3. Is vCenter Server protected with VMware HA or other availability configuration?

4. What do you mean by “SQL server installed locally”?

Now rule out resource contention problems:

1. What is the CPU and memory utilization on the host on which vCenter is running?

2. Is VMware DRS enabled? If so, is vCenter Server vMotioned to other hosts?

3. If not, can you manually vMotion vCenter to another host?

Then rule out storage issue:

1. What storage array is used?

2. What is the default Preferred Storage Path (PSP)?

Rule out network issues:

1. Can you ping vCenter Server’s IP address/FQDN from a physical machine?

2. Can you ping vCenter Server’s IP address/FQDN from a VM on the same host?

3. Can you ping vCenter Server’s IP address/FQDN from a VM on another host?

4. Can you access vCenter Server via remote console? If so, can you ping out to the network?

5. What exact changes were done to add the “backup network”?

6. From the IP list, I see that vDS is used. What type of physical switch is used?

Whiteboarding

Draw the virtual network configuration, including the vDS port groups and related configurations that you obtain from the panelists.

Methodical Approach

The first set of questions clarifies the nature of the problem. Is the VM not responsive, or is it running but not reachable over the network?

The second set of questions rules out performance-related issues. If the problem continues after vMotioning the VM to another host with enough resources, it is not resource related.

Asking for storage information could rule out storage-related issues such as I/O contention or path thrashing. In this case, storage might not be the issue.

Now move on to the network troubleshooting: Pinging by IP address always works, but using FQDN works intermittently. This could indicate a DNS round-robin issue.

If so, ask about the DNS configuration. If the IP address is registered via dynamic DNS, check the latest IP address added via the backup network configuration and how it was configured.

Example Troubleshooting Scenario #3

You are presented with one or more slides that cover the scenario information. Some information could be presented in response to your questions to the panelists. Figure 7.8 represents the first slide.

Figure 7.8. Troubleshooting Scenario Role-Play: Scenario 3, Slide 1.

Image

Figure 7.9 shows a second slide with additional information.

Figure 7.9. Troubleshooting Scenario Role-Play: Scenario 3, Slide 2

Image
Analysis of Existing Information

To analyze existing information, identify validated facts, possible solution areas, and questionable details.

Validated Facts

Based solely on the content of the slides, not much information is provided. All you know is that it is a View 5.0 environment and that the main symptom of the problem is “severe performance problem.” Because this is not a design you provided, you should be allowed to request the original design. The second slide presents a high-level version.

Solution Areas

In a View 5.0 environment, performance issues can arise in one or more of these areas:

1. View client (a.k.a. thin client) connectivity to the View Manager.

• Networking

• Remote desktop protocols

2. VMware vSphere environment on which the virtual desktops are running.

• Computing

• Storage

• Linked clones

Questionable Details

The problem description is a somewhat vague “severe performance problem.” You need to get some metrics that were used to quantify the performance and determine whether it is based on user perception doing their daily tasks or their experience on physical desktops. Keep in mind that this is the first time this organization is using a VMware View environment.

Asking Questions

First identify the exact nature of the problem:

1. Did you submit a support ticket with VMware? If so, ask for the information provided and any suggestions made by the VMware Support Engineer.

2. Which tests did you run to measure the performance? Which data did you use as a baseline?

3. Does this happen all the time? If not, is there a pattern?

Now start asking questions about the infrastructure.

• Client-side questions:

• What type of thin client are you using? Is it on the VMware HCL?

• Which protocols does the thin client use?

• What is the type and speed of network connecting to the View Manager?

• vSphere infrastructure questions:

Ask questions about the cluster architecture:

• What is the make and model of each ESXi server in the cluster?

• What is the type, speed, and count of CPUs/host, as well as the RAM size?

• Which version of ESXi and vCenter Server?

Network questions:

• What is the number and type of NICs per host, including the configured speed?

• What types of virtual switches are being used?

• What is the NIC teaming failover policy?

Storage questions:

• What is the storage array make and model?

• What is the storage connectivity and protocol (FC, FCoE, iSCSI, or NFS)?

• Which VMFS version is in use? If it is VMFS5, was it freshly created or upgraded from an earlier version?

• How are the LUNs provisioned (thin or thick)?

View Composer questions:

• How many parent images are used for the same desktop type and operating system?

• What are the details of linked clones and their placement on the data stores?

• Is storage DRS in use?

Whiteboarding

As you get answers to questions for a solution area, draw or build tables with the answers. Again, be sure to note all responses you get from the panelists.

Methodical Approach

Cover each of the solution areas, and focus on the most likely root cause.

Walk your way through the design, starting with the client access. To rule that out, ask to use one of the desktops directly via vSphere Client’s Remote Console. If there is no improvement, move on to the infrastructure. Rule out the compute resources by using a smaller number of desktops, reducing quantity until you are down to a single desktop.

If the problem still exists with a single desktop, you need to focus on the linked clone details. If all desktops share the same parent image, the problem should not exist if you use a single virtual machine because there would be no contention for the storage or network resources.

Even though the problem exists with a single running desktop, it could be a result of storage resource contention. A possibility is guest OS file system alignment with the underlying physical storage. Check the VMware Knowledgebase for known issues related to file system alignment.

You are allowed to ask the panel to look up any information on the web, such as VMware Knowledge Base (KB) articles.

It is a known fact that, on ESXi 5, the linked clone files are created at a certain granularity size, the size by which the linked clone files are grown. This can result in a file system misalignment even if the parent image’s virtual disks are correctly aligned.

Consider upgrading to vSphere 5.1 and the View release, which takes advantage of space efficient sparse disks (SE sparse disks), which grow in a configurable grain size. Check with the storage vendor about the recommended grain size, for better alignment.

Example Troubleshooting Scenario #4

You are presented with one or more slides that cover the scenario information. Some information might be presented in response to your questions to the panelists. Figure 7.10 represents the first slide.

Figure 7.10. Troubleshooting Scenario Role-Play: Scenario 4, Slide 1.

Image

Figure 7.11 shows a second slide with additional information.

Figure 7.11. Troubleshooting Scenario Role-Play: Scenario 4, Slide 2.

Image
Analysis of Existing Information

To analyze existing information, identify validated facts, possible solution areas, and questionable details.

Validated Facts

Based solely on the content of the slides, all you know is that it is a vCloud Director 5.1 environment and that the main symptom of the problem is “Some VMs lost access to public network.” Because this is not a design you provided, you should be allowed to ask to see the original design. The second slide presents a high-level version.

Solution Areas

In a vCloud 5.1 environment, VM network connectivity can be in one or more of these areas:

1. Network connectivity

a. vSwitch uplinks

b. Physical switch connectivity

c. NIC teaming policy

d. vSwitch port groups

2. vShield

a. vShield Manager

b. vShield Edge Appliances

c. VXLAN configuration

Questionable Details

The problem description is a somewhat general “VMs lost connectivity.” Get some clarification on which VMs have experienced this problem and whether they are on the same host.

Asking Questions

First, identify the exact nature of the problem:

1. Did you submit a support ticket with VMware? If so, ask for the information provided and any suggestions made by the VMware Support Engineer.

2. Which VMs experienced loss of public network connectivity?

3. Are these VMs on the same host?

Now start asking questions about the infrastructure.

• Network design and connectivity:

• What types of vSwitches are in use (standard, distributed)?

• How are these switches connected to the physical switches?

• What is the NIC teaming configuration?

• If the affected VMs are on the same host, are any links down? Was there a NIC failover?

• Can you access the VMs via Virtual Machine Remote Console (VMRC)?

• Can you access the VMs via Remote Desktop Protocol (RDP), if configured?

• vShield questions:

• Are you able to access vShield Manager?

• If so, are the vShield Edge virtual appliances running?

• If you find one that is not running, can this be the root cause?

Whiteboarding

As you receive answers to your questions, draw or build tables with the answers. Again, be sure to note all responses you get from the panelists.

Methodical Approach

First, troubleshoot this as a standard network problem. If you rule out physical and virtual switch problems, consider the vCloud overall design. vShield Edge appliances control access to external networks. VMs protected by vShield appliances lose connectivity when the corresponding appliance is down.

If this is the case, explain that using HA clusters should have detected that the appliance was down and restarted it. However, because the vShield appliances do not have Virtual Machine tools available, Guest OS monitoring is not possible.

Review

The VCDX Troubleshooting Scenario is a role-play that is 15 minutes within the VCDX-DV program and 30 minutes within the VCDX-Cloud and VCDX-DT programs. You will see one to two slides that present the initial information. You act as the architect, and the panelists act as a customer. Additional slides might be available if “trigger” questions are raised, such as “May I see the Blue Screen of Death (BSoD) display?”

This chapter covered design troubleshooting scenarios. The intention is for you to troubleshoot the design rather than a specific technical problem. Most, if not all, designs presented will appear to test your technical troubleshooting skills. However, don’t be intimidated—just analyze the design elements that could be the possible source of the problem.

Make sure you are methodical in your approach, and write down all responses you get from the panel. Think on your feet and use the whiteboard. Think aloud and let the panel know your thought process. Rest assured that even if you do not solve the problem, the assessment looks at the journey rather than the destination. Remember to think aloud. If the panelists cannot hear your thoughts, it is exceedingly difficult to analyze your thought process.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset