The Art of Troubleshooting

  • Given a network problem scenario, select an appropriate course of action based on a general troubleshooting strategy. This strategy includes the following steps.

    • Establish the symptoms

    • Identify the affected areas

    • Establish what has changed

    • Select the most probable cause

    • Implement a solution

    • Test the results

    • Recognize the potential effects of the solution

    • Document the solution

There is little question that at some point in your networking career, you will be called on to troubleshoot network-related problems. Correctly and swiftly identifying these problems is not done by accident; rather, effective troubleshooting requires attention to some specific steps and procedures. Although some organizations have documented troubleshooting procedures for their IT staff members, many do not. Whether you find yourself using these exact steps in your job is debatable, but the general principles will remain the same. The CompTIA objectives list the troubleshooting steps as follows:

1.
Establish what the symptoms are.

2.
Identify the affected areas.

3.
Establish what has changed.

4.
Select the most probable cause.

5.
Implement a solution.

6.
Test the results.

7.
Recognize the potential effects of the solution.

8.
Document the solution.

The following sections examine each of these areas, as well as an additional step that you might want to include in the troubleshooting process.

Establishing What the Symptoms Are

Troubleshooting a problem can be difficult, but trying to do it with limited information is often a fool's quest. Lacking information can cause you to troubleshoot the wrong problem. You might find yourself replacing a toner cartridge when someone actually just used the wrong password.

Therefore, the first step in the troubleshooting process is to establish exactly what the symptoms of the problem are. This stage of the troubleshooting process is about information gathering—a process that requires experience with the operating system being used, communication skills, and, perhaps most importantly, patience. It is very important to get as much information as possible about the problem before you charge out the door with that toner cartridge under your arm. You can glean information from three key sources: the computer (in the form of logs and error messages), the computer user, and your own observation. These sources are examined in the following sections.

Information from the Computer

If you know where to look and what to look for, a computer can help reveal where a problem lies. Many operating systems provide error messages when a problem is encountered. A Linux system, for instance, might present a Segmentation Fault error message, which indicates a memory-related error. Windows, on the other hand, might display an Illegal Operation error message to indicate a possible memory or application failure. Both of these system error messages can be cross-referenced with the operating systems' Web site information to identify the root of the problem. The information provided in these error messages can at times be cryptic, so finding the solution might be tricky.

In addition to the system-generated error messages, network operating systems can be configured to generate log files after a hardware or software failure. An administrator can then view these log files to see when the failure occurred and what was being done when the crash occurred. Window NT/2000 displays error messages in the Event Viewer, Linux stores many of its system log files in the /var/log directory, and NetWare creates a file called abend.log, which contains detailed information about the state of the system at the time of the crash. When you start the troubleshooting process, make sure you are familiar enough with the operating system that is being used to be able to determine whether it is trying to give you a message.

EXAM TIP

Error Message Storage For the Network+ exam, you do not need to know where error messages are stored on the respective operating systems; you only need to know that the troubleshooting process requires you to read system-generated log errors.


Information from the User

Getting accurate information from a computer user or anyone with limited technical knowledge can be a tricky task. Having a limited understanding of computers and technical terminology can make it difficult for a non-technical person to relay the true symptoms of a problem. However, users can convey what they are trying to do and what is not working. When you interview an end user, you will likely want the following information:

  • Error frequency— If it is a repeating problem, ask for the frequency of the problem. Does the problem occur at regular intervals or sporadically? Does it happen daily, weekly, or monthly?

  • Applications in use— You will definitely want to know what applications were in use at the time of the failure. Only the end user will know this information.

  • Past problems— Ask whether this error has been a problem in the past. If it has and it was addressed, you might already have your fix.

  • User modifications— A new screensaver, a game, or other such programs have ways of ending up on users' systems. Although many of these applications can be installed successfully, sometimes they create problems. When you are trying to isolate the problem, ask the user whether any new software additions have been made to the system.

  • Error messages— Network administrators cannot be at all the computers on a network all the time. Therefore, they are likely to miss an error message when it is displayed onscreen. The end user might be able to tell you what error message appeared.

NOTE

Installation Policies Many organizations have strict policies about what can and cannot be installed on computer systems. These policies are not in place to exercise the administrator's control but rather to prevent as many crashes and failures as possible.


NOTE

Gathering Information Your communication skills will be most needed when you are gathering information from end users.


Observation Techniques

Finding a problem often involves nothing more than using your eyes, ears, and nose to locate the problem. For instance, if you are troubleshooting a workstation system and you see a smoke cloud wafting from the back of the system, looking for error messages might not be necessary. If you walk into a server room and hear the CPU fan screaming, you are unlikely to need to review the server logs to find the problem.

EXAM TIP

Observation Techniques For the Network+ exam, remember that observation techniques play a large role in the preemptive troubleshooting process, which can result in finding a small problem before it becomes a large one.


Observation techniques often come into play when you're troubleshooting connectivity errors. For instance, looking for an unplugged cable and confirming that the light-emitting diode (LED) on the network interface card (NIC) is lit requires observation on your part. Keeping an eye as well as a nose out for potential problems is part of the network administrator's role and can help in identifying a situation before it becomes a problem.

NOTE

Troubleshooting Scenario A user calls you to complain that he is unable to access the network. You confirm that he is using the correct username and password and that his account is active. He was able to access the network the previous day.

Troubleshooting Solution This situation might suggest a physical connectivity problem. Confirm that the link LED is lit on the back of the NIC and that the cable is physically attached.


Effective Questioning Techniques

Regardless of the method you are using to gather information about a problem, there are some important questions you will need to have answered. When approaching a problem, consider the following questions:

  • Is only one computer affected, or has the entire network gone down?

  • Is the problem happening all the time, or is it intermittent?

  • Does the problem happen during specific times, or does it happen all the time?

  • Has this problem occurred in the past?

  • Has any network equipment been moved recently?

  • Have any new applications been installed on the network?

  • Has anyone else tried to correct the problem; if so, what has that person tried?

  • Is there any documentation that relates to the problem or to the applications or devices associated with the problem?

By answering these questions, as well as others, you will gain a better idea of exactly what the problem is.

Identifying the Affected Area

Some computer problems are isolated to a single user in a single location; others affect several thousand users spanning multiple locations. Establishing the affected area is an important part of the troubleshooting process, and it will often dictate the strategies you use in resolving the problem.

EXAM TIP

Be Thorough On the Network+ exam, you might be provided with either a description of a scenario or a description augmented by a network diagram. In either case, you should read the description of the problem carefully, step by step. In most cases, the correct answer is fairly logical and the wrong answers can be identified easily.


Problems that affect many users are often connectivity issues that disable access for many users. Such problems can often be isolated to wiring closets, network devices, and server rooms. The troubleshooting process for problems that are isolated to a single user will often begin and end at that user's workstation. The trail might indeed lead you to the wiring closet or server, but that is not likely where the troubleshooting process would begin. Understanding who is affected by a problem can provide you with the first clues about where the problem exists.

NOTE

Troubleshooting Scenario You are a network administrator managing a network that has four separate network segments: sales, administration, payroll, and advertising. Late on Tuesday evening, you get a call from several members of the sales staff, complaining that they are unable to access a network printer.

Troubleshooting Solution Because the reported problem has a common thread, the sales department, it is likely that there is a connectivity issue with the network segment the sales group is on. The problem could be a downed router, switch hub, or authentication server. Whatever the cause, you can more easily isolate the problem if you know the location. Consider how this troubleshooting scenario would be handled differently if the error reports came simultaneously from the sales, payroll, and advertising groups.


Establishing What Has Changed

Whether there is a problem with a workstation's access to a database or an entire network, keep in mind that they were working at some point. Although many claim that the “computer just stopped working,” it is unlikely. Far more likely is that there have been changes to the system or the network that caused the problem. As much as users try to convince you that computers do otherwise, computer systems do not reconfigure themselves. Therefore, establishing what was done to a system will lead you in the right direction to isolate and troubleshoot a problem.

Changes can occur on the network, on a server, or on a workstation. Each of these is discussed in the following sections.

Changes to the Network

Most of today's networks are dynamic and continually growing to accommodate new users and new applications. Unfortunately, these network changes, although intended to increase network functionality, may inadvertently cause additional problems. For instance, a new computer system added to a network might be installed with a duplicate computer name or IP address, which would prevent another computer that has the same name or address from accessing the network. Other changes that can create problems on the network include adding or removing a hub or switch, changing the network's routing information, or adding or removing a server. In fact, almost every change that the network administrator makes to the network can potentially have an undesirable impact elsewhere on the network. For this reason, all changes made to the network should be fully documented and fully thought out.

EXAM TIP

The Obvious Solutions In the Network+ exam, avoid discounting a possible answer because it seems too easy. Many of the troubleshooting questions are based on possible real-world scenarios, many of which do have very easy or obvious solutions.


Changes to the Server

Part of a network administrator's job involves some tinkering with the server. Although this might be unavoidable, it can sometimes lead to several unintentional problems. Even the most mundane of all server tasks can have a negative impact on the network. The following are some common server-related tasks that can cause problems:

  • Changes to user accounts— For the most part, changes to accounts do not cause any problems, but sometimes they do. If after making changes to user accounts a user or several users are unable to log on to the network or access a database, the problem is likely related to the changes made to the accounts.

  • Changes to permissions— Data is protected by permissions that dictate who can and cannot access the data on the drives. Permissions are an important part of system security, but changes to permissions can inadvertently prevent users from being able to access specific files.

  • Patches and updates— Part of the work involved in administering networks is to monitor new patches and updates for the network operating system and install them as needed. It is not uncommon for an upgrade or a fix to an operating system to cause problems on the network.

  • New applications— From time to time, new applications and programs—such as productivity software, firewall software, or even virus software—have to be installed on the server. When any kind of new software is added to the server, it might cause problems on the network. Knowing what has recently been installed can help you isolate a problem.

  • Hardware changes— Either because of failure or expansion, hardware on the server might have to be changed. Changes to the hardware configuration on the server can cause connectivity problems.

NOTE

Faulty Hardware Although recent changes to systems or networks account for many network problems, some problems do happen out of the blue. Faulty hardware is a good example.


Changes to the Workstation

The changes made to the systems on the network are not always under the control of the network administrator. Often, configuration changes and some software installations are performed by the end user. Such changes can be particularly frustrating to troubleshoot, and many users are unaware that the changes they make can cause problems. When looking for changes to a workstation system, consider the following:

  • Network settings— One of the configuration hotspots for workstation computer systems is the network settings. If a workstation is unable to access the network, it is a good idea to confirm that the network settings have not been changed.

  • Printer settings— Many printing problems can be isolated to changes in the printer configuration. Some client systems, such as Linux, are more adept at controlling administrative configuration screens than others; for example, Windows leaves such screens open to anyone who wants to change the configuration. When printing problems are isolated to a single system, changes in the configuration could be the cause.

  • New software— Many users love to download and install nifty screensavers or perhaps the latest 3D adventure games on their work computers. Although it may be more interesting being Zaxon the Level 43 Wizard than John the data entry clerk, the addition of extra software can cause the system to fail. Confirm with the end user that new software has not been added to the system recently.

NOTE

Troubleshooting Scenario A system that could previously log on to the network now receives an error message, saying that it cannot log on due to a duplicate IP address.

Troubleshooting Solution A duplicate IP address means that there are two systems on the network that are attempting to connect to the network using the same IP address. As you know, there can be only one. This often happens when a new system has been added to a network where Dynamic Host Configuration Protocol (DHCP) is not being used.


Selecting the Most Probable Cause of the Problem

There can be many different causes for a single problem on a network, but with appropriate information gathering, it is possible to eliminate many of them. When looking for a probable cause, it is often best to look at the easiest solution first and then work from there. Even in the most complex of network designs, the easiest solution is often the right one. For instance, if a single user cannot log on to a network, it is best to confirm network settings before replacing the NIC. Remember, though, that at this point you are only trying to determine the most probable cause, and your first guess might in fact be incorrect. It might take a few tries to determine the correct cause of the problem.

NOTE

Troubleshooting Scenario A user calls you to inform you that she is unable to access email. After asking a few questions, you determine that the user has only recently started with the company and has been unable to get email since her start date.

Troubleshooting Solution In this scenario, there can be several causes of the problem: perhaps network connectivity, perhaps a bad NIC, or perhaps (most likely) email has never been configured on her workstation. Check to see if email has been configured. If it has not, configure it. If it has and it is working correctly, consider the next most likely cause of the problem.


IN THE FIELD

DEVELOPING A PLAN FOR THE SOLUTION

Although developing a plan for solving a network problem is not specifically listed in the CompTIA objectives, after identifying a cause, but before implementing a solution, you should develop a plan for the solution. This is particularly a concern for server systems in which taking the server or network offline is a difficult and undesirable prospect. After identifying the cause of a problem on the server, it is absolutely necessary to plan for the solution. For instance, the plan must include when the server or network should be taken offline and for how long, what support services are in place, and who will be involved in correcting the problem.

Planning is a very important part of the whole troubleshooting process and can involve formal or informal written procedures. Those who do not have experience troubleshooting servers may be wondering about all the formality, but this formality ensures the least amount of network or server downtime and the maximum data availability.

As far as workstation troubleshooting is concerned, very rarely is a formal planning procedure required, and this makes the process easier. Planning for workstation troubleshooting typically involves arranging a convenient time with end users in order to implement a solution.


NOTE

Escalation Procedures One of the important but often neglected parts of the planning process is the development of escalation procedures. Although many technicians have difficulty admitting that they might need help with a problem, sometimes they need to do it. Unless there are formal escalation procedures defined by an organization, the rule of thumb is simply to determine the closest available suitable source of help and start from there.


Implementing a Solution

At this point, you should be ready to implement a solution—that is, apply the patch, replace the hardware, plug in a cable, or implement some other solution. In an ideal world, your first solution would fix the problem, although unfortunately this is not always the case. If your first solution does not fix the problem, you will need to retrace your steps and start again.

EXAM TIP

Rollback Plans A common and mandatory step that you must take when working on servers and some mission-critical workstations is to develop a rollback plan. The purpose of a rollback plan is to provide a method to get back to where you were before attempting the fix. Troubleshooting should not make the problem worse!


It is important that you attempt only one solution at a time. Trying several solutions at once can make it very unclear which one actually corrected the problem.

Testing the Results

After the corrective change has been made to the server, network, or workstation, it is necessary to test the results. This is where you find out if you were right and the remedy you applied actually worked. Don't forget that first impressions can be deceiving, and a fix that seems to work on first inspection may not actually have corrected the problem.

NOTE

Avoiding False Starts When you have completed a fix, you should test it as thoroughly as you can before informing users of the fix. Users would generally rather wait for a real fix than have two or three false starts.


The testing process is not always as easy as it sounds. If you are testing a connectivity problem, it is not difficult to ascertain whether your solution was successful. However, changes made to an application or to databases are typically much more difficult to test. It might be necessary to have people who are familiar with the database or application run the tests with you in attendance.

NOTE

Virus Activity Keep in mind when troubleshooting a network or systems on a network that the problem might be virus related. Viruses can cause a variety of problems that can often disguise themselves as other problems. Part of your troubleshooting toolkit should include a bootable virus disk with the latest virus definitions. Indicators that you might have a virus include increased error messages and missing and corrupt files.


In an ideal world, you would want be able to fully test a solution to see if it indeed corrected the problem. However, you might not know if you were successful until all users have logged back on, the application has been used, or the database has been queried. As a network administrator, you will be expected to take the testing process as far as you realistically can, even though you might not be able to simulate certain system conditions or loads. The true test will be in a real-world application.

Recognizing the Potential Effects of the Solution

Sometimes, you will apply a fix that corrects one problem but creates another problem. Many such circumstances are hard to predict—but not always. For instance, you might add a new network application, but the application requires more bandwidth than your current network infrastructure can support. The result would be that overall network performance would be compromised.

Everything done to a network can have a ripple effect and negatively affect another area of the network. Actions such as adding clients, replacing hubs, and adding applications can all have unforeseen results. It is very difficult to always know how the changes you make to a network are going to affect the network's functioning. The safest thing to do is assume that the changes you make are going to affect the network in some way and realize that you just have to figure out how. This is where you might need to think outside the box and try to predict possible outcomes.

Documenting the Solution

Although it is often neglected in the troubleshooting process, documentation is as important as any of the other troubleshooting procedures. Documenting a solution involves keeping a record of all the steps taken during the fix—not necessarily just the solution.

EXAM TIP

Log Books Many organizations require that a log book be kept in the server room. This log book should maintain a record of everything that has been done on the network. In addition, many organizations require that administrators keep a log book of all repairs and upgrades made to networks and workstations.


For the documentation to be of use to other network administrators in the future, it must include several key pieces of information. When documenting a procedure, you should include the following information:

  • Date— When was the solution implemented? It is important to know the date because if problems occur after your changes, knowing the date of your fix makes it easier to determine whether your changes caused the problems.

  • Why— Although it is obvious when a problem is being fixed why it is being done, a few weeks later, it might become less clear why that solution was needed. Documenting why the fix was made is important because if the same problem appears on another system, you can use this information to reduce time finding the solution.

  • What— The successful fix should be detailed, along with information about any changes to the configuration of the system or network that were made to achieve the fix. Additional information should include version numbers for software patches or firmware, as appropriate.

  • Results— Many administrators choose to include information on both successes and failures. The documentation of failures may prevent you from going down the same road twice, and the documentation of successful solutions can reduce the time it takes to get a system or network up and running.

  • Who— It might be that there is information left out of the documentation or someone simply wants to ask a few questions about a solution. In both cases, if the name of the person who made a fix is in the documentation, he or she can easily be tracked down. Of course, this is more of a concern in environments where there are a number of IT staff or if system repairs are performed by contractors instead of actual company employees.

NOTE

Troubleshooting Scenario You have been away on a sunny vacation for three weeks, and when you return, there are several error messages on your company's server.

Troubleshooting Solution Part of the role of a network administrator is to review the network documentation. To troubleshoot this scenario, you should look for any documented changes that were made to the system in your absence. Specifically, you should look for network configuration changes and added software applications or operating system patches. It is likely that one of these modifications will be at the root of the problem.


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset