Crash and hang issues

Crashes and hangs can happen with any device running software, and the NetScaler is no different in this regard. While a large percentage of them get picked up during testing, the complexity involved in catching all use cases and packet combinations means that some will make their way to the Customers. The good news is that most are usually fixed by the next revision of the software. This is one of the biggest reasons to stay current in terms of NetScaler builds.

Let's first start by differentiating these crashes and hangs. While their impact on your application's availability can be the same, the underlying issues are very different, and how you have to approach them as an Administrator are different as well.

Understanding crashes

A NetScaler crash can happen due to several reasons:

  • The NetScaler encounters a coding error by which it arrives at an invalid condition such as an invalid pointer reference, due to which it gives up on processing and proceeds to dump a core.
  • One of the packet engines becomes too slow to respond and fails to send out its heartbeats to a system process that monitors all packet engines. This can happen because the packet engine is doing something very CPU-intensive, such as processing a huge regex policy.
  • While rare, a crash of the FreeBSD software itself will eventually crash the system and result in a core file under /var/crash.

Working with crashes

Most administrators will notice a crash in the form of an unexpected reboot or a failover. One way to verify whether the issue was due to a crash is to look for newly created files in /var/core or /var/crash.

You will need to engage Citrix Technical Support to help identify the root cause for the crash and get advice on corrective steps, which will often involve upgrading to a build that contains the fix. To facilitate the investigation, capture the following information to share with the engineer:

  • The core file under /var/core or /var/crash that matches the time of the issue
  • The show techsupport file
  • Note down and provide details of any recent changes in configuration, such as introducing new services or enabling new features before the crash

While waiting for the engagement to complete, consider reverting to an earlier build if the crash is seen immediately after an upgrade. If you are using a very out of date build, consider upgrading to the latest by looking up the release notes for similar potential issues that are fixed, or alternatively consider one of the Citrix-certified Safe Harbor builds (see the upcoming section about the various build types).

Working with hang issues

A hang is a situation where the NetScaler is stuck in a race condition because two functions mutually waiting on each other, or because a process is running in a never-ending loop. One sign of a hang is when the device appears to power up but doesn't respond to any input. There are also cases where the device continues to handle traffic while being unreachable via GUI/SSH/Console.

In the case of a hang you will not see any core dumps. A reboot will almost certainly restore access to the unit, but it should not be the first line of troubleshooting as this will result in important diagnostic information being lost. You should instead attempt to dump a core.

Dumping a core on a VPX/MPX when console is available

You can dump a core by aborting one of the packet engines from console. Here are a quick set of steps taken from the knowledge base article CTX207598 on how to do this:

  1. Go to shell.
  2. Run the command pb_policy -o abort. This tells the NetScaler to dump cores if packet engines are interrupted.
  3. Do a ps -aux and note down the PID of all the packet engines.
  4. Use the kill -6 command and list the PIDs of all packet engines in the command. For example, kill -6 325 326 327 328.
  5. This will dump a core and restart the packet engines.
  6. Once the core dumps are complete, reset the pb_policy back to its default by running the shell command pb_policy -d. This is important, as the abort mode of running the system is performance-intensive.

Dumping a core when NetScaler is completely unresponsive

On MPX units, if NetScaler is unresponsive via console, you can dump a core using the NMI button. This is a recessed button at the back of NetScaler. Once the cores are available, you will need to engage tech support using the core and a collector file for the root cause analysis to be carried out.

Understanding NetScaler Build names

GA (General Availability) builds are available for all Citrix customers to use. This is a good thing. It means that there is a very large install base for such builds and any issues present will have a greater chance of being reported and fixed in the next iteration. GA builds are of two types:

  • Maintenance builds (.M builds) are what nearly all customers run. The difference between one MR and the next is mainly bug fixes and security fixes. Unless there is a clear reason, this is the build you are most encouraged to use.
  • Enhancement builds (.e builds) are a superset of a GA builds and contain features that are not yet available in the GA version. Usually, the features that are included in the current .e builds make it into the maintenance builds of the next release of code. For example, features introduced in 10.5.e became available in the regular 11.0 version.

11.0 releases introduced naming changes in the form of .M builds (Maintenance – containing bugs and security fixes) and .F builds (which introduce new features). Then, there are the special builds:

  • A Private build (the opposite of a GA Build) is limited to a small set of customers. These are provided under very specific conditions such as a customer needing the fix even before the complete range of tests are finished on the build, primarily because the bug has a very high impact. Consequently, the tradeoff needs to be very carefully considered.
  • A Debug build is one that Citrix Engineering produces in a targeted manner to capture a problem which is not reproducible in the Citrix Lab environment and for which the conditions of the failure are not well understood. The build will not fix the issue, but it does contain additional instrumentation to help diagnose the issue when it happens next.

A Safe Harbor build is a GA build that has been available publicly for at least six months and on which customers have reported very few issues. Citrix clearly calls out these builds on the download page so they are easy to notice. At the time of writing, the latest safe harbor build available is 10.5 56.22. There are no 11.0 safe harbor builds yet, but this is subject to change.

An NDPP build is a build for certain sectors, such as government agencies. Such organizations are governed by regulations that require that the security status of networking devices (such as NetScaler) are independently verified as conforming to a specific standard, namely the NDPP standard. This particular build has passed that standard.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset