This chapter is focused on the features that form the basis of most NetScaler deployments. They are available under the traffic management node in the UI. These features are as follows:
Let's explore and discuss troubleshooting these features one at a time.
The NetScaler started off as a high performance load balancer and is still its most prominent use case. In this chapter, we will look at a range of issues/questions that you come across when setting up or managing a load balanced environment with the NetScaler.
First, let's look at some considerations around the general settings of load balancing.
Let's consider a scenario where you've created a new load balancing (LB) vServer or bound a service to an existing vServer that already has a bunch of services. You will notice that even though you have the LB method as the default (also happens to be the recommended) of least connections, NetScaler starts to send requests to the backend in a round robin fashion. This is a deliberate behavior to ensure that the new service you've just added doesn't get inundated with requests; after all, being a new server, it will have the least connections. This behavior is controlled by adjusting the tunable Startup RR factor.
By default, when you create a load balancing VIP (we will shorten this to LB VIP for conciseness), NetScaler uses one of its own IPs, usually a SNIP to send that packet to the servers. This is controlled by the USNIP global mode setting. USIP (use client IP as the source IP), on the other hand, is needed only for specific scenarios.
Some scenarios where USIP really is required are as follows:
Occasionally, USIP gets deployed purely with the goal of getting visibility into the Client IP. There are definitely better ways to achieve this requirement:
Why should you avoid USIP?
To get the most performance out of the NetScaler, as much as possible you should choose one of the native VIP types, such as HTTP, SSL, DNS, or MYSQL. Apart from performance, this also gives you granular control, such as your choice of rewrite policies. For applications that don't have a native protocol, or use a mix of sub protocols, a layer 4 protocol would be the right choice, such as TCP, UDP, or SSL_TCP. Some applications might need you to forward traffic with even minimal handling; this is when SSL_Bridge
or the ANY
type of VIPs are in use, where the NetScaler is essentially just flinging packets it receives on the VIP to the services as fast as it can.
When load balancing Firewalls and CloudBridge devices, there are a couple of options that are not very evident. Let's take a look at what these are because they are the only way to achieve the necessary scenarios.
If you are setting up Firewall load balancing, this will require you to have a vServer of type ANY
with IP and Port set to *
(Wildcard) and MBF enabled so that you are not introducing asymmetry in routing. This all works great except when you also have the L3 mode set to ON
(default) and have more specific static routes, or when you have one of these destination IPs available on the same NetScaler as a VIP:
The Prefer Direct Route option will route traffic directly to this destination VIP directly without passing it through the Firewall first. If you are using FW LB and using routing, or have a corresponding VIP, disable this option; if you are not, leave it at its default.
When you want traffic to pass through different sets of firewalls, the limitation you will run into is the NetScaler's default behavior of only intercepting packets for a *
VIP once for VIPs that have the forwarding mode set to MAC instead of IP. This behavior exists to avoid issues resulting from packets running in a loop between two wildcard VIPs. To enable the interception more than once, enable the –vServerSpecificMac option:
httpOnlyCookieFlag setting when enabled, inserts a flag called httponly
when forwarding the response to the client, for example: Set-Cookie: NSC_iuuq_wjq=ffffffffc3a01f2445a4a423660; expires=Sun, 03-May-2015 15:14:35 GMT;path=/;httponly
.
The significance is that the cookie is not available to applications outside of the browser. This is a recommended approach from a security perspective, as this means even if a Cross Site Scripting (XSS) affected server is accessed by the User, the cookie can't be stolen. You need to however watch out for applications that require out-of-browser handling, a classic example being Java Applets or Client side scripts that need access to this cookie. The problem you may run into is that the requests generated outside of the browser will arrive at the NetScaler without a cookie and potentially end up on a different backend during load balancing, thus breaking the application.
You should also note that this flag exists as a tunable parameter also on the AAA vServer, which is also covered in this book, in a later chapter.
A related flag is the secure http flag, which tells the browser to use the cookie only in secure exchanges. CTX138055 shows a way of setting this using rewrite. So it goes by definition that you should only set this for SSL-based vServers, or you will be breaking the application since the cookies will never be returned.
Note that the articles I mention throughout the book can be found on the Citrix support site. The easiest way to get to them is https://www.citrix.com/support.
This choice usually comes down to how big the Server pool you manage is. A couple of servers are easy to manage using the services approach, but as you are starting several of them, you should consider using ServiceGroups
. ServiceGroups
present the following benefits:
ServiceGroup
level so adding new servers or removing them is faster, since you only need to provide Server and port details without repeating the parameters each timeNow that you've learned how to choose the key options for our LB deployment, let's take a look at troubleshooting some of the common issues.
You've set up load balancing for the first time and tried to access the web page. Your browser appears to hang. Here's how you go about troubleshooting such issues. Start by checking whether the VIP and services are up. If the service is down, selecting show service <servicename> will show you why that service is down.
Some examples of what you might see are:
Resolution: Add a MIP or SNIP. Also make sure that the IP you add is from the right subnet, using a subnet calculator if you have to.
Resolution: This will take involvement from you as well as your server teams. Start by looking at a trace. Tracing on the NetScaler is introduced in Chapter 9, Troubleshooting Tools. If the server itself is running perfectly, a blocking Firewall rule might be the problem here.
Also, be sure that the monitor bound is of the right type; the port might be UP but you might need a monitor that runs specific queries to report an unavailable service accurately, especially for multi-tiered services. ECV (Extended Content Verification) monitors serve well here.
It's not uncommon to land in a situation as a NetScaler Administrator, where the NetScaler shows a monitor time out, but the Server logs will not show any problems. This is one reason why the recommended way of approaching such issues is to get simultaneous traces—on the Client, on the NetScaler, and on the Server.
We've looked at troubleshooting a VIP being down. However, the vServer could be UP but you might see other issues when accessing it; this section talks about a troubleshooting approach for such issues:
TCP
VIP or an ANY
VIP first and observe all ports in use between the client and server. The application's documentation is a great way to get that information too. You can start going up a layer once this characteristic of the application is understood.One example I've encountered is when a customer was using a traffic load generator to test how a newly set up VIP would hold up. They noticed that the servers weren't getting much of the traffic. On deeper analysis, we realized that the traffic generated was pumping a lot of requests with a very small advertised window size (hence the _SW_
in the counter tcp_err_SW_init_pktdrop
in the following screenshot). This kind of behavior closely resembles a Slow Read attack, for which this protection was put in:
Once this was understood, the tool was tweaked to instead resemble regular User traffic and the throughput issue was corrected.
Performance issues manifest usually in the way of pages taking time to load or upload to or download from a site that is timing out. To troubleshoot these:
Once you have the trace, look for the following:
MSS will be shown on the SYN from Client and SYN/ACK from server right in the Info section, but you can also look this up from the Options section of the TCP Frame:
The Options part in the TCP handshake packets will also show you vital information, such as whether Window scaling is enabled and what the scale factor (multiplier) is. If the application experiencing delays involves large transfers, you can try increasing the scale factor so that the receive windows are expanded to accept more data per acknowledgement.
Occasional zero windows are not a serious problem, as long as the receiver is able to quickly empty the receive buffer and send out a notification that it has free buffers to accept more data. The problem is when the zero window situation persists long enough that the sender has to give up, or if timeouts are getting hit. Take the following screenshot for example:
Here, SNIP has advertised a zero window to server, it cannot accept any further data, and the server is obliged to wait. If it thinks it has waited long enough, it will even send a probe to see whether the NetScaler is ready to accept more data (you can find such probes using the Wireshark filter, tcp.analysis.zero_window_probe
). NetScaler on its part, waits for a packet from the client indicating that the client is ready to accept more data or that it has processed the data it has previously received. That confirmation arrives in the form of an ACK. Following this ACK, NetScaler SNIP sends out a TCP Window update, telling the server that is ready to accept more data. The key is whether this recovery happens fast enough, if it doesn't the performance will drop.
Also, a high number of zero windows from the client can cause the NetScaler to reset the connection in order to protect its memory from saturation as that kind of TCP pattern is characteristic of a known TCP attack (sockstress
). This protection by the way is toggled using the command: set ns tcpparam -limitedPersist ENABLED/DISABLED
.
These issues are best diagnosed by a trace and manually calculating SEQ and ACK numbers to find out whether each receiver is receiving and ACKing what the sender sends and whether that ACK is reaching the sender. Some amount of retransmissions or DupAcks are inevitable on any busy production network; however, if you are seeing a high number of them in the same TCP stream, that is a cause for concern.
Also, if you are seeing ICMP messages indicating that the packets are too large, please enable PMTUD in the list modes to avoid fragmentation or drop due to unable-to-fragment issues. We discussed PMTUD in the first chapter.
MaxClients
parameter to 1
. This is telling the NetScaler not to send more requests to a service that is already processing one:"Is the solution to immediately remove the MaxClient
setting?" That depends. The value you configure here is to protect the servers from getting saturated and preventing an extremely degraded performance or worse, the server from crashing due to the load. So a deeper understanding of what the server can handle is needed (working with the Server vendor if needed) to choose an appropriate value.
When you are bringing a new service or set of services into an existing load balancing environment, you would want to verify that the NetScaler is distributing load fairly evenly. There are multiple ways of doing this including looking at the service hits. I find it helpful, especially for troubleshooting, to look at nsconmsg
outputs to see how this is happening.
nsconmsg
is covered in more detail in Chapter 9, Troubleshooting tools.
In the following example, I have used the –j
option to list the vServer name to help me narrow the output and using the –s ConLB
command to set the debug level and the distrconmsg
display (-d
) option, I am able to see how the distribution evolves every 7 seconds.
What I can see in the following screenshot is that the VIP 205_vip
is using the least connections method for load balancing and SourceIP persistence and that all the hits are persistent. Also, it has two services 192:168.1.61:80
and 192.168.1.63:80
, and their respective hits. I started the second client with a delay, which is why 192.168.1.80
starts to be the only one to take load, given the persistence implication, before 192.168.1.63
starts to get some hits. The goal is to get this to be as even as possible:
Persistence, is the most common reason why you might see uneven distribution to the servers. This disparity coming from persistence can sometimes be much exaggerated due to all clients coming from NAT device and consequently having a single IP. This is why cookie insertion makes an excellent persistence method.
Another reason is using a load/response based method that changes distribution based on server capacity. Here's how to see those persistence entries:
This helps when you are trying to determine which server a particular client is being served from.
If using cookie-based persistence, the client comes back with the LB cookie it was provided each time it places a request, so a persistence table is not necessary. Instead, use the show lb vserver
output to identify which server the request will persist to:
Now, if you look at the header of a response and do a match, you can see that this response was served from 63_svc
:
HTTP/1.1 200 OK Content-Type: image/png Last-Modified: Sat, 18 Apr 2015 11:05:01 GMT Accept-Ranges: bytes ETag: "b07dcc83c779d01:0" Server: Microsoft-IIS/7.5 Date: Fri, 08 May 2015 13:28:01 GMT Content-Length: 184946 Set-Cookie: NSC_205_wjq=ffffffffc3a01f2e45525d5f4f58455e445a4a423660
Note that the cookie name can be changed from the default NSC_vipname
format. This ability was specifically added for applications that required the cookie to have a specific name, for example, Lync 2013 needed the cookie to be called MS-WSMAN
:
You can also dive a little deeper into the LB VIP's performance using the –d oldconmsg
option. This will give you a ton of information to work with:
nsconmsg output with –d oldconmsg
In the preceding screenshot, you can see the packets/sec, the weight of each service, the hits that each service is getting, the number of active transactions (Atr), traffic handled in Mbps, and in this particular test case, critically, the Response time (RspTime) which is 492.66 ms. We can see that the server is struggling a little bit, considering it's taking nearly half a second to respond to each request. Another, perhaps bigger indicator of trouble, is the surge queue that is building up SQ (893).
There are also some very useful connection level details you can learn from this screenshot:
Established, as the name would imply, are the connections that are active and traffic is being received on them. OPEN established are the ones that are in the TIME_WAIT
state awaiting closure. There is a NetScaler zombie cleanup process that runs periodically and closes them to make room for newer connections. Any resets resulting from this type of closure will contain a window size of 9300
.
If your users or application team reports that a VIP was unavailable for a brief period before automatically getting restored, here are a few possibilities that you need to look at:
In the preceding snippet, I can see the monitor failure that brought the service down. Notice that unlike with the earlier examples of nsconmsg
, I am using the –K
option and a newnslog
file location. This is because we are doing a postmortem and want to look at historical data. If you leave it out, you will see only live data. You can also get a more detailed look at monitor states using the following:
nsconmsg -K /var/nslog/newnslog -s ConMon=3 -d oldconmsg
nstrace
is covered in Chapter 9, Troubleshooting Tools) as it provides more insight than logs. As always, try and get simultaneous ones on NetScaler and the server to be able to effectively narrow this down./var/core/
with core dumps matching the timestamp of the issue and report them to techsupport. This has the potential of looking like a brief outage due to a failover that follows or in the case of a standalone because the unit reboots and starts serving traffic again. HA Failovers are covered in Chapter 5, High Availability and Networking.