Chapter 10. Monitoring the Health of Zabbix

Zabbix's internal keys have existed since version 1.6, but only in the newer versions have they gotten a wider range. Interestingly, however, Zabbix's internal keys can be termed as the "monitors that monitor other monitors". In practice, we need to measure Zabbix's behavior so that we can take decisions about what changes to make. But just knowing the internal keys of Zabbix is not enough. We need to know what each one of them means and what the effects of a change in any of the parameters related to the keys can be.

Zabbix's internal metrics are an important part of the decisions on performance tuning of an environment. This metrics can't be evaluated alone; we need to understand that they could be affected by other factors, such as the Zabbix database's settings.

The important thing here is to understand that there are no "magic settings" or "silver bullets" that can ensure best performance in any environment. Monitoring and evaluation of specific metrics for each component should always be done. Even if the environment does not present signs of trouble, we can't neglect such collections, as the assessment of historical data will be valuable. In this chapter, we are going to talk about these topics:

  • Zabbix queue
  • Server and proxy internal items
  • Database performance items

The Zabbix queue

Ever since Zabbix was born, one of the key performance indicators has been the item's queue. If the queue is flowing and is not accumulating items, it means that Zabbix is performing well. Just like that? Almost! The queue is certainly one of Zabbix's performance indicators if our focus is only on data collection. A low queue means that Zabbix can make their collections without difficulty, but a high queue is no single indicator to pinpoint performance problems with Zabbix. However, the queue is an indicator that something is not right, which could be a problem with Zabbix, a host, a host group, or even a part of a network. A high queue will require more efforts and work from Zabbix. This high queue can be the cause of performance problems and not the consequence. It is in this view that, in my understanding, the importance of keeping the queue of Zabbix under control lies. There are specific internal items for monitoring the Zabbix queue.

One of them is zabbix[queue,<from>,<to>]. This, in fact, is the only internal item directly related to the Zabbix queue. It has great importance as a performance indicator. The parameters it receives are from and to, which are the delay times we need to measure the queue.

What can these parameters alone indicate? They indicate that we have a queue in some of the intervals defined by the parameters.

For example, suppose we create the following items in Zabbix:

  • zabbix[queue,1,10]
  • zabbix[queue,11,30]
  • zabbix[queue,31,60]
  • zabbix[queue,61,300]
  • zabbix[queue,301,600]
  • zabbix[queue,600]

In this case we will have, translated into items, the same view of the Zabbix queue screen. The intervals are the same, and we get a view of the historical values of these items.

The Zabbix queue

From the creation of the items, we can create triggers to generate alerts if a queue of 10 minutes (600 seconds) exceeds 10 percent of the items. And how would this trigger be? Simple! We need an item that contains the number of active items in Zabbix, like zabbix[items]. This item will contain the total number of active items (including unsupported ones) from the environment.

The path from here to the creation of the trigger is quite simple and something familiar for most of you. We use ({Zabbix server:zabbix[queue,600].last()}/{Zabbix server:zabbix[items].last()}=0)>0.10. In this case, we have an alert when the queue of overdue items, with more than 10 minutes, exceeds 10 percent of the total active items. Of course, this threshold should be set with different values for each environment.

And how can this trigger help with regard to performance? Well, we know that a high queue, for any reason, will directly impact the occupation of pollers, and this can end up generating those unwanted graphs with gaps.

The tip here is to create controls in the Zabbix queue so that we know when the queue exceeds a certain level. Try to keep the limit at approximately 5 percent.

An important piece of information that should be understood by Zabbix administrators is what the Zabbix queue is and what it is not. Many think that there is a table where Zabbix stores a Zabbix queue, but that's not how it works. The queue itself does not exist. What happens is that every item stores within itself (in the database and ConfigCache) some information that allows Zabbix to calculate the size of the queue. An example of the information is as follows:

  • Delay: This is the time set in the item for the occurrence of each data collection. It is the frequency with which Zabbix updates the value of an item.
  • LastClock: This is the timestamp of the last data collection for the item. It is the time when the Zabbix server received the last collected value for the item. With this information recorded, the Zabbix GUI can do the calculations required to display the screen of Zabbix queue. In other words, the queue does not really exist but is a calculus, considering the parameters and information about the items.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset