Alerting on log data

With the data coming in, let's talk about alerting on it with triggers. There are a few things somewhat different than the thresholds and similar numeric comparisons that we've used in triggers so far.

If we have a log item that's collecting all lines and we want to alert on the lines containing some specific string, there are several trigger functions of potential use:

  • str(): This checks for a substring; for example, if we're collecting all values, this function could be used to alert on errors: str(error)
  • regexp: Similar to the str() function, this allows us to specify a regular expression to match
  • iregexp: This is a case-insensitive version of regexp()
These functions only work on a single line; it's not possible to match multiline log entries.

For these three functions, a second parameter is supported as well; in that case, it's either the number of seconds or the number of values to check. For example, str(error,600) would fire if there's an error substring in any of the values over the last 10 minutes.

That seems fine, but there's an issue if we only send error lines to the server by filtering on the agent side. To see what the problem is, let's consider a normal trigger, like the one checking for CPU load exceeding some threshold. Assuming we have a threshold of 5, the trigger currently in the OK state, and values such as 0, 1, and 2 arriving, nothing happens; no events are generated. When the first value above 5 arrives, a PROBLEM event is generated and the trigger switches to the PROBLEM state. No other values above 5 wouldn't generate any events; nothing would happen.

And the problem would be that it would work this way for log monitoring as well. We would generate a PROBLEM event for the first error line, and then nothing. The trigger would stay in the PROBLEM state and nothing else would happen. The solution is somewhat simple: there's a selection box in the trigger properties, Multiple, in the PROBLEM event generation mode option:

Marking this checkbox would make the mentioned CPU load trigger generate a new PROBLEM event for every value above the threshold of 5. Well, that wouldn't be very useful in most cases, but it would be useful for the log monitoring trigger. It's all good if we only receive error lines; a new PROBLEM event would be generated for each of them.

Note that even if we send both errors and good lines, errors after good lines would be picked up, but subsequent errors would be ignored, which could be a problem as well.

With this problem solved, we arrive at another one: once a trigger fires against an item that only receives error lines, this trigger never resolves; it always stays in the PROBLEM state. While that's not an issue in some cases in others, it's not desirable. There's an easy way to make such triggers time out by using a trigger function we're already familiar with, nodata(). If the item receives both error and normal lines, and we want it to time out 10 minutes after the last error arrived even if no normal lines arrive, the trigger expression could be constructed like this:

{host.item.str(error)}=1 and {host.item.nodata(10m)}=0 

Here, we're using the nodata() function the other way around: even if the last entry contains errors, the trigger would switch to the OK state if there were no other values in the last 10 minutes.

We also discussed triggers that time out in Chapter 6, Detecting Problems with Triggers, in the Triggers that time out section.

If the item receives error lines only, we could use an expression like the previous one, but we could also simplify it. In this case, just having any value is a problem situation, so we would use the reversed nodata() function again and alert on values being present:

{host.item.nodata(10m)}=0 

Here, if we have any values in the last 10 minutes, that's it; it's a PROBLEM. If there aren't any values, the trigger switches to OK. This is somewhat less resource intensive as Zabbix doesn't have to evaluate the actual item value.

Yet another trigger function that we could use here is count(). It would allow us to fire an alert when there's a certain number of interesting stringssuch as errorsduring some period of time. For example, the following will alert if there are more than 10 errors in the last 10 minutes:

{host.item.count(10m,error,like)}>10 

Another solution can be to keep the problem open and, after we have checked it ourselves, close it by hand. This can be done by selecting the Allow manual close box in the trigger:

Yet another way could be if we receive log files with errors and OK to make use of the OK event generation option. We would then create a trigger that alerts us when there's, for example, an error in the log and recover when it sees the word OK in the log.

Let's try this with our first log file. Click on triggers on A test host and add the following triggers:

  • NameWarning on errors in logfile1
  • Severity: Warning
  • Problem expression{A test host:log[/tmp/zabbix_logmon/logfile1].str(error)}=1
  • OK event generation: Recovery expression
  • Recovery expression{A test host:log[/tmp/zabbix_logmon/logfile1].str(ok)}=1

Now, let's create some errors in our log file (make sure you wait long enough, like with the other tests):

echo "error" >> /tmp/zabbix_logmon/logfile1

Let's check the dashboard to see whether we get a warning; we should be able to see a problem for our A test host.

Now, let's fix this error by sending the word ok to our log file and see what happens:

echo "ok" >> /tmp/zabbix_logmon/logfile1

We can check this in the dashboard by seeing whether our error is gone but, to get more proof, let's go to Monitoring | Problems and select A test host in the Hosts selection box in our filter. We can now see that there was an issue at 13:04:42 and that the issue has been resolved at 13:09:32; in Actions, we can see the actions that were taken:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset