Sometimes triggers in Zabbix are too sensitive and you get notifications all the time because of quick repeated status changes; this is what we call flapping. This could be for example a swap file that is growing and shrinking all the time, making Zabbix send notifications that there is not enough free space left and a few seconds later going back in an OK state because there is enough space again to come back in alarm once again, a few seconds later. Another example of flapping could be the CPU load going over and under the threshold every x number of seconds. Let's see how we can solve this.
For this recipe, we only need a Zabbix server with an agent installed on the Zabbix server or some host and of course access with a super administrator account, like the one that comes standard with the installation.
({TRIGGER.VALUE}=0 & {MySQL:vfs.fs.size[/var/lib/mysql,pfree].last(0)}<10) | ({TRIGGER.VALUE}=1 & {MySQL:vfs.fs.size[/var/lib/mysql,pfree].last(0)}<30)
We made use of a new macro, {trigger.value}
. We know that once a trigger is in the problem state, it is 1
and once a trigger is in the OK
state it is 0
. By making use of the operator OR
operator (|
) we can tell our trigger to change to a problem state if our volume is less than 10 GB, or to remain into a problem state if the state was already in error and the volume has still less than 30 GB free space left.
Our trigger will come back in the OK
state once the free space in our volume is more than 30 GB. This is possible because the macro {TRIGGER.VALUE}
always returns the current trigger value. The first line defines when the problem starts. In our case, when there is less than 10 percent free space for the MySQL volume.
The second line defines the condition that keeps our trigger in problem state. In our case, this will be less than 30 percent free space.
There are more ways to do smart monitoring, for example, we can make use of the fuzzytime()
function to see if there is still contact with our proxy. Example: {Zabbix server:zabbix[proxy,<proxy name>,last access].fuzzytime(300)}=0
. This will alarm us if there is no contact for 300 seconds.
We can also do a time shift in Zabbix. This means that we can compare a value from an item with the value from example:
yesterday. server:system.cpu.load.avg(1h)}/{server:system.cpu.load.avg(1h,1d) }>2
This expression for example will check the load for 1h today and verify it with our server with the load from yesterday and give a warning if the load is more than 2 times.