Áú»¢¶Ä²©

2 Express?es de Trigger

Overview

The expressions used in triggers are very flexible. You can use them to create complex logical tests regarding monitored statistics.

A simple expression uses a function that is applied to the item with some parameters. The function returns a result that is compared to the threshold, using an operator and a constant.

The syntax of a simple useful expression is function(/host/key,parameter)<operator><constant>.

For example:

  min(/Áú»¢¶Ä²© server/net.if.in[eth0,bytes],5m)>100K

will trigger if the number of received bytes during the last five minutes was always over 100 kilobytes.

While the syntax is exactly the same, from the functional point of view there are two types of trigger expressions:

  • problem expression - defines the conditions of the problem
  • recovery expression (optional) - defines additional conditions of the problem resolution

When defining a problem expression alone, this expression will be used both as the problem threshold and the problem recovery threshold. As soon as the problem expression evaluates to TRUE, there is a problem. As soon as the problem expression evaluates to FALSE, the problem is resolved.

When defining both problem expression and the supplemental recovery expression, problem resolution becomes more complex: not only the problem expression has to be FALSE, but also the recovery expression has to be TRUE. This is useful to create hysteresis and avoid trigger flapping.

1 Fun??es

As fun??es de Trigger permitem referenciar dados coletados, hor¨¢rio atual e outros fatores.

Uma lista completa das fun??es suportadas est¨¢ dispon¨ªvel.

2 Par?metros das fun??es

A maioria das fun??es num¨¦ricas aceita uma quantidade de segundos como par?metro (uma unidade de tempo).

Voc¨º pode utilizar o prefixo # para especificar um par?metro com significado diferente:

Chamada da Fun??o Significa
sum(600) Sumariza??o de todos os valores nos ¨²ltimos 600 segundos (10 minutos)
sum(#5) Sumariza??o dos ¨²ltimos 5 valores

A fun??o last utiliza uma forma diferente para valores prefixados com #, para ela isso indicar¨¢ que ela dever¨¢ recuperar o en¨¦zimo valor anterior (do mais recente para o mais antigo). Exemplo, suponhamos que os ¨²ltimos 10 valores coletados s?o: (2, 3, 4, 9, 10, 11, 11, 16, 7, 0)

  • last(#2) - ir¨¢ retornar o pen¨²ltimo valor 7
  • last(#5) - ir¨¢ retornar 10.

As fun??es avg, count, last, min e max possuem um par?metro adicional: o time_shift. Este par?metro permite referenciar o dado em determinado per¨ªodo de tempo no passado. Por exemplo, avg(1h,1d) ir¨¢ retornar o valor m¨¦dio analisando 1 hora de valores de 1 dia antes do momento da coleta.

As triggers s¨® analisam os dados que est?o no hist¨®rico. Se o dado desejado n?o estiver no hist¨®rico, mas estiver nas m¨¦dias, a informa??o das m¨¦dias n?o ser¨¢ utilizada. Logo, ¨¦ necess¨¢rio que se mantenha o hist¨®rico por tempo compat¨ªvel com as triggers que se deseja criar.

Voc¨º pode utilizar os s¨ªmbolos de unidades nas express?es, por exemplo '5m' (minutos) ao inv¨¦s de '300' segundos ou '1d' (dia) ao inv¨¦s de '86400' segundos, '1K' ao inv¨¦s de '1024' bytes.

3 Operadores

Os seguintes operadores s?o suportados nas express?es de triggers (em ordem descedente de prioridade de execu??o):

Prioridade Operador Defini??o
1 - S¨ªmbolo de negativo
2 not N?o L¨®gico
3 *** |Multiplica??o | | |/** Divis?o
4 + Soma aritm¨¦tica
- Redu??o aritm¨¦tica
5 < Menor que. O operador ¨¦ definido como:
A<B ? (A<=B-0.000001)
<= Menor ou igual a.
> Maior que. O operador ¨¦ definido como:
A>B ? (A>=B+0.000001)
>= Maior ou igual a.
6 = ? igual. O operador ¨¦ definido como:
A=B ? (A>B-0.000001) e (A<B+0.000001)
<> Diferente. O operador ¨¦ definido comoas:
A<>B ? (A<=B-0.000001) ou (A>=B+0.000001)
7 and Operador l¨®gico E
8 or Operador l¨®gico OU

Os operadores not, and e or s?o sens¨ªveis ao caso e dever?o ser escritos em letra min¨²scula. Eles devem estar envoltos em espa?os ou par¨ºnteses.

Todos os operadores, exceto o "S¨ªmbolo de negativo" e "N?o l¨®gico", tem associa??o da esquerda para a direita.

4 Exemplos de triggers

Exemplo 1

An¨¢lise de carga de processamento na CPU: "Processor load is too high on www.zabbix.com"

{www.zabbix.com:system.cpu.load[all,avg1].last()}>5

'www.zabbix.com:system.cpu.load[all,avg1]' prov¨º um nome curto para o par?metro monitorado. Neste caso se refere ao host 'www.zabbix.com' com a chave 'system.cpu.load[all,avg1]'. Utilizando a fun??o 'last()', n¨®s estaremos referindo ao valor mais recente. Finalmente, '>5' define que a trigger dever¨¢ ir para o estado de "INCIDENTE" quando o valor mais recente desta chave, neste host for superior a 5.

Operators

The following operators are supported for triggers (in descending priority of execution):

Priority Operator Definition Notes for unknown values Force cast operand to float 1
1 - Unary minus -Unknown ¡ú Unknown Yes
2 not Logical NOT not Unknown ¡ú Unknown Yes
3 * Multiplication 0 * Unknown ¡ú Unknown
(yes, Unknown, not 0 - to not lose
Unknown in arithmetic operations)
1.2 * Unknown ¡ú Unknown
Yes
/ Division Unknown / 0 ¡ú error
Unknown / 1.2 ¡ú Unknown
0.0 / Unknown ¡ú Unknown
Yes
4 + Arithmetical plus 1.2 + Unknown ¡ú Unknown Yes
- Arithmetical minus 1.2 - Unknown ¡ú Unknown Yes
5 < Less than. The operator is defined as:

A<B ? (A<B-0.000001)
1.2 < Unknown ¡ú Unknown Yes
<= Less than or equal to. The operator is defined as:

A<=B ? (A¡ÜB+0.000001)
Unknown <= Unknown ¡ú Unknown Yes
> More than. The operator is defined as:

A>B ? (A>B+0.000001)
Yes
>= More than or equal to. The operator is defined as:

A>=B ? (A¡ÝB-0.000001)
Yes
6 = Is equal. The operator is defined as:

A=B ? (A¡ÝB-0.000001) and (A¡ÜB+0.000001)
No 1
<> Not equal. The operator is defined as:

A<>B ? (A<B-0.000001) or (A>B+0.000001)
No 1
7 and Logical AND 0 and Unknown ¡ú 0
1 and Unknown ¡ú Unknown
Unknown and Unknown ¡ú Unknown
Yes
8 or Logical OR 1 or Unknown ¡ú 1
0 or Unknown ¡ú Unknown
Unknown or Unknown ¡ú Unknown
Yes

1 String operand is still cast to numeric if:

  • another operand is numeric
  • operator other than = or <> is used on an operand

(If the cast fails - numeric operand is cast to a string operand and both operands get compared as strings.)

not, and and or operators are case-sensitive and must be in lowercase. They also must be surrounded by spaces or parentheses.

All operators, except unary - and not, have left-to-right associativity. Unary - and not are non-associative (meaning -(-1) and not (not 1) should be used instead of --1 and not not 1).

Evaluation result:

  • <, <=, >, >=, =, <> operators shall yield '1' in the trigger expression if the specified relation is true and '0' if it is false. If at least one operand is Unknown the result is Unknown;
  • and for known operands shall yield '1' if both of its operands compare unequal to '0'; otherwise, it yields '0'; for unknown operands and yields '0' only if one operand compares equal to '0'; otherwise, it yields 'Unknown';
  • or for known operands shall yield '1' if either of its operands compare unequal to '0'; otherwise, it yields '0'; for unknown operands or yields '1' only if one operand compares unequal to '0'; otherwise, it yields 'Unknown';
  • The result of the logical negation operator not for a known operand is '0' if the value of its operand compares unequal to '0'; '1' if the value of its operand compares equal to '0'. For unknown operand not yields 'Unknown'.
Exemplo 3

/etc/passwd has been changed

Utilize a fun??o 'diff()':

{www.zabbix.com:vfs.file.cksum[/etc/passwd].diff()}=1

A express?o ser¨¢ verdadeira quando o ¨²ltimo valor da verifica??o 'checksum' do arquivo '/etc/passwd' for diferente da pen¨²ltima verifica??o.

De forma similar esta t¨¦cnica pode ser utilizada para monitorar v¨¢rios outros arquivos, tais quais: /etc/inetd.conf, /kernel, etc.

Exemplo 4

Algu¨¦m est¨¢ baixando um arquivo muito grande da internet (ou um tr¨¢fego intenso por um longo per¨ªodo)

Utilize a fun??o 'min()':

{www.zabbix.com:net.if.in[eth0,bytes].min(5m)}>100K

A express?o ser¨¢ verdadeira quando a quantidade de bytes recebidos nos ¨²ltimos 5 minutos na interface 'eth0' for superior a 100 KB.

Exemplo 5

Ambos os n¨®s do cluster de SMTP est?o indispon¨ªveis

Observe que a express?o utiliza dados de dois hosts diferentes:

{smtp1.zabbix.com:net.tcp.service[smtp].last()}=0 and {smtp2.zabbix.com:net.tcp.service[smtp].last()}=0

A express?o ser¨¢ verdadeira quando ambos os servidores (smtp1.zabbix.com e smtp2.zabbix.com) SMTP estiverem fora do ar.

Exemplo 6

A vers?o do Áú»¢¶Ä²© Agent precisa ser atualizada

Use a fun??o 'str()':

{zabbix.zabbix.com:agent.version.str("beta8")}=1

A express?o ser¨¢ verdadeira se a vers?o do Áú»¢¶Ä²© Agent possuir o texto "beta8" (por exemplo 1.0beta8).

Exemplo 7

Servidor indispon¨ªvel

{zabbix.zabbix.com:icmpping.count(30m,0)}>5

A express?o ser¨¢ verdadeira se o host "zabbix.zabbix.com" estiver inacess¨ªvel por mais de 5 vezes nos ¨²ltimos 30 minutos.

Exemplo 8

Sem dados nos ¨²ltimos 3 minutos

Use a fun??o 'nodata()':

{zabbix.zabbix.com:tick.nodata(3m)}=1

Neste exemplo 'tick' ¨¦ um item do tipo 'Áú»¢¶Ä²© trapper'. Para que esta trigger funcione o item 'tick' precisar¨¢ ter sido definido. O host precisar¨¢ enviar periodicamente o dado para este item atrav¨¦s do comando 'zabbix_sender' ou similar.

A express?o ser¨¢ verdadeira se nenhum dado for recebido nos ¨²ltimos 180 segundos.

Exemplo 9

Alta carga de CPU no per¨ªodo noturno

Utilize a fun??o 'time()':

{zabbix:system.cpu.load[all,avg1].min(5m)}>2 and {zabbix:system.cpu.load[all,avg1].time()}>000000 and {zabbix:system.cpu.load[all,avg1].time()}<060000

A express?o ser¨¢ verdadeira se a carga de CPU for superior a 2, entre a meia noite e as seis da manh?.

Exemplo 10

Verifica se o hor¨¢rio local do host monitorado e do servidor do Áú»¢¶Ä²© est?o sincronizados

Use a fun??o 'fuzzytime()':

{MySQL_DB:system.localtime.fuzzytime(10)}=0

A express?o ser¨¢ verdadeira se o hor¨¢rio do servidor 'MySQL_DB' tiver uma diferen?a maior que 10 segundos em rela??o ao hor¨¢rio do Áú»¢¶Ä²© Server.

Exemplo 11

Comparando a carga atual de CPU com a carga no mesmo hor¨¢rio do dia anterior (usando o par?metro de time_shift).

{server:system.cpu.load.avg(1h)}/{server:system.cpu.load.avg(1h,1d)}>2

A express?o ser¨¢ verdadeira se a carga da ¨²ltima hora for duas vezes superior a carga deste mesmo per¨ªodo um dia antes (24 horas).

Exemplo 12

Usando o valor de outro item como limite para a trigger:

{Template PfSense:hrStorageFree[{#SNMPVALUE}].last()}<{Template PfSense:hrStorageSize[{#SNMPVALUE}].last()}*0.1

A express?o ser¨¢ verdadeira se o espa?o livre for inferior a 10%.

5 T¨¦cnicas 'anti-flapping'

Algumas vezes voc¨º precisa ter condi??es diferentes para estados diferentes (INCIDENTE/OK). Por exemplo, n¨®s podemos ter que definir uma trigger para avisar quando a temperatura de uma sala for superior a 20C (vinte graus) que ¨¦ o m¨¢ximo suport¨¢vel para os servidores funcionarem com seguran?a, mas a temperatura ideal de funcionamento deveria ser de at¨¦ 15C (quinze graus). Temos como definir uma trigger desta forma no Áú»¢¶Ä²©, ela ser¨¢ ativada (mudar para o estado de INCIDENTE) se a temperatura ultrapassar o m¨¢ximo aceit¨¢vel, mas n?o ser¨¢ inativada (retornar ao estado OK) enquanto a temperatura n?o for inferior ¨¤ temperatura ideal.

Para fazer isso podemos definir uma trigger como a do "Exemplo 1". A trigger do "Exemplo 2" apresenta a mesma t¨¦cnica de "anti-flapping" para espa?o em disco.

Exemplo 1

A temperatura na sala dos servidores est¨¢ muito alta

({TRIGGER.VALUE}=0 and {server:temp.last()}>20) or
       ({TRIGGER.VALUE}=1 and {server:temp.last()}>15)
Exemplo 2

Pouco espa?o livre no disco

Incidente: se for menor que 10GB nos ¨²ltimos 5 minutos

Recupera??o (OK): se for maior que 40GB nos ¨²ltimos 10 minutos

({TRIGGER.VALUE}=0 and {server:vfs.fs.size[/,free].max(5m)}<10G) or
       ({TRIGGER.VALUE}=1 and {server:vfs.fs.size[/,free].min(10m)}<40G)

Observe que a macro {TRIGGER.VALUE} retorna o estado corrente da trigger (0 - OK, 1 - Incidente).

Example 12

Using the value of another item to get a trigger threshold:

last(/Template PfSense/hrStorageFree[{#SNMPVALUE}])<last(/Template PfSense/hrStorageSize[{#SNMPVALUE}])*0.1

The trigger will fire if the free storage drops below 10 percent.

Example 13

Using evaluation result to get the number of triggers over a threshold:

(last(/server1/system.cpu.load[all,avg1])>5) + (last(/server2/system.cpu.load[all,avg1])>5) + (last(/server3/system.cpu.load[all,avg1])>5)>=2

The trigger will fire if at least two of the triggers in the expression are over 5.

Example 14

Comparing string values of two items - operands here are functions that return strings.

Problem: create an alert if Ubuntu version is different on different hosts

last(/NY Áú»¢¶Ä²© server/vfs.file.contents[/etc/os-release])<>last(/LA Áú»¢¶Ä²© server/vfs.file.contents[/etc/os-release])
Example 15

Comparing two string values - operands are:

  • a function that returns a string
  • a combination of macros and strings

Problem: detect changes in the DNS query

The item key is:

net.dns.record[8.8.8.8,{$WEBSITE_NAME},{$DNS_RESOURCE_RECORD_TYPE},2,1]

with macros defined as

{$WEBSITE_NAME} = example.com
       {$DNS_RESOURCE_RECORD_TYPE} = MX

and normally returns:

example.com           MX       0 mail.example.com

So our trigger expression to detect if the DNS query result deviated from the expected result is:

last(/Áú»¢¶Ä²© server/net.dns.record[8.8.8.8,{$WEBSITE_NAME},{$DNS_RESOURCE_RECORD_TYPE},2,1])<>"{$WEBSITE_NAME}           {$DNS_RESOURCE_RECORD_TYPE}       0 mail.{$WEBSITE_NAME}"

Notice the quotes around the second operand.

Example 16

Comparing two string values - operands are:

  • a function that returns a string
  • a string constant with special characters \ and "

Problem: detect if the /tmp/hello file content is equal to:

\" //hello ?\"

Option 1) write the string directly

last(/Áú»¢¶Ä²© server/vfs.file.contents[/tmp/hello])="\\\" //hello ?\\\""

Notice how \ and " characters are escaped when the string gets compared directly.

Option 2) use a macro

{$HELLO_MACRO} = \" //hello ?\"

in the expression:

last(/Áú»¢¶Ä²© server/vfs.file.contents[/tmp/hello])={$HELLO_MACRO}
Example 17

Comparing long-term periods.

Problem: Load of Exchange server increased by more than 10% last month

trendavg(/Exchange/system.cpu.load,1M:now/M)>1.1*trendavg(/Exchange/system.cpu.load,1M:now/M-1M)

You may also use the Event name field in trigger configuration to build a meaningful alert message, for example to receive something like

"Load of Exchange server increased by 24% in July (0.69) comparing to June (0.56)"

the event name must be defined as:

Load of {HOST.HOST} server increased by {{?100*trendavg(//system.cpu.load,1M:now/M)/trendavg(//system.cpu.load,1M:now/M-1M)}.fmtnum(0)}% in {{TIME}.fmttime(%B,-1M)} ({{?trendavg(//system.cpu.load,1M:now/M)}.fmtnum(2)}) comparing to {{TIME}.fmttime(%B,-2M)} ({{?trendavg(//system.cpu.load,1M:now/M-1M)}.fmtnum(2)})

It is also useful to allow manual closing in trigger configuration for this kind of problem.

Hysteresis

Sometimes an interval is needed between problem and recovery states, rather than a simple threshold. For example, if we want to define a trigger that reports a problem when server room temperature goes above 20¡ãC and we want it to stay in the problem state until the temperature drops below 15¡ãC, a simple trigger threshold at 20¡ãC will not be enough.

Instead, we need to define a trigger expression for the problem event first (temperature above 20¡ãC). Then we need to define an additional recovery condition (temperature below 15¡ãC). This is done by defining an additional Recovery expression parameter when defining a trigger.

In this case, problem recovery will take place in two steps:

  • First, the problem expression (temperature above 20¡ãC) will have to evaluate to FALSE
  • Second, the recovery expression (temperature below 15¡ãC) will have to evaluate to TRUE

The recovery expression will be evaluated only when the problem event is resolved first.

The recovery expression being TRUE alone does not resolve a problem if the problem expression is still TRUE!

Example 1

Temperature in server room is too high.

Problem expression:

last(/server/temp)>20

Recovery expression:

last(/server/temp)<=15
Example 2

Free disk space is too low.

Problem expression: it is less than 10GB for last 5 minutes

max(/server/vfs.fs.size[/,free],5m)<10G

Recovery expression: it is more than 40GB for last 10 minutes

min(/server/vfs.fs.size[/,free],10m)>40G

Expressions with unsupported items and unknown values

Versions before Áú»¢¶Ä²© 3.2 are very strict about unsupported items in a trigger expression. Any unsupported item in the expression immediately renders trigger value to Unknown.

Since Áú»¢¶Ä²© 3.2 there is a more flexible approach to unsupported items by admitting unknown values into expression evaluation:

  • For the nodata() function, the values are not affected by whether an item is supported or unsupported. The function is evaluated even if it refers to an unsupported item.
  • Logical expressions with OR and AND can be evaluated to known values in two cases regardless of unknown operands:
    • "1 or Unsupported_item1.some_function() or Unsupported_item2.some_function() or ..." can be evaluated to '1' (True),
    • "0 and Unsupported_item1.some_function() and Unsupported_item2.some_function() and ..." can be evaluated to '0' (False).
      Áú»¢¶Ä²© tries to evaluate logical expressions taking unsupported items as Unknown values. In the two cases mentioned above a known value will be produced; in other cases trigger value will be Unknown.
  • If a function evaluation for supported item results in error, the function value is Unknown and it takes part in further expression evaluation.

Note that unknown values may "disappear" only in logical expressions as described above. In arithmetic expressions unknown values always lead to result Unknown (except division by 0).

If a trigger expression with several unsupported items evaluates to Unknown the error message in the frontend refers to the last unsupported item evaluated.