Source:
HashiCorp Nomad by HTTP
Overview
This template is designed to monitor HashiCorp Nomad by Áú»¢¶Ä²©. It works without any external scripts. Currently the template supports Nomad servers and clients discovery.
Requirements
Áú»¢¶Ä²© version: 7.2 and higher.
Tested versions
This template has been tested on:
- HashiCorp Nomad version 1.5.6/1.6.0
Configuration
Áú»¢¶Ä²© should be configured according to the instructions in the Templates out of the box section.
Setup
- Create a synthetic Nomad host. It should be one of the Nomad cluster members, load-balancing service (if cluster is used) or a single node in a selected Nomad region.
- Define the
{$NOMAD.ENDPOINT.API.URL}
macro value with correct web protocol, host and port. - Prepare an ACL token with
node:read
,namespace:read-job
,agent:read
andmanagement
permissions applied. Define the{$NOMAD.TOKEN}
macro value.
Refer to the vendor documentation about or if you have the HashiCorp Vault integration configured.
Additional information:
- Synthetic Nomad host will be used just as an endpoint for servers and clients discovery (general cluster information), it will not be monitored as a Nomad server or client, so that to prevent duplicate entities.
- If you're not using ACL - skip 3rd setup step.
- The Nomad servers/clients discovery is limited by region. If you're using multi-region cluster- create one synthetic host per region.
- The Nomad server/client templates are ready for separate usage. Feel free to use if you prefer manual host creation.
Useful links
Macros used
Name | Description | Default |
---|---|---|
{$NOMAD.ENDPOINT.API.URL} | API endpoint URL for one of the Nomad cluster members. |
http://localhost:4646 |
{$NOMAD.TOKEN} | Nomad authentication token. |
<PUT YOUR AUTH TOKEN> |
{$NOMAD.DATA.TIMEOUT} | Response timeout for an API. |
15s |
{$NOMAD.HTTP.PROXY} | Sets the HTTP proxy for script and HTTP agent items. If this parameter is empty, then no proxy is used. |
|
{$NOMAD.API.RESPONSE.SUCCESS} | HTTP API successful response code. Availability triggers threshold. Change, if needed. |
200 |
{$NOMAD.SERVER.NAME.MATCHES} | The filter to include HashiCorp Nomad servers by name. |
.* |
{$NOMAD.SERVER.NAME.NOT_MATCHES} | The filter to exclude HashiCorp Nomad servers by name. |
CHANGE_IF_NEEDED |
{$NOMAD.SERVER.DC.MATCHES} | The filter to include HashiCorp Nomad servers by datacenter belonging. |
.* |
{$NOMAD.SERVER.DC.NOT_MATCHES} | The filter to exclude HashiCorp Nomad servers by datacenter belonging. |
CHANGE_IF_NEEDED |
{$NOMAD.CLIENT.NAME.MATCHES} | The filter to include HashiCorp Nomad clients by name. |
.* |
{$NOMAD.CLIENT.NAME.NOT_MATCHES} | The filter to exclude HashiCorp Nomad clients by name. |
CHANGE_IF_NEEDED |
{$NOMAD.CLIENT.DC.MATCHES} | The filter to include HashiCorp Nomad clients by datacenter belonging. |
.* |
{$NOMAD.CLIENT.DC.NOT_MATCHES} | The filter to exclude HashiCorp Nomad clients by datacenter belonging. |
CHANGE_IF_NEEDED |
{$NOMAD.CLIENT.SCHEDULE.ELIGIBILITY.MATCHES} | The filter to include HashiCorp Nomad clients by scheduling eligibility. |
.* |
{$NOMAD.CLIENT.SCHEDULE.ELIGIBILITY.NOT_MATCHES} | The filter to exclude HashiCorp Nomad clients by scheduling eligibility. |
CHANGE_IF_NEEDED |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
Nomad clients get | Nomad clients data in raw format. |
HTTP agent | nomad.client.nodes.get Preprocessing
|
Client nodes API response | Client nodes API response message. |
Dependent item | nomad.client.nodes.api.response Preprocessing
|
Nomad servers get | Nomad servers data in raw format. |
Script | nomad.server.nodes.get |
Server-related APIs response | Server-related ( |
Dependent item | nomad.server.api.response Preprocessing
|
Region | Current cluster region. |
Dependent item | nomad.region Preprocessing
|
Nomad servers count | Nomad servers count. |
Dependent item | nomad.servers.count Preprocessing
|
Nomad clients count | Nomad clients count. |
Dependent item | nomad.clients.count Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
HashiCorp Nomad: Client nodes API connection has failed | Client nodes API connection has failed. |
find(/HashiCorp Nomad by HTTP/nomad.client.nodes.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes |
HashiCorp Nomad: Server-related API connection has failed | Server-related API connection has failed. |
find(/HashiCorp Nomad by HTTP/nomad.server.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes |
LLD rule Clients discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Clients discovery | Client nodes discovery. |
Dependent item | nomad.clients.discovery Preprocessing
|
LLD rule Servers discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Servers discovery | Server nodes discovery. |
Dependent item | nomad.servers.discovery Preprocessing
|
HashiCorp Nomad Client by HTTP
Overview
This template is designed to monitor HashiCorp Nomad clients by Áú»¢¶Ä²©. It works without any external scripts.
Requirements
Áú»¢¶Ä²© version: 7.2 and higher.
Tested versions
This template has been tested on:
- HashiCorp Nomad version 1.5.6/1.6.0
Configuration
Áú»¢¶Ä²© should be configured according to the instructions in the Templates out of the box section.
Setup
- Enable telemetry in HashiCorp Nomad agent configuration file. Set the Prometheus metrics format.
Refer to the .
- Prepare an ACL token with
node:read
,namespace:read-job
permissions applied. Define the{$NOMAD.TOKEN}
macro value.
Refer to the vendor documentation about or if you're using integration with HashiCorp Vault.
- Set the values for the
{$NOMAD.CLIENT.API.SCHEME}
and{$NOMAD.CLIENT.API.PORT}
macros to define the common Nomad API web schema and connection port.
Additional information:
-
You have to prepare an additional ACL token only if you wish to monitor Nomad clients as separate entities. If you're using clients discovery - token will be inherited from the master host linked to the HashiCorp Nomad by HTTP template.
-
If you're not using ACL - skip 2nd setup step.
-
The Nomad clients use the default web schema -
HTTP
and default API port -4646
. If you're using clients discovery and you need to re-define macros for the particular host created from prototype, use the context macros like {{$NOMAD.CLIENT.API.SCHEME:NECESSARY.IP
}} or/and {{$NOMAD.CLIENT.API.PORT:NECESSARY.IP
}} on master host or template level. -
Some metrics may not be collected depending on your HashiCorp Nomad agent version and configuration.
Useful links:
Macros used
Name | Description | Default |
---|---|---|
{$NOMAD.CLIENT.API.SCHEME} | Nomad client API scheme. |
http |
{$NOMAD.CLIENT.API.PORT} | Nomad client API port. |
4646 |
{$NOMAD.TOKEN} | Nomad authentication token. |
<PUT YOUR AUTH TOKEN> |
{$NOMAD.DATA.TIMEOUT} | Response timeout for an API. |
15s |
{$NOMAD.HTTP.PROXY} | Sets the HTTP proxy for HTTP agent item. If this parameter is empty, then no proxy is used. |
|
{$NOMAD.API.RESPONSE.SUCCESS} | HTTP API successful response code. Availability triggers threshold. Change, if needed. |
200 |
{$NOMAD.CLIENT.RPC.PORT} | Nomad RPC service port. |
4647 |
{$NOMAD.CLIENT.SERF.PORT} | Nomad serf service port. |
4648 |
{$NOMAD.CLIENT.OPEN.FDS.MAX.WARN} | Maximum percentage of used file descriptors. |
90 |
{$NOMAD.DISK.NAME.MATCHES} | The filter to include HashiCorp Nomad client disks by name. |
.* |
{$NOMAD.DISK.NAME.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client disks by name. |
CHANGE_IF_NEEDED |
{$NOMAD.JOB.NAME.MATCHES} | The filter to include HashiCorp Nomad client jobs by name. |
.* |
{$NOMAD.JOB.NAME.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client jobs by name. |
CHANGE_IF_NEEDED |
{$NOMAD.JOB.NAMESPACE.MATCHES} | The filter to include HashiCorp Nomad client jobs by namespace. |
.* |
{$NOMAD.JOB.NAMESPACE.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client jobs by namespace. |
CHANGE_IF_NEEDED |
{$NOMAD.JOB.TYPE.MATCHES} | The filter to include HashiCorp Nomad client jobs by type. |
.* |
{$NOMAD.JOB.TYPE.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client jobs by type. |
CHANGE_IF_NEEDED |
{$NOMAD.JOB.TASK.GROUP.MATCHES} | The filter to include HashiCorp Nomad client jobs by task group belonging. |
.* |
{$NOMAD.JOB.TASK.GROUP.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client jobs by task group belonging. |
CHANGE_IF_NEEDED |
{$NOMAD.DRIVER.NAME.MATCHES} | The filter to include HashiCorp Nomad client drivers by name. |
.* |
{$NOMAD.DRIVER.NAME.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client drivers by name. |
CHANGE_IF_NEEDED |
{$NOMAD.DRIVER.DETECT.MATCHES} | The filter to include HashiCorp Nomad client drivers by detection state. Possible filtering values: |
.* |
{$NOMAD.DRIVER.DETECT.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client drivers by detection state. Possible filtering values: |
CHANGE_IF_NEEDED |
{$NOMAD.CPU.UTIL.MIN} | CPU utilization threshold. Measured as a percentage. |
90 |
{$NOMAD.RAM.AVAIL.MIN} | CPU utilization threshold. Measured as a percentage. |
5 |
{$NOMAD.INODES.FREE.MIN.WARN} | Warning threshold of the filesystem metadata utilization. Measured as a percentage. |
20 |
{$NOMAD.INODES.FREE.MIN.CRIT} | Critical threshold of the filesystem metadata utilization. Measured as a percentage. |
10 |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
Telemetry get | Telemetry data in raw format. |
HTTP agent | nomad.client.data.get Preprocessing
|
Metrics | Nomad client metrics in raw format. |
Dependent item | nomad.client.metrics.get Preprocessing
|
Monitoring API response | Monitoring API response message. |
Dependent item | nomad.client.data.api.response Preprocessing
|
Service [rpc] state | Current [rpc] service state. |
Simple check | net.tcp.service[tcp,,{$NOMAD.CLIENT.RPC.PORT}] Preprocessing
|
Service [serf] state | Current [serf] service state. |
Simple check | net.tcp.service[tcp,,{$NOMAD.CLIENT.SERF.PORT}] Preprocessing
|
CPU allocated | Total amount of CPU shares the scheduler has allocated to tasks. |
Dependent item | nomad.client.allocated.cpu Preprocessing
|
CPU unallocated | Total amount of CPU shares free for the scheduler to allocate to tasks. |
Dependent item | nomad.client.unallocated.cpu Preprocessing
|
Memory allocated | Total amount of memory the scheduler has allocated to tasks. |
Dependent item | nomad.client.allocated.memory Preprocessing
|
Memory unallocated | Total amount of memory free for the scheduler to allocate to tasks. |
Dependent item | nomad.client.unallocated.memory Preprocessing
|
Disk allocated | Total amount of disk space the scheduler has allocated to tasks. |
Dependent item | nomad.client.allocated.disk Preprocessing
|
Disk unallocated | Total amount of disk space free for the scheduler to allocate to tasks. |
Dependent item | nomad.client.unallocated.disk Preprocessing
|
Allocations blocked | Number of allocations waiting for previous versions. |
Dependent item | nomad.client.allocations.blocked Preprocessing
|
Allocations migrating | Number of allocations migrating data from previous versions. |
Dependent item | nomad.client.allocations.migrating Preprocessing
|
Allocations pending | Number of allocations pending (received by the client but not yet running). |
Dependent item | nomad.client.allocations.pending Preprocessing
|
Allocations starting | Number of allocations starting. |
Dependent item | nomad.client.allocations.start Preprocessing
|
Allocations running | Number of allocations running. |
Dependent item | nomad.client.allocations.running Preprocessing
|
Allocations terminal | Number of allocations terminal. |
Dependent item | nomad.client.allocations.terminal Preprocessing
|
Allocations failed, rate | Number of allocations failed. |
Dependent item | nomad.client.allocations.failed Preprocessing
|
Allocations completed, rate | Number of allocations completed. |
Dependent item | nomad.client.allocations.complete Preprocessing
|
Allocations restarted, rate | Number of allocations restarted. |
Dependent item | nomad.client.allocations.restart Preprocessing
|
Allocations OOM killed | Number of allocations OOM killed. |
Dependent item | nomad.client.allocations.oom_killed Preprocessing
|
CPU idle utilization | CPU utilization in idle state. |
Dependent item | nomad.client.cpu.idle Preprocessing
|
CPU system utilization | CPU utilization in system space. |
Dependent item | nomad.client.cpu.system Preprocessing
|
CPU total utilization | Total CPU utilization. |
Dependent item | nomad.client.cpu.total Preprocessing
|
CPU user utilization | CPU utilization in user space. |
Dependent item | nomad.client.cpu.user Preprocessing
|
Memory available | Total amount of memory available to processes which includes free and cached memory. |
Dependent item | nomad.client.memory.available Preprocessing
|
Memory free | Amount of memory which is free. |
Dependent item | nomad.client.memory.free Preprocessing
|
Memory size | Total amount of physical memory on the node. |
Dependent item | nomad.client.memory.total Preprocessing
|
Memory used | Amount of memory used by processes. |
Dependent item | nomad.client.memory.used Preprocessing
|
Uptime | Uptime of the host running the Nomad client. |
Dependent item | nomad.client.uptime Preprocessing
|
Node info get | Node info data in raw format. |
HTTP agent | nomad.client.node.info.get Preprocessing
|
Nomad client version | Nomad client version. |
Dependent item | nomad.client.version Preprocessing
|
Nodes API response | Nodes API response message. |
Dependent item | nomad.client.node.info.api.response Preprocessing
|
Allocated jobs get | Allocated jobs data in raw format. |
HTTP agent | nomad.client.job.allocs.get Preprocessing
|
Allocations API response | Allocations API response message. |
Dependent item | nomad.client.job.allocs.api.response Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
HashiCorp Nomad Client: Monitoring API connection has failed | Monitoring API connection has failed. |
find(/HashiCorp Nomad Client by HTTP/nomad.client.data.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes |
HashiCorp Nomad Client: Service [rpc] is down | Cannot establish the connection to [rpc] service port {$NOMAD.CLIENT.RPC.PORT}. |
last(/HashiCorp Nomad Client by HTTP/net.tcp.service[tcp,,{$NOMAD.CLIENT.RPC.PORT}]) = 0 |
Average | Manual close: Yes |
HashiCorp Nomad Client: Service [serf] is down | Cannot establish the connection to [serf] service port {$NOMAD.CLIENT.SERF.PORT}. |
last(/HashiCorp Nomad Client by HTTP/net.tcp.service[tcp,,{$NOMAD.CLIENT.SERF.PORT}]) = 0 |
Average | Manual close: Yes |
HashiCorp Nomad Client: OOM killed allocations found | OOM killed allocations found. |
last(/HashiCorp Nomad Client by HTTP/nomad.client.allocations.oom_killed) > 0 |
Warning | Manual close: Yes |
HashiCorp Nomad Client: High CPU utilization | CPU utilization is too high. The system might be slow to respond. |
min(/HashiCorp Nomad Client by HTTP/nomad.client.cpu.total, 10m) >= {$NOMAD.CPU.UTIL.MIN} |
Average | |
HashiCorp Nomad Client: High memory utilization | RAM utilization is too high. The system might be slow to respond. |
(min(/HashiCorp Nomad Client by HTTP/nomad.client.memory.available, 10m) / last(/HashiCorp Nomad Client by HTTP/nomad.client.memory.total))*100 <= {$NOMAD.RAM.AVAIL.MIN} |
Average | |
HashiCorp Nomad Client: The host has been restarted | The host uptime is less than 10 minutes. |
last(/HashiCorp Nomad Client by HTTP/nomad.client.uptime) < 10m |
Warning | Manual close: Yes |
HashiCorp Nomad Client: Nomad client version has changed | Nomad client version has changed. |
change(/HashiCorp Nomad Client by HTTP/nomad.client.version)<>0 |
Info | Manual close: Yes |
HashiCorp Nomad Client: Nodes API connection has failed | Nodes API connection has failed. |
find(/HashiCorp Nomad Client by HTTP/nomad.client.node.info.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes Depends on:
|
HashiCorp Nomad Client: Allocations API connection has failed | Allocations API connection has failed. |
find(/HashiCorp Nomad Client by HTTP/nomad.client.job.allocs.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes Depends on:
|
LLD rule Drivers discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Drivers discovery | Client drivers discovery. |
Dependent item | nomad.client.drivers.discovery Preprocessing
|
Item prototypes for Drivers discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Driver [{#DRIVER.NAME}] state | Driver [{#DRIVER.NAME}] state. |
Dependent item | nomad.client.driver.state["{#DRIVER.NAME}"] Preprocessing
|
Driver [{#DRIVER.NAME}] detection state | Driver [{#DRIVER.NAME}] detection state. |
Dependent item | nomad.client.driver.detected["{#DRIVER.NAME}"] Preprocessing
|
Trigger prototypes for Drivers discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] is in unhealthy state | The [{#DRIVER.NAME}] driver detected, but its state is unhealthy. |
last(/HashiCorp Nomad Client by HTTP/nomad.client.driver.state["{#DRIVER.NAME}"]) = 0 and last(/HashiCorp Nomad Client by HTTP/nomad.client.driver.detected["{#DRIVER.NAME}"]) = 1 |
Warning | Manual close: Yes |
HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] detection state has changed | The [{#DRIVER.NAME}] driver detection state has changed. |
change(/HashiCorp Nomad Client by HTTP/nomad.client.driver.detected["{#DRIVER.NAME}"]) <> 0 |
Info | Manual close: Yes |
LLD rule Physical disks discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Physical disks discovery | Physical disks discovery. |
Dependent item | nomad.client.disk.discovery Preprocessing
|
Item prototypes for Physical disks discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Disk ["{#DEV.NAME}"] space available | Amount of space which is available on ["{#DEV.NAME}"] disk. |
Dependent item | nomad.client.disk.available["{#DEV.NAME}"] Preprocessing
|
Disk ["{#DEV.NAME}"] inodes utilization | Disk space consumed by the inodes on ["{#DEV.NAME}"] disk. |
Dependent item | nomad.client.disk.inodes_percent["{#DEV.NAME}"] Preprocessing
|
Disk ["{#DEV.NAME}"] size | Total size of the ["{#DEV.NAME}"] device. |
Dependent item | nomad.client.disk.size["{#DEV.NAME}"] Preprocessing
|
Disk ["{#DEV.NAME}"] space utilization | Percentage of disk ["{#DEV.NAME}"] space used. |
Dependent item | nomad.client.disk.used_percent["{#DEV.NAME}"] Preprocessing
|
Disk ["{#DEV.NAME}"] space used | Amount of disk ["{#DEV.NAME}"] space which has been used. |
Dependent item | nomad.client.disk.used["{#DEV.NAME}"] Preprocessing
|
Trigger prototypes for Physical disks discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
HashiCorp Nomad Client: Running out of free inodes on [{#DEV.NAME}] device | It may become impossible to write to a disk if there are no index nodes left. |
min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.inodes_percent["{#DEV.NAME}"],5m) >= {$NOMAD.INODES.FREE.MIN.WARN:"{#DEV.NAME}"} |
Warning | Manual close: Yes Depends on:
|
HashiCorp Nomad Client: Running out of free inodes on [{#DEV.NAME}] device | It may become impossible to write to a disk if there are no index nodes left. |
min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.inodes_percent["{#DEV.NAME}"],5m) >= {$NOMAD.INODES.FREE.MIN.CRIT:"{#DEV.NAME}"} |
Average | Manual close: Yes |
HashiCorp Nomad Client: High disk [{#DEV.NAME}] utilization | High disk [{#DEV.NAME}] utilization. |
min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.used_percent["{#DEV.NAME}"],5m) >= {$NOMAD.DISK.UTIL.MIN.WARN:"{#DEV.NAME}"} |
Warning | Manual close: Yes Depends on:
|
HashiCorp Nomad Client: High disk [{#DEV.NAME}] utilization | High disk [{#DEV.NAME}] utilization. |
min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.used_percent["{#DEV.NAME}"],5m) >= {$NOMAD.DISK.UTIL.MIN.CRIT:"{#DEV.NAME}"} |
Average | Manual close: Yes |
LLD rule Allocated jobs discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Allocated jobs discovery | Allocated jobs discovery. |
Dependent item | nomad.client.alloc.discovery Preprocessing
|
Item prototypes for Allocated jobs discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Job ["{#JOB.NAME}"] CPU allocated | Total CPU resources allocated by the ["{#JOB.NAME}"] job across all cores. |
Dependent item | nomad.client.allocs.cpu.allocated["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] CPU system utilization | Total CPU resources consumed by the ["{#JOB.NAME}"] job in system space. |
Dependent item | nomad.client.allocs.cpu.system["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] CPU user utilization | Total CPU resources consumed by the ["{#JOB.NAME}"] job in user space. |
Dependent item | nomad.client.allocs.cpu.user["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] CPU total utilization | Total CPU resources consumed by the ["{#JOB.NAME}"] job across all cores. |
Dependent item | nomad.client.allocs.cpu.total_percent["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] CPU throttled periods time | Total number of CPU periods that the ["{#JOB.NAME}"] job was throttled. |
Dependent item | nomad.client.allocs.cpu.throttled_periods["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] CPU throttled time | Total time that the ["{#JOB.NAME}"] job was throttled. |
Dependent item | nomad.client.allocs.cpu.throttled_time["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] CPU ticks | CPU ticks consumed by the process for the ["{#JOB.NAME}"] job in the last collection interval. |
Dependent item | nomad.client.allocs.cpu.total_ticks["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] Memory allocated | Amount of memory allocated by the ["{#JOB.NAME}"] job. |
Dependent item | nomad.client.allocs.memory.allocated["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] Memory cached | Amount of memory cached by the ["{#JOB.NAME}"] job. |
Dependent item | nomad.client.allocs.memory.cache["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] Memory used | Total amount of memory used by the ["{#JOB.NAME}"] job. |
Dependent item | nomad.client.allocs.memory.usage["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] Memory swapped | Amount of memory swapped by the ["{#JOB.NAME}"] job. |
Dependent item | nomad.client.allocs.memory.swap["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
HashiCorp Nomad Server by HTTP
Overview
This template is designed to monitor HashiCorp Nomad servers by Áú»¢¶Ä²©. It works without any external scripts.
Requirements
Áú»¢¶Ä²© version: 7.2 and higher.
Tested versions
This template has been tested on:
- HashiCorp Nomad version 1.5.6/1.6.0
Configuration
Áú»¢¶Ä²© should be configured according to the instructions in the Templates out of the box section.
Setup
- Enable telemetry in HashiCorp Nomad agent configuration file. Set the Prometheus metrics format.
Refer to the .
- Set the values for the
{$NOMAD.SERVER.API.SCHEME}
and{$NOMAD.SERVER.API.PORT}
macros to define the common Nomad API web schema and connection port.
Additional information:
- The Nomad servers use the default web schema -
HTTP
and default API port -4646
. If you're using servers discovery and you need to re-define macros for the particular host created from prototype, use the context macros like {{$NOMAD.SERVER.API.SCHEME:NECESSARY.IP
}} or/and {{$NOMAD.SERVER.API.PORT:NECESSARY.IP
}} on master host or template level. - Some metrics may not be collected depending on your HashiCorp Nomad agent version, configuration and cluster role.
- Don't forget to define the
{$NOMAD.REDUNDANCY.MIN}
macro value, based on your cluster nodes amount to configure the failure tolerance triggers correctly.
Useful links:
Macros used
Name | Description | Default |
---|---|---|
{$NOMAD.SERVER.API.SCHEME} | Nomad SERVER API scheme. |
http |
{$NOMAD.SERVER.API.PORT} | Nomad SERVER API port. |
4646 |
{$NOMAD.TOKEN} | Nomad authentication token. |
<PUT YOUR AUTH TOKEN> |
{$NOMAD.DATA.TIMEOUT} | Response timeout for an API. |
15s |
{$NOMAD.HTTP.PROXY} | Sets the HTTP proxy for HTTP agent item. If this parameter is empty, then no proxy is used. |
|
{$NOMAD.API.RESPONSE.SUCCESS} | HTTP API successful response code. Availability triggers threshold. Change, if needed. |
200 |
{$NOMAD.SERVER.RPC.PORT} | Nomad RPC service port. |
4647 |
{$NOMAD.SERVER.SERF.PORT} | Nomad serf service port. |
4648 |
{$NOMAD.REDUNDANCY.MIN} | Amount of redundant servers to keep the cluster safe. Default value - '1' for the 3-nodes cluster. Change if needed. |
1 |
{$NOMAD.OPEN.FDS.MAX} | Maximum percentage of used file descriptors. |
90 |
{$NOMAD.SERVER.LEADER.LATENCY} | Leader last contact latency threshold. |
0.3s |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
Telemetry get | Telemetry data in raw format. |
HTTP agent | nomad.server.data.get Preprocessing
|
Metrics | Nomad server metrics in raw format. |
Dependent item | nomad.server.metrics.get Preprocessing
|
Monitoring API response | Monitoring API response message. |
Dependent item | nomad.server.data.api.response Preprocessing
|
Internal stats get | Internal stats data in raw format. |
HTTP agent | nomad.server.stats.get Preprocessing
|
Internal stats API response | Internal stats API response message. |
Dependent item | nomad.server.stats.api.response Preprocessing
|
Nomad server version | Nomad server version. |
Dependent item | nomad.server.version Preprocessing
|
Nomad raft version | Nomad raft version. |
Dependent item | nomad.raft.version Preprocessing
|
Raft peers | Current cluster raft peers amount. |
Dependent item | nomad.server.raft.peers Preprocessing
|
Cluster role | Current role in the cluster. |
Dependent item | nomad.server.raft.cluster_role Preprocessing
|
CPU time, rate | Total user and system CPU time spent in seconds. |
Dependent item | nomad.server.cpu.time Preprocessing
|
Memory used | Memory utilization in bytes. |
Dependent item | nomad.server.runtime.alloc_bytes Preprocessing
|
Virtual memory size | Virtual memory size in bytes. |
Dependent item | nomad.server.virtual_memory_bytes Preprocessing
|
Resident memory size | Resident memory size in bytes. |
Dependent item | nomad.server.resident_memory_bytes Preprocessing
|
Heap objects | Number of objects on the heap. General memory pressure indicator. |
Dependent item | nomad.server.runtime.heap_objects Preprocessing
|
Open file descriptors | Number of open file descriptors. |
Dependent item | nomad.server.process_open_fds Preprocessing
|
Open file descriptors, max | Maximum number of open file descriptors. |
Dependent item | nomad.server.process_max_fds Preprocessing
|
Goroutines | Number of goroutines and general load pressure indicator. |
Dependent item | nomad.server.runtime.num_goroutines Preprocessing
|
Evaluations pending | Evaluations that are pending until an existing evaluation for the same job completes. |
Dependent item | nomad.server.broker.total_pending Preprocessing
|
Evaluations ready | Number of evaluations ready to be processed. |
Dependent item | nomad.server.broker.total_ready Preprocessing
|
Evaluations unacked | Evaluations dispatched for processing but incomplete. |
Dependent item | nomad.server.broker.total_unacked Preprocessing
|
CPU shares for blocked evaluations | Amount of CPU shares requested by blocked evals. |
Dependent item | nomad.server.blocked_evals.cpu Preprocessing
|
Memory shares by blocked evaluations | Amount of memory requested by blocked evals. |
Dependent item | nomad.server.blocked_evals.memory Preprocessing
|
CPU shares for blocked job evaluations | Amount of CPU shares requested by blocked evals of a job. |
Dependent item | nomad.server.blocked_evals.job.cpu Preprocessing
|
Memory shares for blocked job evaluations | Amount of memory requested by blocked evals of a job. |
Dependent item | nomad.server.blocked_evals.job.memory Preprocessing
|
Evaluations blocked | Count of evals in the blocked state for any reason (cluster resource exhaustion or quota limits). |
Dependent item | nomad.server.blocked_evals.total_blocked Preprocessing
|
Evaluations escaped | Count of evals that have escaped computed node classes. This indicates a scheduler optimization was skipped and is not usually a source of concern. |
Dependent item | nomad.server.blocked_evals.total_escaped Preprocessing
|
Evaluations waiting | Count of evals waiting to be enqueued. |
Dependent item | nomad.server.broker.total_waiting Preprocessing
|
Evaluations blocked due to quota limit | Count of blocked evals due to quota limits (the resources for these jobs are not counted in other blocked_evals metrics, except for total_blocked). |
Dependent item | nomad.server.blocked_evals.total_quota_limit Preprocessing
|
Evaluations enqueue time | Average time elapsed with evaluations waiting to be enqueued. |
Dependent item | nomad.server.broker.eval_waiting Preprocessing
|
RPC evaluation acknowledgement time | Time elapsed for Eval.Ack RPC call. |
Dependent item | nomad.server.eval.ack Preprocessing
|
RPC job summary time | Time elapsed for Job.Summary RPC call. |
Dependent item | nomad.server.job_summary.get_job_summary Preprocessing
|
Heartbeats active | Number of active heartbeat timers. Each timer represents a Nomad client connection. |
Dependent item | nomad.server.heartbeat.active Preprocessing
|
RPC requests, rate | Number of RPC requests being handled. |
Dependent item | nomad.server.rpc.request Preprocessing
|
RPC error requests, rate | Number of RPC requests being handled that result in an error. |
Dependent item | nomad.server.rpc.request_error Preprocessing
|
RPC queries, rate | Number of RPC queries. |
Dependent item | nomad.server.rpc.query Preprocessing
|
RPC job allocations time | Time elapsed for Job.Allocations RPC call. |
Dependent item | nomad.server.job.allocations Preprocessing
|
RPC job evaluations time | Time elapsed for Job.Evaluations RPC call. |
Dependent item | nomad.server.job.evaluations Preprocessing
|
RPC get job time | Time elapsed for Job.GetJob RPC call. |
Dependent item | nomad.server.job.get_job Preprocessing
|
Plan apply time | Time elapsed to apply a plan. |
Dependent item | nomad.server.plan.apply Preprocessing
|
Plan evaluate time | Time elapsed to evaluate a plan. |
Dependent item | nomad.server.plan.evaluate Preprocessing
|
RPC plan submit time | Time elapsed for Plan.Submit RPC call. |
Dependent item | nomad.server.plan.submit Preprocessing
|
Plan raft index processing time | Time elapsed that planner waits for the raft index of the plan to be processed. |
Dependent item | nomad.server.plan.wait_for_index Preprocessing
|
RPC list time | Time elapsed for Node.List RPC call. |
Dependent item | nomad.server.client.list Preprocessing
|
RPC update allocations time | Time elapsed for Node.UpdateAlloc RPC call. |
Dependent item | nomad.server.client.update_alloc Preprocessing
|
RPC update status time | Time elapsed for Node.UpdateStatus RPC call. |
Dependent item | nomad.server.client.update_status Preprocessing
|
RPC get client allocs time | Time elapsed for Node.GetClientAllocs RPC call. |
Dependent item | nomad.server.client.get_client_allocs Preprocessing
|
RPC eval dequeue time | Time elapsed for Eval.Dequeue RPC call. |
Dependent item | nomad.server.client.dequeue Preprocessing
|
Vault token last renewal | Time since last successful Vault token renewal. |
Dependent item | nomad.server.vault.token_last_renewal Preprocessing
|
Vault token next renewal | Time until next Vault token renewal attempt. |
Dependent item | nomad.server.vault.token_next_renewal Preprocessing
|
Vault token TTL | Time to live for Vault token. |
Dependent item | nomad.server.vault.token_ttl Preprocessing
|
Vault tokens revoked | Count of revoked tokens. |
Dependent item | nomad.server.vault.distributed_tokens_revoked Preprocessing
|
Jobs dead | Number of dead jobs. |
Dependent item | nomad.server.job_status.dead Preprocessing
|
Jobs pending | Number of pending jobs. |
Dependent item | nomad.server.job_status.pending Preprocessing
|
Jobs running | Number of running jobs. |
Dependent item | nomad.server.job_status.running Preprocessing
|
Job allocations completed | Number of complete allocations for a job. |
Dependent item | nomad.server.job_summary.complete Preprocessing
|
Job allocations failed | Number of failed allocations for a job. |
Dependent item | nomad.server.job_summary.failed Preprocessing
|
Job allocations lost | Number of lost allocations for a job. |
Dependent item | nomad.server.job_summary.lost Preprocessing
|
Job allocations unknown | Number of unknown allocations for a job. |
Dependent item | nomad.server.job_summary.unknown Preprocessing
|
Job allocations queued | Number of queued allocations for a job. |
Dependent item | nomad.server.job_summary.queued Preprocessing
|
Job allocations running | Number of running allocations for a job. |
Dependent item | nomad.server.job_summary.running Preprocessing
|
Job allocations starting | Number of starting allocations for a job. |
Dependent item | nomad.server.job_summary.starting Preprocessing
|
Gossip time | Time elapsed to broadcast gossip messages. |
Dependent item | nomad.server.memberlist.gossip Preprocessing
|
Leader barrier time | Time elapsed to establish a raft barrier during leader transition. |
Dependent item | nomad.server.leader.barrier Preprocessing
|
Reconcile peer time | Time elapsed to reconcile a serf peer with state store. |
Dependent item | nomad.server.leader.reconcile_member Preprocessing
|
Total reconcile time | Time elapsed to reconcile all serf peers with state store. |
Dependent item | nomad.server.leader.reconcile Preprocessing
|
Leader last contact | Time since last contact to leader. General indicator of Raft latency. |
Dependent item | nomad.server.raft.leader.lastContact Preprocessing
|
Plan queue | Count of evals in the plan queue. |
Dependent item | nomad.server.plan.queue_depth Preprocessing
|
Worker evaluation create time | Time elapsed for worker to create an eval. |
Dependent item | nomad.server.worker.create_eval Preprocessing
|
Worker evaluation dequeue time | Time elapsed for worker to dequeue an eval. |
Dependent item | nomad.server.worker.dequeue_eval Preprocessing
|
Worker invoke scheduler time | Time elapsed for worker to invoke the scheduler. |
Dependent item | nomad.server.worker.invoke_scheduler_service Preprocessing
|
Worker acknowledgement send time | Time elapsed for worker to send acknowledgement. |
Dependent item | nomad.server.worker.send_ack Preprocessing
|
Worker submit plan time | Time elapsed for worker to submit plan. |
Dependent item | nomad.server.worker.submit_plan Preprocessing
|
Worker update evaluation time | Time elapsed for worker to submit updated eval. |
Dependent item | nomad.server.worker.update_eval Preprocessing
|
Worker log replication time | Time elapsed that worker waits for the raft index of the eval to be processed. |
Dependent item | nomad.server.worker.wait_for_index Preprocessing
|
Raft calls blocked, rate | Count of blocking raft API calls. |
Dependent item | nomad.server.raft.barrier Preprocessing
|
Raft commit logs enqueued | Count of logs enqueued. |
Dependent item | nomad.server.raft.commit_num_logs Preprocessing
|
Raft transactions, rate | Number of Raft transactions. |
Dependent item | nomad.server.raft.apply Preprocessing
|
Raft commit time | Time elapsed to commit writes. |
Dependent item | nomad.server.raft.commit_time Preprocessing
|
Raft transaction commit time | Raft transaction commit time. |
Dependent item | nomad.server.raft.replication.appendEntries Preprocessing
|
FSM apply time | Time elapsed to apply write to FSM. |
Dependent item | nomad.server.raft.fsm.apply Preprocessing
|
FSM enqueue time | Time elapsed to enqueue write to FSM. |
Dependent item | nomad.server.raft.fsm.enqueue Preprocessing
|
FSM autopilot time | Time elapsed to apply Autopilot raft entry. |
Dependent item | nomad.server.raft.fsm.autopilot Preprocessing
|
FSM register node time | Time elapsed to apply RegisterNode raft entry. |
Dependent item | nomad.server.raft.fsm.register_node Preprocessing
|
FSM index | Current index applied to FSM. |
Dependent item | nomad.server.raft.applied_index Preprocessing
|
Raft last index | Most recent index seen. |
Dependent item | nomad.server.raft.last_index Preprocessing
|
Dispatch log time | Time elapsed to write log, mark in flight, and start replication. |
Dependent item | nomad.server.raft.leader.dispatch_log Preprocessing
|
Logs dispatched | Count of logs dispatched. |
Dependent item | nomad.server.raft.leader.dispatch_num_logs Preprocessing
|
Heartbeat fails | Count of failing to heartbeat and starting election. |
Dependent item | nomad.server.raft.transition.heartbeat_timeout Preprocessing
|
Objects freed, rate | Count of objects freed from heap by go runtime GC. |
Dependent item | nomad.server.runtime.free_count Preprocessing
|
GC pause time | Go runtime GC pause times. |
Dependent item | nomad.server.runtime.gc_pause_ns Preprocessing
|
GC metadata size | Go runtime GC metadata size in bytes. |
Dependent item | nomad.server.runtime.sys_bytes Preprocessing
|
GC runs | Count of go runtime GC runs. |
Dependent item | nomad.server.runtime.total_gc_runs Preprocessing
|
Memberlist events | Count of memberlist events received. |
Dependent item | nomad.server.serf.queue.event Preprocessing
|
Memberlist changes | Count of memberlist changes. |
Dependent item | nomad.server.serf.queue.intent Preprocessing
|
Memberlist queries | Count of memberlist queries. |
Dependent item | nomad.server.serf.queue.queries Preprocessing
|
Snapshot index | Current snapshot index. |
Dependent item | nomad.server.state.snapshot.index Preprocessing
|
Services ready to schedule | Count of service evals ready to be scheduled. |
Dependent item | nomad.server.broker.service_ready Preprocessing
|
Services unacknowledged | Count of unacknowledged service evals. |
Dependent item | nomad.server.broker.service_unacked Preprocessing
|
System evaluations ready to schedule | Count of service evals ready to be scheduled. |
Dependent item | nomad.server.broker.system_ready Preprocessing
|
System evaluations unacknowledged | Count of unacknowledged system evals. |
Dependent item | nomad.server.broker.system_unacked Preprocessing
|
BoltDB free pages | Number of BoltDB free pages. |
Dependent item | nomad.server.raft.boltdb.num_free_pages Preprocessing
|
BoltDB pending pages | Number of BoltDB pending pages. |
Dependent item | nomad.server.raft.boltdb.num_pending_pages Preprocessing
|
BoltDB free page bytes | Number of free page bytes. |
Dependent item | nomad.server.raft.boltdb.free_page_bytes Preprocessing
|
BoltDB freelist bytes | Number of freelist bytes. |
Dependent item | nomad.server.raft.boltdb.freelist_bytes Preprocessing
|
BoltDB read transactions, rate | Count of total read transactions. |
Dependent item | nomad.server.raft.boltdb.total_read_txn Preprocessing
|
BoltDB open read transactions | Number of current open read transactions. |
Dependent item | nomad.server.raft.boltdb.open_read_txn Preprocessing
|
BoltDB pages in use | Number of pages in use. |
Dependent item | nomad.server.raft.boltdb.txstats.page_count Preprocessing
|
BoltDB page allocations, rate | Number of page allocations. |
Dependent item | nomad.server.raft.boltdb.txstats.page_alloc Preprocessing
|
BoltDB cursors | Count of total database cursors. |
Dependent item | nomad.server.raft.boltdb.txstats.cursor_count Preprocessing
|
BoltDB nodes, rate | Count of total database nodes. |
Dependent item | nomad.server.raft.boltdb.txstats.node_count Preprocessing
|
BoltDB node dereferences, rate | Count of total database node dereferences. |
Dependent item | nomad.server.raft.boltdb.txstats.node_deref Preprocessing
|
BoltDB rebalance operations, rate | Count of total rebalance operations. |
Dependent item | nomad.server.raft.boltdb.txstats.rebalance Preprocessing
|
BoltDB split operations, rate | Count of total split operations. |
Dependent item | nomad.server.raft.boltdb.txstats.split Preprocessing
|
BoltDB spill operations, rate | Count of total spill operations. |
Dependent item | nomad.server.raft.boltdb.txstats.spill Preprocessing
|
BoltDB write operations, rate | Count of total write operations. |
Dependent item | nomad.server.raft.boltdb.txstats.write Preprocessing
|
BoltDB rebalance time | Sample of rebalance operation times. |
Dependent item | nomad.server.raft.boltdb.txstats.rebalance_time Preprocessing
|
BoltDB spill time | Sample of spill operation times. |
Dependent item | nomad.server.raft.boltdb.txstats.spill_time Preprocessing
|
BoltDB write time | Sample of write operation times. |
Dependent item | nomad.server.raft.boltdb.txstats.write_time Preprocessing
|
Service [rpc] state | Current [rpc] service state. |
Simple check | net.tcp.service[tcp,,{$NOMAD.SERVER.RPC.PORT}] Preprocessing
|
Service [serf] state | Current [serf] service state. |
Simple check | net.tcp.service[tcp,,{$NOMAD.SERVER.SERF.PORT}] Preprocessing
|
Namespace list time | Time elapsed for Namespace.ListNamespaces. |
Dependent item | nomad.server.namespace.list_namespace Preprocessing
|
Autopilot state | Current autopilot state. |
Dependent item | nomad.server.autopilot.state Preprocessing
|
Autopilot failure tolerance | The number of redundant healthy servers that can fail without causing an outage. |
Dependent item | nomad.server.autopilot.failure_tolerance Preprocessing
|
FSM allocation client update time | Time elapsed to apply AllocClientUpdate raft entry. |
Dependent item | nomad.server.alloc_client_update Preprocessing
|
FSM apply plan results time | Time elapsed to apply ApplyPlanResults raft entry. |
Dependent item | nomad.server.fsm.apply_plan_results Preprocessing
|
FSM update evaluation time | Time elapsed to apply UpdateEval raft entry. |
Dependent item | nomad.server.fsm.update_eval Preprocessing
|
FSM job registration time | Time elapsed to apply RegisterJob raft entry. |
Dependent item | nomad.server.fsm.register_job Preprocessing
|
Allocation reschedule attempts | Count of attempts to reschedule an allocation. |
Dependent item | nomad.server.scheduler.allocs.rescheduled.attempted Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
HashiCorp Nomad Server: Monitoring API connection has failed | Monitoring API connection has failed. |
find(/HashiCorp Nomad Server by HTTP/nomad.server.data.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes |
HashiCorp Nomad Server: Internal stats API connection has failed | Internal stats API connection has failed. |
find(/HashiCorp Nomad Server by HTTP/nomad.server.stats.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes Depends on:
|
HashiCorp Nomad Server: Nomad server version has changed | Nomad server version has changed. |
change(/HashiCorp Nomad Server by HTTP/nomad.server.version)<>0 |
Info | Manual close: Yes |
HashiCorp Nomad Server: Cluster role has changed | Cluster role has changed. |
change(/HashiCorp Nomad Server by HTTP/nomad.server.raft.cluster_role) <> 0 |
Info | Manual close: Yes |
HashiCorp Nomad Server: Current number of open files is too high | Heavy file descriptor usage (i.e., near the process file descriptor limit) indicates a potential file descriptor exhaustion issue. |
min(/HashiCorp Nomad Server by HTTP/nomad.server.process_open_fds,5m)/last(/HashiCorp Nomad Server by HTTP/nomad.server.process_max_fds)*100>{$NOMAD.OPEN.FDS.MAX} |
Warning | |
HashiCorp Nomad Server: Dead jobs found | Jobs with the |
last(/HashiCorp Nomad Server by HTTP/nomad.server.job_status.dead) > 0 and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.job_status.dead,5m) = 0 |
Warning | Manual close: Yes |
HashiCorp Nomad Server: Leader last contact timeout exceeded | The nomad.raft.leader.lastContact metric is a general indicator of Raft latency which can be used to observe how Raft timing is performing and guide infrastructure provisioning. |
min(/HashiCorp Nomad Server by HTTP/nomad.server.raft.leader.lastContact,5m) >= {$NOMAD.SERVER.LEADER.LATENCY} and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.raft.leader.lastContact,5m) = 0 |
Warning | |
HashiCorp Nomad Server: Service [rpc] is down | Cannot establish the connection to [rpc] service port {$NOMAD.SERVER.RPC.PORT}. |
last(/HashiCorp Nomad Server by HTTP/net.tcp.service[tcp,,{$NOMAD.SERVER.RPC.PORT}]) = 0 |
Average | Manual close: Yes |
HashiCorp Nomad Server: Service [serf] is down | Cannot establish the connection to [serf] service port {$NOMAD.SERVER.SERF.PORT}. |
last(/HashiCorp Nomad Server by HTTP/net.tcp.service[tcp,,{$NOMAD.SERVER.SERF.PORT}]) = 0 |
Average | Manual close: Yes |
HashiCorp Nomad Server: Autopilot is unhealthy | The autopilot is in unhealthy state. The successful failover probability is extremely low. |
last(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.state) = 0 and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.state,5m) = 0 |
Average | Manual close: Yes |
HashiCorp Nomad Server: Autopilot redundancy is low | The autopilot redundancy is low. |
last(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.failure_tolerance) < {$NOMAD.REDUNDANCY.MIN} and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.failure_tolerance,5m) = 0 |
Warning | Manual close: Yes |
Feedback
Please report any issues with the template at
You can also provide feedback, discuss the template, or ask for help at ZABBIX forums