kibana/docs/user/production-considerations/task-manager-health-monitoring.asciidoc
Chris Roberson b9e6f935c4
[Actions] Treat failures as successes for Task Manager (#109655)
* Support retry with email as an example

* Fix tests

* Add logic to treat as failure if there is a retry

* Handle retry better

* Make this optional

* Tweaks

* Remove unnecessary code

* Fix existing tests

* Add some unit tests

* Add test

* Add doc note

* More docs

* PR feedback

* Update docs/management/action-types.asciidoc

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

* Update docs/management/action-types.asciidoc

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

* Update docs/management/action-types.asciidoc

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

* Update docs/management/action-types.asciidoc

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

* Update docs/management/action-types.asciidoc

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
2021-09-09 12:51:39 -04:00

137 lines
7.4 KiB
Text

[role="xpack"]
[[task-manager-health-monitoring]]
=== Task Manager health monitoring
++++
<titleabbrev>Health monitoring</titleabbrev>
++++
experimental[]
The Task Manager has an internal monitoring mechanism to keep track of a variety of metrics, which can be consumed with either the health monitoring API or the {kib} server log.
The health monitoring API provides a reliable endpoint that can be monitored.
Consuming this endpoint doesn't cause additional load, but rather returns the latest health checks made by the system. This design enables consumption by external monitoring services at a regular cadence without additional load to the system.
Each {kib} instance exposes its own endpoint at:
[source,sh]
--------------------------------------------------
$ curl -X GET api/task_manager/_health
--------------------------------------------------
// KIBANA
Monitoring the `_health` endpoint of each {kib} instance in the cluster is the recommended method of ensuring confidence in mission critical services such as Alerting and Actions.
[float]
[[task-manager-configuring-health-monitoring]]
==== Configuring the monitored health statistics
The health monitoring API monitors the performance of Task Manager out of the box. However, certain performance considerations are deployment specific and you can configure them.
A health threshold is the threshold for failed task executions. Once a task exceeds this threshold, a status of `warn` or `error` is set on the task type execution. To configure a health threshold, use the <<task-manager-health-settings,`xpack.task_manager.monitored_task_execution_thresholds`>> setting. You can apply this this setting to all task types in the system, or to a custom task type.
By default, this setting marks the health of every task type as `warning` when it exceeds 80% failed executions, and as `error` at 90%.
Set this value to a number between 0 to 100. The threshold is hit when the value *exceeds* this number.
To avoid a status of `error`, set the threshold at 100. To hit `error` the moment any task fails, set the threshold to 0.
Create a custom configuration to set lower thresholds for task types you consider critical, such as alerting tasks that you want to detect sooner in an external monitoring service.
[source,yml]
----
xpack.task_manager.monitored_task_execution_thresholds:
default: # <1>
error_threshold: 70
warn_threshold: 50
custom:
"alerting:.index-threshold": # <2>
error_threshold: 50
warn_threshold: 0
----
<1> A default configuration that sets the system-wide `warn` threshold at a 50% failure rate, and `error` at 70% failure rate.
<2> A custom configuration for the `alerting:.index-threshold` task type that sets a system wide `warn` threshold at 0% (which sets a `warn` status the moment any task of that type fails), and `error` at a 50% failure rate.
[float]
[[task-manager-consuming-health-stats]]
==== Consuming health stats
The health API is best consumed by via the `/api/task_manager/_health` endpoint.
Additionally, there are two ways to consume these metrics:
*Debug logging*
The metrics are logged in the {kib} `DEBUG` logger at a regular cadence.
To enable Task Manager debug logging in your {kib} instance, add the following to your `kibana.yml`:
[source,yml]
----
logging:
loggers:
- context: plugins.taskManager
appenders: [console]
level: debug
----
These stats are logged based on the number of milliseconds set in your <<task-manager-settings,`xpack.task_manager.poll_interval`>> setting, which could add substantial noise to your logs. Only enable this level of logging temporarily.
*Automatic logging*
By default, the health API runs at a regular cadence, and each time it runs, it attempts to self evaluate its performance. If this self evaluation yields a potential problem,
a message will log to the {kib} server log. In addition, the health API will look at how long tasks have waited to start (from when they were scheduled to start). If this number exceeds a configurable threshold (<<task-manager-settings,`xpack.task_manager.monitored_stats_health_verbose_log.warn_delayed_task_start_in_seconds`>>), the same message as above will log to the {kib} server log.
This message looks like:
[source,log]
----
Detected potential performance issue with Task Manager. Set 'xpack.task_manager.monitored_stats_health_verbose_log.enabled: true' in your Kibana.yml to enable debug logging`
----
If this message appears, set <<task-manager-settings,`xpack.task_manager.monitored_stats_health_verbose_log.enabled`>> to `true` in your `kibana.yml`. This will start logging the health metrics at either a `warn` or `error` log level, depending on the detected severity of the potential problem.
[float]
[[making-sense-of-task-manager-health-stats]]
==== Making sense of Task Manager health stats
The health monitoring API exposes three sections: `configuration`, `workload` and `runtime`:
[cols="2"]
|===
a| Configuration
| This section summarizes the current configuration of Task Manager. This includes dynamic configurations that change over time, such as `poll_interval` and `max_workers`, which can adjust in reaction to changing load on the system.
a| Workload
| This section summarizes the work load across the cluster, including the tasks in the system, their types, and current status.
a| Runtime
| This section tracks excution performance of Task Manager, tracking task _drift_, worker _load_, and execution stats broken down by type, including duration and execution results.
a| Capacity Estimation
| This section provides a rough estimate about the sufficiency of its capacity. As the name suggests, these are estimates based on historical data and should not be used as predictions. Use these estimations when following the Task Manager <<task-manager-scaling-guidance>>.
|===
Each section has a `timestamp` and a `status` that indicates when the last update to this section took place and whether the health of this section was evaluated as `OK`, `Warning` or `Error`.
The root `status` indicates the `status` of the system overall.
The Runtime `status` indicates whether task executions have exceeded any of the <<task-manager-configuring-health-monitoring,configured health thresholds>>. An `OK` status means none of the threshold have been exceeded. A `Warning` status means that at least one warning threshold has been exceeded. An `Error` status means that at least one error threshold has been exceeded.
[IMPORTANT]
==============================================
Some tasks (such as <<action-types,connectors>>) will incorrectly report their status as successful even if the task failed.
The runtime and workload block will return data about success and failures and will not take this into consideration.
To get a better sense of action failures, please refer to the <<event-log-index,Event log index>> for more accurate context into failures and successes.
==============================================
The Capacity Estimation `status` indicates the sufficiency of the observed capacity. An `OK` status means capacity is sufficient. A `Warning` status means that capacity is sufficient for the scheduled recurring tasks, but non-recurring tasks often cause the cluster to exceed capacity. An `Error` status means that there is insufficient capacity across all types of tasks.
By monitoring the `status` of the system overall, and the `status` of specific task types of interest, you can evaluate the health of the {kib} Task Management system.