[Alerting][Docs] Adds Alerting & Task Manager Scalability Guidance & Health Monitoring (#91171) (#93604)

Documentation for scaling Kibana alerting, what configurations can change, what impacts they have, etc. Scaling Alerting relies heavily on scaling Task Manager, so these docs also document Task manager Health Monitoring and scaling. Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
2021-03-04 20:42:57 +00:00 · 2021-03-04 20:42:57 +00:00 · 1d032ef590
parent af60da210a
commit 1d032ef590
18 changed files with 1232 additions and 287 deletions
--- a/docs/api/task-manager/health.asciidoc
+++ b/docs/api/task-manager/health.asciidoc
@ -0,0 +1,133 @@
+[[task-manager-api-health]]
+=== Get Task Manager health API
++++
+<titleabbrev>Get Task Manager health</titleabbrev>
++++
+
+Retrieve the health status of the {kib} Task Manager.
+
+[[task-manager-api-health-request]]
+==== Request
+
+`GET <kibana host>:<port>/api/task_manager/_health`
+
+[[task-manager-api-health-codes]]
+==== Response code
+
+`200`::
+    Indicates a successful call.
+
+[[task-manager-api-health-example]]
+==== Example
+
+Retrieve the health status of the {kib} Task Manager:
+
+[source,sh]
+--------------------------------------------------
+$ curl -X GET api/task_manager/_health
+--------------------------------------------------
+// KIBANA
+
+The API returns the following:
+
+[source,sh]
+--------------------------------------------------
+{
+  "id": "15415ecf-cdb0-4fef-950a-f824bd277fe4",
+  "timestamp": "2021-02-16T11:38:10.077Z",
+  "status": "OK",
+  "last_update": "2021-02-16T11:38:09.934Z",
+  "stats": {
+    "configuration": {
+      "timestamp": "2021-02-16T11:29:05.055Z",
+      "value": {
+        "request_capacity": 1000,
+        "max_poll_inactivity_cycles": 10,
+        "monitored_aggregated_stats_refresh_rate": 60000,
+        "monitored_stats_running_average_window": 50,
+        "monitored_task_execution_thresholds": {
+          "default": {
+            "error_threshold": 90,
+            "warn_threshold": 80
+          },
+          "custom": {}
+        },
+        "poll_interval": 3000,
+        "max_workers": 10
+      },
+      "status": "OK"
+    },
+    "runtime": {
+      "timestamp": "2021-02-16T11:38:09.934Z",
+      "value": {
+        "polling": {
+          "last_successful_poll": "2021-02-16T11:38:09.934Z",
+          "last_polling_delay": "2021-02-16T11:29:05.053Z",
+          "duration": {
+            "p50": 0,
+            "p90": 0,
+            "p95": 0,
+            "p99": 0
+          },
+          "claim_conflicts": {
+            "p50": 0,
+            "p90": 0,
+            "p95": 0,
+            "p99": 0
+          },
+          "claim_mismatches": {
+            "p50": 0,
+            "p90": 0,
+            "p95": 0,
+            "p99": 0
+          },
+          "result_frequency_percent_as_number": {
+            "Failed": 0,
+            "NoAvailableWorkers": 0,
+            "NoTasksClaimed": 0,
+            "RanOutOfCapacity": 0,
+            "RunningAtCapacity": 0,
+            "PoolFilled": 0
+          }
+        },
+        "drift": {
+          "p50": 0,
+          "p90": 0,
+          "p95": 0,
+          "p99": 0
+        },
+        "load": {
+          "p50": 0,
+          "p90": 0,
+          "p95": 0,
+          "p99": 0
+        },
+        "execution": {
+          "duration": {},
+          "result_frequency_percent_as_number": {}
+        }
+      },
+      "status": "OK"
+    },
+    "workload": {
+      "timestamp": "2021-02-16T11:38:05.826Z",
+      "value": {
+        "count": 26,
+        "task_types": {},
+        "schedule": [],
+        "overdue": 0,
+        "estimated_schedule_density": []
+      },
+      "status": "OK"
+    }
+  }
+}
+--------------------------------------------------
+
+The health API response is described in <<making-sense-of-task-manager-health-stats>>.
+
+The health monitoring API exposes three sections:
+
+* `configuration` is described in detail under <<task-manager-health-evaluate-the-configuration>>
+* `workload` is described in detail under <<task-manager-health-evaluate-the-workload>>
+* `runtime` is described in detail under <<task-manager-health-evaluate-the-runtime>>
--- a/docs/developer/plugin-list.asciidoc
+++ b/docs/developer/plugin-list.asciidoc
@ -527,6 +527,7 @@ routes, etc.

 |{kib-repo}blob/{branch}/x-pack/plugins/task_manager/README.md[taskManager]
 |The task manager is a generic system for running background tasks.
+Documentation: https://www.elastic.co/guide/en/kibana/master/task-manager-production-considerations.html


 |{kib-repo}blob/{branch}/x-pack/plugins/telemetry_collection_xpack/README.md[telemetryCollectionXpack]
--- a/docs/settings/task-manager-settings.asciidoc
+++ b/docs/settings/task-manager-settings.asciidoc
@ -28,5 +28,18 @@ Task Manager runs background tasks by polling for work on an interval.  You can
  | `xpack.task_manager.max_workers`
  | The maximum number of tasks that this Kibana instance will run simultaneously.  Defaults to 10.
    Starting in 8.0, it will not be possible to set the value greater than 100.
+|===
+
+[float]
+[[task-manager-health-settings]]
+==== Task Manager Health settings 
+
+Settings that configure the <<task-manager-health-monitoring>> endpoint.
+
+[cols="2*<"]
+|===
+| `xpack.task_manager.`
+`monitored_task_execution_thresholds`
+  | Configures the threshold of failed task executions at which point the `warn` or `error` health status is set under each task type execution status (under `stats.runtime.value.excution.result_frequency_percent_as_number[${task type}].status`). This setting allows configuration of both the default level and a custom task type specific level. By default, this setting is configured to mark the health of every task type as `warning` when it exceeds 80% failed executions, and as `error` at 90%. Custom configurations allow you to reduce this threshold to catch failures sooner for task types that you might consider critical, such as alerting tasks. This value can be set to any number between 0 to 100, and a threshold is hit when the value *exceeds* this number. This means that you can avoid setting the status to `error` by setting the threshold at 100, or hit `error` the moment any task fails by setting the threshold to 0 (as it will exceed 0 once a single failure occurs).

 |===
--- a/docs/setup/settings.asciidoc
+++ b/docs/setup/settings.asciidoc
@ -683,5 +683,5 @@ include::{kib-repo-dir}/settings/reporting-settings.asciidoc[]
 include::secure-settings.asciidoc[]
 include::{kib-repo-dir}/settings/security-settings.asciidoc[]
 include::{kib-repo-dir}/settings/spaces-settings.asciidoc[]
-include::{kib-repo-dir}/settings/telemetry-settings.asciidoc[]
 include::{kib-repo-dir}/settings/task-manager-settings.asciidoc[]
+include::{kib-repo-dir}/settings/telemetry-settings.asciidoc[]
--- a/docs/user/alerting/alerting-getting-started.asciidoc
+++ b/docs/user/alerting/alerting-getting-started.asciidoc
@ -164,6 +164,14 @@ If you are using an *on-premises* Elastic Stack deployment with <<using-kibana-w

 * You must enable Transport Layer Security (TLS) for communication <<configuring-tls-kib-es, between {es} and {kib}>>. {kib} alerting uses <<api-keys, API keys>> to secure background alert checks and actions, and API keys require {ref}/configuring-tls.html#tls-http[TLS on the HTTP interface]. A proxy will not suffice.

+[float]
+[[alerting-setup-production]]
+== Production considerations and scaling guidance
+
+When relying on alerts and actions as mission critical services, make sure you follow the <<alerting-production-considerations,Alerting production considerations>>.
+
+See <<alerting-scaling-guidance>> for more information on the scalability of {kib} alerting.
+
 [float]
 [[alerting-security]]
 == Security
--- a/docs/user/alerting/alerting-production-considerations.asciidoc
+++ b/docs/user/alerting/alerting-production-considerations.asciidoc
@ -1,35 +0,0 @@
-[role="xpack"]
-[[alerting-production-considerations]]
-== Production considerations
-
-{kib} alerting runs both alert checks and actions as persistent background tasks managed by the Kibana Task Manager. This has two major benefits:
-
-* *Persistence*: all task state and scheduling is stored in {es}, so if you restart {kib}, alerts and actions will pick up where they left off.  Task definitions for alerts and actions are stored in the index specified by <<task-manager-settings, `xpack.task_manager.index`>>.  The default is `.kibana_task_manager`.  You must have at least one replica of this index for production deployments.  If you lose this index, all scheduled alerts and actions are lost.
-* *Scaling*: multiple {kib} instances can read from and update the same task queue in {es}, allowing the alerting and action load to be distributed across instances. In cases where a {kib} instance no longer has capacity to run alert checks or actions, capacity can be increased by adding additional {kib} instances.
-
-[float]
-=== Running background alert checks and actions
-
-{kib} background tasks are managed by:
-
-* Polling an {es} task index for overdue tasks at 3 second intervals.  You can change this interval using the <<task-manager-settings, `xpack.task_manager.poll_interval`>> setting.
-* Tasks are then claiming them by updating them in the {es} index, using optimistic concurrency control to prevent conflicts. Each {kib} instance can run a maximum of 10 concurrent tasks, so a maximum of 10 tasks are claimed each interval. 
-* Tasks are run on the {kib} server. 
-* In the case of alerts which are recurring background checks, upon completion the task is scheduled again according to the <<defining-alerts-general-details, check interval>>.
-
-[IMPORTANT]
-==============================================
-Because by default tasks are polled at 3 second intervals and only 10 tasks can run concurrently per {kib} instance, it is possible for alert and action tasks to be run late. This can happen if: 
-
-* Alerts use a small *check interval*. The lowest interval possible is 3 seconds, though intervals of 30 seconds or higher are recommended.
-* Many alerts or actions must be *run at once*. In this case pending tasks will queue in {es}, and be pulled 10 at a time from the queue at 3 second intervals. 
-* *Long running tasks* occupy slots for an extended time, leaving fewer slots for other tasks. 
-
-For details on the settings that can influence the performance and throughput of Task Manager, see <<task-manager-settings,`Task Manager Settings`>>.
-
-==============================================
-
-[float]
-=== Deployment considerations
-
-{es} and {kib} instances use the system clock to determine the current time. To ensure schedules are triggered when expected, you should synchronize the clocks of all nodes in the cluster using a time service such as http://www.ntp.org/[Network Time Protocol].
--- a/docs/user/alerting/alerting-troubleshooting.asciidoc
+++ b/docs/user/alerting/alerting-troubleshooting.asciidoc
@ -0,0 +1,55 @@
+[role="xpack"]
+[[alerting-troubleshooting]]
+== Alerting Troubleshooting
+
+This page describes how to resolve common problems you might encounter with Alerting.
+If your problem isn’t described here, please review open issues in the following GitHub repositories:
+
+* https://github.com/elastic/kibana/issues[kibana] (https://github.com/elastic/kibana/issues?q=is%3Aopen+is%3Aissue+label%3AFeature%3AAlerting[Alerting issues])
+
+Have a question? Contact us in the https://discuss.elastic.co/[discuss forum].
+
+[float]
+[[alerts-small-check-interval-run-late]]
+=== Alerts with small check intervals run late
+
+*Problem*:
+
+Alerts with a small check interval, such as every two seconds, run later than scheduled.
+
+*Resolution*:
+
+Alerts run as background tasks at a cadence defined by their *check interval*.
+When an Alert *check interval* is smaller than the Task Manager <<task-manager-settings,`poll_interval`>> the alert will run late.
+
+Either tweak the <<task-manager-settings,{kib} Task Manager settings>> or increase the *check interval* of the alerts in question.
+
+For more details, see <<task-manager-health-scheduled-tasks-small-schedule-interval-run-late>>.
+
+
+[float]
+[[scheduled-alerts-run-late]]
+=== Alerts run late
+
+*Problem*:
+
+Scheduled alerts run at an inconsistent cadence, often running late.
+
+Actions run long after the status of an alert changes, sending a notification of the change too late.
+
+*Solution*:
+
+Alerts and actions run as background tasks by each {kib} instance at a default rate of ten tasks every three seconds.
+
+If many alerts or actions are scheduled to run at the same time, pending tasks will queue in {es}. Each {kib} instance then polls for pending tasks at a rate of up to ten tasks at a time, at three second intervals. Because alerts and actions are backed by tasks, it is possible for pending tasks in the queue to exceed this capacity and run late.
+
+For details on diagnosing the underlying causes of such delays, see <<task-manager-health-tasks-run-late>>.
+
+Alerting and action tasks are identified by their type.
+
+* Alert tasks always begin with `alerting:`. For example, the `alerting:.index-threshold` tasks back the <<alert-type-index-threshold, index threshold stack alert>>.
+* Action tasks always begin with `actions:`. For example, the `actions:.index` tasks back the <<index-action-type, index action>>.
+
+When diagnosing issues related to Alerting, focus on the thats that begin with `alerting:` and `actions:`.
+
+For more details on monitoring and diagnosing task execution in Task Manager, see <<task-manager-health-monitoring>>.
--- a/docs/user/alerting/index.asciidoc
+++ b/docs/user/alerting/index.asciidoc
@ -2,4 +2,4 @@ include::alerting-getting-started.asciidoc[]
 include::defining-alerts.asciidoc[]
 include::action-types.asciidoc[]
 include::alert-types.asciidoc[]
-include::alerting-production-considerations.asciidoc[]
+include::alerting-troubleshooting.asciidoc[]
--- a/docs/user/index.asciidoc
+++ b/docs/user/index.asciidoc
@ -13,6 +13,8 @@ include::monitoring/monitoring-kibana.asciidoc[leveloffset=+2]

 include::security/securing-kibana.asciidoc[]

+include::production-considerations/index.asciidoc[]
+
 include::discover.asciidoc[]

 include::dashboard/dashboard.asciidoc[]
--- a/docs/user/production-considerations/alerting-production-considerations.asciidoc
+++ b/docs/user/production-considerations/alerting-production-considerations.asciidoc
@ -0,0 +1,51 @@
+[role="xpack"]
+[[alerting-production-considerations]]
+== Alerting production considerations
+
++++
+<titleabbrev>Alerting</titleabbrev>
++++
+
+Alerting runs both alert checks and actions as persistent background tasks managed by the Task Manager.
+
+When relying on alerts and actions as mission critical services, make sure you follow the <<task-manager-production-considerations, production considerations>> for Task Manager.
+
+[float]
+[[alerting-background-tasks]]
+=== Running background alert checks and actions
+
+{kib} uses background tasks to run alerts and actions, distributed across all {kib} instances in the cluster.
+
+By default, each {kib} instance polls for work at three second intervals, and can run a maximum of ten concurrent tasks.
+These tasks are then run on the {kib} server.
+
+Alerts are recurring background tasks which are rescheduled according to the <<defining-alerts-general-details, check interval>> on completion.
+Actions are non-recurring background tasks which are deleted on completion.
+
+For more details on Task Manager, see <<task-manager-background-tasks>>.
+
+[IMPORTANT]
+==============================================
+Alert and action tasks can run late or at an inconsistent schedule.
+This is typically a symptom of the specific usage of the cluster in question.
+
+You can address such issues by tweaking the <<task-manager-settings,Task Manager settings>> or scaling the deployment to better suit your use case.
+
+For detailed guidance, see <<alerting-troubleshooting, Alerting Troubleshooting>>.
+==============================================
+
+[float]
+[[alerting-scaling-guidance]]
+=== Scaling Guidance
+
+As alerts and actions leverage background tasks to perform the majority of work, scaling Alerting is possible by following the <<task-manager-scaling-guidance,Task Manager Scaling Guidance>>.
+
+When estimating the required task throughput, keep the following in mind:
+
+* Each alert uses a single recurring task that is scheduled to run at the cadence defined by its <<defining-alerts-general-details,check interval>>.
+* Each action uses a single task. However, because <<alerting-concepts-suppressing-duplicate-notifications,actions are taken per instance>>, alerts can generate a large number of non-recurring tasks.
+
+It is difficult to predict how much throughput is needed to ensure all alerts and actions are executed at consistent schedules.
+By counting alerts as recurring tasks and actions as non-recurring tasks, a rough throughput <<task-manager-rough-throughput-estimation,can be estimated>> as a _tasks per minute_ measurement.
+
+Predicting the buffer required to account for actions depends heavily on the alert types you use, the amount of alert Instances they might detect, and the number of actions you might choose to assign to action groups. With that in mind, regularly <<task-manager-health-monitoring,monitor the health>> of your Task Manager instances.
--- a/docs/user/production-considerations/index.asciidoc
+++ b/docs/user/production-considerations/index.asciidoc
@ -0,0 +1,5 @@
+include::production.asciidoc[]
+include::alerting-production-considerations.asciidoc[]
+include::task-manager-production-considerations.asciidoc[]
+include::task-manager-health-monitoring.asciidoc[]
+include::task-manager-troubleshooting.asciidoc[]
--- a/docs/user/production-considerations/production.asciidoc
+++ b/docs/user/production-considerations/production.asciidoc
@ -1,5 +1,9 @@
 [[production]]
-== Use {kib} in a production environment
+= Use {kib} in a production environment
+
++++
+<titleabbrev>Production considerations</titleabbrev>
++++

 * <<configuring-kibana-shield>>
 * <<csp-strict-mode>>
--- a/docs/user/production-considerations/task-manager-health-monitoring.asciidoc
+++ b/docs/user/production-considerations/task-manager-health-monitoring.asciidoc
@ -0,0 +1,99 @@
+[role="xpack"]
+[[task-manager-health-monitoring]]
+=== Task Manager health monitoring
+
++++
+<titleabbrev>Health monitoring</titleabbrev>
++++
+
+The Task Manager has an internal monitoring mechanism to keep track of a variety of metrics, which can be consumed with either the health monitoring API or the {kib} server log.
+
+The health monitoring API provides a reliable endpoint that can be monitored.
+Consuming this endpoint doesn't cause additional load, but rather returns the latest health checks made by the system. This design enables consumption by external monitoring services at a regular cadence without additional load to the system.
+
+Each {kib} instance exposes its own endpoint at:
+
+[source,sh]
+--------------------------------------------------
+$ curl -X GET api/task_manager/_health
+--------------------------------------------------
+// KIBANA
+
+Monitoring the `_health` endpoint of each {kib} instance in the cluster is the recommended method of ensuring confidence in mission critical services such as Alerting and Actions.
+
+[float]
+[[task-manager-configuring-health-monitoring]]
+==== Configuring the monitored health statistics
+
+The health monitoring API monitors the performance of Task Manager out of the box.  However, certain performance considerations are deployment specific and you can configure them.
+
+A health threshold is the threshold for failed task executions.  Once a task exceeds this threshold, a status of `warn` or `error` is set on the task type execution. To configure a health threshold, use the <<task-manager-health-settings,`xpack.task_manager.monitored_task_execution_thresholds`>> setting.  You can apply this this setting to all task types in the system, or to a custom task type. 
+
+By default, this setting marks the health of every task type as `warning` when it exceeds 80% failed executions, and as `error` at 90%.
+Set this value to a number between 0 to 100. The threshold is hit when the value *exceeds* this number.
+To avoid a status of `error`, set the threshold at 100.  To hit `error` the moment any task fails, set the threshold to 0.
+
+Create a custom configuration to set lower thresholds for task types you consider critical, such as alerting tasks that you want to detect sooner in an external monitoring service.
+
+[source,yml]
+----
+xpack.task_manager.monitored_task_execution_thresholds:
+  default: # <1>
+    error_threshold: 70
+    warn_threshold: 50
+  custom:
+    "alerting:.index-threshold": # <2>
+      error_threshold: 50
+      warn_threshold: 0
+----
+<1> A default configuration that sets the system-wide `warn` threshold at a 50% failure rate, and `error` at 70% failure rate.
+<2> A custom configuration for the `alerting:.index-threshold` task type that sets a system wide `warn` threshold at 0% (which sets a `warn` status the moment any task of that type fails), and `error` at a 50% failure rate.
+
+[float]
+[[task-manager-consuming-health-stats]]
+==== Consuming health stats
+
+The health API is best consumed by via the `/api/task_manager/_health` endpoint.
+
+Additionally, the metrics are logged in the {kib} `DEBUG` logger at a regular cadence.
+To enable Task Manager DEBUG logging in your {kib} instance, add the following to your `kibana.yml`:
+
+[source,yml]
+----
+logging:
+  loggers:
+      - context: plugins.taskManager
+        appenders: [console]
+        level: debug
+----
+
+These stats are logged based the number of milliseconds set in your <<task-manager-settings,`xpack.task_manager.poll_interval`>> setting, which means it could add substantial noise to your logs. Only enable this level of logging temporarily.
+
+[float]
+[[making-sense-of-task-manager-health-stats]]
+==== Making sense of Task Manager health stats
+
+The health monitoring API exposes three sections: `configuration`, `workload` and `runtime`:
+
+[cols="2"]
+|===
+
+a| Configuration
+
+| This section summarizes the current configuration of Task Manager.  This includes dynamic configurations that change over time, such as `poll_interval` and `max_workers`, which can adjust in reaction to changing load on the system.
+
+a| Workload
+
+| This section summarizes the work load across the cluster, including the tasks in the system, their types, and current status.
+
+a| Runtime
+
+| This section tracks excution performance of Task Manager, tracking task _drift_, worker _load_, and execution stats broken down by type, including duration and execution results.
+
+|===
+
+Each section has a `timestamp` and a `status` that indicates when the last update to this section took place and whether the health of this section was evaluated as `OK`, `Warning` or `Error`.
+
+The root `status` indicates the `status` of the system overall.
+
+By monitoring the `status` of the system overall, and the `status` of specific task types of interest, you can evaluate the health of the {kib} Task Management system.
--- a/docs/user/production-considerations/task-manager-production-considerations.asciidoc
+++ b/docs/user/production-considerations/task-manager-production-considerations.asciidoc
@ -0,0 +1,143 @@
+[role="xpack"]
+[[task-manager-production-considerations]]
+== Task Manager
+
+{kib} Task Manager is leveraged by features such as Alerting, Actions, and Reporting to run mission critical work as persistent background tasks.
+These background tasks distribute work across multiple {kib} instances.
+This has three major benefits:
+
+* *Persistence*: All task state and scheduling is stored in {es}, so if you restart {kib}, tasks will pick up where they left off.
+* *Scaling*: Multiple {kib} instances can read from and update the same task queue in {es}, allowing the work load to be distributed across instances. If a {kib} instance no longer has capacity to run tasks, you can increase capacity by adding additional {kib} instances.
+* *Load Balancing*: Task Manager is equipped with a reactive self-healing mechanism, which allows it to reduce the amount of work it executes in reaction to an increased load related error rate in {es}. Additionally, when Task Manager experiences an increase in recurring tasks, it attempts to space out the work to better balance the load.
+
+[IMPORTANT]
+==============================================
+ Task definitions for alerts and actions are stored in the index specified by <<task-manager-settings, `xpack.task_manager.index`>>.
+ The default is `.kibana_task_manager`.
+ 
+ You must have at least one replica of this index for production deployments.
+ If you lose this index, all scheduled alerts and actions are lost.
+==============================================
+
+[float]
+[[task-manager-background-tasks]]
+=== Running background tasks
+
+{kib} background tasks are managed as follows:
+
+* An {es} task index is polled for overdue tasks at 3-second intervals. You can change this interval using the <<task-manager-settings, `xpack.task_manager.poll_interval`>> setting.
+* Tasks are claimed by updating them in the {es} index, using optimistic concurrency control to prevent conflicts. Each {kib} instance can run a maximum of 10 concurrent tasks, so a maximum of 10 tasks are claimed each interval. 
+* Tasks are run on the {kib} server. 
+* Task Manager ensures that tasks:
+** Are only executed once
+** Are retried when they fail (if configured to do so)
+** Are rescheduled to run again at a future point in time (if configured to do so)
+
+[IMPORTANT]
+==============================================
+It is possible for tasks to run late or at an inconsistent schedule.
+
+This is usually a symptom of the specific usage or scaling strategy of the cluster in question.
+
+To address these issues, tweak the {kib} Task Manager settings or the cluster scaling strategy to better suit the unique use case.
+
+For details on the settings that can influence the performance and throughput of Task Manager, see <<task-manager-settings-kb, Task Manager Settings>>.
+
+For detailed troubleshooting guidance, see <<task-manager-troubleshooting>>.
+==============================================
+
+[float]
+=== Deployment considerations
+
+{es} and {kib} instances use the system clock to determine the current time. To ensure schedules are triggered when expected, synchronize the clocks of all nodes in the cluster using a time service such as http://www.ntp.org/[Network Time Protocol].
+
+[float]
+[[task-manager-scaling-guidance]]
+=== Scaling guidance
+
+How you deploy {kib} largely depends on your use case. Predicting the throughout a deployment might require to support Task Management is difficult because features can schedule an unpredictable number of tasks at a variety of scheduled cadences.
+
+However, there is a relatively straight forward method you can follow to produce a rough estimate based on your expected usage.
+
+[float]
+[[task-manager-default-scaling]]
+==== Default scale
+
+By default, {kib} polls for tasks at a rate of 10 tasks every 3 seconds.
+This means that you can expect a single {kib} instance to support up to 200 _tasks per minute_ (`200/tpm`).
+
+In practice, a {kib} instance will only achieve the upper bound of `200/tpm` if the duration of task execution is below the polling rate of 3 seconds. For the most part, the duration of tasks is below that threshold, but it can vary greatly as {es} and {kib} usage grow and task complexity increases (such as alerts executing heavy queries across large datasets).
+
+By <<task-manager-health-evaluate-the-workload,evaluating the workload>>, you can make a rough estimate as to the required throughput as a _tasks per minute_ measurement.
+
+For example, suppose your current workload reveals a required throughput of `440/tpm`.  You can address this scale by provisioning 3 {kib} instances, with an upper throughput of `600/tpm`. This scale would provide aproximately 25% additional capacity to handle ad-hoc non-recurring tasks and potential growth in recurring tasks.
+
+It is highly recommended that you maintain at least 20% additional capacity, beyond your expected workload, as spikes in ad-hoc tasks is possible at times of high activity (such as a spike in actions in response to an active alert).
+
+For details on monitoring the health of {kib} Task Manager, follow the guidance in <<task-manager-health-monitoring>>.
+
+[float]
+[[task-manager-scaling-horizontally]]
+==== Scaling horizontally
+
+At times, the sustainable approach might be to expand the throughput of your cluster by provisioning additional {kib} instances.
+By default, each additional {kib} instance will add an additional 10 tasks that your cluster can run concurrently, but you can also scale each {kib} instance vertically, if your diagnosis indicates that they can handle the additional workload.
+
+[float]
+[[task-manager-scaling-vertically]]
+==== Scaling vertically
+
+Other times it, might be preferable to increase the throughput of individual {kib} instances.
+
+Tweak the *Max Workers* via the <<task-manager-settings,`xpack.task_manager.max_workers`>> setting, which allows each {kib} to pull a higher number of tasks per interval. This could impact the performance of each {kib} instance as the workload will be higher.
+
+Tweak the *Poll Interval* via the <<task-manager-settings,`xpack.task_manager.poll_interval`>> setting, which allows each {kib} to pull scheduled tasks at a higher rate.  This could impact the performance of the {es} cluster as the workload will be higher.
+
+[float]
+[[task-manager-choosing-scaling-strategy]]
+==== Choosing a scaling strategy
+
+Each scaling strategy comes with its own considerations, and the appropriate strategy largely depends on your use case.
+
+Scaling {kib} instances vertically causes higher resource usage in each {kib} instance, as it will perform more concurrent work.
+Scaling {kib} instances horizontally requires a higher degree of coordination, which can impact overall performance.
+
+A recommended strategy is to follow these steps:
+
+1. Produce a <<task-manager-rough-throughput-estimation,rough throughput estimate>> as a guide to provisioning as many {kib} instances as needed. Include any growth in tasks that you predict experiencing in the near future, and a buffer to better address ad-hoc tasks.
+2. After provisioning a deployment, assess whether the provisioned {kib} instances achieve the required throughput by evaluating the <<task-manager-health-monitoring>> as described in <<task-manager-theory-insufficient-throughput, Insufficient throughtput to handle the scheduled workload>>.
+3. If the throughput is insufficient, and {kib} instances exhibit low resource usage, incrementally scale vertically while <<kibana-page,monitoring>> the impact of these changes.
+4. If the throughput is insufficient, and {kib} instances are exhibiting high resource usage, incrementally scale horizontally by provisioning new {kib} instances and reassess.
+
+Task Manager, like the rest of the Elastic Stack, is designed to scale horizontally. Take advantage of this ability to ensure mission critical services, such as Alerting and Reporting, always have the capacity they need.
+
+Scaling horizontally requires a higher degree of coordination between {kib} instances. One way Task Manager coordinates with other instances is by delaying its polling schedule to avoid conflicts with other instances.
+By using <<task-manager-health-monitoring, health monitoring>> to evaluate the <<task-manager-health-evaluate-the-runtime,date of the `last_polling_delay`>> across a deployment, you can estimate the frequency at which Task Manager resets its delay mechanism.
+A higher frequency suggests {kib} instances conflict at a high rate, which you can address by scaling vertically rather than horizontally, reducing the required coordination.
+
+[float]
+[[task-manager-rough-throughput-estimation]]
+==== Rough throughput estimation
+
+Predicting the required throughput a deployment might need to support Task Management is difficult, as features can schedule an unpredictable number of tasks at a variety of scheduled cadences.
+However, a rough lower bound can be estimated, which is then used as a guide.
+
+Throughput is best thought of as a measurements in tasks per minute.
+
+A default {kib} instance can support up to `200/tpm`.
+
+Given a deployment of 100 recurring tasks, estimating the required throughput depends on the scheduled cadence.
+Suppose you expect to run 50 tasks at a cadence of `10s`, the other 50 tasks at `20m`. In addition, you expect a couple dozen non-recurring tasks every minute.
+
+A non-recurring task requires a single execution, which means that a single {kib} instance could execute all 100 tasks in less than a minute, using only half of its capacity. As these tasks are only executed once, the {kib} instance will sit idle once all tasks are executed.
+For that reason, don't include non-recurring tasks in your _tasks per minute_ calculation. Instead, include a buffer in the final _lower bound_ to incur the cost of ad-hoc non-recurring tasks.
+
+A recurring task requires as many executions as its cadence can fit in a minute. A recurring task with a `10s` schedule will require `6/tpm`, as it will execute 6 times per minute. A recurring task with a `20m` schedule only executes 3 times per hour and only requires a throughput of `0.05/tpm`, a number so small it that is difficult to take it into account.
+
+For this reason, we recommend grouping tasks by _tasks per minute_ and _tasks per hour_, as demonstrated in <<task-manager-health-evaluate-the-workload,Evaluate your workload>>, averaging the _per hour_ measurement across all minutes.
+
+Given the predicted workload, you can estimate a lower bound throughput of `340/tpm` (`6/tpm` * 50 + `3/tph` * 50 + 20% buffer).
+As a default, a {kib} instance provides a throughput of `200/tpm`. A good starting point for your deployment is to provision 2 {kib} instances. You could then monitor their performance and reassess as the required throughput becomes clearer.
+
+Although this is a _rough_ estimate, the  _tasks per minute_ provides the lower bound needed to execute tasks on time.
+Once you calculate the rough _tasks per minute_ estimate, add a 20% buffer for non-recurring tasks. How much of a buffer is required largely depends on your use case, so <<task-manager-health-evaluate-the-workload,evaluate your workload>> as it grows to ensure enough of a buffer is provisioned.
--- a/docs/user/production-considerations/task-manager-troubleshooting.asciidoc
+++ b/docs/user/production-considerations/task-manager-troubleshooting.asciidoc
@ -0,0 +1,708 @@
+[role="xpack"]
+[[task-manager-troubleshooting]]
+=== Task Manager troubleshooting
+
++++
+<titleabbrev>Troubleshooting</titleabbrev>
++++
+
+Task Manager is used by a wide range of services in {kib}, such as <<alerting-production-considerations, Alerting>>, Reporting, and Telemetry.
+Unexpected behavior in these services might be a downstream issue originating in Task Manager.
+
+This page describes how to resolve common problems you might encounter with Task Manager.
+If your problem isn’t described here, please review open issues in the following GitHub repositories:
+
+* https://github.com/elastic/kibana/issues[{kib}] (https://github.com/elastic/kibana/issues?q=is%3Aopen+is%3Aissue+label%3A%22Feature%3ATask+Manager%22[Task Manager issues])
+
+Have a question? Contact us in the https://discuss.elastic.co/[discuss forum].
+
+[float]
+[[task-manager-health-scheduled-tasks-small-schedule-interval-run-late]]
+==== Tasks with small schedule intervals run late
+
+*Problem*:
+
+Tasks are scheduled to run every 2 seconds, but seem to be running late.
+
+*Solution*:
+
+Task Manager polls for tasks at the cadence specified by the <<task-manager-settings,`xpack.task_manager.poll_interval`>> setting, which is 3 seconds by default. This means that a task could run late if it uses a schedule that is smaller than this setting.
+
+You can adjust the <<task-manager-settings,`xpack.task_manager.poll_interval`>> setting.  However, this will add additional load to both {kib} and {es} instances in the cluster, as they will perform more queries.
+
+[float]
+[[task-manager-health-tasks-run-late]]
+==== Tasks run late
+
+*Problem*:
+
+The most common symptom of an underlying problem in Task Manager is that tasks appear to run late.
+For instance, recurring tasks might run at an inconsistent cadence, or long after their scheduled time.
+
+*Solution*:
+
+By default, {kib} polls for tasks at a rate of 10 tasks every 3 seconds.
+
+If many tasks are scheduled to run at the same time, pending tasks will queue in {es}. Each {kib} instance then polls for pending tasks at a rate of up to 10 tasks at a time, at 3 second intervals. It is possible for pending tasks in the queue to exceed this capacity and run late as a result.
+
+This type of delay is known as _drift_.The root cause for drift depends on the specific usage, and there are no hard and fast rules for addressing drift.  
+
+For example:
+
+* If drift is caused by *an excess of concurrent tasks* relative to the available capacity of {kib} instances in the cluster, expand the throughput of the cluster.
+* If drift is caused by *long running tasks* that overrun their scheduled cadence,  reconfigure the tasks in question.
+
+Refer to <<task-manager-diagnosing-root-cause>> for step-by-step instructions on identifying the correct resolution.
+
+_Drift_ is often addressed by adjusting the scaling the deployment to better suit your usage.
+For details on scaling Task Manager, see <<task-manager-scaling-guidance>>.
+
+[[task-manager-diagnosing-root-cause]]
+==== Diagnose a root cause for drift
+
+The following guide helps you identify a root cause for _drift_ by making sense of the output from the <<task-manager-health-monitoring>> endpoint.
+
+By analyzing the different sections of the output, you can evaluate different theories that explain the drift in a deployment.
+
+* <<task-manager-health-evaluate-the-configuration,Evaluate the Configuration>>
+** <<task-manager-theory-reduced-polling-rate,{kib} is configured to poll for tasks at a reduced rate>>
+* <<task-manager-health-evaluate-the-runtime,Evaluate the Runtime>>
+** <<task-manager-theory-actual-polling-frequently,{kib} is not actually polling as frequently as it should>>
+** <<task-manager-theory-insufficient-throughput,{kib} is polling as frequently as it should, but that isn't often enough to keep up with the workload>>
+** <<task-manager-theory-long-running-tasks,Tasks run for too long, overrunning their schedule>>
+** <<task-manager-theory-high-fail-rate,Tasks take multiple attempts to succeed>>
+* <<task-manager-health-evaluate-the-workload,Evaluate the Workload>>
+
+Retrieve the latest monitored health stats of a {kib} instance Task Manager:
+
+[source,sh]
+--------------------------------------------------
+$ curl -X GET api/task_manager/_health
+--------------------------------------------------
+// KIBANA
+
+The API returns the following:
+
+[source,json]
+--------------------------------------------------
+{
+  "id": "15415ecf-cdb0-4fef-950a-f824bd277fe4",
+  "timestamp": "2021-02-16T11:38:10.077Z",
+  "status": "OK",
+  "last_update": "2021-02-16T11:38:09.934Z",
+  "stats": {
+    "configuration": {
+      "timestamp": "2021-02-16T11:29:05.055Z",
+      "value": {
+        "request_capacity": 1000,
+        "max_poll_inactivity_cycles": 10,
+        "monitored_aggregated_stats_refresh_rate": 60000,
+        "monitored_stats_running_average_window": 50,
+        "monitored_task_execution_thresholds": {
+          "default": {
+            "error_threshold": 90,
+            "warn_threshold": 80
+          },
+          "custom": {}
+        },
+        "poll_interval": 3000,
+        "max_workers": 10
+      },
+      "status": "OK"
+    },
+    "runtime": {
+      "timestamp": "2021-02-16T11:38:09.934Z",
+      "value": {
+        "polling": {
+          "last_successful_poll": "2021-02-16T11:38:09.934Z",
+          "last_polling_delay": "2021-02-16T11:29:05.053Z",
+          "duration": {
+            "p50": 13,
+            "p90": 128,
+            "p95": 143,
+            "p99": 168
+          },
+          "claim_conflicts": {
+            "p50": 0,
+            "p90": 0,
+            "p95": 0,
+            "p99": 0
+          },
+          "claim_mismatches": {
+            "p50": 0,
+            "p90": 0,
+            "p95": 0,
+            "p99": 0
+          },
+          "result_frequency_percent_as_number": {
+            "Failed": 0,
+            "NoAvailableWorkers": 0,
+            "NoTasksClaimed": 80,
+            "RanOutOfCapacity": 0,
+            "RunningAtCapacity": 0,
+            "PoolFilled": 20
+          }
+        },
+        "drift": {
+          "p50": 99,
+          "p90": 1245,
+          "p95": 1845,
+          "p99": 2878
+        },
+        "load": {
+          "p50": 0,
+          "p90": 0,
+          "p95": 10,
+          "p99": 20
+        },
+        "execution": {
+          "duration": {
+            "alerting:.index-threshold": {
+              "p50": 95,
+              "p90": 1725,
+              "p95": 2761,
+              "p99": 2761
+            },
+            "alerting:xpack.uptime.alerts.monitorStatus": {
+              "p50": 149,
+              "p90": 1071,
+              "p95": 1171,
+              "p99": 1171
+            },
+            "actions:.index": {
+              "p50": 166,
+              "p90": 166,
+              "p95": 166,
+              "p99": 166
+            }
+          },
+          "result_frequency_percent_as_number": {
+            "alerting:.index-threshold": {
+              "Success": 100,
+              "RetryScheduled": 0,
+              "Failed": 0,
+              "status": "OK"
+            },
+            "alerting:xpack.uptime.alerts.monitorStatus": {
+              "Success": 100,
+              "RetryScheduled": 0,
+              "Failed": 0,
+              "status": "OK"
+            },
+            "actions:.index": {
+              "Success": 10,
+              "RetryScheduled": 0,
+              "Failed": 90,
+              "status": "error"
+            }
+          }
+        }
+      },
+      "status": "OK"
+    },
+    "workload": {
+      "timestamp": "2021-02-16T11:38:05.826Z",
+      "value": {
+        "count": 26,
+        "task_types": {
+          "alerting:.index-threshold": {
+            "count": 2,
+            "status": {
+              "idle": 2
+            }
+          },
+          "actions:.index": {
+            "count": 14,
+            "status": {
+              "idle": 2,
+              "running": 2,
+              "failed": 10
+            }
+          },
+          "alerting:xpack.uptime.alerts.monitorStatus": {
+            "count": 10,
+            "status": {
+              "idle": 10
+            }
+          },
+        },
+        "schedule": [
+          ["10s", 2],
+          ["1m", 2],
+          ["60s", 2],
+          ["5m", 2],
+          ["60m", 4]
+        ],
+        "overdue": 0,
+        "estimated_schedule_density": [0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 3, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0]
+      },
+      "status": "OK"
+    }
+  }
+}
+--------------------------------------------------
+
+
+[[task-manager-health-evaluate-the-configuration]]
+===== Evaluate the Configuration
+
+[[task-manager-theory-reduced-polling-rate]]
+*Theory*:
+{kib} is configured to poll for tasks at a reduced rate.
+
+*Diagnosis*:
+Evaluating the health stats, you can see the following output under `stats.configuration.value`:
+
+[source,json]
+--------------------------------------------------
+{
+  "request_capacity": 1000,
+  "max_poll_inactivity_cycles": 10,
+  "monitored_aggregated_stats_refresh_rate": 60000,
+  "monitored_stats_running_average_window": 50,
+  "monitored_task_execution_thresholds": {
+    "default": {
+      "error_threshold": 90,
+      "warn_threshold": 80
+    },
+    "custom": {}
+  },
+  "poll_interval": 3000, # <1>
+  "max_workers": 10 # <2>
+}
+--------------------------------------------------
+<1> `poll_interval` is set to the default value of 3000 milliseconds
+<2> `max_workers` is set to the default value of 10 workers
+
+You can infer from this output that the {kib} instance polls for work every 3 seconds and can run 10 concurrent tasks.
+
+Now suppose the output under `stats.configuration.value` is the following:
+
+[source,json]
+--------------------------------------------------
+{
+  "request_capacity": 1000,
+  "max_poll_inactivity_cycles": 10,
+  "monitored_aggregated_stats_refresh_rate": 60000,
+  "monitored_stats_running_average_window": 50,
+  "monitored_task_execution_thresholds": {
+    "default": {
+      "error_threshold": 90,
+      "warn_threshold": 80
+    },
+    "custom": {}
+  },
+  "poll_interval": 60000, # <1>
+  "max_workers": 1 # <2>
+}
+--------------------------------------------------
+<1> `poll_interval` is set to 60000 milliseconds, far higher than the default
+<2> `max_workers` is set to 1 worker, far lower than the default
+
+You can infer from this output that the {kib} instance only polls for work once a minute and only picks up one task at a time. This throughput is unlikely to support mission critical services, such as Alerting or Reporting, and tasks will usually run late.
+
+There are two possible reasons for such a configuration:
+
+* These settings have been configured manually, which can be resolved by reconfiguring these settings.
+For details, see <<task-manager-settings-kb, Task Manager Settings>>.
+
+* {kib} has reduced its own throughput in reaction to excessive load on the {es} cluster.
+
+Task Manager is equipped with a reactive self-healing mechanism in response to an increase in load related errors in {es}. This mechanism will increase the `poll_interval` setting (reducing the rate at which it queries {es}), and decrease the `max_workers` (reducing the amount of operations it executes against {es}). Once the error rate reduces, these settings are incrementally dialed up again, returning them to the configured settings.
+
+This scenario can be identified by searching the {kib} Server Log for messages such as:
+
+[source, txt]
+--------------------------------------------------
+Max workers configuration is temporarily reduced after Elasticsearch returned 25 "too many request" error(s).
+--------------------------------------------------
+
+Deeper investigation into the high error rate experienced by the {es} cluster is required.
+
+[[task-manager-health-evaluate-the-runtime]]
+===== Evaluate the Runtime
+
+[[task-manager-theory-actual-polling-frequently]]
+*Theory*:
+{kib} is not polling as frequently as it should
+
+*Diagnosis*:
+Evaluating the health stats, you see the following output under `stats.runtime.value.polling`:
+
+[source,json]
+--------------------------------------------------
+{
+  "last_successful_poll": "2021-02-16T11:38:09.934Z", # <1>
+  "last_polling_delay": "2021-02-14T11:29:05.053Z",
+  "duration": { # <2>
+    "p50": 13,
+    "p90": 128,
+    "p95": 143,
+    "p99": 168
+  },
+  "claim_conflicts": { # <3>
+    "p50": 0,
+    "p90": 0,
+    "p95": 0,
+    "p99": 2
+  },
+  "claim_mismatches": {
+    "p50": 0,
+    "p90": 0,
+    "p95": 0,
+    "p99": 0
+  },
+  "result_frequency_percent_as_number": { # <4>
+    "Failed": 0,
+    "NoAvailableWorkers": 0,
+    "NoTasksClaimed": 80,
+    "RanOutOfCapacity": 0,
+    "RunningAtCapacity": 0,
+    "PoolFilled": 20
+  }
+}
+--------------------------------------------------
+<1> Ensure the last successful polling cycle was completed no more than a couple of multiples of `poll_interval` in the past.
+<2> Ensure the duration of polling cycles is usually below 100ms. Longer durations are possible, but unexpected.
+<3> Ensure {kib} instances in the cluster are not encountering a high rate of version conflicts.
+<4> Ensure the majority of polling cycles result in positive outcomes, such as `RunningAtCapacity` or `PoolFilled`.
+
+You can infer from this output that the {kib} instance is polling regularly.
+This assessment is based on the following:
+
+* Comparing the `last_successful_poll` to the `timestamp` (value of `2021-02-16T11:38:10.077Z`) at the root, where you can see the last polling cycle took place 1 second before the monitoring stats were exposed by the health monitoring API.
+* Comparing the `last_polling_delay` to the `timestamp` (value of `2021-02-16T11:38:10.077Z`) at the root, where you can see the last polling cycle delay took place 2 days ago, suggesting {kib} instances are not conflicting often.
+* The `p50` of the `duration` shows that at least 50% of polling cycles take, at most, 13 millisconds to complete.
+* Evaluating the `result_frequency_percent_as_number`:
+** 80% of the polling cycles completed without claiming any tasks (suggesting that there aren't any overdue tasks).
+** 20% completed with Task Manager claiming tasks that were then executed.
+** None of the polling cycles ended up occupying all of the available workers, as `RunningAtCapacity` has a frequency of 0%, suggesting there is enough capacity in Task Manager to handle the workload.
+
+All of these stats are tracked as a running average, which means that they give a snapshot of a period of time (by default {kib} tracks up to 50 cycles), rather than giving a complete history.
+
+Suppose the output under `stats.runtime.value.polling.result_frequency_percent_as_number` was the following:
+
+[source,json]
+--------------------------------------------------
+{
+  "Failed": 30, # <1>
+  "NoAvailableWorkers": 20, # <2>
+  "NoTasksClaimed": 10,
+  "RanOutOfCapacity": 10, # <3>
+  "RunningAtCapacity": 10, # <4>
+  "PoolFilled": 20
+}
+--------------------------------------------------
+<1> 30% of polling cycles failed, which is a high rate.
+<2> 20% of polling cycles are skipped as Task Manager has no capacity left to run tasks.
+<3> 10% of polling cycles result in Task Manager claiming more tasks than it has capacity to run.
+<4> 10% of polling cycles result in Task Manager claiming precisely as many tasks as it has capacity to run.
+
+You can infer from this output that Task Manager is not healthy, as the failure rate is high, and Task Manager is fetching tasks it has no capacity to run.
+Analyzing the {kib} Server Log should reveal the underlying issue causing the high error rate and capacity issues.
+
+The high `NoAvailableWorkers` rate of 20% suggests that there are many tasks running for durations longer than the `poll_interval`.
+For details on analyzing long task execution durations, see the <<task-manager-theory-long-running-tasks,long running tasks>> theory.
+
+[[task-manager-theory-insufficient-throughput]]
+*Theory*:
+{kib} is polling as frequently as it should, but that isn't often enough to keep up with the workload
+
+*Diagnosis*:
+Evaluating the health stats, you can see the following output of `drift` and `load` under `stats.runtime.value`:
+
+[source,json]
+--------------------------------------------------
+{
+  "drift": { # <1>
+    "p50": 99,
+    "p90": 1245,
+    "p95": 1845,
+    "p99": 2878
+  },
+  "load": { # <2>
+    "p50": 0,
+    "p90": 0,
+    "p95": 10,
+    "p99": 20
+  },
+}
+--------------------------------------------------
+<1> `drift` shows us that at least 95% of tasks are running within 2 seconds of their scheduled time.
+<2> `load` shows us that Task Manager is idle at least 90% of the time, and never uses more than 20% of its available workers.
+
+You can infer from these stats that this {kib} has plenty of capacity, and any delays you might be experiencing are unlikely to be addressed by expanding the throughput.
+
+Suppose the output of `drift` and `load` was the following:
+
+[source,json]
+--------------------------------------------------
+{
+  "drift": { # <1>
+    "p50": 2999,
+    "p90": 3845,
+    "p95": 3845.75,
+    "p99": 4078
+  },
+  "load": { # <2>
+    "p50": 80,
+    "p90": 100,
+    "p95": 100,
+    "p99": 100
+  }
+}
+--------------------------------------------------
+<1> `drift` shows us that all tasks are running 3 to 4 seconds after their scheduled time.
+<2> `load` shows us that at least half of the time Task Manager is running at a load of 80%.
+
+You can infer from these stats that this {kib} is using most of its capacity, but seems to keep up with the work most of the time.
+This assessment is based on the following:
+
+* The `p90` of `load` is at 100%, and `p50` is also quite high at 80%. This means that there is little to no room for maneuvering, and a spike of work might cause Task Manager to exceed its capacity. 
+* Tasks run soon after their scheduled time, which is to be expected. A `poll_interval` of `3000` milliseconds would often experience a consistent drift of somewhere between `0` and `3000` milliseconds. A `p50 drift` of `2999` suggests that there is room for improvement, and you could benefit from a higher throughput.
+
+For details on achieving higher throughput by adjusting your scaling strategy, see <<task-manager-scaling-guidance>>.
+
+[[task-manager-theory-long-running-tasks]]
+*Theory*:
+Tasks run for too long, overrunning their schedule
+
+*Diagnosis*:
+The <<task-manager-theory-insufficient-throughput,Insufficient throughtput to handle the scheduled workload>> theory analyzed a hypothetical scenario where both drift and load were unusually high.
+
+Suppose an alternate scenario, where `drift` is high, but `load` is not, such as the following:
+
+[source,json]
+--------------------------------------------------
+{
+    "drift": { # <1>
+        "p50": 9799,
+        "p90": 83845,
+        "p95": 90328,
+        "p99": 123845
+    },
+    "load": { # <2>
+        "p50": 40,
+        "p90": 75,
+        "p95": 80,
+        "p99": 100
+    }
+}
+--------------------------------------------------
+<1> `drift` shows that most (if not all) tasks are running at least 32 seconds too late.
+<2> `load` shows that, for the most part, you have capacity to run more concurrent tasks.
+
+In the preceding scenario, the  tasks are running far too late, but you have sufficient capacity to run more concurrent tasks.
+A high capacity allows {kib} to run multiple different tasks concurrently. If a task is already running when its next schedule run is due, {kib} will avoid running it a second time, and instead wait for the first execution to complete.
+
+If a task takes longer to execute than the cadence of its schedule, then that task will always overrun and experience a high drift. For example, suppose a task is scheduled to execute every 3 seconds, but takes 6 seconds to complete. It will consistently suffer from a drift of, at least, 3 seconds.
+
+Evaluating the health stats in this hypothetical scenario, you see the following output under `stats.runtime.value.execution.duration`:
+
+[source,json]
+--------------------------------------------------
+{
+  "alerting:.index-threshold": { # <1>
+    "p50": 95,
+    "p90": 1725,
+    "p95": 2761,
+    "p99": 2761
+  },
+  "alerting:.es-query": { # <2>
+    "p50": 7149,
+    "p90": 40071,
+    "p95": 45282,
+    "p99": 121845
+  },
+  "actions:.index": {
+    "p50": 166,
+    "p90": 166,
+    "p95": 166,
+    "p99": 166
+  }
+}
+--------------------------------------------------
+<1> 50% of the tasks backing index threshold alerts complete in less than 100 milliseconds.
+<2> 50% of the tasks backing Elasticsearch query alerts complete in 7 seconds, but at least 10% take longer than 40 seconds.
+
+You can infer from these stats that the high drift the Task Manager is experiencing is most likely due to Elasticsearch query alerts that are running for a long time.
+
+Resolving this issue is context dependent and changes from case to case.
+In the preceding example above, this would be resolved by modifying the queries in these alerts to make them faster, or improving the {es} throughput to speed up the exiting query.
+
+[[task-manager-theory-high-fail-rate]]
+*Theory*:
+Tasks take multiple attempts to succeed
+
+*Diagnosis*:
+A high error rate could cause a task to appear to run late, when in fact it runs on time, but experiences a high failure rate.
+
+Evaluating the preceding health stats, you see the following output under `stats.runtime.value.execution.result_frequency_percent_as_number`:
+
+[source,json]
+--------------------------------------------------
+{
+  "alerting:.index-threshold": { # <1>
+    "Success": 100,
+    "RetryScheduled": 0,
+    "Failed": 0,
+    "status": "OK"
+  },
+  "alerting:xpack.uptime.alerts.monitorStatus": {
+    "Success": 100,
+    "RetryScheduled": 0,
+    "Failed": 0,
+    "status": "OK"
+  },
+  "actions:.index": { # <2>
+    "Success": 8,
+    "RetryScheduled": 0,
+    "Failed": 92,
+    "status": "error" # <3>
+  }
+}
+--------------------------------------------------
+<1> 100% of the tasks backing index threshold alerts successfully complete.
+<2> 92% of the tasks backing ES index actions fail to complete.
+<3> The tasks backing ES index actions have exceeded the default `monitored_task_execution_thresholds` _error_ configuration.
+
+You can infer from these stats that most `actions:.index` tasks, which back the ES Index {kib} action, fail.
+Resolving that would require deeper investigation into the {kib} Server Log, where the exact errors are logged, and addressing these specific errors.
+
+[[task-manager-health-evaluate-the-workload]]
+===== Evaluate the Workload
+
+Predicting the required throughput a deplyment might need to support Task Manager is difficult, as features can schedule an unpredictable number of tasks at a variety of scheduled cadences.
+
+<<task-manager-health-monitoring>> provides statistics that make it easier to monitor the adequacy of the existing throughput.
+By evaluating the workload, the required throughput can be estimated, which is used when following the Task Manager <<task-manager-scaling-guidance>>.
+
+Evaluating the preceding health stats above, you see the following output under `stats.workload.value`:
+
+[source,json]
+--------------------------------------------------
+{
+  "count": 26, # <1>
+  "task_types": {
+    "alerting:.index-threshold": {
+      "count": 2, # <2>
+      "status": {
+        "idle": 2
+      }
+    },
+    "actions:.index": {
+      "count": 14,
+      "status": {
+        "idle": 2,
+        "running": 2,
+        "failed": 10 # <3>
+      }
+    },
+    "alerting:xpack.uptime.alerts.monitorStatus": {
+      "count": 10,
+      "status": {
+        "idle": 10
+      }
+    },
+  },
+  "schedule": [ # <4>
+    ["10s", 2],
+    ["1m", 2],
+    ["90s", 2],
+    ["5m", 8]
+  ],
+  "overdue": 0, # <5>
+  "estimated_schedule_density": [  # <6>
+    0, 1, 0, 0, 0, 1, 0, 1, 0, 1,
+    0, 0, 0, 1, 0, 0, 1, 1, 1, 0,
+    0, 3, 0, 0, 0, 1, 0, 1, 0, 1,
+    0, 0, 0, 1, 0, 0, 1, 1, 1, 0
+  ]
+}
+--------------------------------------------------
+<1> There are 26 tasks in the system, including regular tasks, recurring tasks, and failed tasks.
+<2> There are 2 `idle` index threshold alert tasks, meaning they are scheduled to run at some point in the future.
+<3> Of the 14 tasks backing the ES index action, 10 have failed and 2 are running.
+<4> A histogram of all scheduled recurring tasks shows that 2 tasks are scheduled to run every 10 seconds, 2  tasks are scheduled to run once a minute, and so on.
+<5> There are no tasks overdue, which means that all tasks that *should* have run by now *have* run.
+<6> This histogram shows the tasks scheduled to run throughout the upcoming 20 polling cycles. The histogram represents the entire deployment, rather than just this {kib} instance
+
+The `workload` section summarizes the work load across the cluster, listing the tasks in the system, their types, schedules, and current status.
+
+You can infer from these stats that a default deployment should suffice.
+This assessment is based on the following:
+
+* The estimated schedule density is low.
+* There aren't many tasks in the system relative to the default capacity.
+
+Suppose the output of `stats.workload.value` looked something like this:
+
+[source,json]
+--------------------------------------------------
+{
+  "count": 2191, # <1>
+  "task_types": {
+    "alerting:.index-threshold": {
+      "count": 202,
+      "status": {
+        "idle": 183,
+        "claiming": 2,
+        "running": 19
+      }
+    },
+    "alerting:.es-query": {
+      "count": 225,
+      "status": {
+        "idle": 225,
+      }
+    },
+    "actions:.index": {
+      "count": 89,
+      "status": {
+        "idle": 24,
+        "running": 2,
+        "failed": 63
+      }
+    },
+    "alerting:xpack.uptime.alerts.monitorStatus": {
+      "count": 87,
+      "status": {
+        "idle": 74,
+        "running": 13
+      }
+    },
+  },
+  "schedule": [ # <2>
+    ["10s", 38],
+    ["1m", 101],
+    ["90s", 55],
+    ["5m", 89],
+    ["20m", 62],
+    ["60m", 106],
+    ["1d", 61]
+  ],
+  "overdue": 0, # <5>
+  "estimated_schedule_density": [  # <3>
+    10, 1, 0, 10, 0, 20, 0, 1, 0, 1,
+    9, 0, 3, 10, 0, 0, 10, 10, 7, 0,
+    0, 31, 0, 12, 16, 31, 0, 10, 0, 10,
+    3, 22, 0, 10, 0, 2, 10, 10, 1, 0
+  ]
+}
+--------------------------------------------------
+<1> There are 2,191 tasks in the system.
+<2> The scheduled tasks are distributed across a variety of cadences.
+<3> The schedule density shows that you expect to exceed the default 10 concurrent tasks.
+
+You can infer several important attributes of your workload from this output:
+
+* There are many tasks in your system and ensuring these tasks run on their scheduled cadence will require attention to the Task Manager throughput.
+* Assessing the high frequency tasks (tasks that recur at a cadence of a couple of minutes or less), you must support a throughput of approximately 400 tasks per minute (38 every 10 seconds + 101 every minute + 55 every 90 seconds).
+* Assessing the medium frequency tasks (tasks that recur at a cadence of an hour or less), you must support an additional throughput of over 2000 tasks per hour (89 every 5 minutes, + 62 every 20 minutes + 106 each hour). You can average the needed throughput for the hour by counting these tasks as an additional 30 to 40 tasks per minute.
+* Assessing the estimated schedule density, there are cycles that are due to run upwards of 31 tasks concurrently, and along side these cycles, there are empty cycles. You can expect Task Manager to load balance these tasks throughout the empty cycles, but this won't leave much capacity to handle spikes in fresh tasks that might be scheduled in the future.
+
+These rough calculations give you a lower bound to the required throughput, which is _at least_ 440 tasks per minute to ensure recurring tasks are executed, at their scheduled time. This throughput doesn't account for nonrecurring tasks that might have been scheduled, nor does it account for tasks (recurring or otherwise) that might be scheduled in the future.
+
+Given these inferred attributes, it would be safe to assume that a single {kib} instance with default settings **would not** provide the required throughput. It is possible that scaling horizontally by adding a couple more {kib} instances will.
+
+For details on scaling Task Manager, see <<task-manager-scaling-guidance>>.
--- a/docs/user/setup.asciidoc
+++ b/docs/user/setup.asciidoc
@ -56,6 +56,4 @@ include::{kib-repo-dir}/setup/access.asciidoc[]

 include::{kib-repo-dir}/setup/connect-to-elasticsearch.asciidoc[]

-include::{kib-repo-dir}/setup/production.asciidoc[]
-
 include::{kib-repo-dir}/setup/upgrade.asciidoc[]
--- a/x-pack/plugins/task_manager/README.md
+++ b/x-pack/plugins/task_manager/README.md
@ -1,6 +1,7 @@
 # Kibana task manager

 The task manager is a generic system for running background tasks.
+Documentation: https://www.elastic.co/guide/en/kibana/master/task-manager-production-considerations.html

 It supports:
 - Single-run and recurring tasks
@ -495,11 +496,9 @@ Our current model, then, is this:

 ## Limitations in v1.0

-In v1, the system only understands 1 minute increments (e.g. '1m', '7m'). Tasks which need something more robust will need to specify their own "runAt" in their run method's return value.
-
 There is only a rudimentary mechanism for coordinating tasks and handling expired tasks. Tasks are considered expired if their runAt has arrived, and their status is still 'running'.

-There is no task history. Each run overwrites the previous run's state. One-time tasks are removed from the index upon completion regardless of success / failure.
+There is no task history. Each run overwrites the previous run's state. One-time tasks are removed from the index upon completion.

 The task manager's public API is create / delete / list. Updates aren't directly supported, and listing should be scoped so that users only see their own tasks.

@ -522,4 +521,5 @@ The task manager's public API is create / delete / list. Updates aren't directly

 Task Manager exposes runtime statistics which enable basic observability into its inner workings and makes it possible to monitor the system from external services.

-Learn More: [./MONITORING](./MONITORING.MD)
+Public Documentation: https://www.elastic.co/guide/en/kibana/master/task-manager-health-monitoring.html
+Developer Documentation: [./MONITORING](./MONITORING.MD)
--- a/x-pack/plugins/task_manager/server/MONITORING.md
+++ b/x-pack/plugins/task_manager/server/MONITORING.md
@ -32,21 +32,8 @@ xpack.task_manager.monitored_task_execution_thresholds:
 ```

 ## Consuming Health Stats
-Task Manager exposes a `/api/task_manager/_health` api which returns the _latest_ stats.
-Calling this API is designed to be fast and doesn't actually perform any checks- rather it returns the result of the latest stats in the system, and is design in such a way that you could call it from an external service on a regular basis without worrying that you'll be adding substantial load to the system.

-Additionally, the metrics are logged out into Task Manager's `DEBUG` logger at a regular cadence (dictated by the Polling Interval).
-If you wish to enable DEBUG logging in your Kibana instance, you will need to add the following to your `Kibana.yml`:
-```
-logging:
-  loggers:
-      - context: plugins.taskManager
-        appenders: [console]
-        level: debug
-```
-
-Please bear in mind that these stats are logged as often as your `poll_interval` configuration, which means it could add substantial noise to your logs.
-We would recommend only enabling this level of logging temporarily.
+Public Documentation: https://www.elastic.co/guide/en/kibana/master/task-manager-health-monitoring.html#task-manager-consuming-health-stats

 ### Understanding the Exposed Stats

@ -60,6 +47,8 @@ An `OK` status will only be displayed when all sections are marked as `OK`.

 The root `timestamp` is the time in which the summary was exposed (either to the DEBUG logger or the http api) and the `last_update` is the last time any one of the sections was updated.

+Follow this step-by-step guide to make sense of the stats: https://www.elastic.co/guide/en/kibana/master/task-manager-troubleshooting.html#task-manager-diagnosing-root-cause
+
 #### The Configuration Section
 The `configuration` section summarizes Task Manager's current configuration, including dynamic configurations which change over time, such as `poll_interval` and `max_workers` which adjust in reaction to changing load on the system.

@ -85,232 +74,3 @@ These include:
  - The `Success | Retry | Failure ratio` by task type. This is different than the workload stats which tell you what's in the queue, but ca't keep track of retries and of non recurring tasks as they're wiped off the index when completed.

 These are "Hot" stats which are updated reactively as Tasks are executed and interacted with.
-
-### Example Stats
-
-For example, if you _curl_ the `/api/task_manager/_health` endpoint, you might get these stats:
-```
-{
-     /* the time these stats were returned by the api */
-    "timestamp": "2020-10-05T18:26:11.346Z",
-     /* the overall status of the system */
-    "status": "OK",
-     /* last time any stat was updated in this output */
-    "last_update": "2020-10-05T17:57:55.411Z",    
-    "stats": {
-        "configuration": {      /* current configuration of TM */
-            "timestamp": "2020-10-05T17:56:06.507Z",
-            "status": "OK",
-            "value": {
-                "max_workers": 10,
-                "poll_interval": 3000,
-                "request_capacity": 1000,
-                "max_poll_inactivity_cycles": 10,
-                "monitored_aggregated_stats_refresh_rate": 60000,
-                "monitored_stats_running_average_window": 50
-            }
-        },
-        "workload": {  /* The workload of this deployment */
-            "timestamp": "2020-10-05T17:57:06.534Z",
-            "status": "OK",
-            "value": {
-                "count": 6,        /* count of tasks in the system */
-                "task_types": {   /* what tasks are there and what status are they in */
-                    "actions_telemetry": {
-                        "count": 1,
-                        "status": {
-                            "idle": 1
-                        }
-                    },
-                    "alerting_telemetry": {
-                        "count": 1,
-                        "status": {
-                            "idle": 1
-                        }
-                    },
-                    "apm-telemetry-task": {
-                        "count": 1,
-                        "status": {
-                            "idle": 1
-                        }
-                    },
-                    "endpoint:user-artifact-packager": {
-                        "count": 1,
-                        "status": {
-                            "idle": 1
-                        }
-                    },
-                    "lens_telemetry": {
-                        "count": 1,
-                        "status": {
-                            "idle": 1
-                        }
-                    },
-                    "session_cleanup": {
-                        "count": 1,
-                        "status": {
-                            "idle": 1
-                        }
-                    }
-                },
-
-                /* Frequency of recurring tasks schedules */
-                "schedule": [  
-                    ["60s", 1],   /* 1 task, every 60s */
-                    ["3600s", 3],  /* 3 tasks every hour */
-                    ["720m", 1]
-                ],
-                /* There are no overdue tasks in this system at the moment */
-                "overdue": 0, 
-                /* This is the schedule density, it shows a histogram of all the  polling intervals in the next minute (or, if 
-                    pollInterval is configured unusually high it will show a min of 2 refresh intervals into the future, and a max of 50 buckets).
-                    Here we see that on the 3rd polling interval from *now* (which is ~9 seconds from now, as pollInterval is `3s`) there is one task due to run.
-                    We also see that there are 5 due two intervals later, which is fine as we have a max workers of `10`
-                 */
-                "estimated_schedule_density": [0, 0, 1, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
-            }
-        },
-        "runtime": {
-            "timestamp": "2020-10-05T17:57:55.411Z",
-            "status": "OK",
-            "value": {
-                "polling": {
-                    /* When was the last polling cycle? */
-                    "last_successful_poll": "2020-10-05T17:57:55.411Z",
-                    /* When was the last time Task Manager adjusted it's polling delay? */
-                    "last_polling_delay": "2020-10-05T17:57:55.411Z",
-                    /* Running average of polling duration measuring the time from the scheduled polling cycle
-                        start until all claimed tasks are marked as running */
-                    "duration": {
-                        "p50": 4,
-                        "p90": 12,
-                        "p95": 12,
-                        "p99": 12
-                    },
-                    /* Running average of number of version clashes caused by the markAvailableTasksAsClaimed stage
-                        of the polling cycle */
-                    "claim_conflicts": {
-                        "p50": 0,
-                        "p90": 0,
-                        "p95": 0,
-                        "p99": 0
-                    },
-                    /* Running average of mismatch between the number of tasks updated by the markAvailableTasksAsClaimed stage
-                        of the polling cycle and the number of docs found by the sweepForClaimedTasks stage */
-                    "claim_mismatches": {
-                        "p50": 0,
-                        "p90": 0,
-                        "p95": 0,
-                        "p99": 0
-                    },
-                    /* What is the frequency of polling cycle result?
-                        Here we see 94% of "NoTasksClaimed" and 6%  "PoolFilled" */
-                    "result_frequency_percent_as_number": {
-                        /* This tells us that the polling cycle didnt claim any new tasks */
-                        "NoTasksClaimed": 94,
-                        /* This is a legacy result we are renaming in 8.0.0 -
-                            it tells us when a polling cycle resulted in claiming more tasks
-                            than we had workers for, butt he name doesn't make much sense outside of the context of the code */
-                        "RanOutOfCapacity": 0, 
-                        /* This is a legacy result we are renaming in 8.0.0 -
-                            it tells us when a polling cycle resulted in tasks being claimed but less the the available workers */
-                        "PoolFilled": 6,
-                        /* This tells us when a polling cycle resulted in no tasks being claimed due to there being no available workers */
-                        "NoAvailableWorkers": 0,
-                        /* This tells us when a polling cycle resulted in tasks being claimed at 100% capacity of the available workers */
-                        "RunningAtCapacity": 0,
-                        /* This tells us when the poller failed to claim */
-                        "Failed": 0
-                    }
-                },
-                /* on average, 50% of the tasks in this deployment run at most 1.7s after their scheduled time */
-                "drift": {
-                    "p50": 1720,
-                    "p90": 2274,
-                    "p95": 2574,
-                    "p99": 3221
-                },
-                /* on average, 50% of the tasks polling cycles in this deployment result at most in 25% of workers being in use.
-                    We track this in percentages rather than absolute count as max_workers can change over time in response
-                    to changing circumstance. */
-                "load": {
-                    "p50": 25,
-                    "p90": 80,
-                    "p95": 100,
-                    "p99": 100
-                },
-                "execution": {
-                    "duration": {
-                        /* on average, the `endpoint:user-artifact-packager` tasks take 15ms to run */
-                        "endpoint:user-artifact-packager": {
-                            "mean": 15,
-                            "median": 14.5
-                        },
-                        "session_cleanup": {
-                            "mean": 28,
-                            "median": 28
-                        },
-                        "lens_telemetry": {
-                            "mean": 100,
-                            "median": 100
-                        },
-                        "actions_telemetry": {
-                            "mean": 135,
-                            "median": 135
-                        },
-                        "alerting_telemetry": {
-                            "mean": 197,
-                            "median": 197
-                        },
-                        "apm-telemetry-task": {
-                            "mean": 1347,
-                            "median": 1347
-                        }
-                    },
-                    "result_frequency_percent_as_number": {
-                        /* and 100% of `endpoint:user-artifact-packager` have completed in success (within the running average window,
-                            so the past 50 runs (by default, configrable by `monitored_stats_running_average_window`) */
-                        "endpoint:user-artifact-packager": {
-                            "status": "OK",
-                            "Success": 100,
-                            "RetryScheduled": 0,
-                            "Failed": 0
-                        },
-                        "session_cleanup": {
-                            /* `error` status as 90% of results are `Failed` */
-                            "status": "error",
-                            "Success": 5,
-                            "RetryScheduled": 5,
-                            "Failed": 90
-                        },
-                        "lens_telemetry": {
-                            "status": "OK",
-                            "Success": 100,
-                            "RetryScheduled": 0,
-                            "Failed": 0
-                        },
-                        "actions_telemetry": {
-                            "status": "OK",
-                            "Success": 100,
-                            "RetryScheduled": 0,
-                            "Failed": 0
-                        },
-                        "alerting_telemetry": {
-                            "status": "OK",
-                            "Success": 100,
-                            "RetryScheduled": 0,
-                            "Failed": 0
-                        },
-                        "apm-telemetry-task": {
-                            "status": "OK",
-                            "Success": 100,
-                            "RetryScheduled": 0,
-                            "Failed": 0
-                        }
-                    }
-                }
-            }
-        }
-    }
-}
-```