[Alerting][Docs] Support enablement documentation. (#101457) (#103537)

* [Alerting][Docs] Support enablement documentation.

* additional docs

* fixed links

* Apply suggestions from code review

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

* fixed common issues

* Apply suggestions from code review

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

* fixed due to comments

* fixed TM health api page

* fixed TM health api page 2

* Apply suggestions from code review

Co-authored-by: ymao1 <ying.mao@elastic.co>
Co-authored-by: Mike Côté <mikecote@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Mike Côté <mikecote@users.noreply.github.com>
Co-authored-by: ymao1 <ying.mao@elastic.co>

* fixed due to the comments

* fixed due to the comments

* fixed experimental flag

* fixed due to the comments

* Apply suggestions from code review

Co-authored-by: ymao1 <ying.mao@elastic.co>

* Update docs/user/alerting/alerting-troubleshooting.asciidoc

Co-authored-by: ymao1 <ying.mao@elastic.co>

* fixed due to the comments

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>
Co-authored-by: ymao1 <ying.mao@elastic.co>
Co-authored-by: Mike Côté <mikecote@users.noreply.github.com>

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>
Co-authored-by: ymao1 <ying.mao@elastic.co>
Co-authored-by: Mike Côté <mikecote@users.noreply.github.com>
This commit is contained in:
Yuliia Naumenko 2021-06-28 12:05:26 -07:00 committed by GitHub
parent 98c5543866
commit 02e4166405
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
13 changed files with 777 additions and 242 deletions

View file

@ -1,5 +1,5 @@
[[task-manager-api-health]]
=== Get Task Manager health API
== Task Manager health API
++++
<titleabbrev>Get Task Manager health</titleabbrev>
++++
@ -7,18 +7,18 @@
Retrieve the health status of the {kib} Task Manager.
[[task-manager-api-health-request]]
==== Request
=== Request
`GET <kibana host>:<port>/api/task_manager/_health`
[[task-manager-api-health-codes]]
==== Response code
=== Response code
`200`::
Indicates a successful call.
[[task-manager-api-health-example]]
==== Example
=== Example
Retrieve the health status of the {kib} Task Manager:

View file

@ -1,263 +1,201 @@
[role="xpack"]
[[alerting-troubleshooting]]
== Alerting Troubleshooting
== Troubleshooting
++++
<titleabbrev>Troubleshooting</titleabbrev>
++++
This page describes how to resolve common problems you might encounter with Alerting.
If your problem isnt described here, please review open issues in the following GitHub repositories:
* https://github.com/elastic/kibana/issues[kibana] (https://github.com/elastic/kibana/issues?q=is%3Aopen+is%3Aissue+label%3AFeature%3AAlerting[Alerting issues])
Have a question? Contact us in the https://discuss.elastic.co/[discuss forum].
The Alerting framework provides many options for diagnosing problems with Rules and Connectors.
[float]
[[rule-cannot-decrypt-api-key]]
=== Rule cannot decrypt apiKey
[[alerting-kibana-log]]
=== Check the {kib} log
*Problem*:
Rules and connectors log to the Kibana logger with tags of [alerting] and [actions], respectively. Generally, the messages are warnings and errors. In some cases, the error might be a false positive, for example, when a connector is deleted and a rule is running.
The rule fails to execute and has an `Unable to decrypt attribute "apiKey"` error.
*Solution*:
This error happens when the `xpack.encryptedSavedObjects.encryptionKey` value used to create the rule does not match the value used during rule execution. Depending on the scenario, there are different ways to solve this problem:
[cols="2*<"]
|===
| If the value in `xpack.encryptedSavedObjects.encryptionKey` was manually changed, and the previous encryption key is still known.
| Ensure any previous encryption key is included in the keys used for <<xpack-encryptedSavedObjects-keyRotation-decryptionOnlyKeys, decryption only>>.
| If another {kib} instance with a different encryption key connects to the cluster.
| The other {kib} instance might be trying to run the rule using a different encryption key than what the rule was created with. Ensure the encryption keys among all the {kib} instances are the same, and setting <<xpack-encryptedSavedObjects-keyRotation-decryptionOnlyKeys, decryption only keys>> for previously used encryption keys.
| If other scenarios don't apply.
| Generate a new API key for the rule by disabling then enabling the rule.
|===
[float]
[[rules-small-check-interval-run-late]]
=== Rules with small check intervals run late
*Problem*:
Rules with a small check interval, such as every two seconds, run later than scheduled.
*Resolution*:
Rules run as background tasks at a cadence defined by their *check interval*.
When a Rule *check interval* is smaller than the Task Manager <<task-manager-settings,`poll_interval`>> the rule will run late.
Either tweak the <<task-manager-settings,{kib} Task Manager settings>> or increase the *check interval* of the rules in question.
For more details, see <<task-manager-health-scheduled-tasks-small-schedule-interval-run-late>>.
[float]
[[scheduled-rules-run-late]]
=== Rules run late
*Problem*:
Scheduled rules run at an inconsistent cadence, often running late.
Actions run long after the status of a rule changes, sending a notification of the change too late.
*Solution*:
Rules and actions run as background tasks by each {kib} instance at a default rate of ten tasks every three seconds.
If many rules or actions are scheduled to run at the same time, pending tasks will queue in {es}. Each {kib} instance then polls for pending tasks at a rate of up to ten tasks at a time, at three second intervals. Because rules and actions are backed by tasks, it is possible for pending tasks in the queue to exceed this capacity and run late.
For details on diagnosing the underlying causes of such delays, see <<task-manager-health-tasks-run-late>>.
Alerting and action tasks are identified by their type.
* Alerting tasks always begin with `alerting:`. For example, the `alerting:.index-threshold` tasks back the <<rule-type-index-threshold, index threshold stack rule>>.
* Action tasks always begin with `actions:`. For example, the `actions:.index` tasks back the <<index-action-type, index action>>.
When diagnosing issues related to Alerting, focus on the tasks that begin with `alerting:` and `actions:`.
For more details on monitoring and diagnosing task execution in Task Manager, see <<task-manager-health-monitoring>>.
[float]
[[connector-tls-settings]]
=== Connectors have TLS errors when executing actions
*Problem*:
When executing actions, a connector gets a TLS socket error when connecting to
the server.
*Resolution*:
Configuration options are available to specialize connections to TLS servers,
including ignoring server certificate validation, and providing certificate
authority data to verify servers using custom certificates. For more details,
see <<action-settings,Action settings>>.
[float]
[[rules-long-execution-time]]
=== Identify long-running rules
The following query can help you identify rules that are taking a long time to execute and might impact the overall health of your deployment.
[IMPORTANT]
==============================================
By default, only users with a `superuser` role can query the {kib} event log because it is a system index. To enable additional users to execute this query, assign `read` privileges to the `.kibana-event-log*` index.
==============================================
Query for a list of rule ids, bucketed by their execution times:
[source,console]
[source, txt]
--------------------------------------------------
server log [11:39:40.389] [error][alerting][alerting][plugins][plugins] Executing Alert "5b6237b0-c6f6-11eb-b0ff-a1a0cbcf29b6" has resulted in Error: Saved object [action/fdbc8610-c6f5-11eb-b0ff-a1a0cbcf29b6] not found
--------------------------------------------------
Some of the resources, such as saved objects and API keys, may no longer be available or valid, yielding error messages about those missing resources.
[float]
[[alerting-kibana-version]]
=== Use the debugging tools
The following debugging tools are available:
* {kib} versions 7.10 and above
have a <<testing-connectors,Test connector>> UI.
* {kib} versions 7.11 and above
include improved Webhook error messages,
better overall debug logging for actions and connectors,
and Task Manager <<task-manager-diagnosing-root-cause,diagnostics endpoints>>.
[float]
[[alerting-managment-detail]]
=== Using rules and connectors list for the current state and finding issues
*Rules and Connectors* in *Stack Management* lists the rules and connectors available in the space youre currently in. When you click a rule name, you are navigated to the <<rule-details,details page>> for the rule, where you can see currently active alerts.
The start date on this page indicates when a rule is triggered, and for what alerts. In addition, the duration of the condition indicates how long the instance is active.
[role="screenshot"]
image::images/rule-details-alerts-inactive.png[Alerting management details]
[float]
[[alerting-index-threshold-chart]]
=== Preview the index threshold rule chart
When creating or editing an index threshold rule, you see a graph of the data the rule will operate against, from some date in the past until now, updated every 5 seconds.
[role="screenshot"]
image::images/index-threshold-chart.png[Index Threshold chart]
The end date is related to the rule interval (IIRC, 30 “intervals” worth of time). You can use this view to see if the rule is getting the data you expect, and visually compare to the threshold value (a horizontal line in the graph). If the graph does not contain any lines except for the threshold line, then the rule has an issue, for example, no data is available given the specified index and fields or there is a permission error.
Diagnosing these may be difficult - but there may be log messages for error conditions.
[float]
[[alerting-rest-api]]
=== Use the REST APIs
There is a rich set of HTTP endpoints to introspect and manage rules and connectors.
One of the http endpoints available for actions is the POST <<execute-connector-api,_execute API>>. You can use this to “test” an action. For instance, if you have a server log action created, you can execute it via curling the endpoint:
[source, txt]
--------------------------------------------------
curl -X POST -k \
-H 'kbn-xsrf: foo' \
-H 'content-type: application/json' \
api/actions/connector/a692dc89-15b9-4a3c-9e47-9fb6872e49ce/_execute \
-d '{"params":{"subject":"hallo","message":"hallo!","to":["me@example.com"]}}'
--------------------------------------------------
experimental[] In addition, there is a command-line client that uses legacy Rules and Connectors APIs, which can be easier to use, but must be updated for the new APIs.
CLI tools to list, create, edit, and delete alerts (rules) and actions (connectors) are available in https://github.com/pmuellr/kbn-action[kbn-action], which you can install as follows:
[source, txt]
--------------------------------------------------
npm install -g pmuellr/kbn-action
--------------------------------------------------
The same REST POST _execute API command will be:
[source, txt]
--------------------------------------------------
kbn-action execute a692dc89-15b9-4a3c-9e47-9fb6872e49ce {"params":{"subject":"hallo","message":"hallo!","to":["me@example.com"]}}
--------------------------------------------------
The result of this http request (and printed to stdout by https://github.com/pmuellr/kbn-action[kbn-action]) will be data returned by the action execution, along with error messages if errors were encountered.
[float]
[[alerting-error-banners]]
=== Look for error banners
The Rule Management and Rule Details pages contain an error banner, which helps to identify the errors for the rules:
[role="screenshot"]
image::images/rules-management-health.png[Rule management page with the errors banner]
[role="screenshot"]
image::images/rules-details-health.png[Rule details page with the errors banner]
[float]
[[task-manager-diagnostics]]
=== Task Manager diagnostics
Under the hood, Rules and Connectors uses a plugin called Task Manager, which handles the scheduling, execution, and error handling of the tasks.
This means that failure cases in Rules or Connectors will, at times, be revealed by the Task Manager mechanism, rather than the Rules mechanism.
Task Manager provides a visible status which can be used to diagnose issues and is very well documented <<task-manager-health-monitoring,health monitoring>> and <<task-manager-troubleshooting,troubleshooting>>.
Task Manager uses the `.kibana_task_manager` index, an internal index that contains all the saved objects that represent the tasks in the system.
[float]
==== Getting from a Rule to its Task
When a rule is created, a task is created, scheduled to run at the interval specified. For example, when a rule is created and configured to check every 5 minutes, then the underlying task will be expected to run every 5 minutes. In practice, after each time the rule runs, the task is scheduled to run again in 5 minutes, rather than being scheduled to run every 5 minutes indefinitely.
If you use the <<alerting-apis,Alerting REST APIs>> to fetch the underlying rule, youll get an object like so:
[source, txt]
--------------------------------------------------
GET /.kibana-event-log*/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"gte": "now-1d", <1>
"lte": "now"
}
}
},
{
"term": {
"event.action": {
"value": "execute"
}
}
},
{
"term": {
"event.provider": {
"value": "alerting" <2>
}
}
}
]
}
"id": "0a037d60-6b62-11eb-9e0d-85d233e3ee35",
"notify_when": "onActionGroupChange",
"params": {
"aggType": "avg",
},
"runtime_mappings": { <3>
"event.duration_in_seconds": {
"type": "double",
"script": {
"source": "emit(doc['event.duration'].value / 1E9)"
}
}
"consumer": "alerts",
"rule_type_id": "test.rule.type",
"schedule": {
"interval": "1m"
},
"aggs": {
"ruleIdsByExecutionDuration": {
"histogram": {
"field": "event.duration_in_seconds",
"min_doc_count": 1,
"interval": 1 <4>
},
"aggs": {
"ruleId": {
"nested": {
"path": "kibana.saved_objects"
},
"aggs": {
"ruleId": {
"terms": {
"field": "kibana.saved_objects.id",
"size": 10 <5>
}
}
}
}
}
}
"actions": [],
"tags": [],
"name": "test rule",
"enabled": true,
"throttle": null,
"api_key_owner": "elastic",
"created_by": "elastic",
"updated_by": "elastic",
"mute_all": false,
"muted_alert_ids": [],
"updated_at": "2021-02-10T05:37:19.086Z",
"created_at": "2021-02-10T05:37:19.086Z",
"scheduled_task_id": "31563950-b14b-11eb-9a7c-9df284da9f99",
"execution_status": {
"last_execution_date": "2021-02-10T17:55:14.262Z",
"status": "ok"
}
}
--------------------------------------------------
// TEST
<1> This queries for rules executed in the last day. Update the values of `lte` and `gte` to query over a different time range.
<2> Use `event.provider: actions` to query for long-running action executions.
<3> Execution durations are stored as nanoseconds. This adds a runtime field to convert that duration into seconds.
<4> This interval buckets the event.duration_in_seconds runtime field into 1 second intervals. Update this value to change the granularity of the buckets. If you are unable to use runtime fields, make sure this aggregation targets `event.duration` and use nanoseconds for the interval.
<5> This retrieves the top 10 rule ids for this duration interval. Update this value to retrieve more rule ids.
This query returns the following:
[source,json]
The field youre looking for is the one called `scheduled_task_id` which includes the _id of the Task Manager task, so if you then go to the Console and run the following query, youll get the underlying task.
[source, txt]
--------------------------------------------------
GET .kibana_task_manager/_doc/task:31563950-b14b-11eb-9a7c-9df284da9f99
{
"took" : 322,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 326,
"relation" : "eq"
"_index" : ".kibana_task_manager_8.0.0_001",
"_id" : "task:31563950-b14b-11eb-9a7c-9df284da9f99",
"_version" : 838,
"_seq_no" : 8791,
"_primary_term" : 1,
"found" : true,
"_source" : {
"migrationVersion" : {
"task" : "7.6.0"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"ruleIdsByExecutionDuration" : {
"buckets" : [
{
"key" : 0.0, <1>
"doc_count" : 320,
"ruleId" : {
"doc_count" : 320,
"ruleId" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1923ada0-a8f3-11eb-a04b-13d723cdfdc5",
"doc_count" : 140
},
{
"key" : "15415ecf-cdb0-4fef-950a-f824bd277fe4",
"doc_count" : 130
},
{
"key" : "dceeb5d0-6b41-11eb-802b-85b0c1bc8ba2",
"doc_count" : 50
}
]
}
}
},
{
"key" : 30.0, <2>
"doc_count" : 6,
"ruleId" : {
"doc_count" : 6,
"ruleId" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "41893910-6bca-11eb-9e0d-85d233e3ee35",
"doc_count" : 6
}
]
}
}
}
]
}
"task" : {
"schedule" : {
"interval" : "5s"
},
"taskType" : "alerting:.index-threshold",
"retryAt" : null,
"runAt" : "2021-05-10T05:18:02.704Z",
"scope" : [
"alerting"
],
"startedAt" : null,
"state" : """{"alertInstances":{},"previousStartedAt":"2021-05-10T05:17:45.671Z"}""",
"params" : """{"alertId":"30d856c0-b14b-11eb-9a7c-9df284da9f99","spaceId":"default"}""",
"ownerId" : null,
"scheduledAt" : "2021-05-10T04:50:07.333Z",
"attempts" : 0,
"status" : "idle"
},
"references" : [ ],
"updated_at" : "2021-05-10T05:17:58.000Z",
"coreMigrationVersion" : "8.0.0",
"type" : "task"
}
}
--------------------------------------------------
<1> Most rule execution durations fall within the first bucket (0 - 1 seconds).
<2> A single rule with id `41893910-6bca-11eb-9e0d-85d233e3ee35` took between 30 and 31 seconds to execute.
Use the <<get-rule-api,Get Rule API>> to retrieve additional information about rules that take a long time to execute.
What you can see above is the task that backs the rule, and for the rule to work, this task must be in a healthy state. This information is available via <<task-manager-api-health, health API>> or via verbose logs if debug logging is enabled.
When diagnosing the health state of the task, you will most likely be interested in the following fields:
`status`:: This is the current status of the task. Is Task Manager is currently running? Is Task Manager idle, and youre waiting for it to run? Or has Task Manager has tried to run it and failed?
`runAt`:: This is when the task is scheduled to run next. If this is in the past and the status is idle, Task Manager has fallen behind or isnt running. If its in the past, but the status is running, then Task Manager has picked it up and is working on it, which is considered healthy.
`retryAt`:: Another time field, like runAt. If this field is populated, then Task Manager is currently running the task. If the task doesnt complete (and isn't marked as failed), then Task Manager will give it another attempt at the time specified under retryAt.
Investigating the underlying task can help you gauge whether the problem youre seeing is rooted in the rule not running at all, whether its running and failing, or whether it is running, but exhibiting behavior that is different than what was expected (at which point you should focus on the rule itself, rather than the task).
In addition to the above methods, broadly used the next approaches and common issues:
* <<alerting-common-issues, Alerting common issues>>
* <<event-log-index, Querying Event log index>>
* <<testing-connectors, Testing connectors using Connectors UI and `kbn-action` tool>>
include::troubleshooting/alerting-common-issues.asciidoc[]
include::troubleshooting/event-log-index.asciidoc[]
include::troubleshooting/testing-connectors.asciidoc[]

Binary file not shown.

After

Width:  |  Height:  |  Size: 147 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 147 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 197 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 272 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 189 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 114 KiB

View file

@ -0,0 +1,253 @@
[role="xpack"]
[[alerting-common-issues]]
=== Common Issues
This page describes how to resolve common problems you might encounter with Alerting.
[float]
[[rules-small-check-interval-run-late]]
==== Rules with small check intervals run late
*Problem*
Rules with a small check interval, such as every two seconds, run later than scheduled.
*Solution*
Rules run as background tasks at a cadence defined by their *check interval*.
When a Rule *check interval* is smaller than the Task Manager <<task-manager-settings,`poll_interval`>>, the rule will run late.
Either tweak the <<task-manager-settings,{kib} Task Manager settings>> or increase the *check interval* of the rules in question.
For more details, see <<task-manager-health-scheduled-tasks-small-schedule-interval-run-late>>.
[float]
[[scheduled-rules-run-late]]
==== Rules with the inconsistent cadence
*Problem*
Scheduled rules run at an inconsistent cadence, often running late.
Actions run long after the status of a rule changes, sending a notification of the change too late.
*Solution*
Rules and actions run as background tasks by each {kib} instance at a default rate of ten tasks every three seconds.
When diagnosing issues related to Alerting, focus on the tasks that begin with `alerting:` and `actions:`.
Alerting tasks always begin with `alerting:`. For example, the `alerting:.index-threshold` tasks back the <<rule-type-index-threshold, index threshold stack rule>>.
Action tasks always begin with `actions:`. For example, the `actions:.index` tasks back the <<index-action-type, index action>>.
For more details on monitoring and diagnosing task execution in Task Manager, see <<task-manager-health-monitoring>>.
[float]
[[connector-tls-settings]]
==== Connectors have TLS errors when executing actions
*Problem*
When executing actions, a connector gets a TLS socket error when connecting to
the server.
*Solution*
Configuration options are available to specialize connections to TLS servers,
including ignoring server certificate validation, and providing certificate
authority data to verify servers using custom certificates. For more details,
see <<action-settings,Action settings>>.
[float]
[[rules-long-execution-time]]
==== Rules take a long time to run
*Problem*
Rules are taking a long time to execute and are impacting the overall health of your deployment.
[IMPORTANT]
==============================================
By default, only users with a `superuser` role can query the {kib} event log because it is a system index. To enable additional users to execute this query, assign `read` privileges to the `.kibana-event-log*` index.
==============================================
*Solution*
Query for a list of rule ids, bucketed by their execution times:
[source,console]
--------------------------------------------------
GET /.kibana-event-log*/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"gte": "now-1d", <1>
"lte": "now"
}
}
},
{
"term": {
"event.action": {
"value": "execute"
}
}
},
{
"term": {
"event.provider": {
"value": "alerting" <2>
}
}
}
]
}
},
"runtime_mappings": { <3>
"event.duration_in_seconds": {
"type": "double",
"script": {
"source": "emit(doc['event.duration'].value / 1E9)"
}
}
},
"aggs": {
"ruleIdsByExecutionDuration": {
"histogram": {
"field": "event.duration_in_seconds",
"min_doc_count": 1,
"interval": 1 <4>
},
"aggs": {
"ruleId": {
"nested": {
"path": "kibana.saved_objects"
},
"aggs": {
"ruleId": {
"terms": {
"field": "kibana.saved_objects.id",
"size": 10 <5>
}
}
}
}
}
}
}
}
--------------------------------------------------
// TEST
<1> This queries for rules executed in the last day. Update the values of `lte` and `gte` to query over a different time range.
<2> Use `event.provider: actions` to query for long-running action executions.
<3> Execution durations are stored as nanoseconds. This adds a runtime field to convert that duration into seconds.
<4> This interval buckets the `event.duration_in_seconds` runtime field into 1 second intervals. Update this value to change the granularity of the buckets. If you are unable to use runtime fields, make sure this aggregation targets `event.duration` and use nanoseconds for the interval.
<5> This retrieves the top 10 rule ids for this duration interval. Update this value to retrieve more rule ids.
This query returns the following:
[source,json]
--------------------------------------------------
{
"took" : 322,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 326,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"ruleIdsByExecutionDuration" : {
"buckets" : [
{
"key" : 0.0, <1>
"doc_count" : 320,
"ruleId" : {
"doc_count" : 320,
"ruleId" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1923ada0-a8f3-11eb-a04b-13d723cdfdc5",
"doc_count" : 140
},
{
"key" : "15415ecf-cdb0-4fef-950a-f824bd277fe4",
"doc_count" : 130
},
{
"key" : "dceeb5d0-6b41-11eb-802b-85b0c1bc8ba2",
"doc_count" : 50
}
]
}
}
},
{
"key" : 30.0, <2>
"doc_count" : 6,
"ruleId" : {
"doc_count" : 6,
"ruleId" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "41893910-6bca-11eb-9e0d-85d233e3ee35",
"doc_count" : 6
}
]
}
}
}
]
}
}
}
--------------------------------------------------
<1> Most rule execution durations fall within the first bucket (0 - 1 seconds).
<2> A single rule with id `41893910-6bca-11eb-9e0d-85d233e3ee35` took between 30 and 31 seconds to execute.
Use the <<get-rule-api,Get Rule API>> to retrieve additional information about rules that take a long time to execute.
[float]
[[rule-cannot-decrypt-api-key]]
=== Rule cannot decrypt apiKey
*Problem*:
The rule fails to execute and has an `Unable to decrypt attribute "apiKey"` error.
*Solution*:
This error happens when the `xpack.encryptedSavedObjects.encryptionKey` value used to create the rule does not match the value used during rule execution. Depending on the scenario, there are different ways to solve this problem:
[cols="2*<"]
|===
| If the value in `xpack.encryptedSavedObjects.encryptionKey` was manually changed, and the previous encryption key is still known.
| Ensure any previous encryption key is included in the keys used for <<xpack-encryptedSavedObjects-keyRotation-decryptionOnlyKeys, decryption only>>.
| If another {kib} instance with a different encryption key connects to the cluster.
| The other {kib} instance might be trying to run the rule using a different encryption key than what the rule was created with. Ensure the encryption keys among all the {kib} instances are the same, and setting <<xpack-encryptedSavedObjects-keyRotation-decryptionOnlyKeys, decryption only keys>> for previously used encryption keys.
| If other scenarios don't apply.
| Generate a new API key for the rule by disabling then enabling the rule.
|===

View file

@ -0,0 +1,201 @@
[role="xpack"]
[[event-log-index]]
=== Event log index
Use the event log index to determine:
* Whether a rule successfully ran but its associated actions did not
* Whether a rule was ever activated
* Additional information about rule execution errors
* Duration times for rule and action executions
[float]
==== Example Event Log Queries
Event log query to look at all event related to a specific rule id:
[source, txt]
--------------------------------------------------
GET /.kibana-event-log*/_search
{
"sort": [
{
"@timestamp": {
"order": "desc"
}
}
],
"query": {
"bool": {
"filter": [
{
"term": {
"event.provider": {
"value": "alerting"
}
}
},
// optionally filter by specific action event
{
"term": {
"event.action": "active-instance"
| "execute-action"
| "new-instance"
| "recovered-instance"
| "execute"
}
},
// filter by specific rule id
{
"nested": {
"path": "kibana.saved_objects",
"query": {
"bool": {
"filter": [
{
"term": {
"kibana.saved_objects.id": {
"value": "b541b690-bfc4-11eb-bf08-05a30cefd1fc"
}
}
},
{
"term": {
"kibana.saved_objects.type": "alert"
}
}
]
}
}
}
}
]
}
}
}
--------------------------------------------------
Event log query to look at all events related to executing a rule or action. These events include duration.
[source, txt]
--------------------------------------------------
GET /.kibana-event-log*/_search
{
"sort": [
{
"@timestamp": {
"order": "desc"
}
}
],
"query": {
"bool": {
"filter": [
{
"term": {
"event.action": {
"value": "execute"
}
}
},
// optionally filter by specific rule or action id
{
"nested": {
"path": "kibana.saved_objects",
"query": {
"bool": {
"filter": [
{
"term": {
"kibana.saved_objects.id": {
"value": "b541b690-bfc4-11eb-bf08-05a30cefd1fc"
}
}
}
]
}
}
}
}
]
}
}
}
--------------------------------------------------
Event log query to look at the errors.
You should see an `error.message` property in that event, with a message from the action executor that might provide more detail on why the action encountered an error:
[source, txt]
--------------------------------------------------
{
"event": {
"provider": "actions",
"action": "execute",
"start": "2020-03-31T04:27:30.392Z",
"end": "2020-03-31T04:27:30.393Z",
"duration": 1000000
},
"kibana": {
"namespace": "default",
"saved_objects": [
{
"type": "action",
"id": "7a6fd3c6-72b9-44a0-8767-0432b3c70910"
}
],
},
"message": "action executed: .server-log:7a6fd3c6-72b9-44a0-8767-0432b3c70910: server-log",
"@timestamp": "2020-03-31T04:27:30.393Z",
}
--------------------------------------------------
And see the errors for the rules you might provide the next search query:
[source, txt]
--------------------------------------------------
{
"event": {
"provider": "alerting",
"start": "2020-03-31T04:27:30.392Z",
"end": "2020-03-31T04:27:30.393Z",
"duration": 1000000
},
"kibana": {
"namespace": "default",
"saved_objects": [
{
"rel" : "primary",
"type" : "alert",
"id" : "30d856c0-b14b-11eb-9a7c-9df284da9f99"
}
],
},
"message": "alert executed: .index-threshold:30d856c0-b14b-11eb-9a7c-9df284da9f99: 'test'",
"error" : {
"message" : "Saved object [action/ef0e2530-b14a-11eb-9a7c-9df284da9f99] not found"
},
}
--------------------------------------------------
You can also query the event log for failures, which should return more specific details about rules which failed by targeting the event.outcome:
[source, txt]
--------------------------------------------------
GET .kibana-event-log-*/_search
{
"query": {
"bool": {
"must": [
{ "match": { "event.outcome": "failure" }}
]
}
}
}
--------------------------------------------------
Heres an example of what failed credentials from Google SMTP might look like from the response:
[source, txt]
--------------------------------------------------
"error" : {
"message" : """error sending email: Invalid login: 535-5.7.8 Username and Password not accepted. Learn more at
535 5.7.8 https://support.google.com/mail/?p=BadCredentials e207sm3359731pfh.171 - gsmtp"""
},
--------------------------------------------------

View file

@ -0,0 +1,72 @@
[role="xpack"]
[[testing-connectors]]
=== Test connectors
By using Kibana Management UI you can test a newly created Connector by navigating to the Test tab of Connector Edit flyout or by clicking "Save & test" button on Create flyout:
[role="screenshot"]
image::user/alerting/images/connector-save-and-test.png[Rule management page with the errors banner]
or by directly opening the proper connector Edit flyout:
[role="screenshot"]
image::user/alerting/images/email-connector-test.png[Rule management page with the errors banner]
[role="screenshot"]
image::user/alerting/images/teams-connector-test.png[Five clauses define the condition to detect]
[float]
==== experimental[] Troubleshooting Connectors with `kbn-action` tool
Executing an Email action via https://github.com/pmuellr/kbn-action[kbn-action]. In this example, is using a cloud deployment of the stack:
[source]
--------------------------------------------------
$ npm -g install pmuellr/kbn-action
$ export KBN_URLBASE=https://elastic:<password>@<cloud-host>.us-east-1.aws.found.io:9243
$ kbn-action ls
[
{
"id": "a692dc89-15b9-4a3c-9e47-9fb6872e49ce",
"actionTypeId": ".email",
"name": "gmail",
"config": {
"from": "test@gmail.com",
"host": "smtp.gmail.com",
"port": 465,
"secure": true,
"service": null
},
"isPreconfigured": false,
"referencedByCount": 0
}
]
--------------------------------------------------
and then execute this:
[source]
--------------------------------------------------
$ kbn-action execute a692dc89-15b9-4a3c-9e47-9fb6872e49ce '{subject: "hallo", message: "hallo!", to:["test@yahoo.com"]}'
{
"status": "ok",
"data": {
"accepted": [
"test@yahoo.com"
],
"rejected": [],
"envelopeTime": 100,
"messageTime": 955,
"messageSize": 521,
"response": "250 2.0.0 OK 1593144408 r5sm8625873qtc.20 - gsmtp",
"envelope": {
"from": "test@gmail.com",
"to": [
"test@yahoo.com"
]
},
"messageId": "<cf9fec58-600f-64fb-5f66-6e55985b935d@gmail.com>"
},
"actionId": "a692dc89-15b9-4a3c-9e47-9fb6872e49ce"
}
--------------------------------------------------

View file

@ -105,4 +105,5 @@ include::{kib-repo-dir}/api/actions-and-connectors.asciidoc[]
include::{kib-repo-dir}/api/dashboard-api.asciidoc[]
include::{kib-repo-dir}/api/logstash-configuration-management.asciidoc[]
include::{kib-repo-dir}/api/url-shortening.asciidoc[]
include::{kib-repo-dir}/api/task-manager/health.asciidoc[]
include::{kib-repo-dir}/api/upgrade-assistant.asciidoc[]

View file

@ -955,3 +955,73 @@ Tasks are not running, and the server logs contain the following error message:
Inline scripts are a hard requirement for Task Manager to function.
To enable inline scripting, see the Elasticsearch documentation for {ref}/modules-scripting-security.html#allowed-script-types-setting[configuring allowed script types setting].
[float]
[[task-runat-is-in-the-past]]
==== What do I do if the Tasks `runAt` is in the past?
*Problem*:
Tasks' property `runAt` is in the past.
*Solution*:
Wait a bit before declaring it as a lost cause, as Task Manager might just be falling behind on its work.
You should take a look at the Kibana log and see what you can find that relates to Task Manager.
In a healthy environment you should see a log line that indicates that Task Manager was successfully started when Kibana was:
[source, txt]
--------------------------------------------------
server log [12:41:33.672] [info][plugins][taskManager][taskManager] TaskManager is identified by the Kibana UUID: 5b2de169-2785-441b-ae8c-186a1936b17d
--------------------------------------------------
If you see that message and no other errors that relate to Task Manager, its most likely that Task Manager is running fine and has simply not had the chance to pick the task up yet.
If, on the other hand, the runAt is severely overdue, then its worth looking for other Task Manager or Alerting related errors, as something else may have gone wrong.
Its worth looking at the status field, as it might have failed, which would explain why it hasnt been picked up or it might be running which means the task might simply be a very long running one.
[float]
[[task-marked-failed]]
==== What do I do if the Task is marked as failed?
*Problem*:
Tasks marked as failed.
*Solution*:
Broadly speaking the Alerting framework is meant to gracefully handle the cases where a task is failing by rescheduling a fresh run in the future. If this fails to happen, then that means something has gone wrong in the underlying implementation and this isnt expected.
Ideally you should try and find any log lines that relate to this rule and its task, and use these to help us investigate further.
[float]
[[task-manager-kibana-log]]
==== Task Manager Kibana Log
Task manager will write log lines to the Kibana Log on certain occasions. Below are some common log lines and what they mean.
Task Manager has run out of Available Workers:
[source, txt]
--------------------------------------------------
server log [12:41:33.672] [info][plugins][taskManager][taskManager] [Task Ownership]: Task Manager has skipped Claiming Ownership of available tasks at it has ran out Available Workers.
--------------------------------------------------
This log message tells us that Task Manager is not managing to keep up with the sheer amount of work it has been tasked with completing. This might mean that Rules are not running at the frequency that was expected (instead of running every 5 minutes, it runs every 7-8 minutes, just as an example).
By default Task Manager is limited to 10 tasks and this can be bumped up by setting a higher number in the kibana.yml file using the `xpack.task_manager.max_workers` configuration. It is important to keep in mind that a higher number of tasks running at any given time means more load on both Kibana and Elasticsearch, so only change this setting if increasing load in your environment makes sense.
Another approach to addressing this might be to tell workers to run at a higher rate, rather than adding more of them, which would be configured using xpack.task_manager.poll_interval. This value dictates how often Task Manager checks to see if theres more work to be done and uses milliseconds (by default it is 3000, which means an interval of 3 seconds).
Before changing either of these numbers its highly recommended to investigate what Task Manager cant keep up - Are there an unusually high number of rules in the system? Are rules failing often, forcing Task Manager to re-run them constantly? Is Kibana under heavy load? There could be a variety of issues, none of which should be solved by simply changing these configurations.
Task TaskType failed in attempt to run:
[source, txt]
--------------------------------------------------
server log [12:41:33.672] [info][plugins][taskManager][taskManager] Task TaskType "alerting:example.always-firing" failed in attempt to run: Unable to load resource /api/something
--------------------------------------------------
This log message tells us that when Task Manager was running one of our rules, its task errored and, as a result, failed. In this case we can tell that the rule that failed was of type alerting:example.always-firing and that the reason it failed was Unable to load resource /api/something . This is a contrived example, but broadly, if you see a message with this kind of format, then this tells you a lot about where the problem might be.
For example, in this case, wed expect to see a corresponding log line from the Alerting framework itself, saying that the rule failed. You should look in the Kibana log for a line similar to the log line below (probably shortly before the Task Manager log line):
Executing Alert "27559295-44e4-4983-aa1b-94fe043ab4f9" has resulted in Error: Unable to load resource /api/something
This would confirm that the error did in fact happen in the rule itself (rather than the Task Manager) and it would help us pin-point the specific ID of the rule which failed: 27559295-44e4-4983-aa1b-94fe043ab4f9
We can now use the ID to find out more about that rule by using the http endpoint to find that rules configuration and current state to help investigate what might have caused the issue.