[Alerting][Docs] Adding query to identify long running rules to docs (#98773)

* Adding query to identify long running rules to docs

* Wording suggestsion from PR review

* Adding event.provider to query. Allowing copy to console

* Adding note for system privileges

* Adding runtime field to query

* Removing extra dollar sign

* PR fixes
This commit is contained in:
ymao1 2021-05-03 19:18:13 -04:00 committed by GitHub
parent 4d3499822d
commit 582c6c7ae3
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -69,3 +69,167 @@ Configuration options are available to specialize connections to TLS servers,
including ignoring server certificate validation, and providing certificate
authority data to verify servers using custom certificates. For more details,
see <<action-settings,Action settings>>.
[float]
[[rules-long-execution-time]]
=== Identify long-running rules
The following query can help you identify rules that are taking a long time to execute and might impact the overall health of your deployment.
[IMPORTANT]
==============================================
By default, only users with a `superuser` role can query the {kib} event log because it is a system index. To enable additional users to execute this query, assign `read` privileges to the `.kibana-event-log*` index.
==============================================
Query for a list of rule ids, bucketed by their execution times:
[source,console]
--------------------------------------------------
GET /.kibana-event-log*/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"gte": "now-1d", <1>
"lte": "now"
}
}
},
{
"term": {
"event.action": {
"value": "execute"
}
}
},
{
"term": {
"event.provider": {
"value": "alerting" <2>
}
}
}
]
}
},
"runtime_mappings": { <3>
"event.duration_in_seconds": {
"type": "double",
"script": {
"source": "emit(doc['event.duration'].value / 1E9)"
}
}
},
"aggs": {
"ruleIdsByExecutionDuration": {
"histogram": {
"field": "event.duration_in_seconds",
"min_doc_count": 1,
"interval": 1 <4>
},
"aggs": {
"ruleId": {
"nested": {
"path": "kibana.saved_objects"
},
"aggs": {
"ruleId": {
"terms": {
"field": "kibana.saved_objects.id",
"size": 10 <5>
}
}
}
}
}
}
}
}
--------------------------------------------------
// TEST
<1> This queries for rules executed in the last day. Update the values of `lte` and `gte` to query over a different time range.
<2> Use `event.provider: actions` to query for long-running action executions.
<3> Execution durations are stored as nanoseconds. This adds a runtime field to convert that duration into seconds.
<4> This interval buckets the event.duration_in_seconds runtime field into 1 second intervals. Update this value to change the granularity of the buckets. If you are unable to use runtime fields, make sure this aggregation targets `event.duration` and use nanoseconds for the interval.
<5> This retrieves the top 10 rule ids for this duration interval. Update this value to retrieve more rule ids.
This query returns the following:
[source,json]
--------------------------------------------------
{
"took" : 322,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 326,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"ruleIdsByExecutionDuration" : {
"buckets" : [
{
"key" : 0.0, <1>
"doc_count" : 320,
"ruleId" : {
"doc_count" : 320,
"ruleId" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1923ada0-a8f3-11eb-a04b-13d723cdfdc5",
"doc_count" : 140
},
{
"key" : "15415ecf-cdb0-4fef-950a-f824bd277fe4",
"doc_count" : 130
},
{
"key" : "dceeb5d0-6b41-11eb-802b-85b0c1bc8ba2",
"doc_count" : 50
}
]
}
}
},
{
"key" : 30.0, <2>
"doc_count" : 6,
"ruleId" : {
"doc_count" : 6,
"ruleId" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "41893910-6bca-11eb-9e0d-85d233e3ee35",
"doc_count" : 6
}
]
}
}
}
]
}
}
}
--------------------------------------------------
<1> Most rule execution durations fall within the first bucket (0 - 1 seconds).
<2> A single rule with id `41893910-6bca-11eb-9e0d-85d233e3ee35` took between 30 and 31 seconds to execute.
Use the <<get-rule-api,Get Rule API>> to retrieve additional information about rules that take a long time to execute.