**Related to:** https://github.com/elastic/kibana/pull/94143
## Summary
This PR adds new fields to the schema (`EventSchema`, `IEvent`):
- standard ECS fields: `error.*`, `event.*`, `log.level`, `log.logger`, `rule.*`
- custom field set `kibana.detection_engine`
We need these fields on the Detections side to implement detection rule execution log. See the related proposal (https://github.com/elastic/kibana/pull/94143) for more details.
Also, this PR bumps ECS used in Event Log from `1.6.0` to the current `1.8.0` version. They are 100% same in terms of fields used in Event Log, so no changes in the schema were caused by this version increment.
When something causes an exception in `TaskRunner.markTaskAsRunning()` its execution fails, but this happens before we update the SO, which means that this failure does not count towards the `attempts` on the task. Task Manager will continue to try running this task for ever.
This PR increments the `attempts` when a failure occurs during `TaskRunner.markTaskAsRunning()` to ensure such a task doesn't continue to run to infinity.
Note that this fix will not affect `scheduled` tasks, as they are designed to _ignore_ their `attempts` and run for ever. In such a case this task will continue to consume Task Manager resources until canceled, but these failures will be logged and could be identified when needed.
* chore(NA): create new x-pack cigroups and rebalancing them all
* chore(NA): better cigroups balancing
* chore(NA): push rollup tests back into ciGroup1
* chore(NA): move some functional ml tests from cigroup3 into cigroup13
* chore(NA): move some more tests into ciGroup13
* chore(NA): use a single top level describe at x-pack/test/functional/apps/ml
* chore(NA): move settings into ciGroup13
* temporary test for es snapshots env
* Revert "temporary test for es snapshots env"
This reverts commit 789ebe7b9c.
* docs(NA): add missing documentation on the function tests describe split
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
This PR Introduces a `pollingDelay` which is applied to the polling interval whenever the average percentage of tasks experiencing a version conflict is higher than a preconfigured threshold (default to 80%).
Adds additional polling stats to Task Manager monitoring:
- **duration**: Running average of polling duration measuring the time from the scheduled polling cycle start until all claimed tasks are marked as running
- **claim_conflicts**: Running average of number of version clashes caused by the markAvailableTasksAsClaimed stage of the polling cycle
- **claim_mismatches**: Running average of mismatch between the number of tasks updated by the markAvailableTasksAsClaimed stage of the polling cycle and the number of docs found by the sweepForClaimedTasks stage
- **load** - Running average of the percentage of workers in use at the end of each polling cycle.
Added the following values to the Polling stats:
- **NoAvailableWorkers**: This tells us when a polling cycle resulted in no tasks being claimed due to there being no available workers
- **RunningAtCapacity**: This tells us when a polling cycle resulted in tasks being claimed at 100% capacity of the available workers
- **Failed**: This tells us when the poller failed to claim
resolves#55634resolves#65746
Buffers event docs being written for a fixed interval / buffer size,
and indexes those docs via a bulk ES call.
Also now flushing those buffers at plugin stop() time, which
we couldn't do before with the single index calls, which were
run via `setImmediate()`.
This is a redo of PR https://github.com/elastic/kibana/pull/80941 which
had to be reverted.
resolves https://github.com/elastic/kibana/issues/55634
resolves https://github.com/elastic/kibana/issues/65746
Buffers event docs being written for a fixed interval / buffer size,
and indexes those docs via a bulk ES call.
Also now flushing those buffers at plugin stop() time, which
we couldn't do before with the single index calls, which were
run via `setImmediate()`.
This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of https://github.com/elastic/kibana/issues/39349).
In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.
This PR adds an an internal monitoring mechanism in Task Manager which keep track of a variety of metrics and a health api endpoint which makes the monitored statistics accessible.
* wip
* Adding updateFieldsAndMarkAsFailed function
* Updating UBQ
* Only updating retryAt if marking as claiming
* Updating query
* Updating query to only fail one time tasks that have exceeded max attempts
* Fixing tests
* Fixing tests
* Handling claiming tasks by id
* Removing unused function
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Fixes flaky tests in Task Manager and Alerting.
The fix in #73244 was correct, but it missed an edge case which causes the already running task to be rescheduled over and over.
This prevents that edge case which was effecting both TM in general and Alerting specifically.
This PR addresses two issues which caused several tests to be flaky in TM.
When `runNow` was introduced to TM we added a pinned query which returned specific tasks by ID.
This query does not have the filter applied to it which causes task to return when they're already marked as `running` but we didn't address these correctly which caused flakyness in the tests.
This didn't cause a broken beahviour, but it did cause beahviour that was hard to reason about - we now address them correctly.
It seems that sometimes, especially if the ES queue is overworked, it can take some time for the update to the underlying task to be visible (we don't user `refresh:true` on purpose), so adding a wait for the index to refresh to make sure the task is updated in time for the next stage of the test.
* mark legacy ES client types as deprecated
* expose es client to plugins and update mocks
* ElasticSearchClientMock --> ElasticsearchClientMock
* expose es client mocks
* expose es client via RequestHandlerContext
* convert test/plugin_functional/config into ts
* convert top_nav test into ts
* add an integration test for the es client
* update comments to refer to the new es client
* fix import paths. do not use extensions
temp
* update docs
* fix other refs
* add test for a custom client
* fix context
* add test for scoped client
* update docs
resolves https://github.com/elastic/kibana/issues/70086
Configures the saved object client for the event log to access the recently
hidden action and alert saved objects.
We didn't have tests for action/alert event log activity, so added some now.
Also found a buglet that was preventing access to event log data from actions
and alerts in non-default spaces.
Creating events in parallel may be causing a slight flakyness, this change staggers creation to ensure this doesn't happen.
In addition it turned out the `event.end` field was missing in certain cases, causing the test that sorts by `end` to fail.
resolves https://github.com/elastic/kibana/issues/62668
Adds a property named `rel` to the nested saved objects in the event
documents, whose value should not be set, or set to `primary`.
The query by saved object function changes to only match event documents
with that saved objects if it has the `rel: primary` value.
This is used to limit searching alerting's executeAction event document
with only the alert saved object, and not the action saved object (this
document has an alert and action saved object). The alert saved object
has the `rel: primary` field set, and the action does not. Previously,
those documents were returned with a query of the action saved object.
Completes the migration of all Alerting Services plugins onto the Kibana Platform
It includes:
1. Actions plugin
2. Alerting plugin
3. Task Manager plugin
4. Triggers UI plugin
And touches the Uptime and Siem plugins as their use of the Task Manager relied on some of the legacy lifecycle to work (registering AlertTypes and Telemetry tasks after the Start stage has already began). The fix was simply to moves these registrations to the Setup stage.
resolves https://github.com/elastic/kibana/issues/64275
Changes the fields used to query the event log by time range to use the
`@timestamp` field.
Also allow `@timestamp` as a sort option, and make it the default sort option.
* Added server api tests for event log service
* fixed tests
* fixed type check issue
* Fixed failing tests
* fixed jest tests
* Fixed due to comments
* Removed flackiness tests
* fixed type check error
* Fixed func test
Adds a namespace attribute to the saved object object within the Event Log so that each Saved Object can have its own. This change also removes the existing kibana.namespace field.
As Event Log is not yet in use, this does not include a migration.
Enables access to the Alert State, which allows us to see which current Alert Instances are active.
This includes:
1. Addition of a `get` api on Task Manager
2. Typing and validation on Serialisation & Deserialisation of the State of an Alert's underlying Task
3. Addition of the `getAlertState` api on AlertsClient
As of Elasticsearch 8.0.0 it will no longer be possible to use the _id field on documents.
This PR removes the usage that Task Manager makes of this field and switches to pinned queries to achieve a similar effect.
Migrates the existing TaskManager plugin from Legacy to Kibana Platform.
We retain the Legacy API to prevent a breaking change, but under the hood, the legacy plugin is now using the Kibana Platform plugin.
Another reason we retain the Legacy plugin to support several features that the Platform team has yet to migrate to Kibana Platform (mapping, SO schema and migrations).
This moves the interval field under a generic schedule object field in preparation for the introduction of richer scheduling options (such as cron).
It includes a migration for existing tasks, and we've ensured no existing Task Type Definitions exist in Kibana that rely on Interval.
This includes support for the deprecated interval field (which gets mapped to schedule) but that support will be removed in 8.0.0, as it's a breaking change.
Adds a `runNow` api to Task Manager, allowing us to force the refresh of a recurring task.
This PR includes a couple of sustainability changes as well as the feature itself.
1. **Declarative query composition.** At the moment the queries in the TaskStore are huge JSON objects that are hard to maintain and understand. This PR introduces a pattern where the different parts of the query are composed out of type-checked functions, making it easier to maintain and to construct dynamically as needs change. _This was included in this PR as the **markAvailableTasksAsClaimed** query needs different query clauses depending on whether there are specific Tasks we wish to claim first.
2. **Refactoring of the Task Poller** As the `runNow` api is introduced we find Task Manager's lifecycle in a weird state where it has both a _pull_ model, where timeouts & callbacks interact without having to responsd to any external requests, and a _push_ model where requests are made to the new `runNow` api. Balancing these two proved error prone, hard to maintain and had the potential of _lossy_ behaviour where requests are dropped accidentally. To address this TaskPoller has been refactored using Rxjs observables, remodelling the existing _pull_ mechanism as a _push_ mechanism so Task Manager can _respond_ to both _polling_ calls and _runNow_ in a similar fashion.
And ofcourse the main feature of this PR:
3. **runNow api** An api on TaskManager that takes a _task ID_ and attempts to run the task. The call returns a promise which resolves with a result which notifies the caller when the task has either completed successfully, or result in an error.
This PR adds a test that ensures Task Manager is capable of picking up new tasks in parallel to a long running tasks that might otherwise hold up task execution.
This doesn't add functionality - just a missing test case.