Commit graph

92 commits

Author SHA1 Message Date
Tomas Della Vedova 238791b942
ES client : use the new type definitions (#83808)
* Use client from branch

* Get type checking working in core

* Fix types in other plugins

* Update client types + remove type errors from core

* migrate Task Manager Elasticsearch typing from legacy library to client library

* use SortOrder instead o string in alerts

* Update client types + fix core type issues

* fix maps ts errors

* Update Lens types

* Convert Search Profiler body from a string to an object to conform to SearchRequest type.

* Fix SOT types

* Fix/mute Security/Spaces plugins type errors.

* Fix bootstrap types

* Fix painless_lab

* corrected es typing in Event Log

* Use new types from client for inferred search responses

* Latest type defs

* Integrate latest type defs for APM/UX

* fix core errors

* fix telemetry errors

* fix canvas errors

* fix data_enhanced errors

* fix event_log errors

* mute lens errors

* fix or mute maps errors

* fix reporting errors

* fix security errors

* mute errors in task_manager

* fix errors in telemetry_collection_xpack

* fix errors in data plugins

* fix errors in alerts

* mute errors in index_management

* fix task_manager errors

* mute or fix lens errors

* fix upgrade_assistant errors

* fix or mute errors in index_lifecycle_management

* fix discover errors

* fix core tests

* ML changes

* fix core type errors

* mute error in kbn-es-archiver

* fix error in data plugin

* fix error in telemetry plugin

* fix error in discover

* fix discover errors

* fix errors in task_manager

* fix security errors

* fix wrong conflict resolution

* address errors with upstream code

* update deps to the last commit

* remove outdated comments

* fix core errors

* fix errors after update

* adding more expect errors to ML

* pull the lastest changes

* fix core errors

* fix errors in infra plugin

* fix errors in uptime plugin

* fix errors in ml

* fix errors in xpack telemetry

* fix or mute errors in transform

* fix errors in upgrade assistant

* fix or mute fleet errors

* start fixing apm errors

* fix errors in osquery

* fix telemetry tests

* core cleanup

* fix asMutableArray imports

* cleanup

* data_enhanced cleanup

* cleanup events_log

* cleaup

* fix error in kbn-es-archiver

* fix errors in kbn-es-archiver

* fix errors in kbn-es-archiver

* fix ES typings for Hit

* fix SO

* fix actions plugin

* fix fleet

* fix maps

* fix stack_alerts

* fix eslint problems

* fix event_log unit tests

* fix failures in data_enhanced tests

* fix test failure in kbn-es-archiver

* fix test failures in index_pattern_management

* fixing ML test

* remove outdated comment in kbn-es-archiver

* fix error type in ml

* fix eslint errors in osquery plugin

* fix runtime error in infra plugin

* revert changes to event_log cluser exist check

* fix eslint error in osquery

* fixing ML endpoint argument types

* fx types

* Update api-extractor docs

* attempt fix for ese test

* Fix lint error

* Fix types for ts refs

* Fix data_enhanced unit test

* fix lens types

* generate docs

* Fix a number of type issues in monitoring and ml

* fix triggers_actions_ui

* Fix ILM functional test

* Put search.d.ts typings back

* fix data plugin

* Update typings in typings/elasticsearch

* Update snapshots

* mute errors in task_manager

* mute fleet errors

* lens. remove unnecessary ts-expect-errors

* fix errors in stack_alerts

* mute errors in osquery

* fix errors in security_solution

* fix errors in lists

* fix errors in cases

* mute errors in search_examples

* use KibanaClient to enforce promise-based API

* fix errors in test/ folder

* update comment

* fix errors in x-pack/test folder

* fix errors in ml plugin

* fix optional fields in ml api_integartoon tests

* fix another casting problem in ml tests

* fix another ml test failure

* fix fleet problem after conflict resolution

* rollback changes in security_solution. trying to fix test

* Update type for discover rows

* uncomment runtime_mappings as its outdated

* address comments from Wylie

* remove eslint error due to any

* mute error due to incompatibility

* Apply suggestions from code review

Co-authored-by: John Schulz <github.com@jfsiii.org>

* fix type error in lens tests

* Update x-pack/plugins/upgrade_assistant/server/lib/reindexing/reindex_service.ts

Co-authored-by: Alison Goryachev <alisonmllr20@gmail.com>

* Update x-pack/plugins/upgrade_assistant/server/lib/reindexing/reindex_service.test.ts

Co-authored-by: Alison Goryachev <alisonmllr20@gmail.com>

* update deps

* fix errors in core types

* fix errors for the new elastic/elasticsearch version

* remove unused type

* remove unnecessary manual type cast and put optional chaining back

* ML: mute Datafeed is missing indices_options

* Apply suggestions from code review

Co-authored-by: Josh Dover <1813008+joshdover@users.noreply.github.com>

* use canary pacakge instead of git commit

Co-authored-by: Josh Dover <me@joshdover.com>
Co-authored-by: Josh Dover <1813008+joshdover@users.noreply.github.com>
Co-authored-by: Gidi Meir Morris <github@gidi.io>
Co-authored-by: Nathan Reese <reese.nathan@gmail.com>
Co-authored-by: Wylie Conlon <wylieconlon@gmail.com>
Co-authored-by: CJ Cenizal <cj@cenizal.com>
Co-authored-by: Aleh Zasypkin <aleh.zasypkin@gmail.com>
Co-authored-by: Dario Gieselaar <dario.gieselaar@elastic.co>
Co-authored-by: restrry <restrry@gmail.com>
Co-authored-by: James Gowdy <jgowdy@elastic.co>
Co-authored-by: John Schulz <github.com@jfsiii.org>
Co-authored-by: Alison Goryachev <alisonmllr20@gmail.com>
2021-03-25 04:47:16 -04:00
Mikhail Shustov ee84e0b0b7
Merge tsconfig and x-pack/tsconfig files (#94519)
* merge all the typings at root level

* merge x-pack/tsconfig into tsconfig.json

* fix tsconfig after changes in master

* remove unnecessary typings

* update paths to the global typings

* update paths to the global elaticsearch typings

* fix import

* fix path to typings/elasticsearch in fleet plugin

* remove file deleted from master

* fix lint errors
2021-03-16 15:13:49 +01:00
Gidi Meir Morris 79134b3b6d
[Alerting][Docs] Adds Alerting & Task Manager Scalability Guidance & Health Monitoring (#91171)
Documentation for scaling Kibana alerting, what configurations can change, what impacts they have, etc.
Scaling Alerting relies heavily on scaling Task Manager, so these docs also document Task manager Health Monitoring and scaling.
2021-03-04 14:11:53 +00:00
Alejandro Fernández Haro 5342877a32
[HTTP] Apply the same behaviour to all 500 errors (except from custom responses) (#85541)
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
2021-02-18 17:31:18 +00:00
Gidi Meir Morris 619db36591
[Task manager] Adds support for limited concurrency tasks (#90365)
Adds support for limited concurrency on a Task Type.
2021-02-11 14:46:14 +00:00
Dario Gieselaar d0900f844d
Limit cardinality of transaction.name (#90955) 2021-02-10 21:18:41 +01:00
Pierre Gayvallet 3b3327dbc3
Migrate most plugins to synchronous lifecycle (#89562)
* first pass

* migrate more plugins

* migrate yet more plugins

* more oss plugins

* fix test file

* change Plugin signature on the client-side too

* fix test types

* migrate OSS client-side plugins

* migrate OSS client-side test plugins

* migrate xpack client-side plugins

* revert fix attempt on fleet plugin

* fix presentation start signature

* fix yet another signature

* add warnings for server-side async plugins in dev mode

* remove unused import

* fix isPromise

* Add client-side deprecations

* update migration examples

* update generated doc

* fix xpack unit tests

* nit

* (will be reverted) explicitly await for license to be ready in the auth hook

* Revert "(will be reverted) explicitly await for license to be ready in the auth hook"

This reverts commit fdf73feb

* restore await on on promise contracts

* Revert "(will be reverted) explicitly await for license to be ready in the auth hook"

This reverts commit fdf73feb

* Revert "restore await on on promise contracts"

This reverts commit c5f2fe51

* add delay before starting tests in FTR

* update deprecation ts doc

* add explicit contract for monitoring setup

* migrate monitoring plugin to sync

* change plugin timeout to 10sec

* use delay instead of silence
2021-02-08 10:19:54 +01:00
Brandon Kobel 4584a8b570
Elastic License 2.0 (#90099)
* Updating everything except the license headers themselves

* Applying ESLint rules

* Manually replacing the stragglers
2021-02-03 18:12:39 -08:00
Liza Katz 7fbcf68d73
[Search Sessions] Save all sessions, with persisted flag (#89570)
* [data.search] Add search session methods to search service contract

* Fix types

* Fix tests and switch to cancel

* Update docs

* Fix types/tests

* Fix tests

* Update status of SO before cancelling search requests

* Add API integration test

* Fix types

* Update expiration route to use config defaultExpiration

* Fix test

* Update docs

* New logic for extend

* Remove declare module

* Search Sessions: Unskip Flaky Functional Test

* Review feedback

* fix ts

* Save all search sessions and then manage them based on their persisted state

* Get default search session expiration from config

* randomize sleep time

* fix test

* Remove test that is no longer valid

* fix test

* Make sure we poll, and dont persist, searches not in the context of a session

* Added keepalive unit tests

* fix ts

* code review @lukasolson

* ts

* More tests, rename onScreenTimeout to completedTimeout

* lint

* lint

* Delete async seaches

* Support saved object pagination
Fix get search status tests

* better PersistedSearchSessionSavedObjectAttributes ts

* test titles

* Fix undefined bug

* Remove runAt from monitoring task
Increase testing trackingInterval (caused bug)

* support workload histograms that take into account overdue tasks

* Update touched when changing session status to complete \ error

* removed test

* Updated management test data

* Rename configs

* delete tap first
add comments

* Use DataRequestHandlerContext in maps

* ts

* Fixed ts

Co-authored-by: Lukas Olson <olson.lukas@gmail.com>
Co-authored-by: Timothy Sullivan <tsullivan@elastic.co>
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Anton Dosov <anton.dosov@elastic.co>
Co-authored-by: Gidi Meir Morris <github@gidi.io>
2021-02-03 23:15:41 +02:00
Gidi Meir Morris f3fba95955
[Task Manager] ignore version conflicts that exceed max_docs in the claiming process (#89415)
This is a first step in attempting to address the over zealous shifting we've identified in TM.

It [turns out](https://github.com/elastic/elasticsearch/issues/63671) `version_conflicts` don't always count against `max_docs`, so in this PR we correct the `version_conflicts` returned by updateByQuery in TaskManager to only count the conflicts that _may_ have counted against `max_docs`.
This correction isn't necessarily accurate, but it will ensure we don't shift if we are in fact managing to claim tasks.
2021-01-28 10:14:28 +00:00
Gidi Meir Morris f6837a1f66
made unit test more reliable (#89094)
Made unit test more reliable by using resolving promises rather than timed `await`s that could be flaky when the node event loop is overwhelmed.
2021-01-25 15:57:50 +00:00
Jonathan Budzenski 933d1b1471 skip "run cancels expired tasks prior to running new tasks" 2021-01-21 12:10:59 -06:00
Gidi Meir Morris c89f1f18d3
[Task Manager] Increment task attempts when they fail during markTaskAsRunning (#88669)
When something causes an exception in `TaskRunner.markTaskAsRunning()` its execution fails, but this happens before we update the SO, which means that this failure does not count towards the `attempts` on the task. Task Manager will continue to try running this task for ever.

This PR increments the `attempts` when a failure occurs during `TaskRunner.markTaskAsRunning()` to ensure such a task doesn't continue to run to infinity.
Note that this fix will not affect `scheduled` tasks, as they are designed to _ignore_ their `attempts` and run for ever. In such a case this task will continue to consume Task Manager resources until canceled, but these failures will be logged and could be identified when needed.
2021-01-21 14:04:42 +00:00
Gidi Meir Morris b3bec0d6ef
[Task Manager] Cleans up polling shift mechanism (#88210)
Cleanup work
1. Replaced naive initialisation of `last_polling_delay`
2. Changes values in `delayOnClaimConflicts` unit tests to make the values less confusing (it was easy to misunderstand the worker count for being the percentage of workers
3. Added comment explaining the usage of modulo
2021-01-21 14:03:26 +00:00
Gidi Meir Morris e21defa448
[Task Manager] Reject invalid Timeout values in Task Type Definitions (#88602)
This PR adds the following:
1. We now validate the interval passed to `timeout` when a task type definition is registered.
2. replaces usage of `Joi` with `schema-type`
2021-01-20 17:23:02 +00:00
Gidi Meir Morris 4878554cc9
[Task Manager] cancel expired tasks as part of the available workers check (#88483)
When a task expires it continues to reside in the queue until `TaskPool.cancelExpiredTasks()` is called. We call this in `TaskPool.run()`, but `run` won't get called if there is no capacity, as we gate the poller on `TaskPool.availableWorkers()` and that means that if you have as many expired tasks as you have workers - your poller will continually restart but the queue will remain full and that Task Manager is then in capable of taking on any more work. This is what caused `[Task Poller Monitor]: Observable Monitor: Hung Observable...`
2021-01-20 17:22:16 +00:00
Gidi Meir Morris 5e4402c374
[Alerting] Shift polling interval by random amount when Task Manager experiences consistent claim version conflicts (#88020)
This PR Introduces a `pollingDelay` which is applied to the polling interval whenever the average percentage of tasks experiencing a version conflict is higher than a preconfigured threshold (default to 80%).
2021-01-12 23:34:07 +00:00
Gidi Meir Morris f384c484b7
[Task Manager] adds additional polling stats to Task Manager monitoring (#87766)
Adds additional polling stats to Task Manager monitoring:

- **duration**: Running average of polling duration measuring the time from the scheduled polling cycle start until all claimed tasks are marked as running
- **claim_conflicts**: Running average of number of version clashes caused by the markAvailableTasksAsClaimed stage of the polling cycle
- **claim_mismatches**: Running average of mismatch between the number of tasks updated by the markAvailableTasksAsClaimed stage of the polling cycle and the number of docs found by the sweepForClaimedTasks stage
- **load** - Running average of the percentage of workers in use at the end of each polling cycle.
2021-01-11 18:32:24 +00:00
Liza Katz 3eeec0f571
[Search] Search Sessions Monitoring Task (#85253)
* Monitor ids

* import fix

* solve circular dep

* eslint

* mock circular dep

* max retries test

* mock circular dep

* test

* jest <(-:C

* jestttttt

* [data.search] Move search method inside session service and add tests

* merge

* Move background session service to data_enhanced plugin

* Better logs
Save IDs only in monitoring loop

* Fix types

* Space aware session service

* ts

* initial

* initial

* Fix session service saving

* merge fix

* stable stringify

* INMEM_MAX_SESSIONS

* INMEM_MAX_SESSIONS

* use the status API

* Move task scheduling behind a feature flag

* Update x-pack/plugins/data_enhanced/server/search/session/session_service.ts

Co-authored-by: Anton Dosov <dosantappdev@gmail.com>

* Add unit tests

* Update x-pack/plugins/data_enhanced/server/search/session/session_service.ts

Co-authored-by: Anton Dosov <dosantappdev@gmail.com>

* Use setTimeout to schedule monitoring steps

* Update request_utils.ts

* settimeout

* tiny cleanup

* Core review + use client.asyncSearch.status

* update ts

* fix unit test

* code review fixes

* Save individual search errors on SO

* Don't re-fetch completed or errored searches

* Rename Background Sessions to Search Sessions (with a send to background action)

* doc

* doc

* jest fun

* rename rfc

* translations

* merge fix

* merge fix

* code review

* update so name in features

* Move deleteTaskIfItExists to task manager

* task_manager to ts project

* Move deleteTaskIfItExists to public contract

* mock

* use task store

* ts

* code review

* code review + jest

* Alerting code review

Co-authored-by: Lukas Olson <olson.lukas@gmail.com>
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Anton Dosov <dosantappdev@gmail.com>
Co-authored-by: restrry <restrry@gmail.com>
2021-01-11 16:36:38 +02:00
Mikhail Shustov 1b6f737546
task_manager to ts project (#87646) 2021-01-07 19:27:18 +01:00
Gidi Meir Morris e0db4a3f0b
[Task Manager] adds more granular polling results to monitoring stats (#87494)
Added the following values to the Polling stats:

- **NoAvailableWorkers**: This tells us when a polling cycle resulted in no tasks being claimed due to there being no available workers 
- **RunningAtCapacity**: This tells us when a polling cycle resulted in tasks being claimed at 100% capacity of the available workers
- **Failed**: This tells us when the poller failed to claim
2021-01-06 18:00:52 +00:00
Rudolf Meijering 89bd0fbf1e
Resilient saved object migration algorithm (#78413)
* Initial structure of migration state-action machine

* Fix type import

* Retries with exponential back off

* Use discriminated union for state type

* Either type for actions

* Test exponential retries

* TaskEither types for actions

* Fetch indices instead of aliases so we can collect all index state in one request

* Log document id if transform fails

* WIP: Legacy pre-migrations

* UPDATE_TARGET_MAPPINGS

* WIP OUTDATED_DOCUMENTS_TRANSFORM

* Narrow res types depending on control state

* OUTDATED_DOCUMENTS_TRANSFORM

* Use .kibana instead of .kibana_current

* rename control states TARGET_DOCUMENTS* -> OUTDATED_DOCUMENTS*

* WIP MARK_VERSION_INDEX_READY

* Fix and expand INIT -> * transition tests

* Add alias/index name helper functions

* Add feature flag for enabling v2 migrations

* split state_action_machine, reindex legacy indices

* Don't use a scroll search for migrating outdated documents

* model: test control state progressions

* Action integration tests

* Fix existing tests and type errors

* snapshot_in_progress_exception can only happen when closing/deleting an index

* Retry steps up to 10 times

* Update api.md documentation files

* Further actions integration tests

* Action unit tests

* Fix actions integration tests

* Rename actions to be more domain-specific

* Apply suggestions from code review

Co-authored-by: Josh Dover <me@joshdover.com>

* Review feedback: polish and flesh out inline comments

* Fix unhandled rejections in actions unit tests

* model: only delay retryable_es_client_error, reset for other left responses

* Actions unit tests

* More inline comments

* Actions: Group index settings under 'index' key

* bulkIndex -> bulkOverwriteTransformedDocuments to be more domain specific

* state_action_machine tests, fix and add additional tests

* Action integration tests: updateAndPickupMappings, searchForOutdatedDocuments

* oops: uncomment commented out code

* actions integration tests: rejection for createIndex

* update state properties: clearer names, mark all as readonly

* add state properties currentAlias, versionAlias, legacyIndex and test for invalid version scheme in index names

* Use CONSTANTS for constants :D

* Actions: Clarify behaviour and impact of acknowledged: false responses

* Use consistent vocabulary for action responses

* KibanaMigrator test for migrationsV2

* KibanaMigrator test for FATAL state and action exceptions in v2 migrations

* Fix ts error in test

* Refactor: split index file up into a file per model, next, types

* next: use partial application so we don't generate a nextActionMap on every call

* move logic from index.ts to migrations_state_action_machine.ts and test

* add test

* use `Root` to allow specifying oss mode

* Add fix and todo tests for reindexing with preMigrationScript

* Dump execution log of state transitions and responses if we hit FATAL

* add 7.3 xpack tests

* add 100k test data

* Reindex instead of cloning for migrations

* Skip 100k x-pack integration test

* MARK_VERSION_INDEX_READY_CONFLICT for dealing with different versions migrating in parallel

* Track elapsed time

* Fix tests

* Model: make exhaustiveness checks more explicit

* actions integration tests: add additional tests from CR

* migrations_state_action_machine fix flaky test

* Fix flaky integration test

* Reserve FATAL termination only for situations which we never can recover from such as later version already migrated the index

* Handle incompatible_mapping_exception caused by another instance

* Cleanup logging

* Fix/stabilize integration tests

* Add REINDEX_SOURCE_TO_TARGET_VERIFY step

* Strip tests archives of */.DS_Store and __MAC_OSX

* Task manager migrations: remove invalid kibana property when converting legacy indices

* Add disabled mappings for removed field in map saved object type

* verifyReindex action: use count API

* REINDEX_BLOCK_* to prevent lost deletes (needs tests)

* Split out 100k docs integration test so that it has it's own kibana process

* REINDEX_BLOCK_* action tests

* REINDEX_BLOCK_* model tests

* Include original error message when migration_state_machine throws

* Address some CR nits

* Fix TS errors

* Fix bugs

* Reindex then clone to prevent lost deletes

* Fix tests

Co-authored-by: Josh Dover <me@joshdover.com>
Co-authored-by: pgayvallet <pierre.gayvallet@elastic.co>
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
2020-12-15 21:40:02 +01:00
Tyler Smalley 504c8739de
test:jest improvements to better support our monorepo (#84848)
Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co>
2020-12-14 14:07:50 -08:00
Patrick Mueller 1f774bb2e6
[task manager] provide warning when setting max_workers greater than limit (#85574)
resolves https://github.com/elastic/kibana/issues/56573

In this PR we create a new task manager limit on the config property
`xpack.task_manager.max_workers` of 100, but only log a deprecation
warning if that property exceeds the limit.  We'll enforce the limit
in 8.0.

The rationale is that it's unlikely going to be useful to run with
more than some number of workers, due to the amount of simultaneous
work that would end up happening.  In practice, too many workers can
slow things down more than speed them up.

We're setting the limit to 100 for now, but may increase / decrease it
based on further research.
2020-12-14 16:38:05 -05:00
Tyler Smalley b593781009
Jest multi-project configuration (#77894)
Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co>
2020-12-02 11:42:23 -08:00
ymao1 cbc61afcce
[Task Manager] Skip removed task types when claiming tasks (#84273)
* Checking if task type is in registered list

* Loading esArchiver data with removed task type for testing

* PR fixes
2020-12-02 11:49:24 -05:00
Brandon Kobel 58297fa131
Deprecate xpack.task_manager.index setting (#84155)
* Deprecate `xpack.task_manager.index` setting

* Updating developer docs about configuring task manager settings
2020-11-25 06:36:24 -08:00
Mikhail Shustov 5ec6fe315f
[DX] Bump TS version to v4.1 (#83397)
* bump version to 4.1.1-rc

* fix code to run kbn bootstrap

* fix errors

* DO NOT MERGE. mute errors and ping teams to fix them

* Address EuiSelectableProps configuration in discover sidebar

* use explicit type for EuiSelectable

* update to ts v4.1.2

* fix ts error in EuiSelectable

* update docs

* update prettier with ts version support

* Revert "update prettier with ts version support"

This reverts commit 3de48db3ec.

* address another new problem

Co-authored-by: Chandler Prall <chandler.prall@gmail.com>
2020-11-24 16:04:33 +01:00
Mikhail Shustov 95861a0fb0
[DX] Prettier v2.2 (#83899)
* update prettier with ts version support

* mute type-error

* run prettier on codebase

* fix examples

* fix errors after master merged
2020-11-23 13:17:05 +01:00
Gidi Meir Morris 63cb5aee4e
ensure workload agg doesnt run until next interval when it fails (#83632)
Ensures the WorkloadAggregator doesn't retry immediately after errors, and instead retries on the next interval.
2020-11-20 09:23:08 +00:00
Gidi Meir Morris 3b0215c26b
[Task Manager] Ensures retries are inferred from the schedule of recurring tasks (#83682)
This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of https://github.com/elastic/kibana/issues/39349).

In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.
2020-11-19 14:37:28 +00:00
Dario Gieselaar afbf1a983a
[APM] Errors table for service overview (#83065) 2020-11-12 15:49:22 +01:00
Nathan L Smith bc2da67608
Move Elasticsearch type definitions out of APM (#83081)
...and into x-pack.

Also remove `PromiseReturnType` from APM and use the copy in observability everywhere.

All of the additional changes to APM imports are just automatic sorting.

This makes doing #77720 a little easier and removes some implicit circular dependencies for #80508.

Co-authored-by: Dario Gieselaar <dario.gieselaar@elastic.co>
2020-11-11 16:23:34 -06:00
Gidi Meir Morris 51acfb9795
[Task Manager] Changed alerts schedule logic to use Task Manager internals (#80149)
* spiked intervals in alerts

* ensure scheduled tasks dont get wiped

* Fixed type checks and unit tests

* Added simple test, which only covers successful case when edit happened right after task was complete previous execution

* fixed jest

* fallback to existing task schedule when possible

* added missing test

* Added support for day and hour schedule interval values

* added docs for new schedule run result

* fixed doc

* added UnrecoverableError support for task runners nad pluged it into alerting where needed

* typo

Co-authored-by: Yuliia Naumenko <yuliia.naumenko@elastic.com>
2020-11-02 09:49:55 -08:00
Mike Côté 84b23b6d7c
Move task manager README.md to root of plugin (#82012)
* Move task manager README.md to root of plugin

* Fix failing test, update task manager plugin description in docs
2020-10-29 12:39:20 -04:00
Gidi Meir Morris 66d79ea2bf
Reactively disable Task Manager lifecycle when core services become unavailable (#81779)
Plugs the Task Manager polling lifecycle into the Kibana Services Status streams in order to ensure we reactively start and stop polling whenever the Elasticsearch or SavedObjects service switch between `available` and `unavailable`.

This will prevent Task Manager from polling whenever these services switch to an `unavailable` state.
2020-10-29 11:24:10 +00:00
Mikhail Shustov 2782204cc1
Get rid of global types (#81739)
* move global typings to packages/kbn-utility-types

* update all imports

* add tests

* mute error

* update docs

* ok

* rename kbn-utility-types/test --> kbn-utility-types/jest
2020-10-28 11:03:04 +01:00
Gidi Meir Morris 5dfa45d666
[Task Manager] adds basic observability into Task Manager's runtime operations (#77868)
This PR adds an an internal monitoring mechanism in Task Manager which keep track of a variety of metrics and a health api endpoint which makes the monitored statistics accessible.
2020-10-27 15:58:04 +00:00
ymao1 e6ab812891
[Task Manager] Mark task as failed if maxAttempts has been met. (#80681)
* wip

* Adding updateFieldsAndMarkAsFailed function

* Updating UBQ

* Only updating retryAt if marking as claiming

* Updating query

* Updating query to only fail one time tasks that have exceeded max attempts

* Fixing tests

* Fixing tests

* Handling claiming tasks by id

* Removing unused function

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
2020-10-27 07:40:44 -04:00
Patrick Mueller 069e842b87
[task manager] do not sort tasks to be claimed by score if no pinned tasks (#80692)
resolves: https://github.com/elastic/kibana/issues/80371

Previously, when claiming tasks, we were always sorting the tasks to claim by
the score and then by the time they should be run.  We sort by score to
capture `runNow()` tasks, also referred to internally as "pinned" tasks
in the update by query.

The change in this PR is to only sort by score if there are pinned tasks, and
to not sort by score at all if there aren't any.
2020-10-22 11:09:56 -04:00
Gidi Meir Morris 5460ad741c
[Task Manager] Cleans up legacy plugin structure (#80381)
This PR addresses a list of legacy code debt the plugin has incurred over the past year due to extensive changes in its internals and the adoption of the Kibana Platform.

It includes:
1. The `TaskManager` class has been split into several independent components: `TaskTypeDictionary`,  `TaskPollingLifecycle`,  `TaskScheduling`,  `Middleware`. This has made it easier to understand the roles of the different parts and makes it easier to plug them into the observability work.
2. The exposed `mocks` have been corrected to correctly express the Kibana Platform api
3. The lifecycle has been corrected to remove the need for  intermediary streames/promises which we're needed when we first introduced the `setup`/`start` lifecycle to support legacy.
4. The Logger mocks have been replaced with the platform's `coreMocks` implementation
5. The integration tests now test the plugin's actual public api (instead of the internals).
6. The Legacy Elasticsearch client has been replaced with the typed client in response to the deprecation notice.
7. Typing has been narrowed to prevent the `type` field from conflicting with the key in the `TaskDictionary`. This could have caused the displayed `type` on a task to differ from the `type` used in the Dictionary itself (this broke a test during refactoring and could have caused a bug in production code if left).
2020-10-20 13:00:13 +01:00
Mike Côté e0bb8605b4
Apply back pressure in Task Manager whenever Elasticsearch responds with a 429 (#75666)
* Make task manager maxWorkers and pollInterval observables (#75293)

* WIP step 1

* WIP step 2

* Cleanup

* Make maxWorkers an observable for the task pool

* Cleanup

* Fix test failures

* Use BehaviorSubject

* Add some tests

* Make the task manager store emit error events (#75679)

* Add errors$ observable to the task store

* Add unit tests

* Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur (#77096)

* WIP

* Cleanup

* Add error count to message

* Reset observable values on stop

* Add comments

* Fix issues when changing configurations

* Cleanup code

* Cleanup pt2

* Some renames

* Fix typecheck

* Use observables to manage throughput

* Rename class

* Switch to createManagedConfiguration

* Add some comments

* Start unit tests

* Add logs

* Fix log level

* Attempt at adding integration tests

* Fix test failures

* Fix timer

* Revert "Fix timer"

This reverts commit 0817e5e6a5.

* Use Symbol

* Fix merge scan

* replace startsWith with a timer that is scheduled to 0

* typo

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Gidi Meir Morris <github@gidi.io>
2020-10-13 09:32:49 -04:00
Tyler Smalley 7211f78ce1
Bumps Jest related packages (#78720)
Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co>
2020-10-01 14:38:51 -07:00
Gidi Meir Morris 8f54c50363
filter invalid SOs from the searc hresults in Task Manager (#76891)
Filters out invalid SOs from search results to prevent a never ending loop and spamming of logs in Task Manager.
2020-09-14 17:51:19 +01:00
Pierre Gayvallet eee139295d
Migrate data folder creation from legacy to KP (#75527)
* rename uuid service to environment service

* adapt resolve_uuid to directly use the configurations

* move data folder creation to core

* update generated doc

* fix types

* fix monitoring tests

* move instanceUuid to plugin initializer context

* update generated doc
2020-08-26 21:40:03 +02:00
Gidi Meir Morris 5308cc7100
[Task Manager] Monitors the Task Manager Poller and automatically recovers from failure (#75420)
Introduces a monitor around the Task Manager poller which pips through all values emitted by the poller and recovers from poller failures or stalls.
This monitor does the following:
1. Catches the poller thrown errors and recovers by proxying the error to a handler and continues listening to the poller.
2. Reacts to the poller `error` (caused by uncaught errors) and `completion` events, by starting a new poller and piping its event through to any previous subscribers (in our case, Task Manager itself).
3. Tracks the rate at which the poller emits events (this can be both work events, and `No Task` events, so polling and finding no work, still counts as an emitted event) and times out when this rate gets too long (suggesting the poller  has hung) and replaces the Poller with a new one.

We're not aware of any clear cases where Task Manager should actually get restarted by the monitor - this is definitely an error case and we have addressed all known cases.
The goal of introducing this monitor is as an insurance policy in case an unexpected error case breaks the poller in a long running production environment.
2020-08-20 21:26:56 +01:00
Gidi Meir Morris 773883f6a4
[Task Manager] time out work when it overruns in poller (#74980)
If the work performed by the poller hangs, meaning the promise fails to resolve/reject, then the poller can get stuck in a mode where it just waits for ever and no longer polls for fresh work.
This PR introduces a timeout after which the poller will automatically reject the work, freeing the poller to restart pulling fresh work.
2020-08-18 17:32:59 +01:00
Gidi Meir Morris fcb1a2848a
[Task Manager] Handles case where buffer receives multiple entities with the same ID (#74943)
Handles the case where two operations for the same entity make it into a single batched bulk operation and avoid the clashing ID issue that could cause the poller to hang and stop poling for work).
2020-08-17 13:19:04 +01:00
Gidi Meir Morris eb03295f85
[Task manager] Prevents edge case where already running tasks are reschedule every polling interval (#74606)
Fixes flaky tests in Task Manager and Alerting.

The fix in #73244 was correct, but it missed an edge case which causes the already running task to be rescheduled over and over.

This prevents that edge case which was effecting both TM in general and Alerting specifically.
2020-08-13 12:20:38 +01:00
Gidi Meir Morris 5c770e5930
[Task Manager] Correctly handle running tasks when calling RunNow and reduce flakiness in related tests (#73244)
This PR addresses two issues which caused several tests to be flaky in TM.

When `runNow` was introduced to TM we added a pinned query which returned specific tasks by ID.
This query does not have the filter applied to it which causes task to return when they're already marked as `running` but we didn't address these correctly which caused flakyness in the tests.
This didn't cause a broken beahviour, but it did cause beahviour that was hard to reason about - we now address them correctly.

It seems that sometimes, especially if the ES queue is overworked, it can take some time for the update to the underlying task to be visible (we don't user `refresh:true` on purpose), so adding a wait for the index to refresh to make sure the task is updated in time for the next stage of the test.
2020-08-05 17:35:38 +01:00