[Task Manager] Release tasks when encountering a validation failure after run #159964

mikecote · 2023-06-19T16:40:53Z

We should add some logic to Task Manager whenever a task finishes running but returns an invalid state. With #159048, Task Manager will fail to update the task after a run and not put it back into the queue by re-using logic from #159302. We should at least log the validation error and put the task back into the queue with the previous valid state.

elasticmachine · 2023-06-19T16:40:57Z

Pinging @elastic/response-ops (Team:ResponseOps)

…ate objects (#159048) Part of #155764. In this PR, I'm modifying task manager to allow task types to report a versioned schema for the `state` object. When defining `stateSchemaByVersion`, the following will happen: - The `state` returned from the task runner will get validated against the latest version and throw an error if ever it is invalid (to capture mismatches at development and testing time) - When task manager reads a task, it will migrate the task state to the latest version (if necessary) and validate against the latest schema, dropping any unknown fields (in the scenario of a downgrade). By default, Task Manager will validate the state on write once a versioned schema is provided, however the following config must be enabled for errors to be thrown on read: `xpack.task_manager.allow_reading_invalid_state: true`. We plan to enable this in serverless by default but be cautious on existing deployments and wait for telemetry to show no issues. I've onboarded the `alerts_invalidate_api_keys` task type which can be used as an example to onboard others. See [this commit](214bae3). ### How to configure a task type to version and validate The structure is defined as: ``` taskManager.registerTaskDefinitions({ ... stateSchemaByVersion: { 1: { // All existing tasks don't have a version so will get `up` migrated to 1 up: (state: Record<string, unknown>) => ({ runs: state.runs || 0, total_invalidated: state.total_invalidated || 0, }), schema: schema.object({ runs: schema.number(), total_invalidated: schema.number(), }), }, }, ... }); ``` However, look at [this commit](214bae3) for an example that you can leverage type safety from the schema. ### Follow up issues - Onboard non-alerting task types to have a versioned state schema (#159342) - Onboard alerting task types to have a versioned state schema for the framework fields (#159343) - Onboard alerting task types to have a versioned rule and alert state schema within the task state (#159344) - Telemetry on the validation failures (#159345) - Remove feature flag so `allow_reading_invalid_state` is always `false` (#159346) - Force validation on all tasks using state by removing the exemption code (#159347) - Release tasks when encountering a validation failure after run (#159964) ### To Verify NOTE: I have the following verification scenarios in a jest integration test as well => https://github.com/elastic/kibana/pull/159048/files#diff-5f06228df58fa74d5a0f2722c30f1f4bee2ee9df7a14e0700b9aa9bc3864a858. You will need to log the state when the task runs to observe what the task runner receives in different scenarios. ``` diff --git a/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts b/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts index 1e624bcd807..4aa4c2c7805 100644 --- a/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts +++ b/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts @@ -140,6 +140,7 @@ function taskRunner( ) { return ({ taskInstance }: RunContext) => { const state = taskInstance.state as LatestTaskStateSchema; + console.log('*** Running task with the following state:', JSON.stringify(state)); return { async run() { let totalInvalidated = 0; ``` #### Scenario 1: Adding an unknown field to the task saved-object gets dropped 1. Startup a fresh Kibana instance 2. Make the following call to Elasticsearch (I used postman). This call adds an unknown property (`foo`) to the task state and makes the task run right away. ``` POST http://kibana_system:changeme@localhost:9200/.kibana_task_manager/_update/task:Alerts-alerts_invalidate_api_keys { "doc": { "task": { "runAt": "2023-06-08T00:00:00.000Z", "state": "{\"runs\":1,\"total_invalidated\":0,\"foo\":true}" } } } ``` 3. Observe the task run log message, with state not containing `foo`. #### Scenario 2: Task running returning an unknown property causes the task to fail to update 1. Apply the following changes to the code (and ignore TypeScript issues) ``` diff --git a/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts b/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts index 1e624bcd807..b15d4a4f478 100644 --- a/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts +++ b/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts @@ -183,6 +183,7 @@ function taskRunner( const updatedState: LatestTaskStateSchema = { runs: (state.runs || 0) + 1, + foo: true, total_invalidated: totalInvalidated, }; return { ``` 2. Make the task run right away by calling Elasticsearch with the following ``` POST http://kibana_system:changeme@localhost:9200/.kibana_task_manager/_update/task:Alerts-alerts_invalidate_api_keys { "doc": { "task": { "runAt": "2023-06-08T00:00:00.000Z" } } } ``` 3. Notice the validation errors logged as debug ``` [ERROR][plugins.taskManager] Task alerts_invalidate_api_keys "Alerts-alerts_invalidate_api_keys" failed: Error: [foo]: definition for this key is missing ``` #### Scenario 3: Task state gets migrated 1. Apply the following code change ``` diff --git a/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts b/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts index 1e624bcd807..338f21bed5b 100644 --- a/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts +++ b/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts @@ -41,6 +41,18 @@ const stateSchemaByVersion = { total_invalidated: schema.number(), }), }, + 2: { + up: (state: Record<string, unknown>) => ({ + runs: state.runs, + total_invalidated: state.total_invalidated, + foo: true, + }), + schema: schema.object({ + runs: schema.number(), + total_invalidated: schema.number(), + foo: schema.boolean(), + }), + }, }; const latestSchema = stateSchemaByVersion[1].schema; ``` 2. Make the task run right away by calling Elasticsearch with the following ``` POST http://kibana_system:changeme@localhost:9200/.kibana_task_manager/_update/task:Alerts-alerts_invalidate_api_keys { "doc": { "task": { "runAt": "2023-06-08T00:00:00.000Z" } } } ``` 3. Observe the state now contains `foo` property when the task runs. #### Scenario 4: Reading invalid state causes debug logs 1. Run the following request to Elasticsearch ``` POST http://kibana_system:changeme@localhost:9200/.kibana_task_manager/_update/task:Alerts-alerts_invalidate_api_keys { "doc": { "task": { "runAt": "2023-06-08T00:00:00.000Z", "state": "{}" } } } ``` 2. Observe the Kibana debug log mentioning the validation failure while letting the task through ``` [DEBUG][plugins.taskManager] [alerts_invalidate_api_keys][Alerts-alerts_invalidate_api_keys] Failed to validate the task's state. Allowing read operation to proceed because allow_reading_invalid_state is true. Error: [runs]: expected value of type [number] but got [undefined] ``` #### Scenario 5: Reading invalid state when setting `allow_reading_invalid_state: false` causes tasks to fail to run 1. Set `xpack.task_manager.allow_reading_invalid_state: false` in your kibana.yml settings 2. Run the following request to Elasticsearch ``` POST http://kibana_system:changeme@localhost:9200/.kibana_task_manager/_update/task:Alerts-alerts_invalidate_api_keys { "doc": { "task": { "runAt": "2023-06-08T00:00:00.000Z", "state": "{}" } } } ``` 3. Observe the Kibana error log mentioning the validation failure ``` [ERROR][plugins.taskManager] Failed to poll for work: Error: [runs]: expected value of type [number] but got [undefined] ``` NOTE: While corrupting the task directly is rare, we plan to re-queue the tasks that failed to read, leveraging work from #159302 in a future PR (hence why the yml config is enabled by default, allowing invalid reads). --------- Co-authored-by: kibanamachine <[email protected]> Co-authored-by: Ying Mao <[email protected]>

…ate objects (elastic#159048) Part of elastic#155764. In this PR, I'm modifying task manager to allow task types to report a versioned schema for the `state` object. When defining `stateSchemaByVersion`, the following will happen: - The `state` returned from the task runner will get validated against the latest version and throw an error if ever it is invalid (to capture mismatches at development and testing time) - When task manager reads a task, it will migrate the task state to the latest version (if necessary) and validate against the latest schema, dropping any unknown fields (in the scenario of a downgrade). By default, Task Manager will validate the state on write once a versioned schema is provided, however the following config must be enabled for errors to be thrown on read: `xpack.task_manager.allow_reading_invalid_state: true`. We plan to enable this in serverless by default but be cautious on existing deployments and wait for telemetry to show no issues. I've onboarded the `alerts_invalidate_api_keys` task type which can be used as an example to onboard others. See [this commit](elastic@214bae3). ### How to configure a task type to version and validate The structure is defined as: ``` taskManager.registerTaskDefinitions({ ... stateSchemaByVersion: { 1: { // All existing tasks don't have a version so will get `up` migrated to 1 up: (state: Record<string, unknown>) => ({ runs: state.runs || 0, total_invalidated: state.total_invalidated || 0, }), schema: schema.object({ runs: schema.number(), total_invalidated: schema.number(), }), }, }, ... }); ``` However, look at [this commit](elastic@214bae3) for an example that you can leverage type safety from the schema. ### Follow up issues - Onboard non-alerting task types to have a versioned state schema (elastic#159342) - Onboard alerting task types to have a versioned state schema for the framework fields (elastic#159343) - Onboard alerting task types to have a versioned rule and alert state schema within the task state (elastic#159344) - Telemetry on the validation failures (elastic#159345) - Remove feature flag so `allow_reading_invalid_state` is always `false` (elastic#159346) - Force validation on all tasks using state by removing the exemption code (elastic#159347) - Release tasks when encountering a validation failure after run (elastic#159964) ### To Verify NOTE: I have the following verification scenarios in a jest integration test as well => https://github.com/elastic/kibana/pull/159048/files#diff-5f06228df58fa74d5a0f2722c30f1f4bee2ee9df7a14e0700b9aa9bc3864a858. You will need to log the state when the task runs to observe what the task runner receives in different scenarios. ``` diff --git a/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts b/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts index 1e624bcd807..4aa4c2c7805 100644 --- a/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts +++ b/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts @@ -140,6 +140,7 @@ function taskRunner( ) { return ({ taskInstance }: RunContext) => { const state = taskInstance.state as LatestTaskStateSchema; + console.log('*** Running task with the following state:', JSON.stringify(state)); return { async run() { let totalInvalidated = 0; ``` #### Scenario 1: Adding an unknown field to the task saved-object gets dropped 1. Startup a fresh Kibana instance 2. Make the following call to Elasticsearch (I used postman). This call adds an unknown property (`foo`) to the task state and makes the task run right away. ``` POST http://kibana_system:changeme@localhost:9200/.kibana_task_manager/_update/task:Alerts-alerts_invalidate_api_keys { "doc": { "task": { "runAt": "2023-06-08T00:00:00.000Z", "state": "{\"runs\":1,\"total_invalidated\":0,\"foo\":true}" } } } ``` 3. Observe the task run log message, with state not containing `foo`. #### Scenario 2: Task running returning an unknown property causes the task to fail to update 1. Apply the following changes to the code (and ignore TypeScript issues) ``` diff --git a/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts b/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts index 1e624bcd807..b15d4a4f478 100644 --- a/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts +++ b/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts @@ -183,6 +183,7 @@ function taskRunner( const updatedState: LatestTaskStateSchema = { runs: (state.runs || 0) + 1, + foo: true, total_invalidated: totalInvalidated, }; return { ``` 2. Make the task run right away by calling Elasticsearch with the following ``` POST http://kibana_system:changeme@localhost:9200/.kibana_task_manager/_update/task:Alerts-alerts_invalidate_api_keys { "doc": { "task": { "runAt": "2023-06-08T00:00:00.000Z" } } } ``` 3. Notice the validation errors logged as debug ``` [ERROR][plugins.taskManager] Task alerts_invalidate_api_keys "Alerts-alerts_invalidate_api_keys" failed: Error: [foo]: definition for this key is missing ``` #### Scenario 3: Task state gets migrated 1. Apply the following code change ``` diff --git a/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts b/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts index 1e624bcd807..338f21bed5b 100644 --- a/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts +++ b/x-pack/plugins/alerting/server/invalidate_pending_api_keys/task.ts @@ -41,6 +41,18 @@ const stateSchemaByVersion = { total_invalidated: schema.number(), }), }, + 2: { + up: (state: Record<string, unknown>) => ({ + runs: state.runs, + total_invalidated: state.total_invalidated, + foo: true, + }), + schema: schema.object({ + runs: schema.number(), + total_invalidated: schema.number(), + foo: schema.boolean(), + }), + }, }; const latestSchema = stateSchemaByVersion[1].schema; ``` 2. Make the task run right away by calling Elasticsearch with the following ``` POST http://kibana_system:changeme@localhost:9200/.kibana_task_manager/_update/task:Alerts-alerts_invalidate_api_keys { "doc": { "task": { "runAt": "2023-06-08T00:00:00.000Z" } } } ``` 3. Observe the state now contains `foo` property when the task runs. #### Scenario 4: Reading invalid state causes debug logs 1. Run the following request to Elasticsearch ``` POST http://kibana_system:changeme@localhost:9200/.kibana_task_manager/_update/task:Alerts-alerts_invalidate_api_keys { "doc": { "task": { "runAt": "2023-06-08T00:00:00.000Z", "state": "{}" } } } ``` 2. Observe the Kibana debug log mentioning the validation failure while letting the task through ``` [DEBUG][plugins.taskManager] [alerts_invalidate_api_keys][Alerts-alerts_invalidate_api_keys] Failed to validate the task's state. Allowing read operation to proceed because allow_reading_invalid_state is true. Error: [runs]: expected value of type [number] but got [undefined] ``` #### Scenario 5: Reading invalid state when setting `allow_reading_invalid_state: false` causes tasks to fail to run 1. Set `xpack.task_manager.allow_reading_invalid_state: false` in your kibana.yml settings 2. Run the following request to Elasticsearch ``` POST http://kibana_system:changeme@localhost:9200/.kibana_task_manager/_update/task:Alerts-alerts_invalidate_api_keys { "doc": { "task": { "runAt": "2023-06-08T00:00:00.000Z", "state": "{}" } } } ``` 3. Observe the Kibana error log mentioning the validation failure ``` [ERROR][plugins.taskManager] Failed to poll for work: Error: [runs]: expected value of type [number] but got [undefined] ``` NOTE: While corrupting the task directly is rare, we plan to re-queue the tasks that failed to read, leveraging work from elastic#159302 in a future PR (hence why the yml config is enabled by default, allowing invalid reads). --------- Co-authored-by: kibanamachine <[email protected]> Co-authored-by: Ying Mao <[email protected]> (cherry picked from commit 40c2afd)

mikecote · 2023-08-14T12:28:19Z

Did a test of this and the task gets re-queued no problem. Closing..

mikecote added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jun 19, 2023

mikecote self-assigned this Jun 19, 2023

mikecote mentioned this issue Jun 19, 2023

Add support to Task Manager for validating and versioning the task state objects #159048

Merged

mikecote mentioned this issue Jul 27, 2023

Version task-manager task state #155764

Open

mikecote closed this as completed Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task Manager] Release tasks when encountering a validation failure after run #159964

[Task Manager] Release tasks when encountering a validation failure after run #159964

mikecote commented Jun 19, 2023

elasticmachine commented Jun 19, 2023

mikecote commented Aug 14, 2023

[Task Manager] Release tasks when encountering a validation failure after run #159964

[Task Manager] Release tasks when encountering a validation failure after run #159964

Comments

mikecote commented Jun 19, 2023

elasticmachine commented Jun 19, 2023

mikecote commented Aug 14, 2023