Feat: improve workflow builtin actions #3105

didier-wenzek · 2024-09-04T08:58:45Z

This is an alternative PR to improve the way builtin operations are customized.

For the first attempt, see feat: workflow builtin actions honoring on_success handlers #3070
For the motivations for a second proposal, see feat: workflow builtin actions honoring on_success handlers #3070 (comment)
For a description of this proposal, see: allow users to add workflow states after a built-in executing action #3014 (comment)

Proposed changes

Improve the definition of operation workflows, so one can trigger a builtin operation the same way as for a sub operation. The aim is to make things easier to understand with less concepts, more consistent (while a single way to trigger an operation) and more flexible (ability to define specific state on success and on error).

[scheduled]
  operation = "builtin:software_update"
  on_exec = "executing"

[executing]
  action = "await_operation_completion"
  on_success = "postprocess"
  on_error = "rollback"

Introduce OperationAction::BuiltinOperation
Equip the BuiltIn action with on_success, on_error and on_exec handlers.
Let the user provides its own steps for the executing, successful and failed state.
Update the documentation - introducing builtin:<operation>
Make sure the deprecated builtin keyword can still be used for backward compatibility
Ensure a builtin:operation can only be invoked from the same operation
~~Let the user execute any builtin operation from any workflow~~
~~Ability to build the builtin operation input from the calling operation as for sub operations~~
~~Ability to inject excerpts of the builtin operation output into the calling operation as for sub operations~~

Types of changes

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
Documentation Update (if none of the other choices apply)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

#3014

Checklist

I have read the CONTRIBUTING doc
I have signed the CLA (in all commits with git commit -s)
I ran cargo fmt as mentioned in CODING_GUIDELINES
I used cargo clippy as mentioned in CODING_GUIDELINES
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Further comments

codecov · 2024-09-04T09:09:58Z

Codecov Report

Attention: Patch coverage is 42.60870% with 132 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
crates/core/tedge_api/src/workflow/toml_config.rs	12.50%	61 Missing and 2 partials ⚠️
crates/core/tedge_api/src/workflow/mod.rs	38.98%	36 Missing ⚠️
.../core/tedge_agent/src/operation_workflows/actor.rs	56.25%	6 Missing and 8 partials ⚠️
crates/core/tedge_api/src/workflow/state.rs	59.37%	13 Missing ⚠️
crates/core/tedge_api/src/workflow/supervisor.rs	64.28%	4 Missing and 1 partial ⚠️
crates/core/tedge_api/src/workflow/script.rs	92.85%	1 Missing ⚠️

Additional details and impacted files

📢 Thoughts on this report? Let us know!

didier-wenzek · 2024-09-04T15:12:10Z

crates/core/tedge_agent/src/operation_workflows/actor.rs

-                } else {
-                    log_file.log_info("=> sub-operation not yet launched").await;


This log entry has been removed, because confusing when the sub-operation is actually a builtin operation.

didier-wenzek · 2024-09-04T15:16:18Z

crates/core/tedge_agent/src/operation_workflows/actor.rs

+            // As not finalized, the builtin state is sent back
+            // to the builtin operation actor for further processing.
+            let builtin_state = new_state.clone();
+            Ok(self.builtin_command_dispatcher.send(builtin_state).await?)


This can be remove in the future. Indeed, all the builtin-operation actors are currently dealing with 2 steps while things are actually done in a single step (the schedule state for restart and software update; the executing state for log_upload, config_snapshot and config_update).

The config actor still does it in two steps, where the tedgeUrl generation happens in the scheduled state and the rest in executing, although I can't remember why we chose to do it that way.

didier-wenzek · 2024-09-04T15:20:16Z

crates/core/tedge_agent/src/operation_workflows/tests.rs

-        .with_status(CommandStatus::Executing);
+        .with_status(CommandStatus::Successful);
    software_box.send(software_list_response.into()).await?;

    mqtt_box
        .assert_received([MqttMessage::new(
            &Topic::new_unchecked("te/device/main///cmd/software_list/1234"),
-            r#"{"status":"executing"}"#,
+            r#"{"status":"successful"}"#,


3 tests has been updated, watching for a successful status and no more an executing status.

Indeed, publishing the executing state is done by the workflow engine according to the workflow definition and no more by the internal actor.

didier-wenzek · 2024-09-04T15:22:40Z

crates/core/tedge_api/src/workflow/mod.rs

+    /// Trigger a built-in operation
+    ///
+    /// ```toml
+    /// operation = "<builtin:operation-name>"
+    /// on_exec = "<state>"
+    /// ```
+    BuiltInOperation(OperationName, ExecHandlers),


As a first step, I introduced a specific BuiltInOperation case along the existing Operation case.

I will consider to the merge the two.

We decided to go against this idea: a builtin operation can only be invoked from the same operation.

github-actions · 2024-09-04T15:33:07Z

Robot Results

✅ Passed	❌ Failed	⏭️ Skipped	Total	Pass %	⏱️ Duration
505	0	2	505	100	1h33m19.753409999s

reubenmiller

@didier-wenzek I've played around with the proposal and it works well. I've been able to create a custom log_upload handler with a conditional pre-processing step and everything worked as expected.

So I'm happy with the direction of the PR.

docs/src/references/agent/operation-workflow.md

albinsuresh · 2024-09-06T10:39:11Z

docs/src/references/agent/operation-workflow.md

+    i.e. there is no `te/+/+/+/+/cmd/builtin:<operation-name>/+` command topic.
+  - The builtin versions are only internal triggered by the agent from a customized workflow. 
+- On a fresh install of %%te%%, the `<operation-name>` is simply a copy of `builtin:<operation-name>`.
+  - This copy is not materialized on the file system, but created in memory by the agent when no customized version is found.


Materializing them as done for devie_profile would be a good step for users to discover and extend those built-in operation workflows.

Good idea. I will push a follow up PR for that.

albinsuresh · 2024-09-06T10:40:51Z

docs/src/references/agent/operation-workflow.md

+  - By convention, this file is named after the operation name, as in `/etc/tedge/operations/software_update.toml`.
+    However, this is not mandatory: the operation is determined by the `operation` property, e.g. `operation = "software_update"`.


Unrelated to this PR, but I've always wondered why we didn't tie it to the workflow name, as that would have easily prevented conflicting workflow definitions as well.

Why not. In practice, this what we do. I will push a follow up PR to discuss this.

albinsuresh · 2024-09-06T10:44:00Z

docs/src/references/agent/operation-workflow.md

+- A user-provided workflow can invoke any builtin operations as well as the custom versions.
+  - `operation = "<operation-name>"` triggers the custom-version of the operation workflow.
+  - `operation = "builtin:<operation-name>"` triggers the operation ignoring any customization.


A user might wonder when/why they should use one over the other. So, we might wanna highlight that only the former can be used from workflows of other operations and the latter is limited to the workflow of the same type only.

This is now clearer with 6b37ae3

docs/src/references/agent/operation-workflow.md

albinsuresh

LGTM. The feature is working as expected. I just have one remark regarding the decision to deprecate the old builtin action syntax, which might still be simpler for the end-user.

crates/core/tedge_api/src/workflow/mod.rs

albinsuresh · 2024-09-10T12:06:08Z

crates/core/tedge_api/src/workflow/mod.rs

+    pub fn adapt_builtin_request(&self, command_state: GenericCommandState) -> GenericCommandState {
+        match self {
+            OperationAction::BuiltInOperation(_, _) => {
+                command_state.update(GenericStateUpdate::scheduled())


One might wonder why not GenericStateUpdate::init_payload() here, although at the impl level it makes no difference (since the builtin actors react to both init and scheduled messages the same way).

The status of the builtin request could be set to init, yes.

However, splitting hairs:

GenericStateUpdate::init_payload() cannot be used here as this is an empty payload that still needs to be populated with the actual user request.

All the builtin commands do no nothing on the init state, simply proceeding to executing.

To simplify this, we can refactor all the operation actors to have a single do-it action.

albinsuresh · 2024-09-10T12:18:38Z

crates/core/tedge_agent/src/operation_workflows/actor.rs

+            // As not finalized, the builtin state is sent back
+            // to the builtin operation actor for further processing.
+            let builtin_state = new_state.clone();
+            Ok(self.builtin_command_dispatcher.send(builtin_state).await?)


The config actor still does it in two steps, where the tedgeUrl generation happens in the scheduled state and the rest in executing, although I can't remember why we chose to do it that way.

albinsuresh · 2024-09-10T12:49:22Z

tests/RobotFramework/tests/cumulocity/log/workflow/log_upload.toml

+  on_success = "postprocess"      # on_success & on_error can be customized for builtin actions
+
+[postprocess]
+  script = "/usr/bin/log_upload.sh ${.payload.status}"


Although I understand that we want to deprecate action = "builtin" in favour of the newly proposed operation = "builtin:<operation_name>" style, seeing the simplicity of this syntax:

[executing] action = "builtin" on_success = "postprocess" # on_success & on_error can be customized for builtin actions [postprocess] script = "/usr/bin/log_upload.sh ${.payload.status}" on_success = "successful"

...makes me wonder if this is still a better choice, during this API exploration phase, as the new version contains the additional boilerplate (the await_operation state), as follows:

[executing] operation = "builtin:software_update" # trigger the built-in behavior for software update on_exec = "await_operation" [await_operation] action = "await-operation-completion" # awaiting the builtin operation to complete on_success = "postprocess" [postprocess] script = "/usr/bin/log_upload.sh ${.payload.status}" on_success = "successful"

From the user's perspective, the await_operation state addition doesn't add much value. And since we have plans to get rid of the await states in the future anyway, I'm wondering if it's better not to deprecate the old syntax to keep things simpler for now.

Since even the decision to overload the operation keyword for builtin operations might also get revisited while we introduce more granular builtin actions like download, file-copy etc, I'm wondering if we can keep the simpler action = "builtin" syntax itself for another release, and introduce the operation = "builtin:<operation_name>" style only once we're fully confident that it is the way to go?

For this PR, nothing changes except hiding the newly added operation syntax from the toml_config level and keeping the newly added await handlers for the old builtin action. With that, the user still gets the new capability to add post-processing steps to a builtin action with minimal config changes as well. WDYT?

I don't fully follow you.

action = "builtin" is not deprecated yet: this is still working, but discouraged in the documentation.

operation = "builtin:<operation_name> is really nicer from a user perspective. This is a personal point of view, but with some evidence: the documentation has been far simpler to write and there is less traps in using it.

We are now in a position to make implicit the action = "await-operation-completion", removing some burden to the user.

I see smaller-grain actions like download, file-copy ... as a nice idea we really have to work on. But, I don't see any those as full-fledged operations (with an MQTT topic, capabilities, retain messages, states ...). Hence, no risk of confusion.

What I meant was, since action = "builtin" is still supported (although to-be deprecated), there are currently two ways for the user to attach post-processing logic to builtin operation workflows:

The recommended approach with operation = "builtin:<operation> as follows:

[executing] operation = "builtin:log_upload" on_exec = "await_operation" [await_operation] action = "await-operation-completion" on_success = "postprocess" [postprocess] script = "/usr/bin/log_upload.sh ${.payload.status}" on_success = "successful"

The discouraged/legacy approach of action = "builtin", with the newly supported await handlers:

[executing] action = "builtin" on_success = "postprocess" [postprocess] script = "/usr/bin/log_upload.sh ${.payload.status}" on_success = "successful"

From the external user's perspective, the second one with just 2 states looks much simpler, as the additional await_operation state in the first approach is just boilerplate which adds no value to the user (unless I'm missing something here). The only additional thing that they get by using operation is the input and input_script support, which is also less useful here IMO, as there is little scope for input manipulation when the main operation and the builtin operation are of the same type (since the same input can be passed as-is).

Since both syntaxes are providing the same functionality and the latter being minimal, I was just wondering why even expose the new operation = "builtin:<operation> API, when the existing action = "builtin" API (with the newly added support for await actions) does the same job with less boilerplate. Or is there some additional benefit with the operation approach that I'm missing?

PS: Talking purely about the external facing TOML level API/syntax here.

From the external user's perspective, the second one with just 2 states looks much simpler, as the additional await_operation state in the first approach is just boilerplate which adds no value to the user (unless I'm missing something here).

Yes, you are missing something. 3 steps are already required in the second case. Indeed, the scheduled step must be invoked as well as the executing step.

The only additional thing that they get by using operation is the input and input_script support, which is also less useful here IMO, as there is little scope for input manipulation when the main operation and the builtin operation are of the same type (since the same input can be passed as-is).

This was the plan when starting this PR, but has been abandoned meantime. As a builtin:<operation> can only apply to the current operation there is no point to tune the input.

Or is there some additional benefit with the operation approach that I'm missing?

Using action = "builtin" is really error prone unless you stick to a strict template.

didier-wenzek · 2024-09-10T15:40:03Z

Oops, I have to rename this commit f27bf17

Bravo555

One question about the tests and small nits, overall LGTM.

Bravo555 · 2024-09-10T16:41:45Z

docs/src/references/agent/operation-workflow.md


 ```toml
-operation = "software_update"       # an operation for which tedge-agent provides an implementation
+["init"]


nit: unnecessary quotes

Bravo555 · 2024-09-10T17:01:08Z

crates/core/tedge_api/src/workflow/toml_config.rs

+            TomlOperationAction::Operation(operation) => match operation.strip_prefix("builtin:") {
+                None => {
+                    let handlers = TryInto::<ExecHandlers>::try_into((input.handlers, defaults))?;
+                    let input_script = input.input_script;


suggestion: reads shorter + jump to definition in IDE actually jumps to our try_from definition, instead of a stub try_into impl in stdlib

Suggested change

let input_script = input.input_script;

let handlers = ExecHandlers::try_from((input.handlers, defaults))?;

Indeed clearer!

Fixed by 7ad7db4

Bravo555 · 2024-09-10T17:01:50Z

crates/core/tedge_api/src/workflow/toml_config.rs

+                }
+                Some(builtin_operation_name) => {
+                    let handlers = TryInto::<ExecHandlers>::try_into((input.handlers, defaults))?;
+                    Ok(OperationAction::BuiltInOperation(


Suggested change

Ok(OperationAction::BuiltInOperation(

let handlers = ExecHandlers::try_from((input.handlers, defaults))?;

Fixed by 7ad7db4

Bravo555 · 2024-09-10T17:04:25Z

crates/core/tedge_api/src/workflow/toml_config.rs

@@ -173,6 +181,17 @@ impl TryFrom<(TomlOperationState, DefaultHandlers)> for OperationAction {
                        handlers, cmd_output,
                    ))
                }
+                "builtin" => {
+                    let exec_handlers = TryInto::<ExecHandlers>::try_into((
+                        input.handlers.clone(),


Suggested change

input.handlers.clone(),

let exec_handlers = ExecHandlers::try_from((

Fixed by 7ad7db4

Bravo555 · 2024-09-10T17:04:30Z

crates/core/tedge_api/src/workflow/toml_config.rs

+                        ExecHandlers::builtin_default(),
+                    ))?;
+                    let await_handlers = TryInto::<AwaitHandlers>::try_into((
+                        input.handlers,


Suggested change

input.handlers,

let await_handlers = AwaitHandlers::try_from((

Fixed by 7ad7db4

Bravo555 · 2024-09-10T17:13:56Z

tests/RobotFramework/tests/tedge_agent/workflows/custom_operation.robot

+    Should Contain      ${software_list[0]}    "tedge"
+    Should Contain      ${software_list[0]}    "postprocess"
+    Should Contain      ${software_list[0]}    "done"
+    Execute Command     tedge mqtt pub --retain te/device/main///cmd/software_list/robot-456 ''


question: The agent is supposed to clean-up the command, shouldn't we check that it does instead of manually cleaning in the test?

No the agent do nothing once the command is successful or failed. This is the requester who is supposed to clean-up the command, once processed the final outcome.

This commit introduces no changes to the user. This is only a preparation step to replace builtin actions, which usage was very restricted and specific, with builtin operations that can be used as any other operations. Signed-off-by: Didier Wenzek <[email protected]>

Signed-off-by: Didier Wenzek <[email protected]>

The new name is more appropriate, now that these handlers are used not only for background scripts but also to trigger a sub operation or a builtin action. Signed-off-by: Didier Wenzek <[email protected]>

Signed-off-by: Didier Wenzek <[email protected]>

didier-wenzek requested review from reubenmiller, Bravo555, albinsuresh, jarhodes314, rina23q, gligorisaev and a team as code owners September 4, 2024 08:58

didier-wenzek had a problem deploying to Test Pull Request September 4, 2024 08:58 — with GitHub Actions Failure

didier-wenzek marked this pull request as draft September 4, 2024 08:59

didier-wenzek had a problem deploying to Test Pull Request September 4, 2024 10:14 — with GitHub Actions Failure

didier-wenzek force-pushed the feat/improve-workflow-builtin-actions-v2 branch from 4b3f302 to b5f9fbe Compare September 4, 2024 15:10

didier-wenzek temporarily deployed to Test Pull Request September 4, 2024 15:10 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Auto September 4, 2024 15:15 — with GitHub Actions Inactive

didier-wenzek commented Sep 4, 2024

View reviewed changes

didier-wenzek marked this pull request as ready for review September 4, 2024 15:46

reubenmiller reviewed Sep 6, 2024

View reviewed changes

docs/src/references/agent/operation-workflow.md Outdated Show resolved Hide resolved