Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core feature] Workflow level throttling #5125

Open
2 tasks done
a4501150 opened this issue Mar 27, 2024 · 12 comments
Open
2 tasks done

[Core feature] Workflow level throttling #5125

a4501150 opened this issue Mar 27, 2024 · 12 comments
Assignees
Labels
backlogged For internal use. Reserved for contributor team workflow. enhancement New feature or request

Comments

@a4501150
Copy link

a4501150 commented Mar 27, 2024

Motivation: Why do you think this is important?

Currently flyte does not support workflow level throttling that limits the same concurrent workflows that can be running under a flyte project / domain. (same workflow means the same binary identified by workflow name).

In our use case, we use flyte both for workflow scheduling and orchestration. One of our workflow will hit our backend API heavily and we want to limit the resource pressure on our backend API. Although we can implement throttling on backend API but backend API will also use in synchronized manner and we don't want to affect the current synchronized API performance and behavior, thus implement throttling on workflow side is the best choice.

Compare to other workflow engines, this is the major feature that lacks in flyte.

Goal: What should the final outcome look like, ideally?

For instance, when we trigger a workflow by creating a CreateExecution with a launch plan (preferred) OR when we register a workflow, we should be able to configure an extra parameter, eg. max_concurrent_execution as type long which can limit the concurrent workflows that running at the same time.

For example, let's say we have a workflow A, with a Task A. We created a launch plan for this workflow with max_concurrent_execution as 5.

  • We trigger the concurrent workflows by creating 100 CreateExecution requests with same launch plan to flyte grpc API
  • The first 5 workflow executions should be running
  • The rest of 95 executions should be put in a queue at flyte backend, in status of QUEUED
  • The ultimate goal is there should be only max_concurrent_execution workflows in status of RUNNING at the same time
  • The execution id should be always returned

Describe alternatives you've considered

We are now using a hacky way by creating a resource lock task to achieve the functionalities:

  • A new task with for loop and Thread.sleep() inside the for loop along with querying the flyte endpoint to get the top max_concurrent_execution running executions sorted by execution created time.
  • If the returned top X running executions count >= max_concurrent_execution, and the execution id obtained by the the task via ENV FLYTE_INTERNAL_EXECUTION_ID is not one of the returned top X executions, then the task will just wait there in the for loop with thread.sleep()

This effectively limit the concurrent executions of a given launchPlan (workflow) by have the throttling task as the first task in the given workflow DAG.

However, this will create many semi-dangling resource lock tasks if the amount of concurrent execution requests is large and max_concurrent_execution is relatively small.

Another alternatives solution

Task level throttling can further enhance the fine control level, however this maybe harder to implement than workflow level throttling.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@a4501150 a4501150 added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Mar 27, 2024
Copy link

welcome bot commented Mar 27, 2024

Thank you for opening your first issue here! 🛠

Copy link

dosubot bot commented Mar 27, 2024

Hey there, @a4501150! Great to see you bringing up another interesting point for discussion. How have you been?

Thank you for your detailed feature request. Workflow level throttling is indeed a valuable feature that can help manage resource utilization and prevent overloading of backend APIs.

Based on the current Flyte architecture, it is feasible to implement this feature by extending the ExecutionManager and related components. Here's a high-level approach:

  1. Track Active Executions: Extend the ExecutionManager to include a mechanism for tracking the number of currently active executions for each project and domain. This could involve maintaining a counter or a more complex data structure within the ExecutionManager or leveraging an external datastore.

  2. Check Limits Before Execution: Modify the execution creation process to check the current number of active executions against a predefined limit before allowing a new execution to start. If the limit is reached, the request to start a new execution could be rejected or queued until the number of active executions decreases.

  3. Configuration for Limits: Introduce configuration options within the FlyteAdmin configuration to define the limits for concurrent executions at the project and domain levels. These limits could be specified globally, per project, or per domain.

  4. Update Metrics: Utilize the existing metrics infrastructure to monitor and report on the number of active executions and any throttling events. This would be valuable for administrators to understand the system's behavior and adjust limits as necessary.

  5. User Feedback: Enhance the API responses to include information about throttling, such as rejection messages or estimated wait times, to provide feedback to users when their executions are being limited.

The current state management in Flyte, as represented in the NodeExecution and NodeExecutionClosure structures, can be leveraged to monitor and control the concurrency of executions. Additional logic can be introduced in the state transition functions or by introducing new functions and states that specifically handle throttling conditions. For example, a new state like QUEUED could be introduced for executions that are ready to run but are being held due to throttling constraints.

Please note that this is a high-level approach and the actual implementation would require a detailed design and thorough testing to ensure it meets the requirements and does not introduce any regressions.

For more details on the current execution management in Flyte, you can refer to the ExecutionManager code in the Flyte repository.

I hope this provides a good starting point for implementing workflow level throttling in Flyte. If you have any further questions or need more details, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@eapolinario eapolinario self-assigned this Mar 28, 2024
@kumare3
Copy link
Contributor

kumare3 commented Mar 29, 2024

I think this is related to - #267

@kumare3
Copy link
Contributor

kumare3 commented Mar 29, 2024

Also I think this is a great feature to add, would Spotify folks be willing to collaborate on this?

Incase concurrency limits are specified, the execution is created in Planned state and has a description field that explains why it is not running right now.
Then Admin has a reconciliation loop that looks at all Planned workflows to ensure they meet the constraint before launching it.

@a4501150
Copy link
Author

Hey @kumare3 thanks for reply! I think we will be able to collaborate on this, needs some coordination with the team for planning.

Incase concurrency limits are specified, the execution is created in Planned state and has a description field that explains why it is not running right now.
Then Admin has a reconciliation loop that looks at all Planned workflows to ensure they meet the constraint before launching it.

This sounds like a great starter solution!

@eapolinario eapolinario added backlogged For internal use. Reserved for contributor team workflow. and removed untriaged This issues has not yet been looked at by the Maintainers labels Apr 4, 2024
@sshardool
Copy link
Contributor

Wondering if this is fixed or if someone is looking into this @a4501150 ?
If not, we can contribute this.

@kumare3
Copy link
Contributor

kumare3 commented Jun 1, 2024

I have not seen any contributions. @a4501150 / @RRap0so i think Spotify wants to work on this?

@a4501150
Copy link
Author

a4501150 commented Jun 1, 2024

hey sorry for late reply, @sshardool @kumare3 I don't think Spotify has capacity to work on this at the moment. It's great if @sshardool can help on this.

Cc @andresgomezfrr @RRap0so to double confirm on this so work didn't get duplicated.

@sshardool
Copy link
Contributor

Thanks folks. We are evaluating this and should be to actively engage on this in the next couple of weeks.

@davidmirror-ops
Copy link
Contributor

@sshardool do you know if this is in progress? A few other users are hitting the same limitation. Thanks!

@sshardool
Copy link
Contributor

Hello @davidmirror-ops - unfortunately we have not been able to take this up yet and will likely take us some time to get to it. If some else is willing to contribute I would be happy to review, otherwise I will update here once we start work on this.

@kumare3
Copy link
Contributor

kumare3 commented Sep 11, 2024

This is something we are thinking of working on with the LinkedIn team @sshardool

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants