Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to disable a running DAG only after after it's in a finished state #22006

Open
1 of 2 tasks
SangwanP opened this issue Mar 4, 2022 · 28 comments
Open
1 of 2 tasks
Labels
area:Scheduler including HA (high availability) scheduler kind:feature Feature Requests

Comments

@SangwanP
Copy link
Contributor

SangwanP commented Mar 4, 2022

Description

I'd like the ability to "drain" a DAG from the UI. This would mean:

  • If a DAG is in a running state - disable the DAG after the run reaches a finished state (success/failed)
  • If a DAG is in an idle state - disable immediately (same as what disable does right now)

Use case/motivation

Currently, if you disable a DAG, it stops the DAG midway - the running tasks are completed, and queued tasks remain on the queue. I would like a way to disable the DAG, but if there is an active Dag Run it should continue running until it reaches a finished state.

For us, this would be especially helpful when upgrading our Airflow images/environments. To my understanding, Airflow doesn't have a failure-free zero-downtime upgrade (please point me to documentation if its possible). When upgrading, we don't want to fail/kill running DAGs, so we let them reach a finished state. Once none of the DAGs are in a running state, we perform the upgrades. Precisely timing the DAG disable operation such that a subsequent run doesn't start, is not a practical/viable solution.
For upgrades in the past, we have changed all the DAG schedules to None, which allows a running DAG to complete but doesn't start a subsequent run. Once upgrades are done, we update the DAG schedules back to their original value.
We would like to move away from this solution, and have some built-in functionality to deal with such a case.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@SangwanP SangwanP added the kind:feature Feature Requests label Mar 4, 2022
@potiuk
Copy link
Member

potiuk commented Mar 7, 2022

Good idea. Would you like to work on maybe ? this is the best way to make sure it happens :)

@potiuk potiuk added the area:UI Related to UI/UX. For Frontend Developers. label Mar 7, 2022
@bbovenzi bbovenzi added the area:webserver Webserver related Issues label Mar 7, 2022
@bbovenzi
Copy link
Contributor

bbovenzi commented Mar 7, 2022

I wonder if this should be added as a configuration for pausing/unpausing DAGs? Having a separate button for this could be confusing.

@eladkal
Copy link
Contributor

eladkal commented Mar 7, 2022

The use case as described is "Don't schedule new runs but let the current running dags to finish" - This sounds more like a scheduler feature rather than UI.

@potiuk
Copy link
Member

potiuk commented Mar 7, 2022

Very much so @eladkal

@bbovenzi bbovenzi added area:Scheduler including HA (high availability) scheduler and removed area:webserver Webserver related Issues area:UI Related to UI/UX. For Frontend Developers. labels Mar 7, 2022
@SangwanP
Copy link
Contributor Author

SangwanP commented Mar 8, 2022

As a user, I'd like to have functionality in the UI that allows me to either "disable and stop the DAG immediately", or "disable DAG once finished". How to implement such a feature in the UI, such that it's not confusing is of course up for discussion - button, switch, etc.
The underlying functionality to perform the "drain" will likely be need to be developed in the scheduler. So, this feature I think, falls under both scheduler, and UI development areas.

Also, @potiuk - I can try to pick this task up in a couple of weeks. Let's leave it unassigned for now, and I can let you know once I am ready to pick it up.

@potiuk
Copy link
Member

potiuk commented Mar 8, 2022

Sure!

@eladkal
Copy link
Contributor

eladkal commented Mar 8, 2022

I think this involves some changes to the DB schema as probably it means that is_paused is no longer just bool of True / False but actually it has several states.

@bbovenzi
Copy link
Contributor

bbovenzi commented Mar 8, 2022

I think this involves some changes to the DB schema as probably it means that is_paused is no longer just bool of True / False but actually it has several states.

This is why I am partial to it being a config for the scheduler because then we can't represent is_paused as an on/off switch.

@potiuk
Copy link
Member

potiuk commented Mar 8, 2022

I think this involves some changes to the DB schema as probably it means that is_paused is no longer just bool of True / False but actually it has several states.

Maybe a chance to solve the "Activated" "Paused" conundrum: #14459

@potiuk
Copy link
Member

potiuk commented Mar 8, 2022

image

@pateash
Copy link
Contributor

pateash commented Mar 20, 2024

I am happy to check this out @eladkal @potiuk ,

OPTION(1):
changing "is_paused" from boolean to string( paused/unpaused/drain ),
it could be a good idea in to add an additional column is_paused_state in DAG, which will allow users to keep using is_paused, but both can't be used together. ( Something in these lines )

OPTION(2):
changing "is_paused" from boolean to string( paused/unpaused/drain ) could be a good idea in DAGModel,
without changing it for the user in DAG() object.

Happy to hear out other suggestions.

@pateash
Copy link
Contributor

pateash commented Mar 20, 2024

Exploring Option(2)
paused: states ( PAUSED, DRAIN) ~ not UNPAUSED
unpaused: state( UNPAUSED)

@pateash
Copy link
Contributor

pateash commented Mar 20, 2024

Can We change schema of DAG(),
is_paused should be having same datatype in both DagModel and Dag,
changing it from boolean to string will also change openAPI spec and so on.

cc. @potiuk @eladkal @bbovenzi

@bbovenzi
Copy link
Contributor

bbovenzi commented Mar 20, 2024

Instead of messing with the state, I wonder if we could have an advanced version of the dag pause action. That will pause a dag after the end of the run.

We already have a toggle in the manual DAG trigger to unpause as part of triggering a run.
And we already have a pause after so many failures option: #36935

@pateash
Copy link
Contributor

pateash commented Apr 27, 2024

thanks @bbovenzi,
I have found a way to still complete the DAGRun, even if the dag is paused.

Do we want it to be default behaviour or we should parameterize it with an argument in dag like 'drain_on_pause=True/False'?
cc. @eladkal @potiuk @SangwanP

@eladkal
Copy link
Contributor

eladkal commented Apr 27, 2024

We need to be backward compatible but this functionality feel less about the dag code. I don't think it should be set on the code level... Pause/un-pause are action you take on the dag regardless of it's code.

@pateash
Copy link
Contributor

pateash commented Apr 27, 2024

yes @eladkal ,
the only difference will be that,
earlier if we pause a dag and there are some queued/scheduled tasks they will remain in that state forever,
after this change all of the tasks of a dag will be completed. ( ofcourse no further dagruns will be scheduled )

But, I thinks its better if we only do this when users wants it (when setting drain_on_pause=True ) ,
there might be some users who are relying on current behaviour right now.

Technical Details

  • There is a query in schedule_job_runner.py where we only execute tasks for the dags which are un_paused, and I am proposing removing the condition
image

@eladkal
Copy link
Contributor

eladkal commented Apr 28, 2024

I am OK with having some code settings that can help limit the pausing options on specific dag (should DAG author wants it) but as a general concept I don't think it should be government by code (DAG author code).

Today you can pause/unpause at will. We should have the same thing with the new option. Effectively this means that the UI toggle should be changed from 2 mode (paused, unpaused) to 3 mode (paused, unpaused, drain tasks -> paused) where drain eventually is switched to paused when the drain is completeled.

WDYT?

@pateash
Copy link
Contributor

pateash commented Apr 28, 2024

I am OK with having some code settings that can help limit the pausing options on specific dag (should DAG author wants it) but as a general concept I don't think it should be government by code (DAG author code).

Today you can pause/unpause at will. We should have the same thing with the new option. Effectively this means that the UI toggle should be changed from 2 mode (paused, unpaused) to 3 mode (paused, unpaused, drain tasks -> paused) where drain eventually is switched to paused when the drain is completeled.

WDYT?

OPTION 1-

We can change type of paused from Boolean to String, but this will also change behaviour of is_paused_upon_creation,
although we might add is_paused_with_drain_upon_creation as well.

OPTION 2 -

Adding another check or UI element and bind with drain_on_pause, so users can handle it from UI as well.
and keeping the paused to boolean with on and off switch.

Any thoughts?

@eladkal
Copy link
Contributor

eladkal commented May 1, 2024

I prefer option 1.
We need to remember that pausing DAG can also be done from API/CLI and not just the UI.
We can simply deprecate is_paused_upon_creation and replace it with new parameter to represent all possible options.
similar to what we did with K8s delete pod option #30718 where we deprecated boolean option is_delete_operator_pod(boolean) in favor of on_finish_action (Enum)

@jscheffl
Copy link
Contributor

jscheffl commented May 1, 2024

I feel linke when going into details of database, DAG model, REST API, CLI... UI Experience this is a very complex thing. I agree that this option is useful.
On the other hand scheduling logic today is already quite complex with a lot of queries. We should not risk draining performance with too many options.
So taking this as a "valid use case" compared to the complexity added (also with risk of regression in scheduler performance) I'd rather opt for a global option. If the on/of must be separately evaluated per DAG, some queries in Scheduler will get really complex.

I'd propose to have a global config which changes the behavior. Either the on/of is just draining DAGs (means: Do not start new DAG runs, still schedule all tasks irrespective of DAG state) or stop scheduling tasks and DAGs.

I don't see a strong use case that in a single setup both use cases have a hard requirement to be supported in parallel.

@eladkal
Copy link
Contributor

eladkal commented May 1, 2024

@jscheffl The way I imagine it is that the drain status is temporary (for the duration of the drain). once drain is finished it should be switched to pause. So at any given time the amount of tasks that are in this unique state should be relatively small

@SangwanP
Copy link
Contributor Author

SangwanP commented May 1, 2024

I have an alternative solution; someone with a deeper knowledge of the code base can evaluate it:
If one is able to 'disable' the schedule on a DAG, the 'drain' functionality can be achieved without having to make changes to the is_paused field.
There's already a configuration to globally disable schedules (AIRFLOW__SCHEDULER__USE_JOB_SCHEDULE), it'd be nice if this can be extended to be configurable on a per DAG basis. This field can then be set from the UI (how to do that can be discussed).

@shubham22
Copy link

I agree with @jscheffl here that this use case doesn't warrant adding a new state with paused/unpaused/drain for DAGs as it will complicate and slow down scheduling (and rescheduling) logic. We could do this as UI-only feature, but I am personally not convinced that we need to handle this edge case. If we really really need to, I think it is better to provide single and consistent behaviour for pausing DAGs, which considers drainage, unlike today.

If one is able to 'disable' the schedule on a DAG, the 'drain' functionality can be achieved without having to make changes to the is_paused field.

@SangwanP - the use case that you started with was primarily focused on version upgrades and in that scenario, you only need this at the environment level, not at a DAG level. Moreover, in my experience, if you're doing an in-place version upgrade, any tasks that are scheduled, but not yet finished will be run again once version is upgraded. I don't think it is required for you to set all schedules to none and re-set them.

@pateash
Copy link
Contributor

pateash commented May 5, 2024

@eladkal what do we think?

  1. I think as suggested by @shubham22,
    we can make this as consistent( allowing drainage by default )
  2. Or by adding an argument to DAG() to set this on dag level - which will make sure it doesn't affect existing users
  3. additionally, by also adding an ENV variable to enable this by default.

@eladkal
Copy link
Contributor

eladkal commented May 6, 2024

We could do this as UI-only feature

I'll refer to my earlier comment

The use case as described is "Don't schedule new runs but let the current running dags to finish" - This sounds more like a scheduler feature rather than UI.

#22006 (comment)


I am very happy to proceed with what was suggested. We can always improve further according to users feedback

@pateash
Copy link
Contributor

pateash commented May 31, 2024

I am looking into this flow,
I have the backend changes figured out, but I am not able to create 2 popups.
I might need help from a UI wizard.
cc. @eladkal @bbovenzi

image

@pateash
Copy link
Contributor

pateash commented May 31, 2024

Also @eladkal please assign this to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler kind:feature Feature Requests
Projects
None yet
Development

No branches or pull requests

7 participants