Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce RESTARTING state #16681

Merged
merged 8 commits into from
Jul 31, 2021
Merged

Introduce RESTARTING state #16681

merged 8 commits into from
Jul 31, 2021

Conversation

yuqian90
Copy link
Contributor

@yuqian90 yuqian90 commented Jun 27, 2021

closes: #16680

This PR makes sure that when a user clears a running task, the task does not fail. Instead it is killed and retried gracefully.

This is done by introducing a new State called RESTARTING. As the name suggests, a TaskInstance is set to this state when it's cleared while running. Most of the places handles RESTARTING the same way SHUTDOWN is handled, except in TaskInstance.is_eligible_to_retry, where it is always be treated as eligible for retry.

@boring-cyborg boring-cyborg bot added area:Scheduler including HA (high availability) scheduler area:UI Related to UI/UX. For Frontend Developers. area:webserver Webserver related Issues labels Jun 27, 2021
@ashb
Copy link
Member

ashb commented Jun 28, 2021

Is a new state strictly required for this? Cos each state we add needs to be shown in the UI and increases the possible places we need to deal with the new state

airflow/models/taskinstance.py Outdated Show resolved Hide resolved
@yuqian90
Copy link
Contributor Author

Is a new state strictly required for this? Cos each state we add needs to be shown in the UI and increases the possible places we need to deal with the new state

I think another way to implement this is to make clear_task_instances increase the max_tries of the running TaskInstance by 1. However, when I tried to do that I realized it's easy. There are more places that need to be changed to do the right thing. One example is is_eligible_to_retry(). It first checks if self.task.retries is zero. If it is, it short-circuits and returns False. But self.task.retries isn't something that can be incremented by clear_task_instances because it's not stored in the db.

    def is_eligible_to_retry(self):
        """Is task instance is eligible for retry"""
        return self.task.retries and self.try_number <= self.max_tries

@yuqian90 yuqian90 changed the title Introduce CLEARED_WHEN_RUNNING state Introduce RESTARTING state Jun 29, 2021
@yuqian90
Copy link
Contributor Author

yuqian90 commented Jul 1, 2021

Ready for review. The failing Helm Chart; KubernetesExecutor (pull_request) looks unrelated to this PR.

@yuqian90 yuqian90 requested a review from ashb July 1, 2021 13:29
@potiuk
Copy link
Member

potiuk commented Jul 1, 2021

Ready for review. The failing Helm Chart; KubernetesExecutor (pull_request) looks unrelated to this PR.

Yep hopefully already fixed by #16750

@uranusjr

This comment has been minimized.

@uranusjr uranusjr closed this Jul 2, 2021
@uranusjr uranusjr reopened this Jul 2, 2021
@yuqian90 yuqian90 added this to the Airflow 2.2 milestone Jul 6, 2021
airflow/models/dagrun.py Outdated Show resolved Hide resolved
Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code and name looks good now, though this is probably worth documenting somewhere so users can know about this behaviour.

(And you'll have to resolve conflicts obviously.)

@yuqian90
Copy link
Contributor Author

yuqian90 commented Jul 7, 2021

Code and name looks good now, though this is probably worth documenting somewhere so users can know about this behaviour.

(And you'll have to resolve conflicts obviously.)

Thanks. That's done. I added a paragraph in UPDATING.md about this new behaviour. Please see if that helps.

@ashb
Copy link
Member

ashb commented Jul 7, 2021

I mostly also thinking somewhere in the published docs tree.

Ah https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html#task-instances

That diagram needs updating :)

@yuqian90
Copy link
Contributor Author

Thanks everyone for the suggestions in #6762. @kaxil @potiuk @mik-laj @kaverisharma09 I re-created the diagram with draw.io and here's a draft. Some modifications I made:

  • Added "restarting" state introduced in Introduce RESTARTING state #16681
  • Named the state according to their definitions in state.py
  • Added a choice node to represent the alternative state transition
  • Moved the standard lifecycle to the top of the graph. I.e. "none, to scheduled, to queued, to running, and finally to success" is illustrated at the top of the diagram.

Please let me know what you think. I'll include the diagram in my PR and upload to https://cwiki.apache.org/confluence/display/AIRFLOW/File+lists if you think it works.

task_states

@kaxil
Copy link
Member

kaxil commented Jul 14, 2021

Thanks everyone for the suggestions in #6762. @kaxil @potiuk @mik-laj @kaverisharma09 I re-created the diagram with draw.io and here's a draft. Some modifications I made:

  • Added "restarting" state introduced in Introduce RESTARTING state #16681
  • Named the state according to their definitions in state.py
  • Added a choice node to represent the alternative state transition
  • Moved the standard lifecycle to the top of the graph. I.e. "none, to scheduled, to queued, to running, and finally to success" is illustrated at the top of the diagram.

Please let me know what you think. I'll include the diagram in my PR and upload to https://cwiki.apache.org/confluence/display/AIRFLOW/File+lists if you think it works.

task_states

Sounds good

@potiuk
Copy link
Member

potiuk commented Jul 14, 2021

Looks good to me :)

@yuqian90 yuqian90 force-pushed the clear_running branch 2 times, most recently from 65f929e to 7f2d2e0 Compare July 20, 2021 06:08
@potiuk
Copy link
Member

potiuk commented Jul 26, 2021

Hey @ashb - any more comments ? I think I would love to merge that one :)

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jul 26, 2021
@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@potiuk
Copy link
Member

potiuk commented Jul 26, 2021

Hey @yuqian90 , can you please rebase? We had some master issues recently, but they should be solved already :)

airflow/utils/state.py Outdated Show resolved Hide resolved
UPDATING.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler area:UI Related to UI/UX. For Frontend Developers. area:webserver Webserver related Issues full tests needed We need to run full set of tests for this PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle "cleared when running" more gracefully
6 participants