Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

execution_timeout seems to be ignored in EcsRunTaskOperator when using deferable #32580

Open
1 of 2 tasks
tim-x-y-z opened this issue Jul 13, 2023 · 8 comments
Open
1 of 2 tasks
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:amazon-aws AWS/Amazon - related issues

Comments

@tim-x-y-z
Copy link

tim-x-y-z commented Jul 13, 2023

Apache Airflow version

2.6.3

What happened

It seems that with the new aws provider package, when using the deferable keyword in the EcsRunTaskOperator - the execution_timeout is ignored and the task is killed from another timeout, the trigger timeout seems to be timeout=timedelta(seconds=self.waiter_max_attempts * self.waiter_delay + 60).

Also, it seems when the trigger fires that timeout - it seems the task return "success" even though it hasn't finished.
It seems this doesn't kill the task either.

What you think should happen instead

The execution_timeout should be used in the trigger timeout, or at least a warning if that timeout is overriden or is smaller

How to reproduce

Run an EcsRunTaskOperator task with deferable mode, put a large execution_timeout and a small number of waiter_retries. The task should terminates based on the trigger timing out before the execution_timeout is up.

Operating System

linux ubuntu

Versions of Apache Airflow Providers

No response

Deployment

Other Docker-based deployment

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@tim-x-y-z tim-x-y-z added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jul 13, 2023
@nathadfield nathadfield added provider:amazon-aws AWS/Amazon - related issues area:providers and removed area:core labels Jul 13, 2023
@nathadfield
Copy link
Collaborator

#31881 was the PR that introduced this. @vandonr-amz are you able to determine whether this is indeed a bug?

@vandonr-amz
Copy link
Contributor

Hmm that's interesting because execution_timeout is a property of the base operator (i.e. of all operators), which is usually handled "automatically", but I don't think that plays well with deferrable...
Some operators have included it manually in the defer() call, with varying logic, some others don't consider it.

I'm also confused by what's the expected behavior in this report. Setting a large execution_timeout is not going to make the trigger run longer than waiter_retries.

I actually see 3 bugs being reported:

  • tasks are successful when triggers time out (probably not what we want ?)
  • we hit the trigger timeout even though it should end on max_attempts by itself 60 seconds before ?
  • execution_timeout is not handled properly on defer

vandonr-amz added a commit to aws-mwaa/upstream-to-airflow that referenced this issue Jul 13, 2023
… of failing

reported by user in issue apache#32580
the issue is about something else, but the user mentionned this as a "bonus bug"
@vandonr-amz
Copy link
Contributor

After testing it, I can confirm that there was a bug where success was returned after max_attempts had been reached

however, setting a short execution_timeout works well and returns an error even when the task is in the triggerer

and setting a long execution timeout works well too, but it doesn't "override" the max attempts.
And I don't think it should be the case, except maybe if max_attempts is not specified but execution timeout is ? But it's a complex behavior, and I'm not sure it'd be a good idea.

Maybe another way to see this is that the defaults that are set for the ecs operator are too short, and we should increase those values so that less people would run into timeouts "out of the box".
We currently allow 100 pings with 6 sec delay by default, so any task taking more than 10 minutes needs an override of those params.

@vincbeck vincbeck removed the needs-triage label for new issues that we didn't triage yet label Jul 18, 2023
@github-actions
Copy link

This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Aug 18, 2023
@tim-x-y-z
Copy link
Author

tim-x-y-z commented Aug 18, 2023

It seems this #32589 pr has fixed the fact that the task doesn't return success anymore, however - it doesn't seem to have solved the "killing of the task part".
If the execution_timeout exceeds the task doesn't seem to be killed. I should expect that if the task fail, the ecs task gets killed.

@github-actions github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Aug 18, 2023
@vandonr-amz
Copy link
Contributor

ah ok, I missed the It seems this doesn't kill the task either. line. Several issues in one here !
It is a known issue yeah that the on_kill of operators doesn't get called in deferrable mode :/
Maybe it'd be worth opening a new more general issue about this. We can try to fix it in every operator by piecing code together to replicate the expected behavior, but it's a suboptimal solution.

@vandonr-amz
Copy link
Contributor

using the search, it seems like it was reported a looong time ago in #19929 (though at the same time as an other issue too)

@thesuperzapper
Copy link
Contributor

Here is the generic issue for the fact that on_kill is not supported when using deferable operators:

There are also some workarounds described in that issue for provider/operator developers before its fully fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

No branches or pull requests

6 participants