Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EmrContainerOperator in Async mode doesn't respect default "infinite" polling number #40483

Closed
2 tasks done
akomisarek opened this issue Jun 28, 2024 · 3 comments · Fixed by #41008
Closed
2 tasks done
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:amazon-aws AWS/Amazon - related issues

Comments

@akomisarek
Copy link

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

apache-airflow-providers-amazon[aiobotocore]==8.24.0

Apache Airflow version

2.7.3

Operating System

"Debian GNU/Linux 11 (bullseye)"

Deployment

Official Apache Airflow Helm Chart

Deployment details

Deployment to EKS

What happened

EMR EKS job timedout unexpectedly with error (EMRContainerOperator) when used in deferred mode:

  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/utils/waiter_with_logging.py", line 133, in async_wait
    raise AirflowException("Waiter error: max attempts reached")
airflow.exceptions.AirflowException: Waiter error: max attempts reached

While not providing any max_attempts

What you think should happen instead

The job should poll until it becomes FAILD or SUCCESSFUL

How to reproduce

Trigger long running job (over 5 hrs) using EMRContainerOperator in Async/Deferred mode

Anything else

I believe it's caused by the defaults defined here:

waiter_delay: int = 30,
waiter_max_attempts: int = 600,

This contradicts documentation: https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/operators/emr/index.html#airflow.providers.amazon.aws.operators.emr.EmrContainerOperator

Which outlines:
max_polling_attempts (int | None) – Maximum number of times to wait for the job run to finish. Defaults to None, which will poll until the job is not in a pending, submitted, or running state.

Which doesn't seem to be the case and hence raising this as an Issue.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@akomisarek akomisarek added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jun 28, 2024
@eladkal eladkal added provider:amazon-aws AWS/Amazon - related issues good first issue and removed needs-triage label for new issues that we didn't triage yet labels Jun 28, 2024
@STAR-173
Copy link

Hey @akomisarek , are you working on this issue or is it open for PR ?

@akomisarek
Copy link
Author

Hi @STAR-173 - no I haven't started working on this yet, would be only able to pick this up Tuesday/Wednesday, so if you can work on this earlier, feel free to pick it up. Thanks! :)

@akomisarek
Copy link
Author

Sorry, probably I should add follow up comment, that for time being I ditched the idea of Async operator, as I also hit that problem:
#36090

So for now I moved back to Sync approach, and noticed it works even better as it prints out logs, which often works well. So won't contribute for the time being :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants