Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-5218] less polling for AWS Batch status #5825

Merged
merged 1 commit into from
Aug 23, 2019

Conversation

dazza-codes
Copy link
Contributor

@dazza-codes dazza-codes commented Aug 15, 2019

Jira

Description

  • Here are some details about my PR, including screenshots of any UI changes:
    • a small increase in the backoff factor could avoid excessive polling
    • avoid the AWS API throttle limits for highly concurrent tasks
    • correct polling loop details
    • correct log messages, with more informative status details

See also:

  • AWS batch spin-up time:
    • https://forums.aws.amazon.com/thread.jspa?messageID=897734
    • Depends on various scheduling intervals:
      • job scheduling interval
      • compute-environment scaling schedule interval
      • API calls to container registry (and container pulls)
    • bottom line, there are inconsistencies from 10-sec to 10-min spin-ups

Here are some quotes from that message forum link above (note that the forum comments are dated approx 2017 through to 2019 and improvements are released over time). AWS batch support says,

For example, if you submit a hundred jobs to AWS Batch, the Scheduler will transition all of these from SUBMITTED to RUNNABLE or PENDING in about a minute. RUNNABLE jobs should transition to STARTING and RUNNING fairly quickly assuming you have sufficient resources in your compute environment.

AWS batch support note that improvements are released over time,

For example, when we launched the service, we would schedule jobs every 1 minute. We now perform these operations every 10 seconds. It is however important to differentiate that this delay is seen only if there are no Submitted jobs in the JobQueue. Batch continues to transition and schedule jobs with no delay until the JobQueue has no Submitted or Runnable jobs. Thus, AWS Batch may take longer to schedule an individual job when Jobs are submitted infrequently. At scale, Batch can schedule jobs far more quickly.

But, when 100's of jobs require a scaling solution, the delays can be longer,

Further, it is important to note that the AWS Batch resource scaling decisions occur on a different frequency. Upon receiving your first job submission, AWS Batch will launch an initial set of compute resources. After this point Batch re-evaluates resource needs approximately every 10 minutes. By making scaling decisions less frequently, we avoid scenarios where AWS Batch would scale up too quickly and complete all RUNNABLE jobs, leaving a large number of unused instances with partially consumed billing hours.

Tests

  • My PR does not need testing for this extremely good reason:
    • there are tests on the AWS BatchOperator already
    • this PR is an implementation detail in a private method
    • the change does not impact any public API

Commits

  • My commits all reference Jira issues in their subject lines
    • just one commit

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • no new functionality
    • logs are more explicit than before

Code Quality

  • Passes flake8

@dazza-codes dazza-codes force-pushed the patch-1 branch 2 times, most recently from 88d032c to 6f06246 Compare August 15, 2019 04:42
@mik-laj mik-laj added the provider:amazon-aws AWS/Amazon - related issues label Aug 15, 2019
@dazza-codes dazza-codes force-pushed the patch-1 branch 2 times, most recently from 6945fad to dd75d37 Compare August 15, 2019 18:02
@dazza-codes
Copy link
Contributor Author

Most of the CI builds are OK but one of them timed-out and I don't have access to restart it

airflow/contrib/operators/awsbatch_operator.py Outdated Show resolved Hide resolved
airflow/contrib/operators/awsbatch_operator.py Outdated Show resolved Hide resolved
@dazza-codes
Copy link
Contributor Author

Review requests addressed

  • leaving one conversation to @ashb to consider/resolve
  • considered application of black formatting, but skipped it (too many changes)

https://issues.apache.org/jira/browse/AIRFLOW-5218
- avoid the AWS API throttle limits for highly concurrent tasks
- a small increase in the backoff factor could avoid excessive polling
- random sleep before polling to allow the batch task to spin-up
  - the random sleep helps to avoid API throttling
- revise the retry logic slightly to avoid unnecessary pause
  when there are no more retries required
@kaxil kaxil merged commit fc972fb into apache:master Aug 23, 2019
@dazza-codes dazza-codes deleted the patch-1 branch August 23, 2019 17:08
Jerryguo pushed a commit to Jerryguo/airflow that referenced this pull request Sep 2, 2019
https://issues.apache.org/jira/browse/AIRFLOW-5218
- avoid the AWS API throttle limits for highly concurrent tasks
- a small increase in the backoff factor could avoid excessive polling
- random sleep before polling to allow the batch task to spin-up
  - the random sleep helps to avoid API throttling
- revise the retry logic slightly to avoid unnecessary pause
  when there are no more retries required
ashb pushed a commit that referenced this pull request Oct 11, 2019
https://issues.apache.org/jira/browse/AIRFLOW-5218
- avoid the AWS API throttle limits for highly concurrent tasks
- a small increase in the backoff factor could avoid excessive polling
- random sleep before polling to allow the batch task to spin-up
  - the random sleep helps to avoid API throttling
- revise the retry logic slightly to avoid unnecessary pause
  when there are no more retries required

(cherry picked from commit fc972fb)
adityav pushed a commit to adityav/airflow that referenced this pull request Oct 14, 2019
https://issues.apache.org/jira/browse/AIRFLOW-5218
- avoid the AWS API throttle limits for highly concurrent tasks
- a small increase in the backoff factor could avoid excessive polling
- random sleep before polling to allow the batch task to spin-up
  - the random sleep helps to avoid API throttling
- revise the retry logic slightly to avoid unnecessary pause
  when there are no more retries required

(cherry picked from commit fc972fb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants