[AIRFLOW-5218] less polling for AWS Batch status #5825

dazza-codes · 2019-08-15T02:13:42Z

Jira

My PR addresses the following [Airflow Jira]
- https://issues.apache.org/jira/browse/AIRFLOW-5218
- this PR is the smallest possible change that could help that issue

Description

Here are some details about my PR, including screenshots of any UI changes:
- a small increase in the backoff factor could avoid excessive polling
- avoid the AWS API throttle limits for highly concurrent tasks
- correct polling loop details
- correct log messages, with more informative status details

See also:

AWS batch spin-up time:
- https://forums.aws.amazon.com/thread.jspa?messageID=897734
- Depends on various scheduling intervals:
  - job scheduling interval
  - compute-environment scaling schedule interval
  - API calls to container registry (and container pulls)
- bottom line, there are inconsistencies from 10-sec to 10-min spin-ups

Here are some quotes from that message forum link above (note that the forum comments are dated approx 2017 through to 2019 and improvements are released over time). AWS batch support says,

For example, if you submit a hundred jobs to AWS Batch, the Scheduler will transition all of these from SUBMITTED to RUNNABLE or PENDING in about a minute. RUNNABLE jobs should transition to STARTING and RUNNING fairly quickly assuming you have sufficient resources in your compute environment.

AWS batch support note that improvements are released over time,

For example, when we launched the service, we would schedule jobs every 1 minute. We now perform these operations every 10 seconds. It is however important to differentiate that this delay is seen only if there are no Submitted jobs in the JobQueue. Batch continues to transition and schedule jobs with no delay until the JobQueue has no Submitted or Runnable jobs. Thus, AWS Batch may take longer to schedule an individual job when Jobs are submitted infrequently. At scale, Batch can schedule jobs far more quickly.

But, when 100's of jobs require a scaling solution, the delays can be longer,

Further, it is important to note that the AWS Batch resource scaling decisions occur on a different frequency. Upon receiving your first job submission, AWS Batch will launch an initial set of compute resources. After this point Batch re-evaluates resource needs approximately every 10 minutes. By making scaling decisions less frequently, we avoid scenarios where AWS Batch would scale up too quickly and complete all RUNNABLE jobs, leaving a large number of unused instances with partially consumed billing hours.

Tests

My PR does not need testing for this extremely good reason:
- there are tests on the AWS BatchOperator already
- this PR is an implementation detail in a private method
- the change does not impact any public API

Commits

My commits all reference Jira issues in their subject lines
- just one commit

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- no new functionality
- logs are more explicit than before

Code Quality

Passes flake8

airflow/contrib/operators/awsbatch_operator.py

dazza-codes · 2019-08-15T21:22:25Z

Most of the CI builds are OK but one of them timed-out and I don't have access to restart it

https://travis-ci.org/apache/airflow/builds/572411318

airflow/contrib/operators/awsbatch_operator.py

dazza-codes · 2019-08-16T21:03:27Z

Review requests addressed

leaving one conversation to @ashb to consider/resolve
considered application of black formatting, but skipped it (too many changes)

https://issues.apache.org/jira/browse/AIRFLOW-5218 - avoid the AWS API throttle limits for highly concurrent tasks - a small increase in the backoff factor could avoid excessive polling - random sleep before polling to allow the batch task to spin-up - the random sleep helps to avoid API throttling - revise the retry logic slightly to avoid unnecessary pause when there are no more retries required

https://issues.apache.org/jira/browse/AIRFLOW-5218 - avoid the AWS API throttle limits for highly concurrent tasks - a small increase in the backoff factor could avoid excessive polling - random sleep before polling to allow the batch task to spin-up - the random sleep helps to avoid API throttling - revise the retry logic slightly to avoid unnecessary pause when there are no more retries required (cherry picked from commit fc972fb)

dazza-codes force-pushed the patch-1 branch 2 times, most recently from 88d032c to 6f06246 Compare August 15, 2019 04:42

mik-laj added the provider:amazon-aws AWS/Amazon - related issues label Aug 15, 2019

feluelle reviewed Aug 15, 2019

View reviewed changes

airflow/contrib/operators/awsbatch_operator.py Outdated Show resolved Hide resolved

dazza-codes force-pushed the patch-1 branch 2 times, most recently from 6945fad to dd75d37 Compare August 15, 2019 18:02

ashb requested changes Aug 16, 2019

View reviewed changes

airflow/contrib/operators/awsbatch_operator.py Outdated Show resolved Hide resolved

airflow/contrib/operators/awsbatch_operator.py Outdated Show resolved Hide resolved

dazza-codes force-pushed the patch-1 branch from dd75d37 to 8ab9435 Compare August 16, 2019 19:05

dazza-codes force-pushed the patch-1 branch from 8ab9435 to 32a7066 Compare August 18, 2019 17:00

dazza-codes force-pushed the patch-1 branch from 32a7066 to 2712b84 Compare August 18, 2019 17:20

ashb approved these changes Aug 19, 2019

View reviewed changes

kaxil merged commit fc972fb into apache:master Aug 23, 2019

dazza-codes deleted the patch-1 branch August 23, 2019 17:08

dazza-codes mentioned this pull request Dec 12, 2019

[AIRFLOW-5889] Fix polling for AWS Batch job status #6765

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIRFLOW-5218] less polling for AWS Batch status #5825

[AIRFLOW-5218] less polling for AWS Batch status #5825

dazza-codes commented Aug 15, 2019 •

edited

Loading

dazza-codes commented Aug 15, 2019

dazza-codes commented Aug 16, 2019

[AIRFLOW-5218] less polling for AWS Batch status #5825

[AIRFLOW-5218] less polling for AWS Batch status #5825

Conversation

dazza-codes commented Aug 15, 2019 • edited Loading

Jira

Description

Tests

Commits

Documentation

Code Quality

dazza-codes commented Aug 15, 2019

dazza-codes commented Aug 16, 2019

dazza-codes commented Aug 15, 2019 •

edited

Loading