Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not fail requeued TIs #23846

Merged
merged 9 commits into from
Jun 28, 2022
Merged

Do not fail requeued TIs #23846

merged 9 commits into from
Jun 28, 2022

Conversation

tanelk
Copy link
Contributor

@tanelk tanelk commented May 21, 2022

By the time time we process executor events we might have requeued the TI - we should not mark the TI failed in this case.

One way this can happen is deferable operator with quick trigger.

closes: #23824
related: #21316


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragement file, named {pr_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label May 21, 2022
@tanelk
Copy link
Contributor Author

tanelk commented Jun 3, 2022

I ran a test DAG made by @wolfier for 100 runs per DAG (1000 total).
On main branch I had 7 failed runs (all caused by this bug).
On this branch there were none.

@potiuk you said I can ping you for review.
It would be nice to get this quite hard to debug bug fixed.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@potiuk
Copy link
Member

potiuk commented Jun 3, 2022

But need others confirmation too

@potiuk potiuk added this to the Airflow 2.3.3 milestone Jun 3, 2022
@github-actions
Copy link

github-actions bot commented Jun 3, 2022

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jun 3, 2022
@potiuk
Copy link
Member

potiuk commented Jun 6, 2022

Anyone?

@ashb
Copy link
Member

ashb commented Jun 6, 2022

Taking a look.

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way try_number is handled is complex (bad. Internally to TaskInstance class) so could you also test that behaviour is right too?

LGTM once that is added.

@tanelk tanelk requested review from ashb and potiuk June 8, 2022 08:58
@potiuk
Copy link
Member

potiuk commented Jun 19, 2022

@tanelk can you please rebase and make sure what @ashb requested is addressed?

Comment on lines +404 to +416
# ti is queued with another try number - do not fail it
ti1.state = State.QUEUED
ti1.queued_by_job_id = 1
ti1.try_number = 2
session.merge(ti1)
session.commit()

executor.event_buffer[ti1.key.with_try_number(1)] = State.SUCCESS, None

self.scheduler_job._process_executor_events(session=session)
ti1.refresh_from_db(session=session)
assert ti1.state == State.QUEUED
self.scheduler_job.executor.callback_sink.send.assert_not_called()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased a this should check try number handling

@jedcunningham jedcunningham merged commit 66ffe39 into apache:main Jun 28, 2022
ephraimbuddy pushed a commit that referenced this pull request Jun 29, 2022
(cherry picked from commit 66ffe39)
@ephraimbuddy ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Jun 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler full tests needed We need to run full set of tests for this PR to merge type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition between Triggerer and Scheduler
6 participants