Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop creating duplicate Dag File Processors #13662

Merged
merged 2 commits into from
Jan 15, 2021

Conversation

kaxil
Copy link
Member

@kaxil kaxil commented Jan 14, 2021

When a dag file is executed via Dag File Processors and multiple callbacks are
created either via zombies or executor events, the dag file is added to
the _file_path_queue and the manager will launch a new process to
process it, which it should not since the dag file is currently under
processing. This will bypass the _parallelism eventually especially when
it takes a long time to process some dag files and since self._processors
is just a dict with file path as the key. So multiple processors with the same key
count as one and hence parallelism is bypassed.

This address the same issue as #11875
but instead does not exclude file paths that are recently processed and that
run at the limit (which is only used in tests) when Callbacks are sent by the
Agent. This is by design as the execution of Callbacks is critical. This is done
with a caveat to avoid duplicate processor -- i.e. if a processor exists,
the file path is removed from the queue. This means that the processor with
the file path to run callback will be still run when the filepath is added again in the
next loop

Tests are added to check the same.

closes #13047, #11875


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@kaxil kaxil added this to the Airflow 2.0.1 milestone Jan 14, 2021
@kaxil kaxil requested a review from ashb January 14, 2021 03:10
@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Jan 14, 2021

assert file_1 in manager._processors.keys()
assert file_2 in manager._processors.keys()
assert [file_3] == manager._file_path_queue
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check processor_factory is only called once

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashb Updated in b7fd374

@github-actions
Copy link

The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the the workflow link to check the reason.

When a dag file is executed via Dag File Processors and multiple callbacks are
created either via zombies or executor events, the dag file is added to
the _file_path_queue and the manager will launch a new process to
process it, which it should not since the dag file is currently under
processing. This will bypass the _parallelism eventually especially when
it takes long time to process some dag files.

This address the same issue as apache#11875
but instead does not exlucde filepaths that are recently processed and that
run at limit (which is only used in tests) when Callbacks are sent by the
Agent. This is by design as execution of Callbacks is critical. This is done
with a caveat to avoid duplicate processor -- i.e. if a processor exists,
instead of removing the file path from the queue it is removed from the
beginning of the queue to the end. This means that the processor with
the filepath to run callback is still run before other filepaths are added.

Tests are added to check the same.

closes apache#13047
@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jan 15, 2021
@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@kaxil kaxil merged commit 32f5953 into apache:master Jan 15, 2021
@kaxil kaxil deleted the stop-duplicate-processors branch January 15, 2021 16:40
kaxil added a commit to astronomer/airflow that referenced this pull request Jan 15, 2021
When a dag file is executed via Dag File Processors and multiple callbacks are
created either via zombies or executor events, the dag file is added to
the _file_path_queue and the manager will launch a new process to
process it, which it should not since the dag file is currently under
processing. This will bypass the _parallelism eventually especially when
it takes a long time to process some dag files and since self._processors
is just a dict with file path as the key. So multiple processors with the same key
count as one and hence parallelism is bypassed.

This address the same issue as apache#11875
but instead does not exclude file paths that are recently processed and that
run at the limit (which is only used in tests) when Callbacks are sent by the
Agent. This is by design as the execution of Callbacks is critical. This is done
with a caveat to avoid duplicate processor -- i.e. if a processor exists,
the file path is removed from the queue. This means that the processor with
the file path to run callback will be still run when the file path is added again in the
next loop

Tests are added to check the same.

closes apache#13047
closes apache#11875

(cherry picked from commit 32f5953)
kaxil added a commit that referenced this pull request Jan 21, 2021
When a dag file is executed via Dag File Processors and multiple callbacks are
created either via zombies or executor events, the dag file is added to
the _file_path_queue and the manager will launch a new process to
process it, which it should not since the dag file is currently under
processing. This will bypass the _parallelism eventually especially when
it takes a long time to process some dag files and since self._processors
is just a dict with file path as the key. So multiple processors with the same key
count as one and hence parallelism is bypassed.

This address the same issue as #11875
but instead does not exclude file paths that are recently processed and that
run at the limit (which is only used in tests) when Callbacks are sent by the
Agent. This is by design as the execution of Callbacks is critical. This is done
with a caveat to avoid duplicate processor -- i.e. if a processor exists,
the file path is removed from the queue. This means that the processor with
the file path to run callback will be still run when the file path is added again in the
next loop

Tests are added to check the same.

closes #13047
closes #11875

(cherry picked from commit 32f5953)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler full tests needed We need to run full set of tests for this PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Occasional "KeyError" in dag_processing
2 participants