Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add different modes to sort dag files for parsing #15046

Merged
merged 1 commit into from
Mar 29, 2021

Conversation

kaxil
Copy link
Member

@kaxil kaxil commented Mar 27, 2021

This commit adds the feature to allow users to set one of the following modes, the
scheduler will list and sort the dag files to decide the parsing order.:

  • modified_time: Sort by modified time of the files. This is useful on large scale to parse the recently modified DAGs first.
  • random_seeded_by_host: Sort randomly across multiple Schedulers but with same order on the same host. This is useful when running with Scheduler in HA mode where each scheduler can parse different DAG files.
  • alphabetical: Sort by filename

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@kaxil kaxil requested a review from ashb March 27, 2021 02:52
@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Mar 27, 2021
# Sort the file paths by the parsing order mode
list_mode = conf.get("scheduler", "file_parsing_sort_mode", fallback="modified_time")

if list_mode not in FILE_PARSER_MODES:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in configuration.py's validate method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in ade6223

airflow/configuration.py Outdated Show resolved Hide resolved
airflow/utils/dag_processing.py Outdated Show resolved Hide resolved
@kaxil kaxil requested a review from ashb March 28, 2021 21:13
parse different DAG files.
* ``alphabetical``: Sort by filename

version_added: 2.0.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be 2.1? It feels like a new feature, not a bug fix to me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in e394feb

if list_mode not in file_parser_modes:
raise AirflowConfigException(
"`[scheduler] file_parsing_sort_mode` should not be "
+ list_mode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ list_mode
+ repr(list_mode)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use an f-string too maybe for this whole lote?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in e394feb

@kaxil kaxil requested a review from ashb March 29, 2021 14:49
@ashb ashb added this to the Airflow 2.1 milestone Mar 29, 2021
This commit adds the feature to allow users to set one of the following modes, the
 scheduler will list and sort the dag files to decide the parsing order.:

- `modified_time`: Sort by modified time of the files. This is useful on large scale to parse the recently modified DAGs first.
- `random_seeded_by_host`: Sort randomly across multiple Schedulers but with same order on the same host. This is useful when running with Scheduler in HA mode where each scheduler can parse different DAG files.
- `alphabetical`: Sort by filename
@kaxil kaxil merged commit 2e3eb42 into apache:master Mar 29, 2021
@kaxil kaxil deleted the add-dag-processing-modes branch March 29, 2021 21:15
kaxil added a commit to astronomer/airflow that referenced this pull request Apr 12, 2021
This commit adds the feature to allow users to set one of the following modes, the
 scheduler will list and sort the dag files to decide the parsing order.:

- `modified_time`: Sort by modified time of the files. This is useful on large scale to parse the recently modified DAGs first.
- `random_seeded_by_host`: Sort randomly across multiple Schedulers but with same order on the same host. This is useful when running with Scheduler in HA mode where each scheduler can parse different DAG files.
- `alphabetical`: Sort by filename

(cherry picked from commit 2e3eb42)
kaxil added a commit to astronomer/airflow that referenced this pull request Apr 26, 2021
This commit adds the feature to allow users to set one of the following modes, the
 scheduler will list and sort the dag files to decide the parsing order.:

- `modified_time`: Sort by modified time of the files. This is useful on large scale to parse the recently modified DAGs first.
- `random_seeded_by_host`: Sort randomly across multiple Schedulers but with same order on the same host. This is useful when running with Scheduler in HA mode where each scheduler can parse different DAG files.
- `alphabetical`: Sort by filename

(cherry picked from commit 2e3eb42)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants