Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for multiple cron expressions in schedule_interval #24733

Closed
wants to merge 11 commits into from

Conversation

BasPH
Copy link
Contributor

@BasPH BasPH commented Jun 29, 2022

This PR extends the behavior of CronDataIntervalTimetable to support a collection (list/set/tuple) of strings to the schedule_interval. For example:

with DAG(dag_id="foobar", schedule_interval=["@daily", "0 3 * * MON,TUE"]):
    ...

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragement file, named {pr_number}.significant.rst, in newsfragments.

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only question is is it worth exposing this via schedule_interval or should we make people use a timetable directly for this?

Do we also need a new timetable class, or could the existing CornDataIntervalTimetable be extended to take multiple patterns, and then all we change is what gets passed when upgrading schedule_interval.?

You also haven't provided a description field (which is IIRC what is shown in the UI) for the new timetable class.

+------------------------------------------+----------------------------------------------------------------------------------------------------------------------+--------------------------------------+
| Cron preset (``str``) | Convenience cron expression for readability + ``"@daily"`` |
+------------------------------------------+----------------------------------------------------------------------------------------------------------------------+--------------------------------------+
| List of cron expressions/presets | To run at intervals that cannot be expressed by a single cron expression. + ``["0 3 * * *", "0 0 * * MON,TUE"]`` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say if the mode is "and" or "or" (as implemented it's "or")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can be both "and" and "or". For example, when given ["0 3 * * MON,TUE", "@daily"], each expression is looked up in the cron presets and converted if found. Shall I convert to "and/or"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the example to demonstrate ["@daily", "0 3 * * MON,TUE"] is a valid schedule_interval.

docs/apache-airflow/concepts/dags.rst Show resolved Hide resolved
@BasPH
Copy link
Contributor Author

BasPH commented Jun 29, 2022

My only question is is it worth exposing this via schedule_interval or should we make people use a timetable directly for this?

I believe it's more convenient exposing it via schedule_interval since (1) that's currently how other timetables are exposed and (2) timetables to me feel like an implementation detail with lots of intricate details. I believe multiple crons is a common ask and therefore not worth requiring the user to initialize a timetable.

Do we also need a new timetable class, or could the existing CornDataIntervalTimetable be extended to take multiple patterns, and then all we change is what gets passed when upgrading schedule_interval.?

We could extend the existing implementation to take a list of cron expressions (and convert a single string to a list of strings internally to align the business logic).

You also haven't provided a description field (which is IIRC what is shown in the UI) for the new timetable class.

Added.

@uranusjr
Copy link
Member

Exposing this in schedule_interval makes sense to me; this is probably the only semantic that makes sense anyway.

I would propose we accept any Collection[str] input, instead of just list. Passing in a set makes as much sense here as list, maybe even more?

@BasPH BasPH marked this pull request as draft June 30, 2022 07:45
@BasPH
Copy link
Contributor Author

BasPH commented Jun 30, 2022

Merged the MultiCronDataIntervalTimetable into the CronDataIntervalTimetable, which now accepts both a single string and a list of strings.

Converted the PR to draft for the moment because I had to change the logic on _should_fix_dst and plan to add tests to ensure correct functioning.

@uranusjr Agreed, will add.

@BasPH
Copy link
Contributor Author

BasPH commented Jul 1, 2022

@uranusjr Added support for sets (& tuples). Didn't set the type to Collection though because that also includes dicts, which is not supported. Since there's no special type for Collections expect dicts (as far as I know?) the type is now Union[List[str], Set[str], Tuple[str, ...]], does that work for you?

@BasPH
Copy link
Contributor Author

BasPH commented Jul 20, 2022

@uranusjr @ashb any more thoughts on this?

@@ -105,7 +105,8 @@


DagStateChangeCallback = Callable[[Context], None]
ScheduleInterval = Union[None, str, timedelta, relativedelta]
MultiCron = Union[List[str], Set[str], Tuple[str, ...]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This additional type alias doesn’t seem to be necessary?

(Also this can probably be simply Collection[str])

Comment on lines +172 to +174
if isinstance(interval, str) or (
isinstance(interval, (list, set, tuple)) and all(isinstance(element, str) for element in interval)
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(interval, str) or (
isinstance(interval, (list, set, tuple)) and all(isinstance(element, str) for element in interval)
):
if isinstance(interval, str) or (
isinstance(interval, Collection) and all(isinstance(element, str) for element in interval)
):

Comment on lines +2506 to +2510
orm_dag.schedule_interval = (
list(dag.schedule_interval)
if isinstance(dag.schedule_interval, set)
else dag.schedule_interval
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to special-case set?

Comment on lines +111 to +113
def __init__(
self, crons: Union[str, List[str], Set[str], Tuple[str, ...]], timezone: Union[str, Timezone]
) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of normalizing here, I wonder if it’s easier to make this only accept List[str] and normalize in DAG.__init__ instead.

@@ -197,6 +210,9 @@ schedule interval put in place, the logical date is going to indicate the time
at which it marks the start of the data interval, where the DAG run's start
date would then be the logical date + scheduled interval.

.. tip::
For more information on ``logical date``, see :ref:`data-interval` and :ref:`faq:what-does-execution-date-mean`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For more information on ``logical date``, see :ref:`data-interval` and :ref:`faq:what-does-execution-date-mean`.
For more information on *logical date*, see :ref:`data-interval` and :ref:`faq:what-does-execution-date-mean`.

nit (this word is not code so let’s not make it look like code)

@github-actions
Copy link

github-actions bot commented Sep 6, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Sep 6, 2022
@github-actions github-actions bot closed this Sep 11, 2022
@Inetov
Copy link

Inetov commented Jul 11, 2023

Hi @BasPH !
what does it take to finish this? very useful feature!

@moreaupascal56
Copy link

Hi all,
+1 for @Inetov, I can try to help on this as well. I am using this timetable for now (https://stackoverflow.com/a/72492370) but a native option would be great 👍

@jprumsey
Copy link

+1, native option here would be great

@uranusjr
Copy link
Member

Please refrain from posting messages such as “+1” since it does not add to the conversation nor move forward the feature. If you are interested in seeing the functionality, please start a pull request yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:serialization kind:documentation stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants