Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for more than 1 cron exp per DAG #8649

Closed
jeffolsi opened this issue Apr 30, 2020 · 15 comments
Closed

Add support for more than 1 cron exp per DAG #8649

jeffolsi opened this issue Apr 30, 2020 · 15 comments
Labels
AIP-39 Timetables kind:feature Feature Requests

Comments

@jeffolsi
Copy link

Description
Allow DAG to accept list of cron expression and schedule the dag in correlation to all of them.
Similar to how it can be done in cron job

Use case / motivation
Some scheduling like: every 10 min between 16:30 to 18:10 can not be obtained with single cron expression. The idea is that DAG will have the ability to be set according to more than 1 cron but without duplicating the DAG code or the DAG entry in the UI

Even simple scheduling which is common for ETL : bi-weekly can not be done with single cron expression: https://serverfault.com/questions/404398/how-to-schedule-a-biweekly-cronjob

@jeffolsi jeffolsi added the kind:feature Feature Requests label Apr 30, 2020
@jeffolsi jeffolsi changed the title Add support for more than 1 corn exp per DAG Add support for more than 1 cron exp per DAG Apr 30, 2020
@BasPH
Copy link
Contributor

BasPH commented Apr 30, 2020

An immediate solution to your last sentence is to use timedelta. This is also supported: schedule_interval=timedelta(weeks=2).

@jeffolsi
Copy link
Author

An immediate solution to your last sentence is to use timedelta. This is also supported: schedule_interval=timedelta(weeks=2).

It's not the same. When specifying cron exp you guaranty that tasks will be fired when the time comes. If you use timedelta(weeks=2) you are risking that a delay in running of one task will cause further delay in others as it always look for 2 weeks difference than the last task

to explain lets use daily for simplicity:
2020-04-28 0 0 * * * - this will run every day:

2020-04-29 00:00:00
2020-05-01 00:00:00

Now lets say that airflow was down and the run of 2020-04-29 00:00:00 started to run on 2020-04-29 04:00:00, the next run will still be on 2020-05-01 00:00:00

On the other hand with:
2020-04-28 timedelta(days=1)
if the run of 2020-04-29 00:00:00 started to run on 2020-04-29 04:00:00, the next run will still be on 2020-05-01 04:00:00 The whole schedule is shifted because of the delay!

@BasPH
Copy link
Contributor

BasPH commented Apr 30, 2020

Can you provide an example (screenshot/code/whatever) where that happens? As far as I know, the next execution date is always computed with the start_date and schedule_interval, not the execution date of the last DAG run.

@jeffolsi
Copy link
Author

jeffolsi commented May 3, 2020

@BasPH
This is the DAG defintion:

with DAG(
    dag_id=DAG_NAME,
    default_args=default_args,
    schedule_interval=timedelta(minutes=60),
    max_active_runs=1,
    catchup=False
) as dag:

This is an example for the execution times:
delay

As you can this DAG is hourly by timedelta(minutes=60) but it's not the same as specifying @hourly or 0 * * * * . You can also see the gap in times (marked in red) when Airflow was down. When it got up again it gave a "new" timestamp to the execution_date.

I'm sure you can understand that there is no business logic behind the time stamp of XX:46:10.998426

So as said before timedelta(minutes=60) is not equivalent to @hourly or cron job experssion.

@BasPH
Copy link
Contributor

BasPH commented May 3, 2020

Thanks for pointing this out @jeffolsi, that indeed makes no sense and seems like a fundamental error which should be fixed. What version are you running on? Let's make a separate issue for it.

Regarding the multiple cron expressions, I've seen the request multiple times and think it would be a good addition. The apscheduler library has something for combining intervals: https://apscheduler.readthedocs.io/en/stable/modules/triggers/combining.html. I think similar behaviour would be nice to integrate in Airflow too.

@jeffolsi
Copy link
Author

jeffolsi commented May 4, 2020

@BasPH I'm running 1.10.3
I'm not sure what exactly to report on the new issue. I don't consider this a bug but maybe i'm wrong. I just wanted to explain why the suggestion to use timedelta() does not solve this issue so Airflow needs to support multipule cron expressions for single DAG.

I think this is a very important feature for Airflow.

@themantalope
Copy link

@BasPH @jeffolsi

Came across a simple implementation for combining multiple cron strings and croniter objects here

@mdediana
Copy link
Contributor

I would like to work on this.

The idea would be to allow a list of cron expressions as a schedule_interval. For example, the scheduling in the description would be defined as schedule_interval = ['30/10 16 * * *', '*/10 17 * * *', '0,10 18 * * *']. Do you think this is the way to go?

@mik-laj
Copy link
Member

mik-laj commented May 27, 2020

@mdediana We had long discussions about whether to support multiple scheduler intervals. Many people think that this can affect the presentation and readability of the collected data. This can also complicate the scheduler logic. Can you describe your idea on the mailing list?

@themantalope
Copy link

@mik-laj

I would recommend that the user be allow to supply a list of cron strings or cron strings with comma separation. I would then implement a object that has internal logic like this implementation of scheduling with multiple croniter objects. The object should also have a get_next() function similar to the one currently used by the DAG object (see following implementation). If just one cron string is supplied then the DAG uses the croniter object as is currently implemented.

@mdediana
Copy link
Contributor

@mik-laj Sure, I will do that, thanks.

@tambulkar
Copy link

Is there any update on this?

@sarit-si
Copy link

sarit-si commented Aug 3, 2020

@mdediana

I would like to work on this.

The idea would be to allow a list of cron expressions as a schedule_interval. For example, the scheduling in the description would be defined as schedule_interval = ['30/10 16 * * *', '*/10 17 * * *', '0,10 18 * * *']. Do you think this is the way to go?

This will be of great help. Instead of creating separate DAGs for the same job (like what currently I am doing), this would reduce to just 1 DAG taking care of multiple schedules. One workaround right now is if the crons are not strict, one can tweak multiple crons to have the minutes dimension same for all, for ex : "45 0,8,13 * * *", this will run for 0045, 0845 and 1345 Hrs respectively.
Unfortunately, the crons in my case are strict (0100, 0815 and 1330 Hrs), hence have to create 3 separate DAGs.
Enabling schedule interval to accept list of crons would be very helpful :) 👍

@ashb
Copy link
Member

ashb commented Jan 20, 2021

I've started a discussion thread on this on the dev mailing list to scope out what a solution to this will look like https://lists.apache.org/thread.html/rb4e004e68574e5fb77ee5b51f4fd5bfb4b3392d884c178bc767681bf%40%3Cdev.airflow.apache.org%3E

Use cases there would be ace (and feedback once we come up with a design)

@ashb ashb added the AIP-39 Timetables label Apr 19, 2021
@eladkal
Copy link
Contributor

eladkal commented Jan 18, 2022

I think the request as described here (bi-weekly job) is covered fully by AIP 39 already using Timetables
https://airflow.apache.org/docs/apache-airflow/stable/concepts/timetable.html

Closing as issue solved

@eladkal eladkal closed this as completed Jan 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AIP-39 Timetables kind:feature Feature Requests
Projects
None yet
Development

No branches or pull requests

9 participants