Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-XXX] GSoD: Adding Task re-run documentation #6295

Merged
merged 41 commits into from
Nov 27, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
62524e4
adding dag re-run documentation
Oct 9, 2019
c4a4002
Apply suggestions from code review
KKcorps Oct 9, 2019
9731217
fixing merge conflicts
Oct 9, 2019
16a9222
adding proper license
Oct 10, 2019
fb01a3d
adding dag re-run documentation
Oct 9, 2019
a169e9d
Apply suggestions from code review
KKcorps Oct 9, 2019
d545282
fixing merge conflicts
Oct 9, 2019
e068528
adding end of line
Oct 10, 2019
bab59ea
Apply formatting suggestions from code review
KKcorps Oct 10, 2019
993d6fe
adding description of dag re-run and backfill
Oct 10, 2019
3bc8513
Apply ash's suggestions from code review
KKcorps Oct 11, 2019
06926af
adding code review changes
Oct 11, 2019
c5c281e
adding code review changes
Oct 11, 2019
2ed9ba2
adding code review changes
Oct 11, 2019
d6773ea
Apply suggestions from code review
KKcorps Oct 11, 2019
99675da
adding code review changes
Oct 11, 2019
9948588
adding note blocks
Oct 21, 2019
3569521
adding dag re-run documentation
Oct 9, 2019
e82d0e6
Apply suggestions from code review
KKcorps Oct 9, 2019
65fea30
fixing merge conflicts
Oct 9, 2019
08be1dc
adding proper license
Oct 10, 2019
3bd602c
adding dag re-run documentation
Oct 9, 2019
fcf03a6
Apply suggestions from code review
KKcorps Oct 9, 2019
cadc260
fixing merge conflicts
Oct 9, 2019
8178d37
adding end of line
Oct 10, 2019
52c8799
Apply formatting suggestions from code review
KKcorps Oct 10, 2019
5c187a2
adding description of dag re-run and backfill
Oct 10, 2019
b6d8297
Apply ash's suggestions from code review
KKcorps Oct 11, 2019
cd33fd1
adding code review changes
Oct 11, 2019
17ccbf7
adding code review changes
Oct 11, 2019
573563d
Apply suggestions from code review
KKcorps Oct 11, 2019
996b620
adding code review changes
Oct 11, 2019
709c38d
adding cron hint and dag run directive
Nov 1, 2019
ab93444
adding more details in scheduler
Nov 4, 2019
2840871
adding dag scheduling note
Nov 21, 2019
1c3d3e8
Apply suggestions from code review
KKcorps Nov 23, 2019
d3f00a3
removing low resources comment
Nov 23, 2019
863afe7
Apply suggestions from code review
KKcorps Nov 25, 2019
7595a42
fixing grammar and typos
Nov 25, 2019
5d85e2e
adding command links
Nov 25, 2019
83afb44
fix grammar
Nov 25, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
194 changes: 194 additions & 0 deletions docs/dag-run.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

.. http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.

DAG Runs
=========
A DAG Run is an object representing an instantiation of the DAG in time.

Each DAG may or may not have a schedule, which informs how DAG Runs are
created. ``schedule_interval`` is defined as a DAG argument, and receives
preferably a
`cron expression <https://en.wikipedia.org/wiki/Cron#CRON_expression>`_ as
ashb marked this conversation as resolved.
Show resolved Hide resolved
a ``str``, or a ``datetime.timedelta`` object.

.. tip::
You can use an online editor for CRON expressions such as `Crontab guru <https://crontab.guru/>`_

Alternatively, you can also use one of these cron "presets":

+--------------+----------------------------------------------------------------+---------------+
| preset | meaning | cron |
+==============+================================================================+===============+
| ``None`` | Don't schedule, use for exclusively "externally triggered" | |
| | DAGs | |
+--------------+----------------------------------------------------------------+---------------+
| ``@once`` | Schedule once and only once | |
+--------------+----------------------------------------------------------------+---------------+
| ``@hourly`` | Run once an hour at the beginning of the hour | ``0 * * * *`` |
+--------------+----------------------------------------------------------------+---------------+
| ``@daily`` | Run once a day at midnight | ``0 0 * * *`` |
+--------------+----------------------------------------------------------------+---------------+
| ``@weekly`` | Run once a week at midnight on Sunday morning | ``0 0 * * 0`` |
+--------------+----------------------------------------------------------------+---------------+
| ``@monthly`` | Run once a month at midnight of the first day of the month | ``0 0 1 * *`` |
+--------------+----------------------------------------------------------------+---------------+
| ``@yearly`` | Run once a year at midnight of January 1 | ``0 0 1 1 *`` |
+--------------+----------------------------------------------------------------+---------------+

Your DAG will be instantiated for each schedule along with a corresponding
DAG Run entry in the database backend.

.. note::

If you run a DAG on a schedule_interval of one day, the run stamped 2020-01-01
will be triggered soon after 2020-01-01T23:59. In other words, the job instance is
started once the period it covers has ended. The ``execution_date`` available in the context
will also be 2020-01-01.

The first DAG Run is created based on the minimum ``start_date`` for the tasks in your DAG.
Subsequent DAG Runs are created by the scheduler process, based on your DAG’s ``schedule_interval``,
sequentially. If your start_date is 2020-01-01 and schedule_interval is @daily, the first run
will be created on 2020-01-02 i.e., after your start date has passed.

Re-run DAG
''''''''''
There can be cases where you will want to execute your DAG again. One such case is when the scheduled
DAG run fails.

KKcorps marked this conversation as resolved.
Show resolved Hide resolved
.. _dag-catchup:

Catchup
-------

An Airflow DAG with a ``start_date``, possibly an ``end_date``, and a ``schedule_interval`` defines a
series of intervals which the scheduler turns into individual DAG Runs and executes. The scheduler, by default, will
kick off a DAG Run for any interval that has not been run since the last execution date (or has been cleared). This concept is called Catchup.

If your DAG is written to handle its catchup (i.e., not limited to the interval, but instead to ``Now`` for instance.),
then you will want to turn catchup off. This can be done by setting ``catchup = False`` in DAG or ``catchup_by_default = False``
in the configuration file. When turned off, the scheduler creates a DAG run only for the latest interval.

.. code:: python

"""
Code that goes along with the Airflow tutorial located at:
https://github.com/apache/airflow/blob/master/airflow/example_dags/tutorial.py
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta


default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'email': ['[email protected]'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}

dag = DAG(
'tutorial',
default_args=default_args,
start_date=datetime(2015, 12, 1),
description='A simple tutorial DAG',
schedule_interval='@daily',
catchup=False)

In the example above, if the DAG is picked up by the scheduler daemon on 2016-01-02 at 6 AM,
(or from the command line), a single DAG Run will be created, with an `execution_date` of 2016-01-01,
and the next one will be created just after midnight on the morning of 2016-01-03 with an execution date of 2016-01-02.

If the ``dag.catchup`` value had been ``True`` instead, the scheduler would have created a DAG Run
for each completed interval between 2015-12-01 and 2016-01-02 (but not yet one for 2016-01-02,
as that interval hasn’t completed) and the scheduler will execute them sequentially.

Catchup is also triggered when you turn off a DAG for a specified period and then re-enable it.

This behavior is great for atomic datasets that can easily be split into periods. Turning catchup off is great
if your DAG performs catchup internally.


Backfill
KKcorps marked this conversation as resolved.
Show resolved Hide resolved
---------
There can be the case when you may want to run the dag for a specified historical period e.g.,
A data filling DAG is created with ``start_date`` **2019-11-21**, but another user requires the output data from a month ago i.e., **2019-10-21**.
This process is known as Backfill.

You may want to backfill the data even in the cases when catchup is disabled. This can be done through CLI.
Run the below command

.. code:: bash

airflow backfill -s START_DATE -e END_DATE dag_id

The `backfill command <cli-ref.html#backfill>`_ will re-run all the instances of the dag_id for all the intervals within the start date and end date.

Re-run Tasks
------------
Some of the tasks can fail during the scheduled run. Once you have fixed
the errors after going through the logs, you can re-run the tasks by clearing it for the
scheduled date. Clearing a task instance doesn't delete the task instance record.
KKcorps marked this conversation as resolved.
Show resolved Hide resolved
Instead, it updates ``max_tries`` to ``0`` and set the current task instance state to be ``None``, this forces the task to re-run.

Click on the failed task in the Tree or Graph views and then click on **Clear**.
The executor will re-run it.

There are multiple options you can select to re-run -

* **Past** - All the instances of the task in the runs before the current DAG's execution date
* **Future** - All the instances of the task in the runs after the current DAG's execution date
* **Upstream** - The upstream tasks in the current DAG
* **Downstream** - The downstream tasks in the current DAG
* **Recursive** - All the tasks in the child DAGs and parent DAGs
* **Failed** - Only the failed tasks in the current DAG

You can also clear the task through CLI using the command:

.. code:: bash

airflow tasks clear dag_id -t task_regex -s START_DATE -d END_DATE

For the specified ``dag_id`` and time interval, the command clears all instances of the tasks matching the regex.
For more options, you can check the help of the `clear command <cli-ref.html#clear>`_ :

.. code:: bash

airflow tasks clear -h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to cli docs here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I remove the example and just have the CLI doc link or put CLI docs link below the example?


External Triggers
'''''''''''''''''

Note that DAG Runs can also be created manually through the CLI. Just run the command -

.. code:: bash

airflow dags trigger -e execution_date run_id

The DAG Runs created externally to the scheduler get associated with the trigger’s timestamp and are displayed
in the UI alongside scheduled DAG runs. The execution date passed inside the DAG can be specified using the ``-e`` argument.
The default is the current date in the UTC timezone.

In addition, you can also manually trigger a DAG Run using the web UI (tab **DAGs** -> column **Links** -> button **Trigger Dag**)

To Keep in Mind
''''''''''''''''
* Marking task instances as failed can be done through the UI. This can be used to stop running task instances.
* Marking task instances as successful can be done through the UI. This is mostly to fix false negatives, or
for instance, when the fix has been applied outside of Airflow.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ Content
concepts
scheduler
executor/index
dag-run
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this probably makes more sense as a page under Concepts -- it doesn't fit with the rest of the top level items we have.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mik-laj and I were having a discussion on the same perspective that the concepts page needs to be broken down and some subpages need to move inside it. Since that would require changes in many pages, can we have that as part of another PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have a subpage like https://airflow.readthedocs.io/en/stable/howto/index.html and have different topics there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KKcorps What did we decide about this? Merge this PR as is then you split up concepts in to multiple pages afterwards?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, splitting up the page is a big effort. I think we should merge this and that split it up later.

plugins
security
timezone
Expand Down
Loading