Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-XXX] GSoD: Adding Task re-run documentation #6295

Merged
merged 41 commits into from
Nov 27, 2019
Merged

Conversation

KKcorps
Copy link
Contributor

@KKcorps KKcorps commented Oct 9, 2019

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
    • https://issues.apache.org/jira/browse/AIRFLOW-XXX
    • In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
    • In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

docs/index.rst Outdated
@@ -83,7 +83,11 @@ Content
ui
concepts
scheduler
<<<<<<< HEAD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somethings is broken here.

docs/scheduler.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/scheduler.rst Outdated Show resolved Hide resolved
@codecov-io
Copy link

codecov-io commented Oct 10, 2019

Codecov Report

Merging #6295 into master will decrease coverage by 0.07%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6295      +/-   ##
==========================================
- Coverage   80.41%   80.34%   -0.08%     
==========================================
  Files         612      616       +4     
  Lines       35473    35733     +260     
==========================================
+ Hits        28525    28709     +184     
- Misses       6948     7024      +76
Impacted Files Coverage Δ
airflow/executors/kubernetes_executor.py 58.89% <0%> (-6.34%) ⬇️
airflow/jobs/backfill_job.py 89.9% <0%> (-1.53%) ⬇️
airflow/gcp/operators/bigquery.py 86.6% <0%> (ø) ⬆️
...gle/marketing_platform/sensors/campaign_manager.py 100% <0%> (ø)
...oogle/marketing_platform/hooks/campaign_manager.py 100% <0%> (ø)
..._platform/example_dags/example_campaign_manager.py 0% <0%> (ø)
...e/marketing_platform/operators/campaign_manager.py 91.73% <0%> (ø)
airflow/gcp/operators/cloud_sql.py 88.98% <0%> (+0.72%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7c9b44d...52b0bb5. Read the comment docs.

@KKcorps KKcorps requested a review from mik-laj October 10, 2019 20:36
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated
An Airflow DAG with a ``start_date``, possibly an ``end_date``, and a ``schedule_interval`` defines a
series of intervals which the scheduler turn into individual DAG Runs and execute. A key capability
of Airflow is that these DAG Runs are atomic and idempotent items. The scheduler, by default, will
kick off a DAG Run for any interval that has not been run (or has been cleared). This concept is called Catchup.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly true, but slightly misleading in a few edge cases. I'm not sure if this is worth mentioning here or not.

Catchup, and the scheduler in general will not "fill in gaps" - it will only look forward from the most recent dag run.

For example:

  • I have a daily dag with catchup=False running. This is "d0"
  • I pause that dag for 3 days
  • I then start it again.

At this point we have dagruns for d0, d4, d5, ...

  • I edit the dag to set catchup=True

The scheduler will not go and "fill in" d1, d2, d3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with changes.

docs/dag-run.rst Outdated
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2015, 12, 1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is what we do in most cases in our docs, but can you move start_date out of default_args and in to just a DAG arg please?

docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated

**Note**: When clearing a set of tasks’ state in hope of getting them to re-run, it is important
to keep in mind the DAG Run’s state too as it defines whether the scheduler should look
into triggering tasks for that run.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if you clear tasks it will also set the dagrun state to running? But I'm not certain of that.

docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
@ashb
Copy link
Member

ashb commented Oct 11, 2019

Looking good overall though!

@KKcorps KKcorps changed the title [AIRFLOW-XXX] Adding Task re-run documentation [AIRFLOW-XXX] GSoD: Adding Task re-run documentation Oct 16, 2019
@KKcorps
Copy link
Contributor Author

KKcorps commented Oct 18, 2019

@@ -84,6 +84,7 @@ Content
concepts
scheduler
executor/index
dag-run
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this probably makes more sense as a page under Concepts -- it doesn't fit with the rest of the top level items we have.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mik-laj and I were having a discussion on the same perspective that the concepts page needs to be broken down and some subpages need to move inside it. Since that would require changes in many pages, can we have that as part of another PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have a subpage like https://airflow.readthedocs.io/en/stable/howto/index.html and have different topics there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KKcorps What did we decide about this? Merge this PR as is then you split up concepts in to multiple pages afterwards?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, splitting up the page is a big effort. I think we should merge this and that split it up later.

@@ -32,161 +30,10 @@ Airflow production environment. To kick it off, all you need to do is
execute ``airflow scheduler``. It will use the configuration specified in
``airflow.cfg``.

Note that if you run a DAG on a ``schedule_interval`` of one day,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was quite an important point. I feel we should direct people to the new dag run page (where ever we decide it lives) from here too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't done yet. A "see also" link is not strong enough.

docs/dag-run.rst Outdated
Your DAG will be instantiated for each schedule along with a corresponding
DAG Run entry in backend.

**Note**: If you run a DAG on a schedule_interval of one day, the run stamped 2020-01-01
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this and the next paragraph sould be in a .. note:: block to make them stand out more. What does that look like?

@yegeniy
Copy link
Contributor

yegeniy commented Oct 25, 2019

This would be very helpful as a document to share with folks onboarding to Airflow, even if it's imperfect, I think it would be great to get this into the codebase.

docs/scheduler.rst Outdated Show resolved Hide resolved

To start a scheduler, simply run the command:

.. code:: bash

airflow scheduler

You can start executing a DAG once your scheduler has started running successfully.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"executing" a dag doesn't quite make sense. Either we mean enable it and let the scheduler excute it, or we mean "trigger a manual run".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it to 'your dags will start executing'

docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated
Re-run DAG
''''''''''
There can be cases where you will want to execute your DAG again. One such case is when the scheduled
DAG run fails. Another can be the scheduled DAG run wasn't executed due to low resources or the DAG being turned off.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low resources shouldn't stop a dag run from being created, so lets remove that bit.

docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated

An Airflow DAG with a ``start_date``, possibly an ``end_date``, and a ``schedule_interval`` defines a
series of intervals which the scheduler turn into individual DAG Runs and execute. A key capability
of Airflow is that these DAG Runs are atomic and idempotent items. The scheduler, by default, will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
of Airflow is that these DAG Runs are atomic and idempotent items. The scheduler, by default, will
of Airflow is that these DAG Runs should be atomic and idempotent items. The scheduler, by default, will

(should be, as Airflow doesn't do this magically, but it depends on how the tasks are written.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. then I guess I should remove the line only as it doesn't make much sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or I can add The runs should be atomic and idempotent for the catchup to function as expected.

docs/dag-run.rst Outdated
Backfill
---------
There can be the case when you may want to run the dag for a specified historical period e.g. a data pipeline
which dumps data in a DFS every day and another pipeline which requires last 1 month of data in DFS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the best example of backfill.

Backfill's primary aim is for running for date periods before the start date of the task. (Otherwise the scheduler would see the date is before start_date and not do anything)

docs/dag-run.rst Outdated

airflow backfill -s START_DATE -e END_DATE dag_id

The above command will re-run all the instances of the dag_id for all the intervals within the start date and end date.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs/dag-run.rst Outdated

Click on the failed task in the Tree or Graph views and then click on **Clear**.
``
failed to ``None`` and the executor will re-run it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Odd line break here.


.. code:: bash

airflow tasks clear -h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to cli docs here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I remove the example and just have the CLI doc link or put CLI docs link below the example?

docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated

There are multiple options you can select to re-run -

* Past - All the instances of the task in the runs before the current DAG's execution date
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Past - All the instances of the task in the runs before the current DAG's execution date
* **Past** - All the instances of the task in the runs before the current DAG's execution date

We should probably bold all the options

docs/dag-run.rst Outdated Show resolved Hide resolved
@ashb ashb merged commit ac2d0be into apache:master Nov 27, 2019
kaxil pushed a commit that referenced this pull request Jul 1, 2020
@kaxil kaxil added the type:doc label Jul 1, 2020
kaxil pushed a commit that referenced this pull request Jul 1, 2020
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants