Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BashOperator to raise AirflowSkipException on exit code 127 #13421

Merged
merged 7 commits into from
Feb 7, 2021

Conversation

dstandish
Copy link
Contributor

Airflow's skipped task status can be useful when building dags for some trickier requirements.

This PR adds to BashOperator the capability to end the task in skipped state.

If the command exits with code 255, an AirflowSkippedException is thrown.

If you have a better suggestion re choice of exit code, suggest. I chose 255 because I think it is pretty unlikely to be found in the wild, which means this would only be a "breaking" change in a negligible sense.

If we don't like exit codes for this, we could come up with some convention for parsing stdout. But exit codes seems simple and good enough.

@kurtqq
Copy link
Contributor

kurtqq commented Jan 1, 2021

What is special about BashOperator that you suggest this modification only for this operator?
I agree that SKIP state can be useful but I think there should be a more generic approach that will fit all operators rather than specific one.

@dstandish
Copy link
Contributor Author

dstandish commented Jan 1, 2021

What is special about BashOperator that you suggest this modification only for this operator?
I agree that SKIP state can be useful but I think there should be a more generic approach that will fit all operators rather than specific one.

Most other operators do their work in python. So they can easily raise AirflowSkipException. The generic approach is simply to raise AirflowSkipException.

But you can't do that from bash. That's what's special about bash operator. You need some way to signal from within the bash process "ok, i want to skip now", so that when it comes back to python, the exception can be raised. And my proposal is to use exit code for this signalling.

The idea came to me from a stack overflow question where the user was using bash operator, and a possible solution would have been to trigger a skip from within bash, but it's not currently possible.

@kurtqq
Copy link
Contributor

kurtqq commented Jan 1, 2021

Most other operators do their work in python. So they can easily raise AirflowSkipException. The generic approach is simply to raise AirflowSkipException.

Many operators are executed on remote machines that don't have airflow installed in so you cant just raise airflow exception.
Can you explain how to do that for SQL & Spark operators?

@dstandish
Copy link
Contributor Author

dstandish commented Jan 1, 2021

I understand what you're saying. I'm not sure it would be practical to implement for sql, especially in a uniform way. There is so much variation in sql databases, and limitless variation with the types of queries people execute and the returns there may or may not be and the methods used.... In a database where you can throw exceptions like MS Sql, probably you could pick one to be "the skip exception". In others (like snowflake) where you cannot, I can't think of a practical way. Not sure about spark.

In any case though, for bash it is pretty straightforward and clear and easy to understand. We're already evaluating exit code. Why not reserve one of them to mean skip. Anyway, it's an idea 🤷.

@potiuk
Copy link
Member

potiuk commented Jan 1, 2021

In any case though, for bash it is pretty straightforward and clear and easy to understand. We're already evaluating exit code. Why not reserve one of them to mean skip. Anyway, it's an idea .

Yep. I think this is a good idea and Spark/SQL opertors can raise their own ways of raising Skip exception on their own terms. Nothing wrong with that.

@potiuk potiuk added this to the Airflow 2.1 milestone Jan 1, 2021
@potiuk
Copy link
Member

potiuk commented Jan 1, 2021

However before relesing 2.0.1 and deciding what to do with branching we should only merge after this happens (as this is new feature and should go to 2.1.

@github-actions
Copy link

github-actions bot commented Jan 1, 2021

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jan 1, 2021
@dstandish
Copy link
Contributor Author

deciding what to do with branching

not familiar with what's going on with branching.... but no need to recap it here if it will surface on dev list

in other news, it looks like one of the tests did not like the addition of the task in example_bash_operator.py... i'll look into this tomorrow.

thanks

@dstandish
Copy link
Contributor Author

OK @potiuk I fixed the test.

One problem is that our example dags also serve as test dags, so if you add another example task (as I did) you might break a test.

The other contributing factor was that the test depended on hardcoded counts of the tasks in the dag.

In my resultion I thought about simply excluding my task from the dag (i.e. purge it from dag1.task_dict). But task removal isn't really supported so I abandoned that.

What I ended up doing is a mild refactor of the test such that the "starting states" were controlled in a dictionary and instead of hardcoding state stats, I add a helper function to read this from the config.

@github-actions
Copy link

github-actions bot commented Jan 2, 2021

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@dstandish
Copy link
Contributor Author

Found more tests that need to be updated

I think example dags should not be used in tests: #13453

@potiuk
Copy link
Member

potiuk commented Jan 3, 2021

@dstandish - while I understand your frustrations (been there, done that), I honestly think it is quite the opposite (and no offence but it is a bit selfish thinking).

To be honest, our plans are to turn all example DAGs (in providers) into system tests. It makes perfect sense to use examples as tests. If we did not do it, your change would make examples invalid for example.

I think having example DAGs wrong (which would have happen if they were not part of the tests) is far worse (for the community) than you having to fix the examples during the change. You ask to trade off your time with the time of at least hundreds of people (users) scratching their heads while following the examples.

Imagine the users asking questions: why it does not work if I follow the example ? Repeat x-times. Trade it off with the time of a single person simply iterating on fixing the tests with a clear information (failed tests). While I think it's worse for you as a peraon, it's far better for the community as a whole to fix the examples in this case.

That is at least my way of thinking - when I spend 5x times more time than I initially thought about on fixing something where i also have to fix documentation, tests and related tests that start to break (and when it turns out that my change need to be a bit more generalised).

No hard feeling - it's just just an opinion and a.way of thinking :). I might be wrong, of course, and the examples might be useless as examples. But I would rather spend even a bit more time to add them to documentation and make them useful in this case as well.

It just follows my motto of 'wirh every change leave the (small) world around you behind a little better that before' :).

@dstandish
Copy link
Contributor Author

dstandish commented Jan 3, 2021

hehe no worries :)

i think the example_bash_operator does a good job of illustrating why example dags should not be used in tests.

in general, tests should have only the elements that are absolutely necessary to test the behavior they are examining.

let me give an example.

consider this class:

class MyObj:
    def __init__(self, param1=None, param2=None):
        self.param1 = param1
        self.param2 = param2
        
    def get_param1(self):
        return self.param1

consider these tests

def test_get_param1():
    m = MyObj('a')
    assert m.get_param1() == 'a'
def test_get_param1():
    m = MyObj('a', 'b')
    assert m.get_param1() == 'a'

The first test is better because the value for param2 is totally irrelevant to the test. Including 'b' is just extra noise, and it is in fact a costly distraction because if there is something wrong with the test, it's another thing we have to chase down to see if it is the cause. And if we need to make a change, then it's another thing we need to understand.

Similarly, when we use example_bash_operator as a test dag for many different tests, inevitably we end up with a lot of extra noise in tests. An example of this is in test test_backfill_multi_dates in test_backfill_job.py. It uses example bash operator.

There is a component in there that spells out expected execution order:

        expected_execution_order = [
            ("runme_0", DEFAULT_DATE),
            ("runme_1", DEFAULT_DATE),
            ("runme_2", DEFAULT_DATE),
            ("runme_0", end_date),
            ("runme_1", end_date),
            ("runme_2", end_date),
            ("also_run_this", DEFAULT_DATE),
            ("also_run_this", end_date),
            ("run_after_loop", DEFAULT_DATE),
            ("run_after_loop", end_date),
            ("run_this_last", DEFAULT_DATE),
            ("run_this_last", end_date),
        ]

I'd bet this test could be accomplished with fewer tasks. But it uses all of them because that's what is in example_bash_operator. And probably as people have added to that example, they add to this and other tests test. But by doing so, we are spending effort, and not getting any return on our effort. Extraneous content in a test makes it tougher to understand exactly what it is testing. So adding more tasks to this test makes this test worse, not better. It increases technical debt. And that is not good for the community. If I add another task to this test, merely because I want to demonstrate new feature in an example dag, then I make this test worse.

Keeping test and example combined is not good for users either. In handcuffs us with regard to writing example dags because we are constrained by the tests that reference it, tests which we don't want to modify without good reason.

So in sum there are two main problems with this approach:

  1. test and example dags are the same
    • constrained from designing example dags to best exemplify features because they have mixed purposes
    • can't freely alter an example without dealing with entirely unrelated tests
  2. having too many tests share the same test dag
    • makes tests unnecessarily complicated and noisy
    • can reach a point where the structure required by one test is incompatible, or at least very inconvenient, when trying to meet the needs of another test
    • developer inconvenience -- need to understand, and alter, many different tests, entirely unrelated to your change
      • non-trivial risk that "fixing" a test actually breaks it, when did not really need to change it at all for your PR

So, I do believe there are costs incurred in (1) mixing test and example dags and (2) having many tests share the same test dag. And my question is, what is the benefit? If there is a return, then great. But I don't see it.

What do you think?

@dstandish
Copy link
Contributor Author

dstandish commented Jan 4, 2021

ok @potiuk ..... if only for the sake of discussion, i have copied example_bash_operator.py as it existed before to tests/dags/test_miscellaneous.py and renamed the dag to miscellaneous_test_dag.

i have called it miscellaneous because, although the bash operator is used in it, it's not used in any tests of the bash operator. it's just a generic test dag used in a variety of tests with no unifying theme.

i did not update references in other tests which also use example_bash_operator but which are not so picky about its structure that they fail when you add a task.

I also left in place the refactor of test_mark_tasks.py because even though the refactor would not be needed with the creation of miscellaneous_test_dag, the improvements help readability and are worth keeping.

@dstandish dstandish requested a review from potiuk January 4, 2021 06:14
@github-actions github-actions bot removed the full tests needed We need to run full set of tests for this PR to merge label Jan 4, 2021
@dstandish dstandish requested a review from ashb January 7, 2021 17:32
@github-actions
Copy link

github-actions bot commented Feb 5, 2021

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@dstandish
Copy link
Contributor Author

@ashb @potiuk i think this one's good to go

with ``set -e;``
Example:
Airflow will evaluate the exit code of the bash command. In general, a non-zero exit code will result in
task failure and zero will result in task success. Exit code ``255`` will throw an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
task failure and zero will result in task success. Exit code ``255`` will throw an
task failure and zero will result in task success. Exit code ``127`` will throw an

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etc. Docs still refer to 255 below too.

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than the docs still saying 255

@dstandish dstandish changed the title BashOperator to raise AirflowSkipException on exit code 255 BashOperator to raise AirflowSkipException on exit code 127 Feb 6, 2021
@dstandish
Copy link
Contributor Author

dstandish commented Feb 6, 2021

LGTM other than the docs still saying 255

my bad... fixed. just one flakey build error now so i'm letting this sleeping dog lie.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants