Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-5100] Use safe_mode configuration setting by default #5757

Merged
merged 2 commits into from
Aug 12, 2019

Conversation

jml
Copy link
Contributor

@jml jml commented Aug 8, 2019

The scheduler calls list_py_file_paths to find DAGs to schedule. It does so
without passing any parameters other than the directory. This means that
it won't discover DAGs that are missing the words "airflow" and "DAG" even
if DAG_DISCOVERY_SAFE_MODE is disabled.

Since list_py_file_paths will refer to the configuration if
include_examples is not provided, it makes sense to have the same behaviour
for safe_mode.

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"

Description

See above. There are no UI changes.

Tests

I stared at the code a long time and couldn't figure out a good place to test.

  • list_py_file_paths doesn't have direct tests
    • adding a test for this particular branch might be OK, but feels a bit like a change detector test
    • If you can point me to a context manager to temporarily override a configuration setting, I could probably write some direct tests that use tests/dags/ as the source of DAGs
  • SchedulerJob is actually exhibiting the bug, and is where the behaviour needs to be correct
    • There's no obvious API to exercise that would let me test this
    • I could theoretically extract a _make_processor_agent from _execute, and assert things about its return value, but I don't want spend time on that without a core contributor approving it first
    • I would still need a context manager to temporarily override configuration settings

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

  • Passes flake8

The scheduler calls `list_py_file_paths` to find DAGs to schedule. It does so
without passing any parameters other than the directory. This means that
it *won't* discover DAGs that are missing the words "airflow" and "DAG" even
if DAG_DISCOVERY_SAFE_MODE is disabled.

Since `list_py_file_paths` will refer to the configuration if
`include_examples` is not provided, it makes sense to have the same behaviour
for `safe_mode`.
Copy link
Contributor

@BasPH BasPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with making this configurable for the user. However, I don't know who came up with the name "safe_mode" but to me it's a very ambiguous name. If we decide to make it configurable I suggest we change the name to something meaningful such as "DAG_DISCOVERY_STRING_CHECK". What do you think?

@jml
Copy link
Contributor Author

jml commented Aug 8, 2019

Hi @BasPH,

Thanks for responding so quickly. I think we might have different understandings as to what this PR is about. Specifically, it's not about exposing a new setting, it's about making an existing setting actually work.

dag_discovery_safe_mode is already exposed to the end user as something they can configure:

Pasting Slack messages for posterity:

@jml OK, we figured out what was going on. We disabled safe mode, and removed some comments that added "airflow" and "DAG" to our dag configs. Those DAGs then stopped running. We are confused, because we thought the whole point of disabling safe mode was so that they would run.
@ashb It Should have.

Unfortunately, the setting isn't respected by the scheduler.

So, I'd rather not rename the setting in this PR, and instead focus on making the existing functionality work.

Thanks!
jml

@@ -288,18 +288,21 @@ def correct_maybe_zipped(fileloc):
COMMENT_PATTERN = re.compile(r"\s*#.*")


def list_py_file_paths(directory, safe_mode=True,
def list_py_file_paths(directory, safe_mode=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def list_py_file_paths(directory, safe_mode=None,
def list_py_file_paths(directory, safe_mode=conf.getboolean('core', 'a', fallback=False)

This one-liner should work.

Can you also add some tests so that we can discover this kind of issue in future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kaxil,

I'm happy to add tests, but as mentioned in the PR description, I need some pointers. Specifically:

  • which of the two testing strategies that I suggest would be most appropriate
  • if the second, I'm going to have to refactor things, and I don't want to do that if it's going to be in any way controversial
  • is there a context manager (or something similar) for temporarily setting configuration variables

Thanks,
jml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to set safe_mode via default arguments, although I'm falling back to True rather than False, since that's what the behaviour was before the option was introduced.

Note that this means that configuration changes within the lifetime of the process will not be respected by the default argument, which might surprise people.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than in tests, there is no way to change the config of a running Airflow process, so that isn't a problem.

We have from tests.test_utils.config import config_vars which is the decorator you change configs for a test (but this won't change the default value here as you are aware.)

I think the way I would approach this is to not read the config setting here (i.e. no default), but instead pass the value down the scheduler job, and test via mocking that we pass the value along, which probably means adding it as attributes to a number of places in the call stack. Hmmmm. Maybe leaving this here is fine. (I've been off for a week, brain still isn't firing on all cylinders yet)

(The behaviour of list_py_file_paths is already tested via tests/models/test_dagbag.py)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your suggested approach: in other projects I have it as a rule that only the very outer-most layers touch the configuration systems, and everything else is passed as parameters.

The reason I chose this approach is that it's the same approach that's already used for include_examples. I thought having two different approaches would be more confusing.

Relatedly, I also think it's a bad idea to have a default value for this parameter, but I'm not sure what the backwards compatibility policies are. I guess I could create a new function with mandatory params.

I'm honestly happy to roll up my sleeves and do a chunk of work, I just don't want to get caught in interminable discussion. We've already collectively spent a lot of words & roundtrips on a one line change (my first Airflow PR!), and I'm wary that a 600 line change would be much worse.

So, what do I need to do to get this merged?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is such an internal method that no-one outside of the scheduler is likely to call I'm not very concerned about breaking changes to the sig of this function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've re-run the failing test (we haven't managed to fix all flaky tests yet 😢), if that goes green we're probably good to merge.

@ashb ashb merged commit c4a9d8b into apache:master Aug 12, 2019
@ashb
Copy link
Member

ashb commented Aug 12, 2019

Thanks for taking the time to dig in to and fix this @jml!

ashb pushed a commit to ashb/airflow that referenced this pull request Aug 20, 2019
…AG files (apache#5757)

The scheduler calls `list_py_file_paths` to find DAGs to schedule. It does so
without passing any parameters other than the directory. This means that
it *won't* discover DAGs that are missing the words "airflow" and "DAG" even
if DAG_DISCOVERY_SAFE_MODE is disabled.

Since `list_py_file_paths` will refer to the configuration if
`include_examples` is not provided, it makes sense to have the same behaviour
for `safe_mode`.

(cherry picked from commit c4a9d8b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants