-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-5100] Use safe_mode configuration setting by default #5757
Conversation
The scheduler calls `list_py_file_paths` to find DAGs to schedule. It does so without passing any parameters other than the directory. This means that it *won't* discover DAGs that are missing the words "airflow" and "DAG" even if DAG_DISCOVERY_SAFE_MODE is disabled. Since `list_py_file_paths` will refer to the configuration if `include_examples` is not provided, it makes sense to have the same behaviour for `safe_mode`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with making this configurable for the user. However, I don't know who came up with the name "safe_mode" but to me it's a very ambiguous name. If we decide to make it configurable I suggest we change the name to something meaningful such as "DAG_DISCOVERY_STRING_CHECK". What do you think?
Hi @BasPH, Thanks for responding so quickly. I think we might have different understandings as to what this PR is about. Specifically, it's not about exposing a new setting, it's about making an existing setting actually work. dag_discovery_safe_mode is already exposed to the end user as something they can configure:
Pasting Slack messages for posterity:
Unfortunately, the setting isn't respected by the scheduler. So, I'd rather not rename the setting in this PR, and instead focus on making the existing functionality work. Thanks! |
airflow/utils/dag_processing.py
Outdated
@@ -288,18 +288,21 @@ def correct_maybe_zipped(fileloc): | |||
COMMENT_PATTERN = re.compile(r"\s*#.*") | |||
|
|||
|
|||
def list_py_file_paths(directory, safe_mode=True, | |||
def list_py_file_paths(directory, safe_mode=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def list_py_file_paths(directory, safe_mode=None, | |
def list_py_file_paths(directory, safe_mode=conf.getboolean('core', 'a', fallback=False) |
This one-liner should work.
Can you also add some tests so that we can discover this kind of issue in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kaxil,
I'm happy to add tests, but as mentioned in the PR description, I need some pointers. Specifically:
- which of the two testing strategies that I suggest would be most appropriate
- if the second, I'm going to have to refactor things, and I don't want to do that if it's going to be in any way controversial
- is there a context manager (or something similar) for temporarily setting configuration variables
Thanks,
jml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to set safe_mode
via default arguments, although I'm falling back to True
rather than False
, since that's what the behaviour was before the option was introduced.
Note that this means that configuration changes within the lifetime of the process will not be respected by the default argument, which might surprise people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than in tests, there is no way to change the config of a running Airflow process, so that isn't a problem.
We have from tests.test_utils.config import config_vars
which is the decorator you change configs for a test (but this won't change the default value here as you are aware.)
I think the way I would approach this is to not read the config setting here (i.e. no default), but instead pass the value down the scheduler job, and test via mocking that we pass the value along, which probably means adding it as attributes to a number of places in the call stack. Hmmmm. Maybe leaving this here is fine. (I've been off for a week, brain still isn't firing on all cylinders yet)
(The behaviour of list_py_file_paths
is already tested via tests/models/test_dagbag.py)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your suggested approach: in other projects I have it as a rule that only the very outer-most layers touch the configuration systems, and everything else is passed as parameters.
The reason I chose this approach is that it's the same approach that's already used for include_examples
. I thought having two different approaches would be more confusing.
Relatedly, I also think it's a bad idea to have a default value for this parameter, but I'm not sure what the backwards compatibility policies are. I guess I could create a new function with mandatory params.
I'm honestly happy to roll up my sleeves and do a chunk of work, I just don't want to get caught in interminable discussion. We've already collectively spent a lot of words & roundtrips on a one line change (my first Airflow PR!), and I'm wary that a 600 line change would be much worse.
So, what do I need to do to get this merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is such an internal method that no-one outside of the scheduler is likely to call I'm not very concerned about breaking changes to the sig of this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've re-run the failing test (we haven't managed to fix all flaky tests yet 😢), if that goes green we're probably good to merge.
Thanks for taking the time to dig in to and fix this @jml! |
…AG files (apache#5757) The scheduler calls `list_py_file_paths` to find DAGs to schedule. It does so without passing any parameters other than the directory. This means that it *won't* discover DAGs that are missing the words "airflow" and "DAG" even if DAG_DISCOVERY_SAFE_MODE is disabled. Since `list_py_file_paths` will refer to the configuration if `include_examples` is not provided, it makes sense to have the same behaviour for `safe_mode`. (cherry picked from commit c4a9d8b)
The scheduler calls
list_py_file_paths
to find DAGs to schedule. It does sowithout passing any parameters other than the directory. This means that
it won't discover DAGs that are missing the words "airflow" and "DAG" even
if DAG_DISCOVERY_SAFE_MODE is disabled.
Since
list_py_file_paths
will refer to the configuration ifinclude_examples
is not provided, it makes sense to have the same behaviourfor
safe_mode
.Make sure you have checked all steps below.
Jira
Description
See above. There are no UI changes.
Tests
I stared at the code a long time and couldn't figure out a good place to test.
list_py_file_paths
doesn't have direct teststests/dags/
as the source of DAGsSchedulerJob
is actually exhibiting the bug, and is where the behaviour needs to be correct_make_processor_agent
from_execute
, and assert things about its return value, but I don't want spend time on that without a core contributor approving it firstCommits
Documentation
Code Quality
flake8