Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't allow defaults other than None in context parameters, and improve error message #38015

Merged
merged 6 commits into from
Mar 22, 2024

Conversation

kevinalh
Copy link
Contributor

This PR does two things for the Python task decorator:

  1. Don't allow a default parameter other than None for context parameters, since this is always replaced by None in 2.8. Keeping this behavior allowed would only bring confusion as it's ambiguous. Suggested by @potiuk in Fix None defaults breaking usage of context arguments in decorated tasks #38007 (review).
  2. Improve the error message for cases where the replacement of default values breaks the function signature. For this I opted for a simple try-catch where we let the Python traceback provide the exact nature of the failure.

Closes: #38006


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@kevinalh kevinalh changed the title Context parameters fix Don't allow defaults other than None in context parameters, and improve error message Mar 10, 2024
airflow/decorators/base.py Outdated Show resolved Hide resolved
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect

@kevinalh kevinalh force-pushed the context_parameters_fix branch 2 times, most recently from 4208ad8 to 9e3664f Compare March 16, 2024 13:03
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks really good now. @uranusjr ?

@kevinalh kevinalh force-pushed the context_parameters_fix branch 2 times, most recently from c550359 to d2f65f3 Compare March 20, 2024 02:27
@potiuk potiuk merged commit a7d7ef6 into apache:main Mar 22, 2024
46 checks passed
Copy link

boring-cyborg bot commented Mar 22, 2024

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

@jscheffl
Copy link
Contributor

jscheffl commented Apr 7, 2024

Uh, oh, late to the party. I noticed this change which - after some reading - really looks understandable. But when testing 2.9.0RC2 I had a lot of broken DAGs and it felt to be like a breaking change.

Therefore I raised a PR to add a newsfragment such that we have less support issues and user complaints. Still I see it a bit problematic... in our cases the DAGs just broke because we have strong mypy/pylint checks and want to ensure we have typed code...

I our case we had a DAG and wanted to have the DAG params being typed. Imageine something like

with ... as dag:
    @task
    def get_file_list(params):
        """Prepares the file list to be re-computed based on the passed parameters."""
        return params["files"]

    @task(retries=2)
    def recompute_batch(file) -> None:
        ... # do something with the file

    recompute_batch.expand(file=get_file_list())

... so here the get_file_list(params) is not having types defined which is not nice...

Then changing this to get_file_list(params: ParamsDict) is typed - but to add the task in the DAG we need to call the function get_file_list() in the code for the expansion. Required argument then is missing.

So we had the option to use either get_file_list(params: ParamsDict|None) which then requires a check against None in the code in order to satisfy mypy.

That is why in many DAGs we use the signature like def get_file_list(params: ParamsDict = dag.params) -> List[str]:.

With this change being now on the main line. What is the recommended way for key context variables and parallel type checks and demand for typed context?

The full example we use typed today, broken in 2.9.0rc2

with ... as dag:
    @task
    def get_file_list(params: ParamsDict = dag.params) -> List[str]:
        """Prepares the file list to be re-computed based on the passed parameters."""
        result: List[str] = params["files"]
        return result

    @task(retries=2)
    def recompute_batch(file: str) -> None:
        ...

    recompute_batch.expand(file=get_file_list())

UPDATE: Oh even trying to get away from the default I now get a validation error:
ValueError: The key 'params' in args is a part of kwargs and therefore reserved.
So what is then the recommended best-practice to use kwargs in a typed manner? :-(

@potiuk
Copy link
Member

potiuk commented Apr 7, 2024

I think a little bit more context is needed on that one, becuase this is a follow-up and really fix of another problem introdued in 2.8 in #33430 (where we allowed context params to not have defaults at all).

This change is really codifying what has been explained for quite some time - since 2.7.0. I think we explicitly explained that | None should be used for typing

From https://airflow.apache.org/docs/apache-airflow/2.8.0/core-concepts/taskflow.html#context :

@task
def print_ti_info(task_instance: TaskInstance | None = None, dag_run: DagRun | None = None):
    print(f"Run ID: {task_instance.run_id}")  # Run ID: scheduled__2023-08-09T00:00:00+00:00
    print(f"Duration: {task_instance.duration}")  # Duration: 0.972019
    print(f"DAG Run queued at: {dag_run.queued_at}")  # 2023-08-10 00:00:01+02:20

This is how we explain that we expect None in those parameters used by the context. That's the "canonical" way I think we suggest people doing it since 2.7. In 2.8 #33430 introduced the possibility of "NOT" having the defaults.

Technically speaking - params = dag.params is just a mypy hack, because the value set as default will be anyhow overwritten with the params from context. And it might be misleading because people my expect it to be set. And it's a hack that you can do for only subset of arguments - like params in a reasonable/easy way IMHO.

Say - for TaskInstance or DagRun - how would you provide defaults there? I think limiting only to params being set by dag.params as default is even more hack'y and confusing in this context.

There is a class of difficult to spot errors that is avoided with this fix - for example this will not work as you think it will work (if you don't know that run_id will be injected by task_flow):

@task
def print_ti_info(some_value="x", run_id: str = "my_run_id"):
     print(some_value)
     print(run_id)

print_ti_info():

In example above you might simply miss that the context param has the run_id parameter injected, there are other name clashes possible too - especially for task names overriding the context names accidentally. There is quite some ambiguity today and this change is going to address it.

And there is another case - it's not academic problem either actually (See #38006) the change was not really implemented "because we thought it is a good idea" - it is solving a real issue our user had - when #33430 was added - for some of our users they could see "default value cannot follow non-default one) - introduced by #33430 - where we actually allowed context values not to provide defaults at all. This was when user used a parameter (logical_date) that was also accidentally a context parameter. So it detected a mistake the user could miss before.

However yes - I agree that this change might have much bigger effect than we initially anticipated. Others could have used it for similar purpose to yours. We might want to make some other follow-ups after that if we see that this is a problem for others.

The case you have @jscheffl is really a hack, but one that is likely to be used. I don't think it is strong enough reason to make RC4 - don't think so, while I did not get approval from @uranusjr and @Taragolis - we all thought it's a reasonable one, I think (I'd love to hear your thought here as well).

But answering the typing question - one of the solutions (and a simple one) is:

@task
def print_ti_info(params: ParamsDict | None = None, task_instance: TaskInstance | None = None, dag_run: DagRun | None = None):
    assert params and task_instance and dag_run
    ...

A bit ugly, yes, but will work, and I personally think it's much more "straightforward" than smth like that:

def print_ti_info(params: ParamsDict = dag.param, task_instance: TaskInstance = TaskInstance(), dag_run: DagRun = DagRun()):
     ...

Ideally (and maybe that should be a follow-up) we should be able to have (and mybe even publish) Airflow MyPy plugin - someone even requested one for default args #38796 recently https://github.com/apache/airflow/tree/main/dev/mypy/plugin (the ones we have work with default args and outputs). Maybe it is possible - we could write a plugin that could lilely be used to also allow this case (though @uranusjr in #33430 hinted that it might be impossible).

def print_ti_info(params: ParamsDict, task_instance: TaskInstance, dag_run: DagRun):
    ....

print_ti_info()

And maybe that's the best idea to actually do so for anyone who wants to use MyPy in this case?

@jscheffl
Copy link
Contributor

jscheffl commented Apr 7, 2024

Thanks @potiuk for the detailled background. Convinced.
And believe me, I did not want to revert and call for an RC4 - if we can add the newsfragtment on the docs this woul dbe great for others potentially stumbling into it.
Of course I'll start adjusting our DAGs now...

@potiuk
Copy link
Member

potiuk commented Apr 7, 2024

If we can add the newsfragtment on the docs this woul dbe great for others potentially stumbling into it.

Agree :)

utkarsharma2 pushed a commit to astronomer/airflow that referenced this pull request Apr 22, 2024
…ve error message (apache#38015)

* feat: don't allow defaults other than None for context parameters

* feat: improve error message when replacing context parameters with None breaks signature

* feat: improve error message for unsupported position of context key parameter

* style: make stack trace more readable by putting the error message in a variable

* feat: add details to exceptions and related test

* fix: put task decorated function outside pytest.raises call
@ephraimbuddy ephraimbuddy added this to the Airflow 2.9.2 milestone Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug-fix Changelog: Bug Fixes
Projects
None yet
7 participants