Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for TaskGroup in _PythonDecoratedOperator #12312

Merged
merged 4 commits into from
Nov 15, 2020

Conversation

turbaszek
Copy link
Member

@turbaszek turbaszek commented Nov 12, 2020

Crucial feature of functions decorated by @task is to be able
to invoke them multiple times in single DAG. To do this we are
generating custom task_id for each invocation. However, this didn't
work with TaskGroup as the task_id is already altered by adding group_id
prefix. This PR fixes it.

closes: #12309


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@turbaszek
Copy link
Member Author

This was interesting one @casassg @yuqian90 👌

Copy link
Contributor

@casassg casassg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, indeed. Didn't think of this!

airflow/operators/python.py Outdated Show resolved Hide resolved
Copy link
Member

@XD-DENG XD-DENG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Nov 12, 2020
@github-actions
Copy link

The PR needs to run all tests because it modifies core of Airflow! Please rebase it to latest master or ask committer to re-run it!

airflow/operators/python.py Outdated Show resolved Hide resolved
@turbaszek turbaszek added this to the Airflow 2.0.0-beta3 milestone Nov 12, 2020
@yuqian90
Copy link
Contributor

Thanks for raising and addressing the issue! Change looks good. Just a few minor suggestions.

@@ -165,7 +166,7 @@ def __init__(
multiple_outputs: bool = False,
**kwargs,
) -> None:
kwargs['task_id'] = self._get_unique_task_id(task_id, kwargs.get('dag'))
kwargs['task_id'] = self._get_unique_task_id(task_id, kwargs.get('dag'), kwargs.get('task_group'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing kwargs.get, I suggest having dag and task_group as two keyword arguments with defaults. I.e. like this:

    def __init__(
        self,	        self,
        *,	        *,
        python_callable: Callable,	        python_callable: Callable,
        task_id: str,	        task_id: str,
        op_args: Tuple[Any],	        op_args: Tuple[Any],
        op_kwargs: Dict[str, Any],	        op_kwargs: Dict[str, Any],
        multiple_outputs: bool = False,	        multiple_outputs: bool = False,
        dag=None,
        task_group=None,
        **kwargs,	        **kwargs,
    ) -> None:	    ) -> None:
        kwargs['task_id'] = self._get_unique_task_id(task_id, dag, task_group)
        ...
        super().__init__(dag=dag, task_group=task_group, **kwargs)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was something I wanted suggest when we were doing the @task PR. However, I like that we access the kwargs and then we pass them to super constructor. Additionally they are not visible in signature so users don't have to think about them. But no strong opinion on my side

airflow/operators/python.py Outdated Show resolved Hide resolved
@yuqian90
Copy link
Contributor

That being said, I wasn't super aware of the feature of dynamically generating unique task_id when @task decorator is used. Now looking at the code a bit more, I have some concerns with using the dynamic task_id feature too much in production DAGs which may evolve incrementally.

For example, if there are many @task decorated tasks without explicitly given task_id. Their task_ids will be generated sequentially: task__1, task__2, task__3, etc. After the DAG goes into production, one day someone inserts a new task before task__2. The task_ids after that will all be shifted forward by one place. This is going to produce task__1, task__2, task__3, task__4. But at this point the task__3 is no longer the same task__3 as before. This is going to cause confusion when looking at historical DagRuns. E.g. history, log files, etc.

I think we should make it clear somewhere that the dynamic task_id feature should be used carefully if the DAG is expected to change in the future.

@turbaszek
Copy link
Member Author

turbaszek commented Nov 13, 2020

For example, if there are many @task decorated tasks without explicitly given task_id. Their task_ids will be generated sequentially: task__1, task__2, task__3, etc. After the DAG goes into production, one day someone inserts a new task before task__2. The task_ids after that will all be shifted forward by one place. This is going to produce task__1, task__2, task__3, task__4. But at this point the task__3 is no longer the same task__3 as before. This is going to cause confusion when looking at historical DagRuns. E.g. history, log files, etc.

This is a valid problem. I would be in favor of describing this in docs as I'm not sure if we will be able to solve this problem somehow (disabling id auto generation is rather no go for me).

@casassg @kaxil @dimberman @ashb WDYT?

@casassg
Copy link
Contributor

casassg commented Nov 13, 2020

Agree on adding this on the tutorial and concepts documentation for @task decorator

turbaszek and others added 4 commits November 14, 2020 12:14
Crucial feature of functions decorated by @task is to be able
to invoke them multiple times in single DAG. To do this we are
generating custom task_id for each invocation. However, this didn't
work with TaskGroup as the task_id is already altered by adding group_id
prefix. This PR fixes it.

closes: apache#12309
@turbaszek
Copy link
Member Author

@yuqian90 @casassg I added a note about possible problems, let me know what you think 👍

Copy link
Contributor

@casassg casassg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotta love the Knights of Nii

Copy link
Contributor

@yuqian90 yuqian90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@turbaszek turbaszek merged commit 39ea872 into apache:master Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
full tests needed We need to run full set of tests for this PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TaskGroup does not support dynamically generated tasks
7 participants