Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chained .expand() calls leads to DAG failure w/ no task failure #31218

Closed
2 tasks done
geoffjentry opened this issue May 11, 2023 · 4 comments
Closed
2 tasks done

Chained .expand() calls leads to DAG failure w/ no task failure #31218

geoffjentry opened this issue May 11, 2023 · 4 comments
Labels
area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet

Comments

@geoffjentry
Copy link

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

This was on 2.4.3. I have not yet had a chance to try on 2.5.x or 2.6. Will report further when I get a chance.

We were seeing a weird issue with one of our DAGs. The pattern was a series of tasks using .expand() one after the other. A set of IDs were being passed along the chain. Some of the tasks had the effect of filtering down that set of IDs. If the set of IDs wound up empty, as you'd imagine the subsequent tasks would be skipped.

Occasionally what we'd see was when that happened the DAG would be marked as a failure. At some point instead of skipped the tasks are marked as not yet started, and no task had been marked as failed.

What you think should happen instead

I'd expect to not have a DAG marked as a failure if there was no obvious failure

How to reproduce

I've managed to replicate the behavior I saw. That said the only way I found that I could reliably use to toggle the behavior isn't what we were doing so I might have stumbled on "same issue, different cause".

Example DAG code:

@dag(
    dag_id="test",
    schedule=None,
    catchup=False,
    start_date=datetime(2022, 1, 1),
)
def test():
    @task
    def generate_some_data():
        # return [1, 2, 3, 4, 5]
        return []

    @task
    def plus_one(datum):
        return datum + 1

    some_data = generate_some_data()
    #some_data = []

    foo = plus_one.expand(datum=some_data)
    bar = plus_one.override(task_id="bar").expand(datum=foo)
    baz = plus_one.override(task_id="baz").expand(datum=bar)
    qux = plus_one.override(task_id="quz").expand(datum=baz)

Run as is, it works as I'd expect, and the DAG succeeds w/ all but the first task being skipped
However, if I swap out the some_data assignment, I get this behavior. The first 2 tasks are skipped, the 2nd two are not yet started, and the DAG fails.
With this specific example I could see why the 2nd DAG is marked as failed - my assumption is no task succeeded. But I don't understand why it fails the way it does. Why does it choose that point to start the not yet started behavior?
In the real DAG it exhibits this behavior with successful tasks earlier on so even my handwavy mental explanation above doesn't track.

Screenshot 2023-04-12 at 8 39 37 AM 5 17 37 PM
Screenshot 2023-04-12 at 8 40 37 AM 5 17 37 PM

Operating System

This is using docker container apache/airflow:2.4.3-python3.10

Versions of Apache Airflow Providers

apache-airflow-providers-common-sql==1.2.0
apache-airflow-providers-ftp==3.1.0
apache-airflow-providers-google==8.4.0
apache-airflow-providers-http==4.0.0
apache-airflow-providers-imap==3.0.0
apache-airflow-providers-postgres==5.2.2
apache-airflow-providers-sendgrid==3.0.0
apache-airflow-providers-slack==6.0.0
apache-airflow-providers-sqlite==3.2.1

Deployment

Other 3rd-party Helm chart

Deployment details

No response

Anything else

It seemed tied to something specific about our DAG code, but we couldn't identify anything that was causing it. We could artificially change order of tasks & things like that and the problem would go away. But any configuration that was actually valid for what we were trying to do would trigger this.

Eventually for this and other reasons we changed the whole approach and thus the problem went away.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@geoffjentry geoffjentry added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels May 11, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented May 11, 2023

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@dzhigimont
Copy link
Contributor

@geoffjentry The bug is still reproduced on the 2.5.0 version but isn't on the 2.6.0, so we can close the issue. FYI @potiuk

@benbuckman
Copy link

@dzhigimont Do you know which change between 2.5 and 2.6 would have fixed this issue?
Thank you

@ephraimbuddy
Copy link
Contributor

@dzhigimont Do you know which change between 2.5 and 2.6 would have fixed this issue? Thank you

Should be this: #27964

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet
Projects
None yet
Development

No branches or pull requests

4 participants