Chained .expand() calls leads to DAG failure w/ no task failure #31218

geoffjentry · 2023-05-11T15:46:03Z

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

This was on 2.4.3. I have not yet had a chance to try on 2.5.x or 2.6. Will report further when I get a chance.

We were seeing a weird issue with one of our DAGs. The pattern was a series of tasks using .expand() one after the other. A set of IDs were being passed along the chain. Some of the tasks had the effect of filtering down that set of IDs. If the set of IDs wound up empty, as you'd imagine the subsequent tasks would be skipped.

Occasionally what we'd see was when that happened the DAG would be marked as a failure. At some point instead of skipped the tasks are marked as not yet started, and no task had been marked as failed.

What you think should happen instead

I'd expect to not have a DAG marked as a failure if there was no obvious failure

How to reproduce

I've managed to replicate the behavior I saw. That said the only way I found that I could reliably use to toggle the behavior isn't what we were doing so I might have stumbled on "same issue, different cause".

Example DAG code:

@dag(
    dag_id="test",
    schedule=None,
    catchup=False,
    start_date=datetime(2022, 1, 1),
)
def test():
    @task
    def generate_some_data():
        # return [1, 2, 3, 4, 5]
        return []

    @task
    def plus_one(datum):
        return datum + 1

    some_data = generate_some_data()
    #some_data = []

    foo = plus_one.expand(datum=some_data)
    bar = plus_one.override(task_id="bar").expand(datum=foo)
    baz = plus_one.override(task_id="baz").expand(datum=bar)
    qux = plus_one.override(task_id="quz").expand(datum=baz)

Run as is, it works as I'd expect, and the DAG succeeds w/ all but the first task being skipped
However, if I swap out the some_data assignment, I get this behavior. The first 2 tasks are skipped, the 2nd two are not yet started, and the DAG fails.
With this specific example I could see why the 2nd DAG is marked as failed - my assumption is no task succeeded. But I don't understand why it fails the way it does. Why does it choose that point to start the not yet started behavior?
In the real DAG it exhibits this behavior with successful tasks earlier on so even my handwavy mental explanation above doesn't track.

Operating System

This is using docker container apache/airflow:2.4.3-python3.10

Versions of Apache Airflow Providers

apache-airflow-providers-common-sql==1.2.0
apache-airflow-providers-ftp==3.1.0
apache-airflow-providers-google==8.4.0
apache-airflow-providers-http==4.0.0
apache-airflow-providers-imap==3.0.0
apache-airflow-providers-postgres==5.2.2
apache-airflow-providers-sendgrid==3.0.0
apache-airflow-providers-slack==6.0.0
apache-airflow-providers-sqlite==3.2.1

Deployment

Other 3rd-party Helm chart

Deployment details

No response

Anything else

It seemed tied to something specific about our DAG code, but we couldn't identify anything that was causing it. We could artificially change order of tasks & things like that and the problem would go away. But any configuration that was actually valid for what we were trying to do would trigger this.

Eventually for this and other reasons we changed the whole approach and thus the problem went away.

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2023-05-11T15:46:05Z

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

dzhigimont · 2023-05-11T18:09:22Z

@geoffjentry The bug is still reproduced on the 2.5.0 version but isn't on the 2.6.0, so we can close the issue. FYI @potiuk

benbuckman · 2023-05-11T18:26:02Z

@dzhigimont Do you know which change between 2.5 and 2.6 would have fixed this issue?
Thank you

ephraimbuddy · 2023-05-12T12:20:34Z

@dzhigimont Do you know which change between 2.5 and 2.6 would have fixed this issue? Thank you

Should be this: #27964

geoffjentry added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels May 11, 2023

ephraimbuddy closed this as completed May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chained .expand() calls leads to DAG failure w/ no task failure #31218

Chained .expand() calls leads to DAG failure w/ no task failure #31218

geoffjentry commented May 11, 2023

boring-cyborg bot commented May 11, 2023

dzhigimont commented May 11, 2023

benbuckman commented May 11, 2023

ephraimbuddy commented May 12, 2023

Chained .expand() calls leads to DAG failure w/ no task failure #31218

Chained .expand() calls leads to DAG failure w/ no task failure #31218

Comments

geoffjentry commented May 11, 2023

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented May 11, 2023

dzhigimont commented May 11, 2023

benbuckman commented May 11, 2023

ephraimbuddy commented May 12, 2023