Chained .expand() calls leads to DAG failure w/ no task failure #31218
Labels
area:core
kind:bug
This is a clearly a bug
needs-triage
label for new issues that we didn't triage yet
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
This was on 2.4.3. I have not yet had a chance to try on 2.5.x or 2.6. Will report further when I get a chance.
We were seeing a weird issue with one of our DAGs. The pattern was a series of tasks using
.expand()
one after the other. A set of IDs were being passed along the chain. Some of the tasks had the effect of filtering down that set of IDs. If the set of IDs wound up empty, as you'd imagine the subsequent tasks would be skipped.Occasionally what we'd see was when that happened the DAG would be marked as a failure. At some point instead of
skipped
the tasks are marked asnot yet started
, and no task had been marked as failed.What you think should happen instead
I'd expect to not have a DAG marked as a failure if there was no obvious failure
How to reproduce
I've managed to replicate the behavior I saw. That said the only way I found that I could reliably use to toggle the behavior isn't what we were doing so I might have stumbled on "same issue, different cause".
Example DAG code:
Run as is, it works as I'd expect, and the DAG succeeds w/ all but the first task being skipped
However, if I swap out the some_data assignment, I get this behavior. The first 2 tasks are skipped, the 2nd two are not yet started, and the DAG fails.
With this specific example I could see why the 2nd DAG is marked as failed - my assumption is no task succeeded. But I don't understand why it fails the way it does. Why does it choose that point to start the not yet started behavior?
In the real DAG it exhibits this behavior with successful tasks earlier on so even my handwavy mental explanation above doesn't track.
Operating System
This is using docker container
apache/airflow:2.4.3-python3.10
Versions of Apache Airflow Providers
apache-airflow-providers-common-sql==1.2.0
apache-airflow-providers-ftp==3.1.0
apache-airflow-providers-google==8.4.0
apache-airflow-providers-http==4.0.0
apache-airflow-providers-imap==3.0.0
apache-airflow-providers-postgres==5.2.2
apache-airflow-providers-sendgrid==3.0.0
apache-airflow-providers-slack==6.0.0
apache-airflow-providers-sqlite==3.2.1
Deployment
Other 3rd-party Helm chart
Deployment details
No response
Anything else
It seemed tied to something specific about our DAG code, but we couldn't identify anything that was causing it. We could artificially change order of tasks & things like that and the problem would go away. But any configuration that was actually valid for what we were trying to do would trigger this.
Eventually for this and other reasons we changed the whole approach and thus the problem went away.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: