Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"task_fail" contains duplicates for FK to "task_instance" table #26627

Closed
1 of 2 tasks
kosteev opened this issue Sep 23, 2022 · 10 comments
Closed
1 of 2 tasks

"task_fail" contains duplicates for FK to "task_instance" table #26627

kosteev opened this issue Sep 23, 2022 · 10 comments
Labels
area:core kind:bug This is a clearly a bug

Comments

@kosteev
Copy link
Contributor

kosteev commented Sep 23, 2022

Apache Airflow version

Other Airflow 2 version

What happened

Airflow 2.3.3.
Task instance failures produce duplicates in "task_fail" table for this constraint (dag_id, task_id, run_id, map_index).

What you think should happen instead

Recently FK constraint between task_fail and task_instance tables was introduced:
#22260
And then there was a change to purge duplicates for this constraint from task_fail table (on db upgrade):
#22769

Removing duplicates on db upgrade to Airflow 2.3+ before establishing FK between tables makes sense, however these duplicates can occur in running Airflow 2.3+ instance (see "How to reproduce").

What is rationale for removing duplicates once on upgrading to Airflow 2.3+ but keeping this possibility to generate duplicates again? Isn’t it going to break foreign key and integrity of these two tables?

How to reproduce

Trigger DAG with task that fails multiple times for different tries, it will produce duplications in "task_fail" table.
Example (two tries with 5 mins interval for retries):

id | task_id |    dag_id    |          start_date           |           end_date            | duration | map_index |                run_id
1  | task    | dag1_failing | 2022-09-23 09:11:44.102894+00 | 2022-09-23 09:11:44.469007+00 |        0 |        -1 | scheduled__2022-09-22T00:00:00+00:00
3  | task    | dag1_failing | 2022-09-23 09:16:44.995269+00 | 2022-09-23 09:16:45.310398+00 |        0 |        -1 | scheduled__2022-09-22T00:00:00+00:00

Operating System

Linux

Versions of Apache Airflow Providers

No response

Deployment

Composer

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@kosteev kosteev added area:core kind:bug This is a clearly a bug labels Sep 23, 2022
@uranusjr
Copy link
Member

I tend to believe this is not the intention, and feel free to submit fixes where duplicates can be created. (Can you provide examples? You didn’t explain when duplicates happen.)

@kosteev
Copy link
Contributor Author

kosteev commented Sep 26, 2022

Sorry, I didn't stated it clearly, basically multiple failures of the same task instance produce duplicates in task_fail table.

@uranusjr
Copy link
Member

Hmm yeah that makes sense. I’m guessing the duplicated entries should be expected, and deleting them on upgrade could be an accident…? Not sure. cc @dstandish

@dstandish
Copy link
Contributor

dstandish commented Sep 26, 2022

Yes I think that removing the duplicates is not necessary.

@dstandish
Copy link
Contributor

dstandish commented Sep 26, 2022

I'm taking a look at how we use taskfail though... i the effort is not super great, it would be best if we can figure out what is actually the key of this table and enforce it. i know it's used in the dags/./duration view ... not sure what the consequence is of having the duplicates.

but from a purely DB-level referential integrity perspective, there's no issue because it is TF -> TI and not the other way around

@uranusjr
Copy link
Member

From a “recording the history” perspective, I’d expect repeated TaskFail events be all be kept, and if the UI can only show one, it should have logic to only pull in the last failure for a ti.

@dstandish
Copy link
Contributor

dstandish commented Sep 27, 2022

Yeah, sigh... PR to remove the check: #26714

I think in my head the FK ref direction was flipped so in my mind the TF records had to be deduped before adding the key. My mistake. Fortunately, it seems of little consequence.

I looked at the code in the task duration and gantt views, the two locations where some processing of taskfail records is done. Gantt actually seems to work correctly if you have retries enabled on your task.

But for the task duration view, task fails are only accounted for in the cumulative view (they are ignored in the non-cumulative view), and the cumulative view seems broken because it can go down with increasing time.

@Jaxing
Copy link

Jaxing commented Oct 4, 2022

Maybe this belongs to a new issue but I think it stems for the same problem.
When upgrading to 2.3.0 from 2.2.5 we got a similar issue, i.e. duplication in the task_fail table.

During the upgrade the duplicates are moved to _airflow_moved__2_3__duplicates__task_fail. However the schema of this table is quite different from the task_fail table

Here is the columns from the _airflow_moved__2_3__duplicates__task_fail:

id task_id dag_id execution_date start_date end_date duration

And here is the columns from the new (?) task_fail:

id task_id dag_id start_date end_date duration map_index run_id

So it's not clear how I could revert this delete if I'd like to since I'm missing data like map_index and run_id.

@dstandish
Copy link
Contributor

Something like this should work:

INSERT INTO task_fail (id, task_id, dag_id, start_date, end_date, duration, map_index, run_id)
SELECT
    id,
    task_id,
    dag_id,
    start_date,
    end_date,
    duration,
    -1 as map_index,
    r.run_id
FROM _airflow_moved__2_3__duplicates__task_fail d
JOIN dag_run r on r.execution_date = d.execution_date

But you could also just delete the table. The task fail records are only used in the task duration and gantt views. So, the only way you can notice their impact at all is if you're looking at that specific run in the web UI, which you're probably unlikely to do. The failures show up as additional boxes in the gantt view.

leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this issue Oct 4, 2022
Do not purge duplicates from TaskFail table on db upgrade.

It seems that duplicates in this table are expected and there is no sense in purging them, i.e. it was implemented by mistake.
Purging duplicates violates integrity of this table.
See apache/airflow#26627 for details.

Change-Id: I280789879bc69f3e3e30228db2b263a63954739c
GitOrigin-RevId: 809668b733ce3113ce0e6de7b6a3189bc11bac4d
@dstandish
Copy link
Contributor

Closing this issue; check removed here #26714

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

4 participants