on_failure_callback not called when task receives termination signal #11086

madison-ookla · 2020-09-22T18:08:55Z

Apache Airflow version: 1.10.7, 1.10.10, 1.10.12

Environment:

Cloud provider or hardware configuration: AWS EC2
OS (e.g. from /etc/os-release): Linux
Kernel (e.g. uname -a): Debian
Install tools:
Others:

What happened:

For the last several versions of Airflow, we've noticed that when a task receives a SIGTERM signal (currently represented as Task exited with return code Negsignal.SIGKILL, though previously represented as Task exited with return code -9), the failure email would be sent, but the on_failure_callback would not be called.

This happened fairly frequently in the past for us as we had tasks that would consume high amounts of memory and occasionally we would have too many running on the same worker and the tasks would be OOM killed. In these instances, we would receive failure emails with the contents detected as zombie and the on_failure_callback would not be called. We were hoping #7025 would resolve this with the most recent upgrade (and we've also taken steps to reduce our memory footprint), but we just had this happen again recently.

What you expected to happen:

If a tasks fails (even if the cause of the failure is a lack of resources), I would hope the on_failure_callback would still be called.

How to reproduce it:

Example DAG setup:

CODE

# -*- coding: utf-8 -*-

"""
# POC: On Failure Callback for SIGKILL
"""

from datetime import datetime

import numpy as np

from airflow import DAG
from airflow.api.common.experimental.trigger_dag import trigger_dag
from airflow.operators.python_operator import PythonOperator


def on_failure_callback(**context):
    print("===IN ON FAILURE CALLBACK===")
    print("Triggering another run of the task")
    trigger_dag("OOM_test_follower")


def high_memory_task():
    l = []
    iteration = 0
    while True:
        print(f"Iteration: {iteration}")
        l.append(np.random.randint(1_000_000, size=(1000, 1000, 100)))
        iteration += 1


def failure_task():
    raise ValueError("whoops")


def print_context(**context):
    print("This DAG was launched by the failure callback")
    print(context)


dag = DAG(
    dag_id="OOM_test",
    schedule_interval=None,
    catchup=False,
    default_args={
        "owner": "madison.bowden",
        "start_date": datetime(year=2019, month=7, day=1),
        "email": "your-email",
    },
)

with dag:

    PythonOperator(
        task_id="oom_task",
        python_callable=high_memory_task,
        on_failure_callback=on_failure_callback,
    )

failure_dag = DAG(
    dag_id="Failure_test",
    schedule_interval=None,
    catchup=False,
    default_args={
        "owner": "madison.bowden",
        "start_date": datetime(year=2019, month=7, day=1),
        "email": "your-email",
    },
)

with failure_dag:

    PythonOperator(
        task_id="failure_task",
        python_callable=failure_task,
        on_failure_callback=on_failure_callback,
    )

dag_follower = DAG(
    dag_id="OOM_test_follower",
    schedule_interval=None,
    catchup=False,
    default_args={
        "owner": "madison.bowden",
        "start_date": datetime(year=2019, month=7, day=1),
        "email": "your-email",
    },
)

with dag_follower:

    PythonOperator(
        task_id="oom_task_failure", python_callable=print_context, provide_context=True
    )

With the above example, the Failure_test should trigger a run of the OOM_test_follower DAG when it fails. The OOM_test DAG when triggered should quickly run out of memory and then not trigger a run of the OOM_test_follower DAG.

Anything else we need to know:

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2020-09-22T18:08:57Z

Thanks for opening your first issue here! Be sure to follow the issue template!

turbaszek · 2020-09-22T18:27:47Z

Related to #10917 ?

madison-ookla · 2020-09-22T19:32:09Z

Related to #10917 ?

Good find! I think this issue is related, but not quite the same. This particular case is explicitly mentioned in @houqp's response here, and it looks like the change that would rectify this (moving callback handling to local_scheduler_job rather than local_task_job, I think) is out of scope for that particular change.

houqp · 2020-09-24T05:05:55Z

yeah, those are two different issues. @madison-ookla what executor are you using?

If you run into OOM, the raw task process will receive a SIGKILL instead of SIGTERM, which cannot be captured and handled by the process itself. #7025 doesn't solve this problem because it only handles cases where raw task process did not exit by itself or were killed by SIGKILL.

I think to properly fix this bug, we will need to move the failure callback invocation into caller of raw task process, e.g. local_task_job. That way, we can just check for return code from the raw task and always invoke failure callback if it's not 0, which should cover the SIGKILL case.

I will update my #10917 PR to cover this.

madison-ookla · 2020-09-24T20:30:32Z

@houqp Totally agree with your assessment, and if you can incorporate that in #10917 that would be amazing! We definitely only want the callback issued once, and ideally issued regardless what happens to the process (SIGKILL vs SIGTERM). FWIW we're using the CeleryExecutor.

Thanks a ton!

molejnik-mergebit · 2021-04-12T08:35:22Z

Hello, I have similar problem. Any ETA on fixing this?

houqp · 2021-04-16T00:37:03Z

@molejnik-mergebit this should have already been fixed in 2.0.1 through #10917. what version are you on?

molejnik-mergebit · 2021-04-16T06:06:02Z

@houqp thanks for the heads up, I'm still on old 1.10.12. Will try to update and check.

houqp · 2021-04-23T17:14:10Z

@molejnik-mergebit actually I take it back, this bug is not fixed, but it's now fixable after #10917, I believe @ephraimbuddy is working on it.

molejnik-mergebit · 2021-04-23T19:54:44Z

Thanks for the update. I got it as well after update to 2.0.1, but I didn't have steps to reproduce it.

madison-ookla added the kind:bug This is a clearly a bug label Sep 22, 2020

turbaszek added the area:Scheduler including HA (high availability) scheduler label Sep 22, 2020

houqp closed this as completed Apr 16, 2021

This was referenced Apr 16, 2021

on_failure_callback does not seem to fire on pod deletion/eviction #14422

Closed

Execute on_failure_callback when SIGTERM is received #15172

Merged

houqp reopened this Apr 23, 2021

ephraimbuddy mentioned this issue Apr 26, 2021

Fix on_failure_callback when task receive SIGKILL #15537

Merged

ashb closed this as completed in #15537 May 5, 2021

ephraimbuddy mentioned this issue Jun 23, 2021

Fix task retries when they receive sigkill and have retries. Properly handle sigterm too #16301

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

on_failure_callback not called when task receives termination signal #11086

on_failure_callback not called when task receives termination signal #11086

madison-ookla commented Sep 22, 2020 •

edited

Loading

boring-cyborg bot commented Sep 22, 2020

turbaszek commented Sep 22, 2020

madison-ookla commented Sep 22, 2020

houqp commented Sep 24, 2020

madison-ookla commented Sep 24, 2020

molejnik-mergebit commented Apr 12, 2021

houqp commented Apr 16, 2021

molejnik-mergebit commented Apr 16, 2021

houqp commented Apr 23, 2021 •

edited

Loading

molejnik-mergebit commented Apr 23, 2021

on_failure_callback not called when task receives termination signal #11086

on_failure_callback not called when task receives termination signal #11086

Comments

madison-ookla commented Sep 22, 2020 • edited Loading

boring-cyborg bot commented Sep 22, 2020

turbaszek commented Sep 22, 2020

madison-ookla commented Sep 22, 2020

houqp commented Sep 24, 2020

madison-ookla commented Sep 24, 2020

molejnik-mergebit commented Apr 12, 2021

houqp commented Apr 16, 2021

molejnik-mergebit commented Apr 16, 2021

houqp commented Apr 23, 2021 • edited Loading

molejnik-mergebit commented Apr 23, 2021

madison-ookla commented Sep 22, 2020 •

edited

Loading

houqp commented Apr 23, 2021 •

edited

Loading