Scheduler pod hang when K8s API call fail #28328

nguyenmphu · 2022-12-13T07:49:50Z

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

Airflow version: 2.3.4

I have deployed airflow with the official Helm in K8s with KubernetesExecutor. Sometimes the scheduler hang when calling K8s API. The log:

ERROR - Exception when executing Executor.end
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 752, in _execute
    self._run_scheduler_loop()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 842, in _run_scheduler_loop
    self.executor.heartbeat()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/base_executor.py", line 171, in heartbeat
    self.sync()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 649, in sync
    next_event = self.event_scheduler.run(blocking=False)
  File "/usr/local/lib/python3.8/sched.py", line 151, in run
    action(*argument, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/event_scheduler.py", line 36, in repeat
    action(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 673, in _check_worker_pods_pending_timeout
    for pod in pending_pods().items:
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 15697, in list_namespaced_pod
    return self.list_namespaced_pod_with_http_info(namespace, **kwargs)  # noqa: E501
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 15812, in list_namespaced_pod_with_http_info
    return self.api_client.call_api(
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 373, in request
    return self.rest_client.GET(url,
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 240, in GET
    return self.request("GET", url,
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 213, in request
    r = self.pool_manager.request(method, url,
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/request.py", line 74, in request
    return self.request_encode_url(
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/request.py", line 96, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/poolmanager.py", line 376, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 815, in urlopen
    return self.urlopen(
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
    conn.connect()
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connection.py", line 358, in connect
    self.sock = conn = self._new_conn()
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 182, in _exit_gracefully
    sys.exit(os.EX_OK)
SystemExit: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 773, in _execute
    self.executor.end()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 823, in end
    self._flush_task_queue()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 776, in _flush_task_queue
    self.log.debug('Executor shutting down, task_queue approximate size=%d', self.task_queue.qsize())
  File "<string>", line 2, in qsize
  File "/usr/local/lib/python3.8/multiprocessing/managers.py", line 835, in _callmethod
    kind, result = conn.recv()
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

Then the executor process was killed and the pod was still running. But the scheduler does not work.

After restarting, the scheduler worked usually.

What you think should happen instead

When the error occurs, the executor needs to auto restart or the scheduler should be killed.

How to reproduce

No response

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2022-12-13T07:49:52Z

Thanks for opening your first issue here! Be sure to follow the issue template!

potiuk · 2022-12-27T21:28:27Z

Looks like handling error while reading output on closed connection could be done better.

maxnathaniel · 2022-12-30T09:49:34Z

@potiuk would like to take on this good first issue. How should the error be handled in this scenario? Restart the executor? Presumably, closed connection refers to the call via self.kube_client.list_namespaced_pod?

potiuk · 2022-12-30T11:21:48Z

I think no restart is needed. This error seems to be raised (from stacktrace) when everything is finished and simply the connection is reset by a thread that reads it (and it should be simply ignored)

nguyenmphu added area:core kind:bug This is a clearly a bug labels Dec 13, 2022

potiuk added this to the Airflow 2.5.1 milestone Dec 13, 2022

potiuk added good first issue provider:cncf-kubernetes Kubernetes provider related issues labels Dec 27, 2022

potiuk assigned maxnathaniel Dec 30, 2022

maxnathaniel mentioned this issue Jan 3, 2023

Handle ConnectionReset exception in Executor cleanup #28685

Merged

potiuk closed this as completed in #28685 Jan 3, 2023

dadonnelly316 mentioned this issue Jan 10, 2023

Airflow Scheduler Hangs After Failed K8 API Call #28836

Closed

2 tasks

This was referenced Jan 14, 2023

Status of testing of Apache Airflow 2.5.1rc1 #28947

Closed

Status of testing of Apache Airflow 2.5.1rc2 #29026

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler pod hang when K8s API call fail #28328

Scheduler pod hang when K8s API call fail #28328

nguyenmphu commented Dec 13, 2022 •

edited

Loading

boring-cyborg bot commented Dec 13, 2022

potiuk commented Dec 27, 2022

maxnathaniel commented Dec 30, 2022

potiuk commented Dec 30, 2022

Scheduler pod hang when K8s API call fail #28328

Scheduler pod hang when K8s API call fail #28328

Comments

nguyenmphu commented Dec 13, 2022 • edited Loading

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented Dec 13, 2022

potiuk commented Dec 27, 2022

maxnathaniel commented Dec 30, 2022

potiuk commented Dec 30, 2022

nguyenmphu commented Dec 13, 2022 •

edited

Loading