Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler pod hang when K8s API call fail #28328

Closed
2 tasks done
nguyenmphu opened this issue Dec 13, 2022 · 4 comments · Fixed by #28685
Closed
2 tasks done

Scheduler pod hang when K8s API call fail #28328

nguyenmphu opened this issue Dec 13, 2022 · 4 comments · Fixed by #28685
Assignees
Labels
area:core good first issue kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues
Milestone

Comments

@nguyenmphu
Copy link

nguyenmphu commented Dec 13, 2022

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

Airflow version: 2.3.4

I have deployed airflow with the official Helm in K8s with KubernetesExecutor. Sometimes the scheduler hang when calling K8s API. The log:

ERROR - Exception when executing Executor.end
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 752, in _execute
    self._run_scheduler_loop()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 842, in _run_scheduler_loop
    self.executor.heartbeat()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/base_executor.py", line 171, in heartbeat
    self.sync()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 649, in sync
    next_event = self.event_scheduler.run(blocking=False)
  File "/usr/local/lib/python3.8/sched.py", line 151, in run
    action(*argument, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/event_scheduler.py", line 36, in repeat
    action(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 673, in _check_worker_pods_pending_timeout
    for pod in pending_pods().items:
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 15697, in list_namespaced_pod
    return self.list_namespaced_pod_with_http_info(namespace, **kwargs)  # noqa: E501
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 15812, in list_namespaced_pod_with_http_info
    return self.api_client.call_api(
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 373, in request
    return self.rest_client.GET(url,
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 240, in GET
    return self.request("GET", url,
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 213, in request
    r = self.pool_manager.request(method, url,
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/request.py", line 74, in request
    return self.request_encode_url(
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/request.py", line 96, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/poolmanager.py", line 376, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 815, in urlopen
    return self.urlopen(
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
    conn.connect()
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connection.py", line 358, in connect
    self.sock = conn = self._new_conn()
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 182, in _exit_gracefully
    sys.exit(os.EX_OK)
SystemExit: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 773, in _execute
    self.executor.end()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 823, in end
    self._flush_task_queue()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 776, in _flush_task_queue
    self.log.debug('Executor shutting down, task_queue approximate size=%d', self.task_queue.qsize())
  File "<string>", line 2, in qsize
  File "/usr/local/lib/python3.8/multiprocessing/managers.py", line 835, in _callmethod
    kind, result = conn.recv()
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

Then the executor process was killed and the pod was still running. But the scheduler does not work.

After restarting, the scheduler worked usually.

What you think should happen instead

When the error occurs, the executor needs to auto restart or the scheduler should be killed.

How to reproduce

No response

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@nguyenmphu nguyenmphu added area:core kind:bug This is a clearly a bug labels Dec 13, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Dec 13, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@potiuk potiuk added this to the Airflow 2.5.1 milestone Dec 13, 2022
@potiuk potiuk added good first issue provider:cncf-kubernetes Kubernetes provider related issues labels Dec 27, 2022
@potiuk
Copy link
Member

potiuk commented Dec 27, 2022

Looks like handling error while reading output on closed connection could be done better.

@maxnathaniel
Copy link
Contributor

@potiuk would like to take on this good first issue. How should the error be handled in this scenario? Restart the executor? Presumably, closed connection refers to the call via self.kube_client.list_namespaced_pod?

@potiuk
Copy link
Member

potiuk commented Dec 30, 2022

I think no restart is needed. This error seems to be raised (from stacktrace) when everything is finished and simply the connection is reset by a thread that reads it (and it should be simply ignored)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core good first issue kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants