Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPDEV-103824] - caught sftp eof error during reboot #681

Merged
merged 3 commits into from
Aug 26, 2024

Conversation

Imadzuma
Copy link
Collaborator

Description

When node s rebooted, sometimes an eof error ocurrs, that fails kubemarine process. This floating issue occurs in case, if this node has opened sftp connection (it's opened on put to node/get from node operations), that nexpectedly closes from the node side, that can follow eof error, when kubemarine tries to close it from its side. Such behavior was detected on rhel 9 nodes:

2024-07-03 21:48:02.062 +0400   File "/usr/local/lib/python3.12/site-packages/kubemarine/core/flow.py", line 57, in run_flow
2024-07-03 21:48:02.062 +0400     self._run(resources)
2024-07-03 21:48:02.062 +0400   File "/usr/local/lib/python3.12/site-packages/kubemarine/core/flow.py", line 88, in _run
2024-07-03 21:48:02.062 +0400     run_actions(resources, self._actions)
2024-07-03 21:48:02.062 +0400   File "/usr/local/lib/python3.12/site-packages/kubemarine/core/flow.py", line 122, in run_actions
2024-07-03 21:48:02.062 +0400     act.run(resources)
2024-07-03 21:48:02.062 +0400   File "/usr/local/lib/python3.12/site-packages/kubemarine/core/flow.py", line 231, in run
2024-07-03 21:48:02.062 +0400     run_tasks_recursive(self.tasks, final_list, cluster, self.cumulative_points, [])
2024-07-03 21:48:02.062 +0400   File "/usr/local/lib/python3.12/site-packages/kubemarine/core/flow.py", line 390, in run_tasks_recursive
2024-07-03 21:48:02.062 +0400     run_tasks_recursive(task, final_task_names, cluster, cumulative_points, __task_path)
2024-07-03 21:48:02.062 +0400   File "/usr/local/lib/python3.12/site-packages/kubemarine/core/flow.py", line 390, in run_tasks_recursive
2024-07-03 21:48:02.062 +0400     run_tasks_recursive(task, final_task_names, cluster, cumulative_points, __task_path)
2024-07-03 21:48:02.062 +0400   File "/usr/local/lib/python3.12/site-packages/kubemarine/core/flow.py", line 375, in run_tasks_recursive
2024-07-03 21:48:02.062 +0400     proceed_cumulative_point(cluster, cumulative_points, __task_name, force=force_cumulative_point)
2024-07-03 21:48:02.062 +0400   File "/usr/local/lib/python3.12/site-packages/kubemarine/core/flow.py", line 508, in proceed_cumulative_point
2024-07-03 21:48:02.062 +0400     call_result = point_method(cluster)
2024-07-03 21:48:02.062 +0400                   ^^^^^^^^^^^^^^^^^^^^^
2024-07-03 21:48:02.062 +0400   File "/usr/local/lib/python3.12/site-packages/kubemarine/system.py", line 342, in reboot_nodes
2024-07-03 21:48:02.062 +0400     cluster.get_new_nodes_or_self().call(reboot_group)
2024-07-03 21:48:02.062 +0400   File "/usr/local/lib/python3.12/site-packages/kubemarine/core/group.py", line 475, in call
2024-07-03 21:48:02.063 +0400     result = action(self, **kwargs)
2024-07-03 21:48:02.063 +0400              ^^^^^^^^^^^^^^^^^^^^^^
2024-07-03 21:48:02.063 +0400   File "/usr/local/lib/python3.12/site-packages/kubemarine/system.py", line 359, in reboot_group
2024-07-03 21:48:02.063 +0400     return perform_group_reboot(group)
2024-07-03 21:48:02.063 +0400            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-03 21:48:02.063 +0400   File "/usr/local/lib/python3.12/site-packages/kubemarine/system.py", line 396, in perform_group_reboot
2024-07-03 21:48:02.063 +0400     group.wait_for_reboot(initial_boot_history)
2024-07-03 21:48:02.063 +0400   File "/usr/local/lib/python3.12/site-packages/kubemarine/core/group.py", line 724, in wait_for_reboot
2024-07-03 21:48:02.063 +0400     results = self._await_rebooted_nodes(timeout, initial_boot_history=initial_boot_history)
2024-07-03 21:48:02.063 +0400               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-03 21:48:02.063 +0400   File "/usr/local/lib/python3.12/site-packages/kubemarine/core/group.py", line 738, in _await_rebooted_nodes
2024-07-03 21:48:02.063 +0400     return executor.wait_for_boot(self.get_hosts(), timeout, initial_boot_history)
2024-07-03 21:48:02.063 +0400            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-03 21:48:02.063 +0400   File "/usr/local/lib/python3.12/site-packages/kubemarine/core/executor.py", line 721, in wait_for_boot
2024-07-03 21:48:02.063 +0400     return self._wait_for_boot_with_executor(left_nodes, TPE, timeout, initial_boot_history)
2024-07-03 21:48:02.063 +0400            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-03 21:48:02.063 +0400   File "/usr/local/lib/python3.12/site-packages/kubemarine/core/executor.py", line 742, in _wait_for_boot_with_executor
2024-07-03 21:48:02.063 +0400     self._disconnect(left_nodes)
2024-07-03 21:48:02.063 +0400   File "/usr/local/lib/python3.12/site-packages/kubemarine/core/executor.py", line 849, in _disconnect
2024-07-03 21:48:02.063 +0400     cxn.close()
2024-07-03 21:48:02.063 +0400   File "/usr/local/lib/python3.12/site-packages/fabric/connection.py", line 722, in close
2024-07-03 21:48:02.063 +0400     self._sftp.close()
2024-07-03 21:48:02.063 +0400   File "/usr/local/lib/python3.12/site-packages/paramiko/sftp_client.py", line 195, in close
2024-07-03 21:48:02.063 +0400     self.sock.close()
2024-07-03 21:48:02.063 +0400   File "/usr/local/lib/python3.12/site-packages/paramiko/channel.py", line 669, in close
2024-07-03 21:48:02.063 +0400     self.transport._send_user_message(m)
2024-07-03 21:48:02.063 +0400   File "/usr/local/lib/python3.12/site-packages/paramiko/transport.py", line 1953, in _send_user_message
2024-07-03 21:48:02.063 +0400     self._send_message(data)
2024-07-03 21:48:02.063 +0400   File "/usr/local/lib/python3.12/site-packages/paramiko/transport.py", line 1929, in _send_message
2024-07-03 21:48:02.063 +0400     self.packetizer.send_message(data)
2024-07-03 21:48:02.063 +0400   File "/usr/local/lib/python3.12/site-packages/paramiko/packet.py", line 435, in send_message
2024-07-03 21:48:02.063 +0400     self.write_all(out)
2024-07-03 21:48:02.063 +0400   File "/usr/local/lib/python3.12/site-packages/paramiko/packet.py", line 368, in write_all
2024-07-03 21:48:02.063 +0400     raise EOFError()
2024-07-03 21:48:02.063 +0400 EOFError

Fixes # (issue)

Solution

  • Added cauthing eof error during closing sftp connection. In case, if it happens, the warning message will be printed in the logs and the process will be continued as normal (the new sftp connection will be created, when if it will be needed after reboot).

Test Cases

TestCase 1 (floating issue)

Test Configuration:

  • Hardware:
  • OS: rhell 9
  • Inventory:

Steps:

  1. Run kubemarine install with reboot cumulative point;

Results:

Before After
Reboot node task fails with an EOFError and kubemarine stops working An warning message is printed and kubemarine continue working

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • Integration CI passed
  • Unit tests. If Yes list of new/changed tests with brief description
  • There is no merge conflicts

Unit tests

Indicate new or changed unit tests and what they do, if any.

@koryaga koryaga marked this pull request as draft August 21, 2024 07:58
@koryaga koryaga self-assigned this Aug 21, 2024
@koryaga koryaga added the bug Something isn't working label Aug 21, 2024
@koryaga koryaga marked this pull request as ready for review August 21, 2024 09:29
@koryaga koryaga requested a review from ilia1243 August 21, 2024 09:29
@koryaga koryaga added the improvement New feature or request label Aug 21, 2024
@koryaga koryaga self-requested a review August 21, 2024 09:54
@koryaga koryaga merged commit 6fd290c into main Aug 26, 2024
42 checks passed
@koryaga koryaga deleted the bugfix/caught-sftp-eof-error branch August 26, 2024 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working improvement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants