[PSUPCLPL-8487] Prevent Kubemarine from getting stuck (#93)

* PSUPCLPL-8487 Prevent Kubemarine from getting stuck * Updated error codes support * Fix review comments * Update Troubleshooting.md Co-authored-by: shmo1218 <[email protected]>
Netcracker · Jan 25, 2022 · 59d8578 · 59d8578
1 parent 3562057
commit 59d8578
Show file tree

Hide file tree

Showing 10 changed files with 359 additions and 34 deletions.
diff --git a/documentation/Troubleshooting.md b/documentation/Troubleshooting.md
@@ -1,6 +1,12 @@
 This section provides troubleshooting information for Kubemarine and Kubernetes solutions.
 
-- [Trobleshooting Tools](#troubleshooting-tools)
+- [KubeMarine Errors](#kubemarine-errors)
+  - [KME0001: Unexpected exception](#kme0001-unexpected-exception)
+  - [KME0002: Remote group exception](#kme0002-remote-group-exception)
+  - [KME0003: Action took too long to complete and timed out](#kme0003-action-took-too-long-to-complete-and-timed-out)
+  - [KME0004: There are no workers defined in the cluster scheme](#kme0004-there-are-no-workers-defined-in-the-cluster-scheme)
+  - [KME0005: {hostname} is not a sudoer](#kme0005-hostname-is-not-a-sudoer)
+- [Troubleshooting Tools](#troubleshooting-tools)
   - [etcdctl script](#etcdctl-script)
 - [Troubleshooting Kubernetes Generic Issues](#troubleshooting-kubernetes-generic-issues)
   - [CoreDNS Responds with High Latency](#coredns-responds-with-high-latency)
@@ -13,7 +19,203 @@ This section provides troubleshooting information for Kubemarine and Kubernetes
   - [Numerous generation of auditd system messages ](#numerous-generation-of-auditd-system)
   - [Failing during installation on Ubuntu OS](#failing-during-installation-on-ubuntu-os)
 
-# Trobleshooting Tools
+# KubeMarine Errors
+
+This section lists all known errors with explanations and recommendations for their fixing. If an 
+error occurs during the execution of any of these procedures, you can find it here.
+
+
+## KME0001: Unexpected exception
+
+```
+FAILURE - TASK FAILED xxx
+Reason: KME001: Unexpected exception
+Traceback (most recent call last):
+  File "/home/centos/repos/kubemarine/kubemarine/src/core/flow.py", line 131, in run_flow
+    task(cluster)
+  File "/home/centos/repos/kubemarine/kubemarine/install", line 193, in deploy_kubernetes_init
+    cluster.nodes["worker"].new_group(apply_filter=lambda node: 'master' not in node['roles']).call(kubernetes.init_workers)
+  File "/home/centos/repos/kubemarine/kubemarine/src/core/group.py", line 165, in call
+    return self.call_batch([action], **{action.__name__: kwargs})
+  File "/home/centos/repos/kubemarine/kubemarine/src/core/group.py", line 179, in call_batch
+    results[action] = action(self, **action_kwargs)
+  File "/home/centos/repos/kubemarine/kubemarine/src/kubernetes.py", line 238, in init_workers
+    reset_installation_env(group)
+  File "/home/centos/repos/kubemarine/kubemarine/src/kubernetes.py", line 60, in reset_installation_env
+    group.sudo("systemctl stop kubelet", warn=True)
+  File "/home/centos/repos/kubemarine/kubemarine/src/core/group.py", line 34, in sudo
+    return self.do("sudo", *args, **kwargs)
+  File "/home/centos/repos/kubemarine/kubemarine/src/core/group.py", line 106, in do
+    self.workaround(exception)
+  File "/home/centos/repos/kubemarine/kubemarine/src/core/group.py", line 119, in workaround
+    raise e from None
+  File "/home/centos/repos/kubemarine/kubemarine/src/core/group.py", line 95, in do
+    return self._do(do_type, args, kwargs)
+  File "/home/centos/repos/kubemarine/kubemarine/src/core/group.py", line 141, in _do
+    with ThreadPoolExecutor(max_workers=len(self.nodes)) as executor:
+  File "/usr/lib/python3.6/concurrent/futures/thread.py", line 104, in __init__
+    raise ValueError("max_workers must be greater than 0")
+ValueError: max_workers must be greater than 0
+```
+
+This error occurs in case of an unexpected exception at runtime and does not yet have a classifying 
+code.
+
+To fix it, first try checking the nodes and the cluster with 
+[IAAS checker](Kubecheck.md#iaas-procedure) and [PAAS checker](Kubecheck.md#paas-procedure). If you 
+see failed tests, try fixing the cause of the failure. If the error persists, try to inspect the 
+stacktrace and come to a solution yourself as much as possible. 
+
+If you still can't resolve this error yourself, start 
+[a new issue](https://github.com/Netcracker/KubeMarine/issues/new) and attach a description of the 
+error with its stacktrace. We will try to help as soon as possible.
+
+If you were able to solve the problem yourself, let us know about it and your solution by 
+[opening a new PR](https://github.com/Netcracker/KubeMarine/pulls). Our team will appreciate it!
+
+
+## KME0002: Remote group exception
+
+Shell error:
+
+```
+FAILURE!
+TASK FAILED xxx
+KME0002: Remote group exception
+10.101.10.1:
+	Encountered a bad command exit code!
+	
+	Command: 'apt install bad-package-name'
+	
+	Exit code: 127
+	
+	Stdout:
+	
+	
+	
+	Stderr:
+	
+	bash: apt: command not found
+```
+
+Hierarchical error:
+
+```
+FAILURE!
+TASK FAILED xxx
+KME0002: Remote group exception
+10.101.10.1:
+	KME0003: Action took too long to complete and timed out
+```
+
+An error indicating an unexpected runtime bash command exit on a remote cluster host. This error 
+occurs when a command is terminated unexpectedly with a non-zero error code.
+
+The error prints the status of the command execution for each node in the group on which the bash command 
+was executed. The status can be a correct result (shell results), a result with an error 
+(shell error), as well as a hierarchical KME with its own code.
+
+To fix it, first try checking the nodes and the cluster with 
+[IAAS checker](Kubecheck.md#iaas-procedure) and [PAAS checker](Kubecheck.md#paas-procedure). If you 
+see failed tests, try fixing the cause of the failure. Make sure that you do everything according to 
+the instructions in the correct sequence and correctly fill the inventory and other dependent
+files. If the error persists, try to figure out what might be causing the command to fail on remote 
+nodes and fix by yourself as much as possible.
+
+If you still can't resolve this error yourself, start 
+[a new issue](https://github.com/Netcracker/KubeMarine/issues/new) and attach a description of the 
+error with its stacktrace. We will try to help as soon as possible.
+
+
+## KME0003: Action took too long to complete and timed out
+
+```
+FAILURE!
+TASK FAILED xxx
+KME0002: Remote group exception
+10.101.10.1:
+	KME0003: Action took too long to complete and timed out
+```
+
+An error that occurs when a command did not have time to execute at the specified time.
+
+The error can occur if there is a problem with the remote hypervisor or host hanging, if the 
+command executable hangs, or if the SSH-connection is unexpectedly disconnected or other network 
+problems between the deployer node and the cluster.
+
+The longest possible timeout for the command is 2700 seconds (45 minutes).
+
+To resolve this error, check all of the listed items that may hang and manually fix the hang by 
+rebooting the hypervisor or node, fixing the environment or settings of the executable, updating it,
+fixing the network channel, as well as any other actions that, in your opinion, should fix the 
+frozen stage of the procedure. It will be useful to check the cluster with 
+[IAAS checker](Kubecheck.md#iaas-procedure) to detect problems with network connectivity.
+
+
+## KME0004: There are no workers defined in the cluster scheme
+
+```
+FAILURE!
+KME0004: There are no workers defined in the cluster scheme
+```
+
+An error related with the absence of any worker role in the inventory file. The error occurs before
+the payload is executed on the cluster.
+
+To fix it, you need to either specify new nodes with the `worker` role, or add the `worker` role to 
+the existing masters nodes.
+
+An example of specifying different nodes with separate `master` and `worker` roles is as follows.
+
+```yaml
+- address: 10.101.1.1
+  internal_address: 192.168.101.1
+  name: master-1
+  roles:
+  - master
+- address: 10.101.1.2
+  internal_address: 192.168.101.2
+  name: worker-1
+  roles:
+  - worker
+```
+
+An example of specifying multiple `master` and `worker` roles for a single node is as follows.
+
+```yaml
+- address: 10.101.1.1
+  internal_address: 192.168.101.1
+  name: master-1
+  roles:
+  - master
+  - worker
+```
+
+**Note**: Masters with a `worker` role remain as control planes, however, they start scheduling
+applications pods.
+
+
+## KME0005: {hostname} is not a sudoer
+
+```
+FAILURE!
+TASK FAILED prepare.check.sudoer
+KME0005: 10.101.1.1 is not a sudoer
+```
+
+The error reports that the specified node does not have superuser rights. The error occurs 
+before the payload is executed on the cluster when running the `install` or `add_node` procedure.
+
+To fix this, add a connection user to the sudoer group on the cluster node. 
+
+An example for Ubuntu (reboot required) is as given below.
+
+```bash
+sudo adduser <username> sudo
+```
+
+
+# Troubleshooting Tools
 
 This section describes the additional tools that Kubemarine provides for convenient troubleshooting of various issues.
 
@@ -193,7 +395,7 @@ To run defragmentation for all cluster members list all endpoints sequentially
 ```
 `ENDPOINT_IP` is the internal IP address of the etcd endpoint.
 
-> **_Note:_** that defragmentation to a live member blocks the system from reading and writing data while rebuilding its states. It is not recommended to run defragmentation for all etcd members at the same time.
+> **Note**: The defragmentation to a live member blocks the system from reading and writing data while rebuilding its states. It is not recommended to run defragmentation for all etcd members at the same time.
 
 ## etcdctl defrag return context deadline exceeded
 

diff --git a/kubemarine/core/errors.py b/kubemarine/core/errors.py
@@ -0,0 +1,130 @@
+# Copyright 2021-2022 NetCracker Technology Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+from traceback import print_exc
+from typing import Union
+
+from fabric.exceptions import GroupException
+from concurrent.futures import TimeoutError
+
+KME_DICTIONARY = {
+    "KME0000": {
+        "name": "Test exception"
+    },
+    "KME0003": {
+        "instance": TimeoutError,
+        "name": "Action took too long to complete and timed out"
+    },
+    "KME0004": {
+        "name": "There are no workers defined in the cluster scheme"
+    },
+    "KME0005": {
+        "name": "{hostname} is not a sudoer"
+    }
+}
+
+
+# TODO: support for more complex KME00XX objects with custom constructors
+class KME(RuntimeError):
+    def __init__(self, code, **kwargs):
+        self.code = code
+        self.kme = KME_DICTIONARY.get(self.code)
+        if self.kme is None:
+            raise ValueError('An error was raised with an unknown error code')
+        self.message = self.kme.get('name').format(**kwargs)
+        super().__init__(self.message)
+
+    def __str__(self):
+        return self.code + ": " + self.message
+
+
+def pretty_print_error(reason: Union[str, Exception], log=None) -> None:
+    """
+    Parses the passed error and nicely displays its name and structure depending on what was passed.
+    The method outputs to stdout by default, but will use the logger if one is specified.
+    :param reason: an object containing an exception or other error (must be able to be represented
+    as a string)
+    :param log: logger object, if you need to write a log there
+    :return: None
+    """
+
+    if reason == "":
+        return
+
+    if isinstance(reason, KME):
+        if log:
+            log.critical(reason)
+        else:
+            sys.stderr.write(str(reason))
+
+        return
+
+    for dictionary_code, dictionary_kme in KME_DICTIONARY.items():
+        if dictionary_kme.get('instance') and isinstance(reason, type(dictionary_kme['instance'])):
+            kme = KME(dictionary_code)
+
+            if log:
+                log.critical(kme)
+            else:
+                sys.stderr.write(str(kme))
+
+            return
+
+    if isinstance(reason, GroupException):
+        description = "KME0002: Remote group exception"
+
+        if log:
+            log.critical(description)
+        else:
+            sys.stderr.write(f"{description}\n")
+
+        for connection, result in reason.result.items():
+            if log:
+                log.critical("%s:" % connection.host)
+            else:
+                sys.stderr.write("\n%s:" % connection.host)
+
+            found_dictionary_code = None
+            for dictionary_code, dictionary_kme in KME_DICTIONARY.items():
+                if dictionary_kme.get('instance') \
+                        and isinstance(result, dictionary_kme['instance']):
+                    found_dictionary_code = dictionary_code
+                    break
+
+            if found_dictionary_code:
+                kme = KME(found_dictionary_code)
+                if log:
+                    log.critical("\t" + str(kme))
+                else:
+                    sys.stderr.write("\n\t%s\n" % str(kme))
+            else:
+                if log:
+                    log.critical("\t" + str(result).replace("\n", "\n\t"))
+                else:
+                    sys.stderr.write("\n\t%s\n" % str(result).replace("\n", "\n\t"))
+
+        return
+
+    if isinstance(reason, Exception):
+        if log:
+            log.critical('KME0001: Unexpected exception', exc_info=True)
+        else:
+            sys.stderr.write("KME0001: Unexpected exception\n\n")
+            print_exc()
+    else:
+        if log:
+            log.critical(reason)
+        else:
+            sys.stderr.write(reason + "\n")
diff --git a/kubemarine/core/group.py b/kubemarine/core/group.py
@@ -520,6 +520,9 @@ def _do(self, do_type: str, nodes: Connections, is_async, *args, **kwargs) -> _H
             if kwargs.get("hide") is None:
                 kwargs['hide'] = True
 
+            if kwargs.get("timeout", None) is None:
+                kwargs["timeout"] = self.cluster.globals['nodes']['command_execution']['timeout']
+
         execution_timeout = kwargs.get("timeout", None)
 
         results = {}