Skip to content

Commit

Permalink
[PSUPCLPL-8487] Prevent Kubemarine from getting stuck (#93)
Browse files Browse the repository at this point in the history
* PSUPCLPL-8487 Prevent Kubemarine from getting stuck

* Updated error codes support

* Fix review comments

* Update Troubleshooting.md

Co-authored-by: shmo1218 <[email protected]>
  • Loading branch information
iLeonidze and shmo1218 authored Jan 25, 2022
1 parent 3562057 commit 59d8578
Show file tree
Hide file tree
Showing 10 changed files with 359 additions and 34 deletions.
208 changes: 205 additions & 3 deletions documentation/Troubleshooting.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
This section provides troubleshooting information for Kubemarine and Kubernetes solutions.

- [Trobleshooting Tools](#troubleshooting-tools)
- [KubeMarine Errors](#kubemarine-errors)
- [KME0001: Unexpected exception](#kme0001-unexpected-exception)
- [KME0002: Remote group exception](#kme0002-remote-group-exception)
- [KME0003: Action took too long to complete and timed out](#kme0003-action-took-too-long-to-complete-and-timed-out)
- [KME0004: There are no workers defined in the cluster scheme](#kme0004-there-are-no-workers-defined-in-the-cluster-scheme)
- [KME0005: {hostname} is not a sudoer](#kme0005-hostname-is-not-a-sudoer)
- [Troubleshooting Tools](#troubleshooting-tools)
- [etcdctl script](#etcdctl-script)
- [Troubleshooting Kubernetes Generic Issues](#troubleshooting-kubernetes-generic-issues)
- [CoreDNS Responds with High Latency](#coredns-responds-with-high-latency)
Expand All @@ -13,7 +19,203 @@ This section provides troubleshooting information for Kubemarine and Kubernetes
- [Numerous generation of auditd system messages ](#numerous-generation-of-auditd-system)
- [Failing during installation on Ubuntu OS](#failing-during-installation-on-ubuntu-os)

# Trobleshooting Tools
# KubeMarine Errors

This section lists all known errors with explanations and recommendations for their fixing. If an
error occurs during the execution of any of these procedures, you can find it here.


## KME0001: Unexpected exception

```
FAILURE - TASK FAILED xxx
Reason: KME001: Unexpected exception
Traceback (most recent call last):
File "/home/centos/repos/kubemarine/kubemarine/src/core/flow.py", line 131, in run_flow
task(cluster)
File "/home/centos/repos/kubemarine/kubemarine/install", line 193, in deploy_kubernetes_init
cluster.nodes["worker"].new_group(apply_filter=lambda node: 'master' not in node['roles']).call(kubernetes.init_workers)
File "/home/centos/repos/kubemarine/kubemarine/src/core/group.py", line 165, in call
return self.call_batch([action], **{action.__name__: kwargs})
File "/home/centos/repos/kubemarine/kubemarine/src/core/group.py", line 179, in call_batch
results[action] = action(self, **action_kwargs)
File "/home/centos/repos/kubemarine/kubemarine/src/kubernetes.py", line 238, in init_workers
reset_installation_env(group)
File "/home/centos/repos/kubemarine/kubemarine/src/kubernetes.py", line 60, in reset_installation_env
group.sudo("systemctl stop kubelet", warn=True)
File "/home/centos/repos/kubemarine/kubemarine/src/core/group.py", line 34, in sudo
return self.do("sudo", *args, **kwargs)
File "/home/centos/repos/kubemarine/kubemarine/src/core/group.py", line 106, in do
self.workaround(exception)
File "/home/centos/repos/kubemarine/kubemarine/src/core/group.py", line 119, in workaround
raise e from None
File "/home/centos/repos/kubemarine/kubemarine/src/core/group.py", line 95, in do
return self._do(do_type, args, kwargs)
File "/home/centos/repos/kubemarine/kubemarine/src/core/group.py", line 141, in _do
with ThreadPoolExecutor(max_workers=len(self.nodes)) as executor:
File "/usr/lib/python3.6/concurrent/futures/thread.py", line 104, in __init__
raise ValueError("max_workers must be greater than 0")
ValueError: max_workers must be greater than 0
```

This error occurs in case of an unexpected exception at runtime and does not yet have a classifying
code.

To fix it, first try checking the nodes and the cluster with
[IAAS checker](Kubecheck.md#iaas-procedure) and [PAAS checker](Kubecheck.md#paas-procedure). If you
see failed tests, try fixing the cause of the failure. If the error persists, try to inspect the
stacktrace and come to a solution yourself as much as possible.

If you still can't resolve this error yourself, start
[a new issue](https://github.com/Netcracker/KubeMarine/issues/new) and attach a description of the
error with its stacktrace. We will try to help as soon as possible.

If you were able to solve the problem yourself, let us know about it and your solution by
[opening a new PR](https://github.com/Netcracker/KubeMarine/pulls). Our team will appreciate it!


## KME0002: Remote group exception

Shell error:

```
FAILURE!
TASK FAILED xxx
KME0002: Remote group exception
10.101.10.1:
Encountered a bad command exit code!
Command: 'apt install bad-package-name'
Exit code: 127
Stdout:
Stderr:
bash: apt: command not found
```

Hierarchical error:

```
FAILURE!
TASK FAILED xxx
KME0002: Remote group exception
10.101.10.1:
KME0003: Action took too long to complete and timed out
```

An error indicating an unexpected runtime bash command exit on a remote cluster host. This error
occurs when a command is terminated unexpectedly with a non-zero error code.

The error prints the status of the command execution for each node in the group on which the bash command
was executed. The status can be a correct result (shell results), a result with an error
(shell error), as well as a hierarchical KME with its own code.

To fix it, first try checking the nodes and the cluster with
[IAAS checker](Kubecheck.md#iaas-procedure) and [PAAS checker](Kubecheck.md#paas-procedure). If you
see failed tests, try fixing the cause of the failure. Make sure that you do everything according to
the instructions in the correct sequence and correctly fill the inventory and other dependent
files. If the error persists, try to figure out what might be causing the command to fail on remote
nodes and fix by yourself as much as possible.

If you still can't resolve this error yourself, start
[a new issue](https://github.com/Netcracker/KubeMarine/issues/new) and attach a description of the
error with its stacktrace. We will try to help as soon as possible.


## KME0003: Action took too long to complete and timed out

```
FAILURE!
TASK FAILED xxx
KME0002: Remote group exception
10.101.10.1:
KME0003: Action took too long to complete and timed out
```

An error that occurs when a command did not have time to execute at the specified time.

The error can occur if there is a problem with the remote hypervisor or host hanging, if the
command executable hangs, or if the SSH-connection is unexpectedly disconnected or other network
problems between the deployer node and the cluster.

The longest possible timeout for the command is 2700 seconds (45 minutes).

To resolve this error, check all of the listed items that may hang and manually fix the hang by
rebooting the hypervisor or node, fixing the environment or settings of the executable, updating it,
fixing the network channel, as well as any other actions that, in your opinion, should fix the
frozen stage of the procedure. It will be useful to check the cluster with
[IAAS checker](Kubecheck.md#iaas-procedure) to detect problems with network connectivity.


## KME0004: There are no workers defined in the cluster scheme

```
FAILURE!
KME0004: There are no workers defined in the cluster scheme
```

An error related with the absence of any worker role in the inventory file. The error occurs before
the payload is executed on the cluster.

To fix it, you need to either specify new nodes with the `worker` role, or add the `worker` role to
the existing masters nodes.

An example of specifying different nodes with separate `master` and `worker` roles is as follows.

```yaml
- address: 10.101.1.1
internal_address: 192.168.101.1
name: master-1
roles:
- master
- address: 10.101.1.2
internal_address: 192.168.101.2
name: worker-1
roles:
- worker
```
An example of specifying multiple `master` and `worker` roles for a single node is as follows.

```yaml
- address: 10.101.1.1
internal_address: 192.168.101.1
name: master-1
roles:
- master
- worker
```

**Note**: Masters with a `worker` role remain as control planes, however, they start scheduling
applications pods.


## KME0005: {hostname} is not a sudoer

```
FAILURE!
TASK FAILED prepare.check.sudoer
KME0005: 10.101.1.1 is not a sudoer
```
The error reports that the specified node does not have superuser rights. The error occurs
before the payload is executed on the cluster when running the `install` or `add_node` procedure.
To fix this, add a connection user to the sudoer group on the cluster node.
An example for Ubuntu (reboot required) is as given below.
```bash
sudo adduser <username> sudo
```


# Troubleshooting Tools

This section describes the additional tools that Kubemarine provides for convenient troubleshooting of various issues.

Expand Down Expand Up @@ -193,7 +395,7 @@ To run defragmentation for all cluster members list all endpoints sequentially
```
`ENDPOINT_IP` is the internal IP address of the etcd endpoint.
> **_Note:_** that defragmentation to a live member blocks the system from reading and writing data while rebuilding its states. It is not recommended to run defragmentation for all etcd members at the same time.
> **Note**: The defragmentation to a live member blocks the system from reading and writing data while rebuilding its states. It is not recommended to run defragmentation for all etcd members at the same time.
## etcdctl defrag return context deadline exceeded
Expand Down
130 changes: 130 additions & 0 deletions kubemarine/core/errors.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Copyright 2021-2022 NetCracker Technology Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import sys
from traceback import print_exc
from typing import Union

from fabric.exceptions import GroupException
from concurrent.futures import TimeoutError

KME_DICTIONARY = {
"KME0000": {
"name": "Test exception"
},
"KME0003": {
"instance": TimeoutError,
"name": "Action took too long to complete and timed out"
},
"KME0004": {
"name": "There are no workers defined in the cluster scheme"
},
"KME0005": {
"name": "{hostname} is not a sudoer"
}
}


# TODO: support for more complex KME00XX objects with custom constructors
class KME(RuntimeError):
def __init__(self, code, **kwargs):
self.code = code
self.kme = KME_DICTIONARY.get(self.code)
if self.kme is None:
raise ValueError('An error was raised with an unknown error code')
self.message = self.kme.get('name').format(**kwargs)
super().__init__(self.message)

def __str__(self):
return self.code + ": " + self.message


def pretty_print_error(reason: Union[str, Exception], log=None) -> None:
"""
Parses the passed error and nicely displays its name and structure depending on what was passed.
The method outputs to stdout by default, but will use the logger if one is specified.
:param reason: an object containing an exception or other error (must be able to be represented
as a string)
:param log: logger object, if you need to write a log there
:return: None
"""

if reason == "":
return

if isinstance(reason, KME):
if log:
log.critical(reason)
else:
sys.stderr.write(str(reason))

return

for dictionary_code, dictionary_kme in KME_DICTIONARY.items():
if dictionary_kme.get('instance') and isinstance(reason, type(dictionary_kme['instance'])):
kme = KME(dictionary_code)

if log:
log.critical(kme)
else:
sys.stderr.write(str(kme))

return

if isinstance(reason, GroupException):
description = "KME0002: Remote group exception"

if log:
log.critical(description)
else:
sys.stderr.write(f"{description}\n")

for connection, result in reason.result.items():
if log:
log.critical("%s:" % connection.host)
else:
sys.stderr.write("\n%s:" % connection.host)

found_dictionary_code = None
for dictionary_code, dictionary_kme in KME_DICTIONARY.items():
if dictionary_kme.get('instance') \
and isinstance(result, dictionary_kme['instance']):
found_dictionary_code = dictionary_code
break

if found_dictionary_code:
kme = KME(found_dictionary_code)
if log:
log.critical("\t" + str(kme))
else:
sys.stderr.write("\n\t%s\n" % str(kme))
else:
if log:
log.critical("\t" + str(result).replace("\n", "\n\t"))
else:
sys.stderr.write("\n\t%s\n" % str(result).replace("\n", "\n\t"))

return

if isinstance(reason, Exception):
if log:
log.critical('KME0001: Unexpected exception', exc_info=True)
else:
sys.stderr.write("KME0001: Unexpected exception\n\n")
print_exc()
else:
if log:
log.critical(reason)
else:
sys.stderr.write(reason + "\n")
3 changes: 3 additions & 0 deletions kubemarine/core/group.py
Original file line number Diff line number Diff line change
Expand Up @@ -520,6 +520,9 @@ def _do(self, do_type: str, nodes: Connections, is_async, *args, **kwargs) -> _H
if kwargs.get("hide") is None:
kwargs['hide'] = True

if kwargs.get("timeout", None) is None:
kwargs["timeout"] = self.cluster.globals['nodes']['command_execution']['timeout']

execution_timeout = kwargs.get("timeout", None)

results = {}
Expand Down
Loading

0 comments on commit 59d8578

Please sign in to comment.