Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nagios: Add remaining machines to Nagios #1716

Closed
Willsparker opened this issue Nov 27, 2020 · 16 comments
Closed

Nagios: Add remaining machines to Nagios #1716

Willsparker opened this issue Nov 27, 2020 · 16 comments
Assignees

Comments

@Willsparker
Copy link
Contributor

ref: #1229

I've written a script that compares the machines on Jenkins to config files on the Nagios server, and outputs a list of machines that still need to be added to Nagios:
Note: An exceptions list is used to filter out names with the following string in their name: win sxa gdams will EC2 azurebuildagent aahlenst master infra.

build-inspira-solaris10u11-sparcv9-1
build-inspira-solaris10u11-sparcv9-2
build-linaro-centos74-armv8-1
build-linaro-centos74-armv8-2
build-linaro-centos76-armv8-2
build-macstadium-macos1010-x64-1
build-osuosl-aix71-ppc64-1
build-osuosl-aix71-ppc64-2
build-packet-ubuntu1804-armv8l-1
build-packet_esxi-solaris10u11-x64-1
build-packet_esxi-solaris10u11-x64-2
docker-aws-ubuntu1604-x64-0
docker-aws-ubuntu1604-x64-1
docker-aws-ubuntu1604-x64-2
docker-godaddy-ubuntu1604-x64-1
docker-marist-ubuntu1604-s390x-1
docker-scaleway-ubuntu1604-armv7l-1
test-ibm-aix71-ppc64-1
test-ibm-aix71-ppc64-2
test-ibmcloud-ubuntu1604-x64-1
test-inspira-solaris10u11-sparcv9-1
test-inspira-solaris11-sparcv9-1
test-macincloud-macos1010-x64-1
test-macincloud-macos1010-x64-2
test-macstadium-macos11-arm64-1
test-macstadium-macos11-arm64-2
test-osuosl-aix71-ppc64-1
test-osuosl-aix72-ppc64-1
test-osuosl-aix72-ppc64-2
test-osuosl-centos74-ppc64le-2
@Willsparker
Copy link
Contributor Author

Note: Removed perf machines as per #1710

@Willsparker
Copy link
Contributor Author

Willsparker commented Jan 6, 2021

As per #1717 (comment) , I've currently got a list of 15 machines that need adding:

Inventory List In Jenkins? In Nagios? Do I have access?
build-godaddy-ubuntu1604-x64-1 YES NO YES
build-linaro-centos76-armv8-2 YES NO NO
build-macstadium-macos1010-x64-1 YES NO YES
build-packet-ubuntu1804-armv8-1 NO NO NO
docker-aws-ubuntu1604-x64-1 YES NO NO
docker-aws-ubuntu1604-x64-2 YES NO NO
docker-godaddy-ubuntu1604-x64-1 YES NO NO
docker-marist-ubuntu1603-s390x-1 NO NO YES
docker-scaleway-ubuntu1604-armv7-1 NO NO NO
test-osuosl-centos74-ppc64le-2 YES NO YES
test-macincloud-macos1010-x64-1 YES NO YES
test-macincloud-macos1010-x64-2 YES NO YES
test-macstadium-macos11-arm64-1 YES NO NO
test-macstadium-macos11-arm64-2 YES NO NO
test-ibmcloud-ubuntu1604-x64-1 YES NO NO

A few machines are also not even in Jenkins at the minute, so I'll put a PR to remove them from the inventory once it's confirmed that they aren't our machines anymore.

@Willsparker
Copy link
Contributor Author

As per above PR, the remaining arm machines that aren't recognized in Jenkins, need to be renamed in Jenkins, to remain consistent with the rest of the arm machines

@Willsparker
Copy link
Contributor Author

build-macstadium-macos1010-x64-1 was already in nagios as build-macstadium-macos1010-1 ; Fixed

@Willsparker
Copy link
Contributor Author

test-osuosl-centos74-ppc64le-2 - Unable to install nagios-plugins-common due to the following error:

[root@test-centos74-2 ~]# yum install nagios-plugins-common
error: rpmdb: BDB0113 Thread/process 4106/70366702127008 failed: BDB1507 Thread died in Berkeley DB library
error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages index using db5 -  (-30973)
error: cannot open Packages database in /var/lib/rpm
CRITICAL:yum.main:

Error: rpmdb open failed

Skipping for now

@Willsparker
Copy link
Contributor Author

The two test-macincloud-macos1010 machines are also in Nagios, just in the incorrect name. Fixed

@Willsparker
Copy link
Contributor Author

Willsparker commented Jan 15, 2021

Due to #1781 and #1843 (and https://adoptopenjdk.slack.com/archives/C53GHCXL4/p1610362672472600), there will soon be quite a few machines to remove from Nagios. Due to this, I'm going to make a check that does the reverse of the check implemented in #1717 (i.e. checks the Nagios server list against the Inventory), to determine if any servers aren't needed anymore.

@Willsparker
Copy link
Contributor Author

Willsparker commented Jan 18, 2021

The above check has been put into Nagios ( https://nagios.adoptopenjdk.net/nagios/cgi-bin/extinfo.cgi?type=2&host=Nagios_Server&service=Test+if+machines+need+removing+from+Nagios ). I also quickly went through and fixed the machines that were on Nagios, but not in the Inventory:

build-digitocean-centos69-x64-1     --> Removed (as per the Slack conversation above)
build-packet-ubuntu1604-armv8-1     --> Removed (turns out this machine is a docker machine now - must be leftover from before Nagios was overhauled)
build-osuosl-ppc64-aix-71-1         --> Renamed to build-osuosl-aix71-ppc64-1, as per the inventory
build-osuosl-ppc64-aix-71-2         --> Renamed to build-osuosl-aix71-ppc64-2, as per the inventory

That check is green now :-)

@Willsparker
Copy link
Contributor Author

Willsparker commented Jan 18, 2021

Updated list:

Inventory List In Jenkins? In Nagios? Additional Notes
build-linaro-centos76-armv8-2 YES NO Not adding as Linaro Machines are likely to be changed soon (ref: #1849 (comment))
build-packet-ubuntu1804-armv8-1 NO NO Blocked by: #1837
docker-aws-ubuntu1604-x64-1 YES NO Being removed as per #1843
docker-scaleway-ubuntu1604-armv7-1 NO NO Blocked by: #1837
test-macstadium-macos11-arm64-1 YES NO Blocked by: #1855

While the test-aws-rhel76-* machines are still currently in Nagios, their service checks / notifications have been disabled. I'll remove them once they're not in the inventory. Now removed

@Willsparker
Copy link
Contributor Author

docker-marist-ubuntu1604-s390x-1 is having an issue with installing nagios-plugins with apt:

root@localhost:~# apt install nagios-plugins
Reading package lists... Done
Building dependency tree       
Reading state information... Done
You might want to run 'apt-get -f install' to correct these:
The following packages have unmet dependencies:
 linux-image-4.4.0-171-generic : Depends: linux-modules-4.4.0-171-generic but it is not going to be installed
 nagios-plugins : Depends: monitoring-plugins but it is not going to be installed
E: Unmet dependencies. Try 'apt-get -f install' with no packages (or specify a solution).

Due to it being release week, I'll wait until after release to try and fix it.

@Willsparker
Copy link
Contributor Author

The remaining macOS machines are proving difficult to run the playbook on - Running this task on the macos machines comes up with Brew's Running Homebrew as root is extremely dangerous and no longer supported message, despite the task not using become: yes, nor the playbook being run with the -u root option.
However, even when running it manually on the machines (i.e. brew install nagios-plugins), the following shows:
Error: Cannot install in Homebrew on ARM processor in Intel default prefix (/usr/local)!.

@aahlenst
Copy link
Contributor

See Homebrew/brew#9117 for why Homebrew emits the warning on ARM.

@Willsparker
Copy link
Contributor Author

Willsparker commented Jan 22, 2021

Hmm, okay, thank you! - looks like the way to fix the HOMEBREW_PREFIX is to reinstall brew on the machine - I'm a bit hesitant to do that in release week, so I'll wait until after.

@Willsparker
Copy link
Contributor Author

Willsparker commented Jan 27, 2021

As it's after release week, I'm now going to look into the various machine issues:

  • docker-marist-ubuntu1604-s390x-1
  • test-osuosl-centos74-ppc64le-2
  • test-macstadium-macos11-arm64-2

As expected, the way to fix the docker-marist machine issue was to run apt-get -f install 👍

test-osuosl machine was fixed by rebuilding the yum databases. ref: Ref: #1868 . However, once I added the machine to Nagios via the playbooks, the checks that used check_by_ssh were returning:

Remote command execution failed: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

This was due to the ECDSA key being different to what the Nagios Server thought it should be, and was fixed with ssh-keygen -R 140.211.168.217, and confirmed fixed via:

nagios@nagios:~$ /usr/local/nagios/libexec/check_by_ssh -H 140.211.168.217 -C "uname"
Linux

test-macstadium-macos11-arm64-2 installation of homebrew was as easy as copy ing the command from the homebrew page and running /opt/homebrew/bin/brew install nagios-plugins. 👍 The old installation is still on the machine, and is still the default one that is run when using running brew, however.

@sxa
Copy link
Member

sxa commented Jul 5, 2021

Moving to icebox as we don't have anyone actively working on this just now. Although the macos playbooks are being worked on under #1910

@steelhead31
Copy link
Contributor

Closing , as complete as part of Nagios rework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants