Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move Marist machines to the self-service provisioning #2673

Closed
sxa opened this issue Jul 15, 2022 · 29 comments
Closed

Move Marist machines to the self-service provisioning #2673

sxa opened this issue Jul 15, 2022 · 29 comments

Comments

@sxa
Copy link
Member

sxa commented Jul 15, 2022

To avoid having to go through support for any requests on our Marist systems, they have been trialling a self-service interface for their machines and it is ready to be used as the primary method for provisioning our machines. #2267 has machines which have been provisioned through the new interface and we should start migrating our existing systems across to this too.

The first step will be to ensure we have capacity in the system (At the moment the account I'm using only has 4 machine slots available) and then start duplicating the existing machines in it, followed by decomissioning the existing ones. We will likely look at having at least one dockerhost system in order to have a wider range of distributions tested for Linux/s390x (Subject to availability...)

Systems ready for installation:

@sxa
Copy link
Member Author

sxa commented Aug 10, 2022

I'm going to use this as a conclusive verification of a number of other infrastructure PRs that we have in flight just now, so I won't run the playbooks until after they are merged:

@Haroon-Khel
Copy link
Contributor

For future reference before syncing inventories in awx you have to update the project source first in order for awx to have the latest inventory file. I assumed the syncing inventory process automatically pulled the latest inventory file.

Running https://awx2.adoptopenjdk.net/#/jobs/playbook/137?job_search=page_size:20;order_by:-finished;not__launch_type:sync on test-marist-rhel8-s390x-2 as a prelim playbook run

@Haroon-Khel
Copy link
Contributor

Failed at the installation of systemtap-sdt-devel

I've created a new job in awx which I can use for debugging/testing. It deploys my own branch, https://github.com/Haroon-Khel/openjdk-infrastructure/tree/awx.debug, which so far the only change is systemtap-sdt-devel commented out

https://awx2.adoptopenjdk.net/#/jobs/playbook/143?job_search=page_size:20;order_by:-finished;not__launch_type:sync

@Haroon-Khel
Copy link
Contributor

test-marist-rhel8-s390x-2 is actually a SLES15 machine

test-marist-rhel8-s390x-2:~ # cat /etc/os-release
NAME="SLES"
VERSION="15-SP2"
VERSION_ID="15.2"
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP2"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15:sp2"

And test-marist-sles15-s390x-2 is Rhel 8

[root@testrhel8 ~]# cat /etc/os-release 
NAME="Red Hat Enterprise Linux"
VERSION="8.6 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.6"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.6 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.6
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.6"

@Haroon-Khel
Copy link
Contributor

Failed at downloading Ant

TASK [ant : Download Apache Ant binaries] **************************************
fatal: [test-marist-rhel8-s390x-2]: FAILED! => {"changed": false, "dest": "/tmp/", "elapsed": 0, "gid": 0, "group": "root", "mode": "01777", "msg": "Request failed: <urlopen error unknown url type: https>", "owner": "root", "size": 255, "state": "directory", "uid": 0, "url": "https://archive.apache.org/dist/ant/binaries/apache-ant-1.10.5-bin.zip"}

@sxa
Copy link
Member Author

sxa commented Aug 18, 2022

Failed at the installation of systemtap-sdt-devel

Presumably that's only on a subset of the OSs?

@sxa
Copy link
Member Author

sxa commented Aug 18, 2022

Tried deploying to just the RHEL79 build machines - hit #2700
Tried deploying to test-marist-ubuntu2204 system - failed because gcc7 PR has not yet been merged
Tried deploying to the RHEL79 build machines skipping the docker tag
Redeploy to RHEL79 after removing /etc/yum.repos.d/docker.repo as that was already in place and preventing yum update PASSED
Deploying to all test-marist systems (With docker bypassed to be safe for now)

PLAY RECAP *********************************************************************
test-marist-rhel7-s390x-1  : ok=221  changed=100  unreachable=0    failed=0    skipped=307  rescued=0    ignored=1   
test-marist-rhel7-s390x-2  : ok=218  changed=99   unreachable=0    failed=0    skipped=303  rescued=0    ignored=1   
test-marist-rhel8-s390x-1  : ok=18   changed=5    unreachable=0    failed=1    skipped=34   rescued=0    ignored=0   
test-marist-rhel8-s390x-2  : ok=19   changed=1    unreachable=0    failed=1    skipped=27   rescued=0    ignored=0   
test-marist-sles12-s390x-1 : ok=12   changed=2    unreachable=0    failed=1    skipped=25   rescued=0    ignored=0   
test-marist-sles12-s390x-2 : ok=0    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=0   
test-marist-sles15-s390x-1 : ok=135  changed=18   unreachable=0    failed=0    skipped=377  rescued=0    ignored=0   
test-marist-sles15-s390x-2 : ok=18   changed=7    unreachable=0    failed=1    skipped=34   rescued=0    ignored=0   
test-marist-ubuntu1604-s390x-1 : ok=162  changed=28   unreachable=0    failed=0    skipped=349  rescued=0    ignored=0   
test-marist-ubuntu1804-s390x-1 : ok=12   changed=1    unreachable=0    failed=1    skipped=24   rescued=0    ignored=0   
test-marist-ubuntu1804-s390x-2 : ok=12   changed=1    unreachable=0    failed=1    skipped=24   rescued=0    ignored=0   
test-marist-ubuntu1804-s390x-3 : ok=111  changed=18   unreachable=0    failed=1    skipped=268  rescued=0    ignored=0   
test-marist-ubuntu1804-s390x-4 : ok=194  changed=85   unreachable=0    failed=0    skipped=317  rescued=0    ignored=0   
test-marist-ubuntu2004-s390x-1 : ok=186  changed=74   unreachable=0    failed=0    skipped=325  rescued=0    ignored=0   
test-marist-ubuntu2204-s390x-1 : ok=22   changed=1    unreachable=0    failed=1    skipped=32   rescued=0    ignored=0   

Failures in Ubuntu 22.04 (Will be gcc-7 - PR ready), Ubuntu 18, the new SLES15 and the old SLES12, and RHEL8. Those will need further investigation. I'm pausing for now so someone else can take over, as it's the build machines I really needed :-)
But we havn't hit any problems due to the intrusion prevention on those systems, which is promising.

@sxa
Copy link
Member Author

sxa commented Aug 23, 2022

Failed at the installation of systemtap-sdt-devel

This is specific to SLES15. It is installed on the -1 sles15 machine so it's not entirely clear why this message is appearing on the other machines, unless it was bypassed . libc.so.6 is on the machine:

test-marist-sles15-s390x-2:~ # ls -l /lib64/libc.so.6
lrwxrwxrwx 1 root root 12 Nov  5  2021 /lib64/libc.so.6 -> libc-2.26.so
test-marist-sles15-s390x-2:~ # zypper install systemtap-sdt-devel
Refreshing service 'SMT-http_lxslsmt'.
Loading repository data...
Reading installed packages...
Resolving package dependencies...

Problem: nothing provides 'libc.so.6(GLIBC_2.27)(64bit)' needed by the to be installed systemtap-4.6-151.d_t.3.s390x
 Solution 1: do not install systemtap-sdt-devel-4.6-151.d_t.3.s390x
 Solution 2: break systemtap-4.6-151.d_t.3.s390x by ignoring some of its dependencies

Choose from above solutions by number or cancel [1/2/c/d/?] (c): c
test-marist-sles15-s390x-2:~ # 

@sxa
Copy link
Member Author

sxa commented Aug 23, 2022

test-marist-ubuntu-1804-s390x- systems 1 and 2 had these entries in /etc/hosts:

91.189.95.85 ppa.launchpad.net
91.189.88.142 ports.ubuntu.com

This was preventing them from updating themselves - presumably implemented to bypass a temporary problem at some point - the date stamp on the file was:

-rw-r--r-- 1 root root 487 Apr 22  2021 /etc/hosts

I've commented those lines out of both machines now which should avoid this problem:

root@test-marist-ubuntu1804-s390x-2:~# apt-get update
Err:1 http://ports.ubuntu.com/ubuntu-ports bionic InRelease
  Could not connect to ports.ubuntu.com:80 (91.189.88.142), connection timed out
Err:2 http://ports.ubuntu.com/ubuntu-ports bionic-updates InRelease
  Unable to connect to ports.ubuntu.com:http:
Err:3 http://ports.ubuntu.com/ubuntu-ports bionic-backports InRelease
  Unable to connect to ports.ubuntu.com:http:
Err:4 http://ports.ubuntu.com/ubuntu-ports bionic-security InRelease
  Unable to connect to ports.ubuntu.com:http:
Reading package lists... Done                      
W: Failed to fetch http://ports.ubuntu.com/ubuntu-ports/dists/bionic/InRelease  Could not connect to ports.ubuntu.com:80 (91.189.88.142), connection timed out
W: Failed to fetch http://ports.ubuntu.com/ubuntu-ports/dists/bionic-updates/InRelease  Unable to connect to ports.ubuntu.com:http:
W: Failed to fetch http://ports.ubuntu.com/ubuntu-ports/dists/bionic-backports/InRelease  Unable to connect to ports.ubuntu.com:http:
W: Failed to fetch http://ports.ubuntu.com/ubuntu-ports/dists/bionic-security/InRelease  Unable to connect to ports.ubuntu.com:http:
W: Some index files failed to download. They have been ignored, or old ones used instead.

@sxa
Copy link
Member Author

sxa commented Aug 24, 2022

sles12-2 was missing the AWX ssh key - now fixed so that should work now.
RHEL8 looks to be trying to install some of the 31-bit (s390) packages which we probably don't need.

@steelhead31
Copy link
Contributor

steelhead31 commented Aug 24, 2022

@sxa want me to pick up the systemtap-sdt-devel on test-marist-sles15-s390x-2 ?

@sxa
Copy link
Member Author

sxa commented Aug 24, 2022

Sure - please co-ordinate with Haroon in slack.

@Haroon-Khel
Copy link
Contributor

That would be helpful @steelhead31 Thanks

@sxa
Copy link
Member Author

sxa commented Aug 25, 2022

Ubuntu 22.04 looking happier now that #2691 is merged.

@steelhead31
Copy link
Contributor

The sles15 playbooks run better using python 3 as the ansible_python_interpreter ( which can be specified in the inventory ), and also an issue with the ipv6 configuration on test-marist-sles15-s390x-2 has been resolved by disabling ipv6 as shown below.

1. Edit the file sysctl.conf by executing the command sudo vi /etc/sysctl.conf
2. Add the below 2 lines to the file
  net.ipv6.conf.all.disable_ipv6 = 1
  net.ipv6.conf.default.disable_ipv6 = 1
3. Save and execute the command "sudo sysctl -p" . This would re-load the settings and disables ipv6 address.
4. Execute the command ip a | grep inet - this should only show ipv4 addresses

@sxa
Copy link
Member Author

sxa commented Aug 31, 2022

From Marist: "Let me know when fully migrated and I can remove the old servers as we are targeting end of September to power off the old storage servers."

@sxa
Copy link
Member Author

sxa commented Sep 2, 2022

@Haroon-Khel Looks like there may be some problems that need addressing: https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk11_hs_sanity.openjdk_s390x_linux/651

Certainly a subset of them are in the compression code (we've seen issues there elsewhere - at least on Ubuntu 20.04 - that run was on 22.04) and if all the failures are related to that it will be good to confirm which distributions and versions it happens on, as there will be implications elsewhere.

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Sep 2, 2022

Nagios should be working on all of the new marist machines expect for test-marist-rhel8-s390x-2 due to
No package nagios-plugins-all available. Should have a quick solution. @steelhead31 Can you check if the marist machines appear in that view you showed earlier?

@sxa
Copy link
Member Author

sxa commented Sep 6, 2022

Added docker tag onto test-marist-ubuntu2204-s390x-1 as openjdk_build_docker_multiarch builds were getting stuck due to lack of suitable labels. The dockerhost-marist machine is currently unsuitable as despite being in jenkins it appears that it cannot run docker as the jenkins user (See this log from when I tried to add the tag to that machine)

@sxa
Copy link
Member Author

sxa commented Sep 8, 2022

Request for Eclipse to set up two machines for Temurin Compliance:

https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/issues/1917

@sxa
Copy link
Member Author

sxa commented Sep 30, 2022

NOTE: I've brought docker-marist-ubuntu1604-s390x-1 back online in jenkins for now since that one (why not others?) was causing 'temporarily offline in jenkins' messages to appear in the bot channel, but I've switched the docker label to dockerX

We'll need to understand as part of #1716 why the other marist machines which we have disabled (marked offline in jenkins) are not giving the same notifications e.g. https://ci.adoptopenjdk.net/computer/build%2Dmarist%2Drhel77%2Ds390x%2D1/ and https://ci.adoptopenjdk.net/computer/test%2Dmarist%2Dubuntu1804%2Ds390x%2D1/ (and all the other "old" ones)

@sxa
Copy link
Member Author

sxa commented Oct 3, 2022

Temurin Compliance systems still awaiting setup, but otherwise this is complete. Old machines will need to be deprovisioned, but that is due to be done later.

@sxa sxa closed this as completed Oct 3, 2022
@sxa sxa unpinned this issue Oct 14, 2022
@sxa
Copy link
Member Author

sxa commented Oct 18, 2022

@Haroon-Khel @steelhead31 Can we remove the old machines from Nagios, Jenkins and the inventory files please as they have now been deprovisioned. Full list as follows (Some of these were temporary systems so if you can't find them, that's not a problem):

  • 148.100.113.46
  • 148.100.113.30
  • 148.100.113.58
  • 148.100.113.25
  • 148.100.113.20
  • 148.100.86.102
  • 148.100.245.197
  • 148.100.84.144
    Older early adopter Self-service account:
  • 148.100.84.26
  • 148.100.84.159
  • 148.100.84.52
  • 148.100.84.144

@sxa sxa reopened this Oct 18, 2022
@steelhead31
Copy link
Contributor

Will do, has the ansible inventory been updated with the new ip's / hostnames ?, Im starting work on fixing the discrepancies between nagios and ansible today.

@sxa
Copy link
Member Author

sxa commented Oct 18, 2022

Will do, has the ansible inventory been updated with the new ip's / hostnames ?

Yep the new ones have been live for a few weeks: https://github.com/adoptium/infrastructure/pull/2690/files

In theory removing the ones listed above should only leave the s390x ones added in that PR.

@steelhead31
Copy link
Contributor

All have now been removed from nagios.

@sxa
Copy link
Member Author

sxa commented Oct 18, 2022

That'll clear up the slack channel a bit then ;-)

@sxa
Copy link
Member Author

sxa commented Nov 7, 2022

The old machines have all been relieved or their duties and returned to Marist.

There is still some more work required to fix some issues that have shown up during this release cycle under #2807 but those can be covered under that issue. The old TCK machines will be decomissioned this week too.

@sxa sxa closed this as completed Nov 7, 2022
@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Dec 12, 2022

Removing the following machines from inventory.yml and jenkins as they've been decommissioned

* https://ci.adoptopenjdk.net/computer/test-marist-sles15-s390x-1/
* https://ci.adoptopenjdk.net/computer/build-marist-rhel77-s390x-1/
* https://ci.adoptopenjdk.net/computer/build-marist-rhel77-s390x-2/
* https://ci.adoptopenjdk.net/computer/test-marist-ubuntu1604-s390x-1/
* https://ci.adoptopenjdk.net/computer/test-marist-ubuntu1804-s390x-1/
* https://ci.adoptopenjdk.net/computer/test-marist-ubuntu1804-s390x-2/
* https://ci.adoptopenjdk.net/computer/test-marist-ubuntu1804-s390x-3/
* https://ci.adoptopenjdk.net/computer/test-marist-ubuntu1804-s390x-4/
* https://ci.adoptopenjdk.net/computer/docker-marist-ubuntu1604-s390x-1/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

3 participants