Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AIX test machines at OSUOSL not available #1644

Closed
sxa opened this issue Oct 22, 2020 · 23 comments
Closed

AIX test machines at OSUOSL not available #1644

sxa opened this issue Oct 22, 2020 · 23 comments

Comments

@sxa
Copy link
Member

sxa commented Oct 22, 2020

test-osuosl-aix72-ppc64-1 is marked as CPAN allegedly not working on it (Need to verify current issue via Grinder)
test-osuosl-aix72-ppc64-2 is currently offline - raising with OSUOSL.

@sxa
Copy link
Member Author

sxa commented Oct 22, 2020

-2 now back up and running. The ssh keys on it weren't up to date on either of them but that has now been resolved by refreshing. It was trying to connect to -1 using a DNS entry which was no longer in place so that has now been fixed too. Just need to see what the issues are with CPAN and whether either of them can now run test jobs properly.

@Haroon-Khel
Copy link
Contributor

Ill run a sanity system and openjdk test on both to begin with, I think these would trigger an error if either machine is exhibiting CPAN issues.

@Haroon-Khel
Copy link
Contributor

I ran both system and openjdk sanity tests on both machines. The tests were able to run without error. The following test cases, from openjdk sanity, failed on both machines

jdk_lang_j9_0
jdk_math_j9_0
jdk_util_j9_0

None of the system tests failed. Where were you notified that CPAN was not working on either machine?

@Haroon-Khel
Copy link
Contributor

Im also running an openjdk sanity test on test-osuosl-aix72-ppc64-1 via grinder, incase the CPAN issues occur only via grinder.
https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/4288/console

@sxa
Copy link
Member Author

sxa commented Oct 23, 2020

Those three suites failing is a concern = JDK11/J9 sanity.openjdk appears to pass on the other machines so we have something that needs to be fixed: https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk11_j9_sanity.openjdk_ppc64_aix/211

@sxa sxa added the testFail label Oct 26, 2020
@Haroon-Khel
Copy link
Contributor

@smlambert ref the discussion we had in the team meeting.
On https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/4288/, a sanity openjdk test ran on test-osuosl-aix72-ppc64-1.

In terms of machine dependencies and configuration, would you know why java/lang/ProcessBuilder/Basic.java#id0.Basic_id0 might fail? The machine meets the prereqs

@smlambert
Copy link
Contributor

May be helpful to look at what the test itself does (and if its doing anything special on AIX), if it is behaving well on one machine but not another, can you compare what LIBPATH is on the machines you are trying to compare. (if you search for AIX in the test source, you will see several places where there is AIX specific handling of args and such, starting with:

https://github.com/AdoptOpenJDK/openjdk-jdk11u/blob/master/test/jdk/java/lang/ProcessBuilder/Basic.java#L75

@sxa
Copy link
Member Author

sxa commented Nov 18, 2020

@Haroon-Khel Have you looked more into this? Would be good to get these two machines live again if possible. We are restricted on AIX testing capacity.

@Haroon-Khel
Copy link
Contributor

The test failure is caused by https://github.com/ibmruntimes/openj9-openjdk-jdk11/blob/29d8a1d89c10cfd0cf86075b292bb4be6b196e29/test/jdk/java/lang/ProcessBuilder/Basic.java#L1794, and the 3 lines that follow it.
From what I have gathered, the test tries to trigger an expected java.lang.OutOfMemoryError, and then tries to look for certain expected results in stderr. In this case the stderr is empty, causing this test to fail.
Continuing to look into this.

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Nov 24, 2020

Weirdly, the test has just passed on test-osuosl-aix72-ppc64-1. It appears intermittent, as it has just failed again after a subsequent run.
When it passed, I captured what the stderr for the test is supposed to look like:

JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2020/11/24 07:39:36 - please wait.
JVMDUMP032I JVM requested Java dump using '/home/jenkins/jdk-11.0.8+10/bin/JTwork/scratch/javacore.20201124.073936.24445318.0001.txt' in response to an event
JVMDUMP010I Java dump written to /home/jenkins/jdk-11.0.8+10/bin/JTwork/scratch/javacore.20201124.073936.24445318.0001.txt
JVMDUMP032I JVM requested Snap dump using '/home/jenkins/jdk-11.0.8+10/bin/JTwork/scratch/Snap.20201124.073936.24445318.0002.trc' in response to an event
JVMDUMP010I Snap dump written to /home/jenkins/jdk-11.0.8+10/bin/JTwork/scratch/Snap.20201124.073936.24445318.0002.trc
JVMDUMP007I JVM Requesting Tool dump using '/home/jenkins/jdk-11.0.8+10/bin/java -version'
JVMDUMP011I Tool dump created process 27132240
openjdk version "11.0.8" 2020-07-14
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.8+10)
Eclipse OpenJ9 VM AdoptOpenJDK (build openj9-0.21.0, JRE 11 AIX ppc64-64-Bit Compressed References 20200715_695 (JIT enabled, AOT enabled)
OpenJ9   - 34cf4c075
OMR      - 113e54219
JCL      - 95bb504fbb based on jdk-11.0.8+10)
JVMDUMP013I Processed dump event "systhrow", detail "java/lang/OutOfMemoryError".
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at Basic$JavaChild.main(Basic.java:368)

@aixtools
Copy link
Contributor

Look at ojdk01, ojdk02, (the AIX 7.1 systems) and ojdk03 and ojdk04 (the AIX 7.2 pair):

The default perl used on the AIX 7.1 ones is the AIX perl - ancient (5.10) (as this is about CPAN).

ojdk03 - does not have all the ssh keys it is suppossed to have - to allow automated login from OSUNIM; ojdk04 - for the 3rd time at least, no longer has either the OSUNIM or my admin authorized keys.

IMHO: there are systems outside these systems making unauthorized changes - because my PKI keys keep getting restored, and keep getting removed. What else is being modified?

@sxa
Copy link
Member Author

sxa commented Nov 30, 2020

OSUNIM key added to the set of machines that you have access to and your key has also been reinstated on 03/04 - hopefully it won't disappear this timeas it was deployed properly through our automation

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Dec 2, 2020

I did some digging.

This same test failure affected aarch64, eclipse-openj9/openj9#9032. The solution there was to exclude the test case for that platform, https://github.com/AdoptOpenJDK/openjdk-tests/pull/1716/files.

This test used to be excluded on aix due to adoptium/aqa-tests#1397 but has since been reincluded, adoptium/aqa-tests#1788, due to an upstream fix.

For the sake of re adding the ci.role.test label back to test-osuosl-aix72-ppc64-1 and test-osuosl-aix72-ppc64-2, could this test be excluded for aix? Thoughts @smlambert @sxa

@smlambert
Copy link
Contributor

Yes to re-excluding, but will want someone to chase down the reason we thought the upstream fix would/did fix the issue.

@Haroon-Khel
Copy link
Contributor

If ive understood it correctly, I think the upstream fix was for a different issue related to the same test

@smlambert
Copy link
Contributor

@adamfarley, given it was your upstream fix, can you check if the test failure is happening is different than what was fixed via: https://bugs.openjdk.java.net/browse/JDK-8239365 ?

@adamfarley
Copy link
Contributor

adamfarley commented Dec 2, 2020

No, I don't think so. My issue wasn't an OOM, and the bug I fixed wasn't checking against the error class. It was checking against an error message supplied by the OS, derived from an error message "set" that could change depending on what sets you'd installed.

If you weren't referring to the OOM, please include a job link, trss link, or a copy of the error output.

@aixtools
Copy link
Contributor

aixtools commented Dec 4, 2020

OSUNIM key added to the set of machines that you have access to and your key has also been reinstated on 03/04 - hopefully it won't disappear this timeas it was deployed properly through our automation

The key for 03/04 has been removed - again. 01/02 is working fine.

root@p8-aix2-osunim:[/home/root]ssh [email protected] date
[email protected]'s password:
root@p8-aix2-osunim:[/home/root]ssh [email protected] date
[email protected]'s password:
root@p8-aix2-osunim:[/home/root]ssh [email protected] date
Fri Dec  4 06:03:49 PST 2020
root@p8-aix2-osunim:[/home/root]ssh [email protected] date
Fri Dec  4 06:03:56 PST 2020
root@p8-aix2-osunim:[/home/root]

Using my desktop I can access 01/02, but not 03/04 - when using the hostname (but can when using IP address??)

 ssh [email protected] date
Warning: Permanently added 'p8-aix2-ojdk02.osuosl.org' (RSA) to the list of known hosts.
X11 forwarding request failed on channel 0
Fri Dec  4 06:12:09 PST 2020

++++++
  04/12/2020   15:06.58   /home/mobaxterm  ssh [email protected] date
Warning: Permanently added '140.211.9.28' (RSA) to the list of known hosts.
X11 forwarding request failed on channel 0
Fri Dec  4 08:07:09 CST 2020
                                                                                                                                  ✔

  04/12/2020   15:07.10   /home/mobaxterm  ssh [email protected] date
Warning: Permanently added 'p8-aix2-ojdk03.osuosl.org' (RSA) to the list of known hosts.
[email protected]'s password:

                                                                                                                                  ✘

  04/12/2020   15:08.19   /home/mobaxterm  nslookup 140.211.9.28

Name:      140.211.9.28
Address 1: 140.211.9.28 p8-aix2-ojdk03.osuosl.org
                                                                                                                                  ✔

  04/12/2020   15:08.36   /home/mobaxterm  ssh [email protected] date
X11 forwarding request failed on channel 0
Fri Dec  4 08:08:56 CST 2020

++++++++++

Strangely enough - a few moments - works for both IP and hostname addressing:


  04/12/2020   15:10.13   /home/mobaxterm  ssh [email protected] date
Warning: Permanently added '140.211.9.36' (RSA) to the list of known hosts.
X11 forwarding request failed on channel 0
Fri Dec  4 08:10:26 CST 2020
                                                                                                                                  ✔

  04/12/2020   15:10.26   /home/mobaxterm  ssh [email protected] date
Warning: Permanently added 'p8-aix2-ojdk02.osuosl.org' (RSA) to the list of known hosts.
X11 forwarding request failed on channel 0
Fri Dec  4 06:12:09 PST 2020
                                                                                                                                  ✔

  04/12/2020   15:12.09   /home/mobaxterm  ssh [email protected] date
X11 forwarding request failed on channel 0
Fri Dec  4 08:12:45 CST 2020
                                                                                                                                  ✔

  04/12/2020   15:12.45   /home/mobaxterm  ssh [email protected] date
X11 forwarding request failed on channel 0
Fri Dec  4 08:13:02 CST 2020

No idea what is causing this - but not a warm and cozy feeling.

@aixtools
Copy link
Contributor

aixtools commented Dec 8, 2020

My idea now - is that there is - perhaps - an unknown second agent or program that is updating the authorized file.

Again - I cannot access ojdk04 - either as myself, nor as the nim admin account - both internal and external IP addresses attempted.

root@p8-aix2-osunim:[/home/root]ssh [email protected]
[email protected]'s password:

[email protected]'s password:

[email protected]'s password:

This is getting tiresome. Somewhere there is a bug - and it should not be this host - but I have no clue.

When I get access again, I'll try to remember to create an audit record to at least see when the authorized file is being updated. Maybe from that we can locate the source.

@sxa
Copy link
Member Author

sxa commented Dec 8, 2020

My idea now - is that there is - perhaps - an unknown second agent or program that is updating the authorized file.

Nothing unkjnown about it - we use Bastillion to manage access. That machine (and 9.28) had duplicate entries in the sytsem so it was updating the keys file twice - once for the full admin user set and another for the AIX set. I've removed the dupicate so it won't happen again.

@sxa
Copy link
Member Author

sxa commented Dec 9, 2020

On the basis that the problematic tests have been excluded I'm going to re-enable those two test machiens as we have a significant backlog on AIX testing just now.

Added ci.role.test back onto:

FYI @andrew-m-leonard both are now running test jobs starting with these two:

@Haroon-Khel
Copy link
Contributor

Seeing as the failing test was excluded, can this issue be closed?

@sxa
Copy link
Member Author

sxa commented Jan 14, 2021

Yep the machines are running the tests on a regular basis now so this can be closed :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants