Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple JTReg JVM Failures on large heap builds #8637

Closed
M-Davies opened this issue Feb 24, 2020 · 30 comments · Fixed by eclipse/omr#4926
Closed

Multiple JTReg JVM Failures on large heap builds #8637

M-Davies opened this issue Feb 24, 2020 · 30 comments · Fixed by eclipse/omr#4926
Labels
comp:jit segfault Issues that describe segfaults / JVM crashes test failure

Comments

@M-Davies
Copy link

M-Davies commented Feb 24, 2020

Failure link

https://github.com/ibmruntimes/openj9-openjdk-jdk14/blob/2baf17d18e36af5750446aff79407efdc3eb97be/test/jdk/java/nio/channels/SocketChannel/CloseDuringConnect.java#L1 fails on JDK14-j9

  • test category, openjdk
  • OS/architecture, multiple

Optional info

  • intermittent failure (yes): Intermittent Failures List adoptium/aqa-tests#1596 (comment)
  • New test: Seen to fail on
    java/nio/channels/SocketChannel/Hangup.java for x86-64_mac_xl
    java/nio/channels/SocketChannel/CloseDuringConnect.java for x86-64_mac_xl
    java/nio/file/WatchService/UpdateInterference.java for s390x_linux_xl

Failure output (captured from console output)

11:16:59  [2020-02-21 11:16:57,471] Agent[0]: stderr: Assertion failed at /Users/jenkins/workspace/build-scripts/jobs/jdk14/jdk14-mac-x64-openj9-macosXL/workspace/build/src/build/macosx-x86_64-server-release/vm/compiler/../omr/compiler/il/OMRBlock.cpp:1041: storeRegNodeInfoEntry != storeNodeInfo->end()
11:16:59  [2020-02-21 11:16:57,471] Agent[0]: stderr: VMState: 0x000561ff
11:16:59  [2020-02-21 11:16:57,471] Agent[0]: stderr: 	We should have a regStore pre split point that can be used to allocate register for replacement node
11:16:59  [2020-02-21 11:16:57,471] Agent[0]: stderr: compiling java/util/concurrent/locks/AbstractQueuedSynchronizer$ConditionObject.awaitNanos(J)J at level: very-hot (profiling)
11:16:59  [2020-02-21 11:16:57,471] Agent[0]: stderr: 
11:16:59  [2020-02-21 11:16:57,472] Agent[0]: stderr: JVMDUMP039I Processing dump event "abort", detail "" at 2020/02/21 11:16:57 - please wait.
11:16:59  [2020-02-21 11:16:57,474] Agent[0]: stderr: JVMDUMP032I JVM requested System dump using '/Users/jenkins/workspace/Grinder/openjdk-tests/TKG/test_output_15822837806546/jdk_custom_0/work/scratch/0/core.20200221.111657.72286.0001.dmp' in response to an event
11:17:03  [2020-02-21 11:17:03,075] Agent[0]: stderr: JVMDUMP012E Error in System dump: The core file created by child process with pid = 72300 was not found. Expected to find core file with name "/cores/core.72300"
11:17:03  [2020-02-21 11:17:03,075] Agent[0]: stderr: JVMDUMP032I JVM requested Java dump using '/Users/jenkins/workspace/Grinder/openjdk-tests/TKG/test_output_15822837806546/jdk_custom_0/work/scratch/0/javacore.20200221.111657.72286.0002.txt' in response to an event
11:17:03  [2020-02-21 11:17:03,247] Agent[0]: stderr: JVMDUMP010I Java dump written to /Users/jenkins/workspace/Grinder/openjdk-tests/TKG/test_output_15822837806546/jdk_custom_0/work/scratch/0/javacore.20200221.111657.72286.0002.txt
11:17:03  [2020-02-21 11:17:03,247] Agent[0]: stderr: JVMDUMP032I JVM requested Snap dump using '/Users/jenkins/workspace/Grinder/openjdk-tests/TKG/test_output_15822837806546/jdk_custom_0/work/scratch/0/Snap.20200221.111657.72286.0003.trc' in response to an event
11:17:03  [2020-02-21 11:17:03,349] Agent[0]: stderr: JVMDUMP010I Snap dump written to /Users/jenkins/workspace/Grinder/openjdk-tests/TKG/test_output_15822837806546/jdk_custom_0/work/scratch/0/Snap.20200221.111657.72286.0003.trc
11:17:03  [2020-02-21 11:17:03,350] Agent[0]: stderr: JVMDUMP007I JVM Requesting JIT dump using '/Users/jenkins/workspace/Grinder/openjdk-tests/TKG/test_output_15822837806546/jdk_custom_0/work/scratch/0/jitdump.20200221.111657.72286.0004.dmp'
11:17:03  [2020-02-21 11:17:03,636] Agent[0]: stderr: JVMDUMP010I JIT dump written to /Users/jenkins/workspace/Grinder/openjdk-tests/TKG/test_output_15822837806546/jdk_custom_0/work/scratch/0/jitdump.20200221.111657.72286.0004.dmp
11:17:03  [2020-02-21 11:17:03,636] Agent[0]: stderr: JVMDUMP013I Processed dump event "abort", detail "".

TEST RESULT: Error. Agent communication error: java.io.EOFException; check console log for any additional details

Dumps: javacores.zip (from java/nio/channels/SocketChannel/Hangup.java)

@andrewcraik
Copy link
Contributor

@r30shah this is an assert out of OMR::Block::splitPostGRA could you take a look? Note the 0.19 target for a fix.

@M-Davies
Copy link
Author

Also fails on s390x_linux_xl for java/nio/file/WatchService/UpdateInterference.java

23:06:28  [2020-02-24 23:06:19,468] Agent[3]: stderr: Assertion failed at /home/jenkins/workspace/workspace/build-scripts/jobs/jdk14/jdk14-linux-s390x-openj9-linuxXL/workspace/build/src/build/linux-s390x-server-release/vm/compiler/../omr/compiler/il/OMRBlock.cpp:1041: storeRegNodeInfoEntry != storeNodeInfo->end()
23:06:28  [2020-02-24 23:06:19,468] Agent[3]: stderr: VMState: 0x000561ff
23:06:28  [2020-02-24 23:06:19,468] Agent[3]: stderr: 	We should have a regStore pre split point that can be used to allocate register for replacement node
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: compiling java/util/concurrent/LinkedBlockingDeque.pollFirst(JLjava/util/concurrent/TimeUnit;)Ljava/lang/Object; at level: very-hot (profiling)
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #0: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x8ef106) [0x3ffab9ef106]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #1: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x8fbd08) [0x3ffab9fbd08]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #2: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x560a40) [0x3ffab660a40]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #3: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x560c76) [0x3ffab660c76]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #4: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x52ba82) [0x3ffab62ba82]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #5: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x5322ba) [0x3ffab6322ba]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #6: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x44e5a6) [0x3ffab54e5a6]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #7: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x44fdee) [0x3ffab54fdee]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #8: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x4500b0) [0x3ffab5500b0]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #9: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x741b8c) [0x3ffab841b8c]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #10: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x741342) [0x3ffab841342]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #11: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x7427e8) [0x3ffab8427e8]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #12: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x4f415c) [0x3ffab5f415c]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #13: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x1703ba) [0x3ffab2703ba]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #14: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x1712e2) [0x3ffab2712e2]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #15: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9prt29.so(+0x29650) [0x3ffb0b29650]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #16: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x172fc8) [0x3ffab272fc8]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #17: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x173606) [0x3ffab273606]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #18: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x16f008) [0x3ffab26f008]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #19: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x16f2d0) [0x3ffab26f2d0]
23:06:28  [2020-02-24 23:06:19,469] Agent[3]: stderr: #20: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x16f3b8) [0x3ffab26f3b8]
23:06:28  [2020-02-24 23:06:19,476] Agent[3]: stderr: #21: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9prt29.so(+0x29650) [0x3ffb0b29650]
23:06:28  [2020-02-24 23:06:19,477] Agent[3]: stderr: #22: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0x16f8da) [0x3ffab26f8da]
23:06:28  [2020-02-24 23:06:19,480] Agent[3]: stderr: #23: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/lib/default/libj9thr29.so(+0x58f0) [0x3ffb0c058f0]
23:06:28  [2020-02-24 23:06:19,480] Agent[3]: stderr: #24: /lib/s390x-linux-gnu/libpthread.so.0(+0x7934) [0x3ffb1b07934]
23:06:28  [2020-02-24 23:06:19,481] Agent[3]: stderr: #25: /lib/s390x-linux-gnu/libc.so.6(+0xedce2) [0x3ffb196dce2]
23:06:28  [2020-02-24 23:06:19,481] Agent[3]: stderr: 
23:06:28  [2020-02-24 23:06:19,482] Agent[3]: stderr: JVMDUMP039I Processing dump event "abort", detail "" at 2020/02/24 23:06:19 - please wait.
23:06:28  [2020-02-24 23:06:19,485] Agent[3]: stderr: JVMDUMP032I JVM requested System dump using '/home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdk-tests/TKG/test_output_15825823422523/jdk_nio_0/work/scratch/0/core.20200224.230619.5854.0001.dmp' in response to an event
23:06:28  [2020-02-24 23:06:24,680] Agent[3]: stderr: JVMPORT030W /proc/sys/kernel/core_pattern setting "|/usr/share/apport/apport %p %s %c %d %P" specifies that the core dump is to be piped to an external program.  Attempting to rename either core or core.22485.
23:06:28  [2020-02-24 23:06:24,680] Agent[3]: stderr: 
23:06:28  [2020-02-24 23:06:24,782] Agent[3]: stderr: JVMDUMP010I System dump written to /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdk-tests/TKG/test_output_15825823422523/jdk_nio_0/work/scratch/0/core.20200224.230619.5854.0001.dmp
23:06:28  [2020-02-24 23:06:24,782] Agent[3]: stderr: JVMDUMP032I JVM requested Java dump using '/home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdk-tests/TKG/test_output_15825823422523/jdk_nio_0/work/scratch/0/javacore.20200224.230619.5854.0002.txt' in response to an event
23:06:28  [2020-02-24 23:06:25,725] Agent[3]: stderr: JVMDUMP010I Java dump written to /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdk-tests/TKG/test_output_15825823422523/jdk_nio_0/work/scratch/0/javacore.20200224.230619.5854.0002.txt
23:06:28  [2020-02-24 23:06:25,725] Agent[3]: stderr: JVMDUMP032I JVM requested Snap dump using '/home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdk-tests/TKG/test_output_15825823422523/jdk_nio_0/work/scratch/0/Snap.20200224.230619.5854.0003.trc' in response to an event
23:06:28  [2020-02-24 23:06:25,839] Agent[3]: stderr: JVMDUMP010I Snap dump written to /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdk-tests/TKG/test_output_15825823422523/jdk_nio_0/work/scratch/0/Snap.20200224.230619.5854.0003.trc
23:06:28  [2020-02-24 23:06:25,839] Agent[3]: stderr: JVMDUMP007I JVM Requesting JIT dump using '/home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdk-tests/TKG/test_output_15825823422523/jdk_nio_0/work/scratch/0/jitdump.20200224.230619.5854.0004.dmp'
23:06:28  [2020-02-24 23:06:26,014] Agent[3]: stderr: *** stack smashing detected ***: /home/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/openjdkbinary/j2sdk-image/bin/java terminated

Dumps: https://ibm.box.com/s/fwhhfnacqphjcv9aq1g4opnwzcxzf37y

@r30shah
Copy link
Contributor

r30shah commented Feb 25, 2020

Code where this failure occurs is working on finding register to use for common node across spit point to uncommon it. When it gather the information about available registers to use, it scans all the exit edges in the Extended Basic Block and see if it can use any information. It throws an Assert when it sees a node under PassThrough under GlRegDeps and can not find a regStore node for which PassThrough was created. JIT Dump from the #8637 (comment) would not be useful as we do not get exact same trees and it succeeds in that compilation. Looking into the core-dumps of the Linux on Z failure.

@M-Davies Would you able to share the job link of these failures? I quickly tried running CloseDuringConnect.java with JDK13-OpenJ9 driver available on AdoptOpenJDK website and could not get it reproduced on my personal laptop which runs macOS. Seems like this is happening on JDK14-OpenJ9 only, Job links can provide me link from where I can get the drivers and some more information and I can avoid re-building the driver.

@pshipton
Copy link
Member

pshipton commented Feb 25, 2020

Note jdk14 builds from Adopt are available, just not from the website.
https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk14/

@M-Davies
Copy link
Author

@r30shah

@M-Davies Would you able to share the job link of these failures? I quickly tried running CloseDuringConnect.java with JDK13-OpenJ9 driver available on AdoptOpenJDK website and could not get it reproduced on my personal laptop which runs macOS. Seems like this is happening on JDK14-OpenJ9 only, Job links can provide me link from where I can get the drivers and some more information and I can avoid re-building the driver.

Hangup.java-> https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk14_j9_sanity.openjdk_x86-64_mac_xl/13/
CloseDuringConnect.java -> https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk14_j9_sanity.openjdk_x86-64_mac_xl/10/
UpdateInterference.java -> https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/11/

@r30shah
Copy link
Contributor

r30shah commented Feb 26, 2020

@M-Davies I have been trying to reproduce this failure locally on both Linux on Z (I am running on Ubuntu 18 , z13) and macOS - 10.15. I do not have any success yet. I do not know if I have access to launch a grinder, Would it be possible for you to launch a grinder (30 runs are sufficient if you are seeing 1/4 failures) and collect a log file of the method we are compiling and failed.
For macOS run (https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk14_j9_sanity.openjdk_x86-64_mac_xl/10/) , Additional JVM Options needed : -Xjit:{java/util/concurrent/locks/AbstractQueuedSynchronizer\$ConditionObject.awaitNanos(J)J}{very-hot}(traceJProfilingValue,log=awaitNanos.methodLog)
For Linux on Z run (https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk14_j9_sanity.openjdk_s390x_linux_xl/11/) Additional JVM Options needed: -Xjit:{java/util/concurrent/LinkedBlockingDeque.pollFirst(JLjava/util/concurrent/TimeUnit;)Ljava/lang/Object;}{very-hot}(traceJProfilingValue,log=pollFirst.methodLog)

@M-Davies
Copy link
Author

M-Davies commented Feb 26, 2020

@r30shah
Copy link
Contributor

r30shah commented Feb 26, 2020

@M-Davies Thanks a lot for launching the grinder. Mac one failed, and I downloaded the files, seems like the launcher didn't like the options. I am seeing following in the javacore.

2CIUSERARG               -Xjit:{java/util/concurrent/locks/AbstractQueuedSynchronizer\onditionObject.awaitNanos(J)J}{very-hot}(traceJProfilingValue,log=awaitNanos.methodLog)

So signature of the method is messed up. Can I request you to launch another one, with following option on macOS again?
-Xjit:{java/util/concurrent/locks/AbstractQueuedSynchronizer*}{very-hot}(traceFull,traceJProfilingValue,log=awaitNanos.methodLog)

@M-Davies
Copy link
Author

So signature of the method is messed up. Can I request you to launch another one, with following option on macOS again?
-Xjit:{java/util/concurrent/locks/AbstractQueuedSynchronizer*}{very-hot}(traceFull,traceJProfilingValue,log=awaitNanos.methodLog)

@r30shah https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/2360/

@r30shah
Copy link
Contributor

r30shah commented Feb 27, 2020

@M-Davies Sorry to bother you again, seems like the grinder did not work, seeing following failure,

05:14:17  + ./openjdk-tests/get.sh -s /Users/jenkins/workspace/Grinder -t /Users/jenkins/workspace/Grinder/openjdk-tests -p x86-64_mac_xl -r customized -c https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk14/job/jdk14-mac-x64-openj9-macosXL/32/artifact/workspace/target/OpenJDK14-jdk_x64_mac_openj9_macosXL_2020-02-25-20-28.tar.gz --openj9_repo https://github.com/eclipse/openj9.git --openj9_branch master --tkg_repo https://github.com/AdoptOpenJDK/TKG.git --tkg_branch master
05:14:17  TESTDIR: /Users/jenkins/workspace/Grinder/openjdk-tests
05:14:17  get jdk binary...
05:14:17  _ENCODE_FILE_NEW=UNTAGGED curl -OLJSks  https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk14/job/jdk14-mac-x64-openj9-macosXL/32/artifact/workspace/target/OpenJDK14-jdk_x64_mac_openj9_macosXL_2020-02-25-20-28.tar.gz
05:14:17  unzip file: OpenJDK14-jdk_x64_mac_openj9_macosXL_2020-02-25-20-28.tar.gz ...
05:14:17  gzip: OpenJDK14-jdk_x64_mac_openj9_macosXL_2020-02-25-20-28.tar.gz: not in gzip format
05:14:17  ls: */: No such file or directory

@M-Davies
Copy link
Author

M-Davies commented Mar 2, 2020

Apologies for the delay. It was using a binary that had already been deleted and I couldn't create new grinder on Friday with Jenkins down. Have a rerun going at https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/2366/
Didn't increase the iteration count. This is the correct grinder: https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/2380/

@r30shah
Copy link
Contributor

r30shah commented Mar 2, 2020

Grinder in previous comment didn't fail. I have tried reproduced this on internal machines with no success. @AdamBrousseau @pshipton Is it possible to get access to one of the macOS machine where this fails fairly, the machine @M-Davies have seen this failure on. Given that fix is targeted at 0.19 release, this will help speed up the investigation and create fix.

@pshipton
Copy link
Member

pshipton commented Mar 2, 2020

These are machines at AdoptOpenJDK.net, @smlambert can you help with the request in the previous comment.

@smlambert
Copy link
Contributor

Process to request access to a machine at AdoptOpenJDK used to be to raise a request in the openjdk-infrastructure repo, assume that still holds (@sxa555 can correct me if that has changed).

Search for "should have a regStore pre split point" in TRSS in recent jdk14 pipeline, results show that it happens on 3 platforms (3 machines test-godaddy-centos7-x64-1, test-marist-ubuntu1604-s390x-2, test-macincloud-macos1010-x64-2).

Screen Shot 2020-03-02 at 9 50 49 PM

Also suggest trying to reproduce on the 2 non-mac platforms internally if that has not been tried, in case those platforms match external configuration more closely than mac platform.

Looking at Deep History in TRSS on xLinux, shows that reproduces most frequently on test-godaddy-centos7-x64-1, but did also occur on test-scaleway-ubuntu1604-x64-1.

Screen Shot 2020-03-02 at 9 59 49 PM

@smlambert
Copy link
Contributor

I searched the last several jdk14 pipelines and see it occur intermittently in ppc64le_linux_xl, x86-64_mac_xl and s390x_linux_xl test runs, and seen once in a (Passing) windows_xl run.

Screen Shot 2020-03-02 at 10 20 07 PM

Noting I have not found it occurring in compressedrefs runs.

Also noting, seen all the way back to the first jdk14 test runs at AdoptOpenJDK launched from jdk14 pipeline 37 (do not have history before that), whose java -version info is:

OpenJDK Runtime Environment AdoptOpenJDK (build 14+36-202002192048)
Eclipse OpenJ9 VM AdoptOpenJDK (build master-2d2b3853c, JRE 14 Mac OS X amd64-64-Bit 20200219_26 (JIT enabled, AOT enabled)
OpenJ9 - 2d2b385
OMR - 3070301
JCL - 2baf17d18e based on jdk-14+36)

@M-Davies M-Davies changed the title JTReg VM Failure: java/nio/channels/SocketChannel/CloseDuringConnect.java Multiple JTReg JVM Failures on large heap builds Mar 3, 2020
@M-Davies
Copy link
Author

M-Davies commented Mar 3, 2020

Thanks @smlambert !
Have reran 3 grinders for the machines specified, see if we can narrow down which ones are affected

@M-Davies
Copy link
Author

M-Davies commented Mar 3, 2020

@r30shah https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/2402/ (CloseDuringConnect.java) failed 5 times. You should be able to see any javacore files that it created by downloading the openjdk_test_output.tar.gz (IIRC, they're at TKG/test_output/jdk_nio_0/work/java/nio/channels/SocketChannel/CloseDuringConnect/).

I've kicked off two more grinders with the test targets swapped over to see if I can produce a failure output with them:
https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/2407/
https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/2408/

@sxa
Copy link
Contributor

sxa commented Mar 3, 2020

Process to request access to a machine at AdoptOpenJDK used to be to raise a request in the openjdk-infrastructure repo, assume that still holds (@sxa555 can correct me if that has changed).

Nope that's generally the right way to do it (although I don't think I have full access to all of the mac machines so I'd need to engage someone - probably at MSFT - who can grant it)

@lumpfish
Copy link
Contributor

lumpfish commented Mar 3, 2020

Occurred on last night's nightly build on test-osuosl-ubuntu1604-ppc64le-4: https://ci.adoptopenjdk.net/job/Test_openjdk14_j9_sanity.openjdk_ppc64le_linux_xl/16/console

@r30shah
Copy link
Contributor

r30shah commented Mar 3, 2020

Thanks @smlambert @M-Davies for all the pointers. I have been able to make progress in investigating the failures by analyzing the log file from the https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/2402/.

So failure happens because of following trees.

n1020n    BBStart <block_87> (freq 4079) (extension of previous block)
n1019n    istore  savedState<auto slot 3>[#522  Auto] 
n1015n      iloadi  java/util/concurrent/locks/AbstractQueuedSynchronizer.state I[#557  volatile Shadow +48] 
n820n         aloadi  java/util/concurrent/locks/AbstractQueuedSynchronizer$ConditionObject.this$0 Ljava/util/concurrent/locks/AbstractQueuedSynchronizer;[#381  final Shadow +32]
n816n           ==>aload
n1378n    aRegStore edi  (X!=0 privatizedInlinerArg sharedMemory )
n825n       aloadi  java/util/concurrent/locks/AbstractQueuedSynchronizer$ConditionObject.this$0 Ljava/util/concurrent/locks/AbstractQueuedSynchronizer;[#381  final Shadow +32] [flags 0x20607 0x0 ]
n816n         ==>aload
...
n4257n    treetop (profilingCode )                                  
n4255n      call  unknown[#353  helper Method] [flags 0x400 0x0 ]   
n1045n        ==>aloadi
n4256n        aconst 0x5eb962e0 (X!=0 sharedMemory )                 
n1042n    BBEnd </block_87>                                          
n1049n    BBStart <block_54> (freq 24) (extension of previous block) 
n1359n    treetop                                                    
n1053n      aloadi  java/util/concurrent/locks/AbstractQueuedSynchronizer.head Ljava/util/concurrent/locks/AbstractQueuedSynchronizer$Node;[#563  volatile Shadow +32] 
n820n         ==>aloadi
n1371n    ificmpne --> block_83 BBStart at n1367n (inlineHCRGuard )                    
n1369n      iload  unknown static[#631  Static] [flags 0x10303 0x0 ] (cannotOverflow ) 
n1027n      ==>iconst 0
n4724n      GlRegDeps                                                                  
n4723n        PassThrough rdi                                                          
n820n           ==>aloadi
n1364n    BBEnd </block_54>                                                            

To add a profiling trees while lowering trees for call, n4257n we split the block using post GRA Block splitter [1].
Now in the above trees while un-commoning node at split points it encounters node n4723n which is a PassThrough under a GlRegDeps n4724n and in analyzing trees it can not find regStore node that stored the child of PassThrough node in to rdi register and we throw the assert. Tracking back this node, I see that localCSE which happens before adding profiling trees and after GRA, replaces the child of n4723n with n820n before that it was n825n which is stored into rdi in regStore node n1378n. I would need a log with tracing local CSE to understand why we are seeing those trees.
@M-Davies Sorry to bother you again but would you able to launch a grinder with following options.
-Xjit:{java/util/concurrent/locks/AbstractQueuedSynchronizer*}{very-hot}(traceFull,traceLocalCSE,traceJProfilingValue,log=awaitNanos.methodLog)

[1]. https://github.com/eclipse/omr/blob/3260b77c3845917c690721fecc8db9b736fbd71d/compiler/il/OMRBlock.cpp#L1264

@M-Davies
Copy link
Author

M-Davies commented Mar 4, 2020

@pshipton
Copy link
Member

pshipton commented Mar 4, 2020

Moving forward since I don't see any fix in hand and it's becoming too late to update the 0.19 release. Note the branch for the 0.20 release occurs on March 8.

@lumpfish
Copy link
Contributor

lumpfish commented Mar 5, 2020

Happening regularly on jdk14 nightly builds:

For builds of 4th Mar 2020, occurred on
https://ci.adoptopenjdk.net/job/Test_openjdk14_j9_sanity.openjdk_ppc64le_linux_xl/18/console

For builds of 3rd Mar 2020, occurred on
https://ci.adoptopenjdk.net/job/Test_openjdk14_j9_sanity.openjdk_x86-64_mac_xl/19/consoleFull (3 occurrences)

For builds of 2nd Mar 2020, occurred on
https://ci.adoptopenjdk.net/job/Test_openjdk14_j9_sanity.openjdk_ppc64le_linux_xl/16/console

@M-Davies
Copy link
Author

Have kicked off another grinder in case the one above expires before @r30shah has a chance to look at it https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/2482/

Have also kicked one off for java/nio/channels/SocketChannel/CloseDuringWrite.java (seen on JDK14-openj9-macXL last night) https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/2483/

@r30shah
Copy link
Contributor

r30shah commented Mar 12, 2020

I have opened up PR with fix (eclipse/omr#4926), waiting for the personal testing to be finished so I can ask to review.

r30shah added a commit to r30shah/omr that referenced this issue Mar 12, 2020
For the optimization level hot and above, we run two passes in localCSE.
First one for the volatiles only and second one for non volatiles only.
In case of the volatiles only pass, to make sure we common the volatile
which are based on an indirection non volatile finals or non global autos/parm
it marks such node available for commoning in volatile pass. Issue occurs, where
it encounters first reference of such non volatile nodes and just adds it to
the list of replaced nodes without replacing it. This causes an issue with the
subsequent reference of such non volatiles which will be replaced with the available
expression as node is already in the list of replaced nodes.
This commit fixes the issue so that if it encounters finals or autos or parms in volatile pass
which are candidate for commoning, it will replace it with available expression and then
add it to replaced list.

Fixes: eclipse-openj9/openj9#8637

Signed-off-by: Rahil Shah <[email protected]>
@pshipton
Copy link
Member

@r30shah is there any outlook for getting the OMR change merged?

@r30shah
Copy link
Contributor

r30shah commented Mar 24, 2020

I am verifying some of the concerns @andrewcraik had on the PR, currently testing changes (Perf and Functional) to make sure, they are not regressing the performance. I am expecting the testing to be finished with addressing concerns on the changes to be done this week.

@pshipton
Copy link
Member

Ok thanks. Note the milestone 2 build is coming up this weekend and we generally don't allow any more change after that. Although we set M2 earlier this time and can evaluate the risk of the change once it's ready.

@pshipton pshipton added the segfault Issues that describe segfaults / JVM crashes label Mar 25, 2020
r30shah added a commit to r30shah/omr that referenced this issue Mar 26, 2020
For the optimization level hot and above, we run two passes in localCSE.
First one for the volatiles only and second one for non volatiles only.
In case of the volatiles only pass, to make sure we common the volatile
which are based on an indirection non volatile finals or non global autos/parm
it marks such node available for commoning in volatile pass. Issue occurs, where
it encounters first reference of such non volatile nodes and just adds it to
the list of replaced nodes without replacing it. This causes an issue with the
subsequent reference of such non volatiles which will be replaced with the available
expression as node is already in the list of replaced nodes.
This commit fixes the issue so that if it encounters finals or autos or parms in volatile pass
which are candidate for commoning, it will replace it with available expression and then
add it to replaced list.

Fixes: eclipse-openj9/openj9#8637

Signed-off-by: Rahil Shah <[email protected]>
@pshipton
Copy link
Member

We're past Milestone 2 for the 0.20.0 release, moving this forward to the next release.
If a low risk fix does become available in the next couple of weeks, please raise it for consideration.

r30shah added a commit to r30shah/omr that referenced this issue Mar 30, 2020
For the optimization level hot and above, we run two passes in localCSE.
First one for the volatiles only and second one for non volatiles only.
In case of the volatiles only pass, it was commoning the node that are final fields/
gloabal Autos/Parms as well with the assumption that we can rely on that node to not get changed.
Issue occurs, where it encounters first reference of such non volatile nodes and just adds it to
the list of replaced nodes without replacing it. This causes an issue with the
subsequent reference of such non volatiles which will be replaced with the available
expression as node is already in the list of replaced nodes.
This commit fixes the issue so that it puts node to the replaced list only if commoning is done.

Fixes: eclipse-openj9/openj9#8637

Signed-off-by: Rahil Shah <[email protected]>
r30shah added a commit to r30shah/omr that referenced this issue Apr 8, 2020
For the optimization level hot and above, we run two passes in localCSE.
First one for the volatiles only and second one for non volatiles only.
In case of the volatiles only pass, it was commoning the node that are final fields/
gloabal Autos/Parms as well with the assumption that we can rely on that node to not get changed.
Issue occurs, where it encounters first reference of such non volatile nodes and just adds it to
the list of replaced nodes without replacing it. This causes an issue with the
subsequent reference of such non volatiles which will be replaced with the available
expression as node is already in the list of replaced nodes.
This commit fixes the issue so that it puts node to the replaced list only if commoning is done.

Fixes: eclipse-openj9/openj9#8637

Signed-off-by: Rahil Shah <[email protected]>
@M-Davies
Copy link
Author

Thanks for all the help on this everyone :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:jit segfault Issues that describe segfaults / JVM crashes test failure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants