Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

-Xshareclasses by default causes a number of issues #3333

Closed
pshipton opened this issue Oct 17, 2018 · 51 comments
Closed

-Xshareclasses by default causes a number of issues #3333

pshipton opened this issue Oct 17, 2018 · 51 comments

Comments

@pshipton
Copy link
Member

pshipton commented Oct 17, 2018

When -Xshareclasses:bootClassesOnly was enabled by default on xlinux jdk11, a number of issues occurred which I believe are all caused by AOT code in the shared cache.

eclipse/omr#3075 / #3319
eclipse/omr#3076
eclipse/omr#3077
#3316

@pshipton
Copy link
Member Author

@dsouzai

@dsouzai
Copy link
Contributor

dsouzai commented Oct 17, 2018

Is there any way to run some of these with -Xaot:disableSymbolValidationManager or perhaps export TR_OptionsAOT=disableSymbolValidationManager ?

@pshipton
Copy link
Member Author

Note that except for eclipse/omr#3076, the problems occurred when compiling the functional tests, and not an actual test case. The best thing would be set it up on a local machine and try to recreate there manually. Then you'll have the shared cache which caused the problem, and can also try setting environment variables to try different options.

@pshipton
Copy link
Member Author

Note that @hangshao0 was trying to recreate #3319 and only found one internal jenkins machine where it occurred.

@dsouzai
Copy link
Contributor

dsouzai commented Oct 18, 2018

I tried running locally, on ub16hcxrt2, with forceAOT, always using the "new" AOT code, always using the "old" AOT code, using an OpenJ9 build of a SHA right before the reverts, using the build from https://ci.eclipse.org/openj9/job/Test-sanity.functional-JDK11-linux_x86-64_cmprssptrs/100/console (https://ci.eclipse.org/openj9/job/Test-sanity.functional-JDK11-linux_x86-64_cmprssptrs/100/fingerprints/). I can't get it to reproduce at all.

What makes me think it's not the new AOT changes is that this only happens on xLinux, whereas the new AOT changes are enabled on 64-bit x86 which includes Windows.

What makes me think it could be the new AOT changes are 3 minor bugs @jdmpapin found in the code (which he's going to fix as part of this bigger code cleanup work).

I figured if I could reproduce the issue I'd try fixing the bugs to see if it the issue goes away, but I can't reproduce it at all.

@pshipton
Copy link
Member Author

@dsouzai maybe you already have, but if not please check with @hangshao0 as he was able to reproduce it on one machine.

@hangshao0
Copy link
Contributor

hangshao0 commented Oct 19, 2018

ub16hcxrt2 is the machine that I reproduced this failure yesterday (4 out of 5 runs failed). I have slacked @dsouzai the link to one of these failures.

@pshipton
Copy link
Member Author

@hangshao0 thanks. Please help check what @dsouzai is doing to see if it can be repeated again.

@pshipton
Copy link
Member Author

@dchopra001 note @DanHeidinga was also seeing problems on macOS.

@hangshao0
Copy link
Contributor

@dsouzai, you'd better clean up the existing shared cache on the system each time before running.

@dsouzai
Copy link
Contributor

dsouzai commented Oct 19, 2018

you'd better clean up the existing shared cache on the system each time before running

Yup I've been doing that.

The job @hangshao0 linked me used a different JVM than what I was running, so I'll try running with that tomorrow.

@dsouzai
Copy link
Contributor

dsouzai commented Oct 19, 2018

openjdk-tests sha:

~/dsouzai/openjdk-tests$ git status
HEAD detached at 4c18447
nothing to commit, working tree clean

and I used OpenJ9-JDK11-linux_x86-64_cmprssptrs-20181017-120625 as the JDK, as well as

export SPEC=linux_x86-64_cmprssptrs
export JDK_VERSION=11
export JVM_VERSION=openjdk11-openj9
export JDK_IMPL=openj9
export BUILD_LIST=functional

I ran it on ub16hcxrt2, basically running the same thing as https://hyc-runtimes-jenkins.swg-devops.com/view/Test_grinder/job/Grinder/538/console but no luck.

@dsouzai
Copy link
Contributor

dsouzai commented Oct 19, 2018

I did however hit an issue on a test I was running:

0000000002FD1D00: Object neither in heap nor stack-allocated in thread LargeThreadPool-thread-50
0000000002FD1D00:       O-Slot=00007F790F65AB68
0000000002FD1D00:       O-Slot value=0000000001C9A500
0000000002FD1D00:       PC=00007F7977234BC8
0000000002FD1D00:       framesWalked=1
0000000002FD1D00:       arg0EA=0000000003047108
0000000002FD1D00:       walkSP=0000000003046FF0
0000000002FD1D00:       literals=0000000000000000
0000000002FD1D00:       jitInfo=00007F793445CDC8
0000000002FD1D00:       method=00000000031ECE58 (com/ibm/ws/webcontainer/session/impl/HttpSessionContextImpl.isValid(Ljavax/servlet/http/HttpSession;Ljavax/servlet/http/HttpServletRequest;Z)Z) (JIT)
0000000002FD1D00:       stack=0000000003040E88-0000000003047E90
14:51:15.269 0x18ae900    j9mm.479    *   ** ASSERTION FAILED ** at ../../../../gc_glue_java/ScavengerRootScanner.hpp:105: ((MM_StackSlotValidator(MM_StackSlotValidator::NOT_ON_HEAP, *slotPtr, stackLocation, walkState).validate(_env)))

From jdmpview:

> !stackslots 0x0000000002FD1D00
...
<2fd1d00> JIT (hidden) frame: bp = 0x0000000003047018, pc = 0x00007F7977C67209, unwindSP = 0x0000000003047018, cp = 0x0000000001C98260, arg0EA = 0x0000000003047020, jitInfo = 0x00007F7936F99838
<2fd1d00>       Method: com/ibm/wsspi/kernel/service/utils/AtomicServiceReference.getService()Ljava/lang/Object; !j9method 0x0000000001C98988
<2fd1d00>       Bytecode index = -1, inlineDepth = 0, PC offset = 0x0000000000000469
<2fd1d00>       stackMap=0x00007F7936F99A6B, slots=I16(0x0001) parmBaseOffset=I16(0x0008), parmSlots=U16(0x0001), localBaseOffset=I16(0xFFE8)
<2fd1d00>       Described JIT args starting at 0x0000000003047020 for U16(0x0001) slots
<2fd1d00>               I-Slot: : a0[0x0000000003047020] = 0x00000000C00056C8
<2fd1d00>       JIT-RegisterMap = UDATA(0x0000000000000001)
<2fd1d00>       JIT-HighWordRegisterMap = UDATA(0x0000000000000000)
<2fd1d00>               JIT-RegisterMap-O-Slot[0x00007F790F65AB60] = 0x00000000C10C5B00 (jit_rax)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65AB68] = UDATA(0x0000000001C9A500) (jit_rbx)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65AB70] = UDATA(0x0000000000000000) (jit_rcx)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65AB78] = UDATA(0x00000000C345BA58) (jit_rdx)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65AB80] = UDATA(0x0000000000000068) (jit_rdi)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65AB88] = UDATA(0x00000000C43FB008) (jit_rsi)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65AB90] = UDATA(0x00007F77A00008E0) (jit_rbp)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65AB98] = UDATA(0x00007F7996266640) (jit_rsp)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65ABA0] = UDATA(0x0000000001C98700) (jit_r8)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65ABA8] = UDATA(0x00000000C2567F68) (jit_r9)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65ABB0] = UDATA(0x00000000C2567F30) (jit_r10)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65ABB8] = UDATA(0x00000000C24D86B0) (jit_r11)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65ABC0] = UDATA(0x00000000C43FB008) (jit_r12)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65ABC8] = UDATA(0x0000000000000001) (jit_r13)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65ABD0] = UDATA(0x00000000C2567F68) (jit_r14)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65ABD8] = UDATA(0x0000000001C9A500) (jit_r15)
<2fd1d00> JIT inline frame: bp = 0x00000000030470E8, pc = 0x00007F7977234BC8, unwindSP = 0x0000000003047020, cp = 0x0000000001E93CA0, arg0EA = 0x0000000000000000, jitInfo = 0x00007F793445CDC8
<2fd1d00>       Method: com/ibm/websphere/security/WSSecurityHelper.isServerSecurityEnabled()Z !j9method 0x0000000001E93EF8
<2fd1d00>       Bytecode index = 3, inlineDepth = 2, PC offset = 0xFFFFFFFFFFF906D0
<2fd1d00> JIT inline frame: bp = 0x00000000030470E8, pc = 0x00007F7977234BC8, unwindSP = 0x0000000003047020, cp = 0x00000000031EB690, arg0EA = 0x0000000000000000, jitInfo = 0x00007F793445CDC8
<2fd1d00>       Method: com/ibm/ws/webcontainer/session/impl/HttpSessionContextImpl.checkSecurity(Lcom/ibm/ws/session/SessionData;Ljavax/servlet/http/HttpServletRequest;)V !j9method 0x00000000031ECFB8
<2fd1d00>       Bytecode index = 50, inlineDepth = 1, PC offset = 0xFFFFFFFFFFFB6340
<2fd1d00> JIT frame: bp = 0x00000000030470E8, pc = 0x00007F7977234BC8, unwindSP = 0x0000000003047020, cp = 0x00000000031EB690, arg0EA = 0x0000000003047108, jitInfo = 0x00007F793445CDC8
<2fd1d00>       Method: com/ibm/ws/webcontainer/session/impl/HttpSessionContextImpl.isValid(Ljavax/servlet/http/HttpSession;Ljavax/servlet/http/HttpServletRequest;Z)Z !j9method 0x00000000031ECE58
<2fd1d00>       Bytecode index = 50, inlineDepth = 0, PC offset = 0x00000000000005C0
<2fd1d00>       stackMap=0x00007F793445DB59, slots=I16(0x0004) parmBaseOffset=I16(0x0010), parmSlots=U16(0x0003), localBaseOffset=I16(0xFF88)
<2fd1d00>       Described JIT args starting at 0x00000000030470F8 for U16(0x0003) slots
<2fd1d00>               O-Slot: : a2[0x00000000030470F8] = 0x00000000C345BA58
<2fd1d00>               O-Slot: : a1[0x0000000003047100] = 0x00000000C43FB008
<2fd1d00>               O-Slot: : a0[0x0000000003047108] = 0x00000000C2567F68
<2fd1d00>       Described JIT temps starting at 0x0000000003047070 for IDATA(0x000000000000000F) slots
<2fd1d00>               I-Slot: : t14[0x0000000003047070] = 0x00000000C24D86B0
<2fd1d00>               I-Slot: : t13[0x0000000003047078] = 0x00000000C2567F68
<2fd1d00>               I-Slot: : t12[0x0000000003047080] = 0x00000000C43FB008
<2fd1d00>               O-Slot: : t11[0x0000000003047088] = 0x00000000C43F6DA8
<2fd1d00>               O-Slot: : t10[0x0000000003047090] = 0x00000000C345BA58
<2fd1d00>               O-Slot: : t9[0x0000000003047098] = 0x00000000C43FB008
<2fd1d00>               O-Slot: : t8[0x00000000030470A0] = 0x00000000C2567F68
<2fd1d00>               I-Slot: : t7[0x00000000030470A8] = 0x0000000000000001
<2fd1d00>               O-Slot: : t6[0x00000000030470B0] = 0x00000000C43FB008
<2fd1d00>               I-Slot: : t5[0x00000000030470B8] = 0x0000000000000000
<2fd1d00>               I-Slot: : t4[0x00000000030470C0] = 0x00000000C345BA58
<2fd1d00>               I-Slot: : t3[0x00000000030470C8] = 0x00000000C2567F68
<2fd1d00>               I-Slot: : t2[0x00000000030470D0] = 0x00000000C2567F68
<2fd1d00>               O-Slot: : t1[0x00000000030470D8] = 0x00000000C24D86B0
<2fd1d00>               I-Slot: : t0[0x00000000030470E0] = 0x00000000C43F6DA8
<2fd1d00>       JIT-RegisterMap = UDATA(0x0000000000000202)
<2fd1d00>       JIT-HighWordRegisterMap = UDATA(0x0000000000000000)
<2fd1d00>               JIT-RegisterMap-O-Slot[0x00007F790F65AB68] = 0x0000000001C9A500 (jit_rbx)
<2fd1d00>               JIT-RegisterMap-O-Slot[0x00007F790F65ABA8] = 0x00000000C2567F68 (jit_r9)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65ABB0] = UDATA(0x00000000C2567F30) (jit_r10)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65ABB8] = UDATA(0x00000000C24D86B0) (jit_r11)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65ABC0] = UDATA(0x00000000C43FB008) (jit_r12)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65ABC8] = UDATA(0x0000000000000001) (jit_r13)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65ABD0] = UDATA(0x00000000C2567F68) (jit_r14)
<2fd1d00>               JIT-RegisterMap-I-Slot[0x00007F790F65ABD8] = UDATA(0x0000000001C9A500) (jit_r15)
<2fd1d00>       JIT-Frame-RegisterMap[0x0000000003047058] = UDATA(0x00000000C345BA58) (jit_rbx)
<2fd1d00>       JIT-Frame-RegisterMap[0x0000000003047060] = UDATA(0x00000000C43FB008) (jit_r9)
<2fd1d00>       JIT-Frame-RegisterMap[0x00007F790F65ABB0] = UDATA(0x00000000C2567F30) (jit_r10)
<2fd1d00>       JIT-Frame-RegisterMap[0x00007F790F65ABB8] = UDATA(0x00000000C24D86B0) (jit_r11)
<2fd1d00>       JIT-Frame-RegisterMap[0x00007F790F65ABC0] = UDATA(0x00000000C43FB008) (jit_r12)
<2fd1d00>       JIT-Frame-RegisterMap[0x00007F790F65ABC8] = UDATA(0x0000000000000001) (jit_r13)
<2fd1d00>       JIT-Frame-RegisterMap[0x00007F790F65ABD0] = UDATA(0x00000000C2567F68) (jit_r14)
<2fd1d00>       JIT-Frame-RegisterMap[0x00007F790F65ABD8] = UDATA(0x0000000001C9A500) (jit_r15)
...

JIT-RegisterMap-O-Slot[0x00007F790F65AB68] = 0x0000000001C9A500 (jit_rbx) is a J9Class but for some reason sits in an O-Slot. Looking at the disassembly;

...
0x7f7977234b06 {com/.../HttpSessionContextImpl.isValid} +1278          488b3c3d00000000     mov       rdi, qword ptr [rdi]
0x7f7977234b0e {com/.../HttpSessionContextImpl.isValid} +1286          4c8bc7               mov       r8, rdi
0x7f7977234b11 {com/.../HttpSessionContextImpl.isValid} +1289          4589420c             mov       dword ptr [r10+0x0c], r8d
0x7f7977234b15 {com/.../HttpSessionContextImpl.isValid} +1293          f685aa01000010       test      byte ptr [rbp+0x1aa], 0x10
0x7f7977234b1c {com/.../HttpSessionContextImpl.isValid} +1300          0f852e420000         jne       0x7f7977238d50 C>> +18248
0x7f7977234b22 {com/.../HttpSessionContextImpl.isValid} +1306          4d8bca               mov       r9, r10
0x7f7977234b25 {com/.../HttpSessionContextImpl.isValid} +1309          48bb000000c000000000 mov       rbx, 0x0c0000000 Obj - {java/lang/Class}
0x7f7977234b2f {com/.../HttpSessionContextImpl.isValid} +1319          4c2bcb               sub       r9, rbx
0x7f7977234b32 {com/.../HttpSessionContextImpl.isValid} +1322          4c3b8de8000000       cmp       r9, qword ptr [rbp+0x0e8] J9VMThread.heapSizeForBarrierRange0
0x7f7977234b39 {com/.../HttpSessionContextImpl.isValid} +1329          730a                 jae       0x7f7977234b45 C>> +1341
0x7f7977234b3b {com/.../HttpSessionContextImpl.isValid} +1331          41f602f0             test      byte ptr [r10], 0x0f0
0x7f7977234b3f {com/.../HttpSessionContextImpl.isValid} +1335          0f84f7410000         je        0x7f7977238d3c C>> +18228
0x7f7977234b45 {com/.../HttpSessionContextImpl.isValid} +1341          4885ff               test      rdi, rdi <<< +1329 ^+5471 ^+6259 ^+18243 ^+18263
0x7f7977234b48 {com/.../HttpSessionContextImpl.isValid} +1344          0f84f0000000         je        0x7f7977234c3e C>> +1590
0x7f7977234b4e {com/.../HttpSessionContextImpl.isValid} +1350          0f1f440000           nop       eax, dword ptr [rax+rax]
0x7f7977234b53 {com/.../HttpSessionContextImpl.isValid} +1355          48898c2498000000     mov       qword ptr [rsp+0x98], rcx
0x7f7977234b5b {com/.../HttpSessionContextImpl.isValid} +1363          4889542470           mov       qword ptr [rsp+0x70], rdx
0x7f7977234b60 {com/.../HttpSessionContextImpl.isValid} +1368          4889742478           mov       qword ptr [rsp+0x78], rsi
0x7f7977234b65 {com/.../HttpSessionContextImpl.isValid} +1373          4c8bc8               mov       r9, rax
0x7f7977234b68 {com/.../HttpSessionContextImpl.isValid} +1376          4c899c24b8000000     mov       qword ptr [rsp+0x0b8], r11
0x7f7977234b70 {com/.../HttpSessionContextImpl.isValid} +1384          4c89a42490000000     mov       qword ptr [rsp+0x90], r12
0x7f7977234b78 {com/.../HttpSessionContextImpl.isValid} +1392          4c89ac2488000000     mov       qword ptr [rsp+0x88], r13
0x7f7977234b80 {com/.../HttpSessionContextImpl.isValid} +1400          4c89b42480000000     mov       qword ptr [rsp+0x80], r14
0x7f7977234b88 {com/.../HttpSessionContextImpl.isValid} +1408          498bdf               mov       rbx, r15
0x7f7977234b8b {com/.../HttpSessionContextImpl.isValid} +1411          48bf503ae90100000000 mov       rdi, 0x1e93a50 Ptr Obj - {com/ibm/wsspi/kernel/service/utils/AtomicServiceReference} // Try{
0x7f7977234b95 {com/.../HttpSessionContextImpl.isValid} +1421          4c8b043d00000000     mov       r8, qword ptr [rdi]
0x7f7977234b9d {com/.../HttpSessionContextImpl.isValid} +1429          498bc0               mov       rax, r8
0x7f7977234ba0 GCMap  Bytecode 18:3 Registers: rax rbx r8 r9  StackMap: P0xd8 P0xe0 P0xe8 0x68 0x70 0x78 0x80 0x90 0xb8
0x7f7977234ba0 {com/.../HttpSessionContextImpl.isValid} +1432  18:3    458b00               mov       r8d, dword ptr [r8]
0x7f7977234ba3 {com/.../HttpSessionContextImpl.isValid} +1435          4981e000ffffff       and       r8, 0x0ffffff00
0x7f7977234baa {com/.../HttpSessionContextImpl.isValid} +1442          660f1f440000         nop       ax, word ptr [rax+rax]
0x7f7977234bb0 GCMap  Bytecode 18:3 Registers: rbx r9  StackMap: P0xd8 P0xe0 P0xe8 0x68 0x70 0x78 0x80 0x90 0xb8
0x7f7977234bb0 {com/.../HttpSessionContextImpl.isValid} +1448  18:3    48bf0087c90100000000 mov       rdi, 0x1c98700 J9Class - {com/ibm/wsspi/kernel/service/utils/AtomicServiceReference}
0x7f7977234bba {com/.../HttpSessionContextImpl.isValid} +1458          493bf8               cmp       rdi, r8
0x7f7977234bbd {com/.../HttpSessionContextImpl.isValid} +1461          0f856d410000         jne       0x7f7977238d30 C>> +18216
0x7f7977234bc3 {com/.../HttpSessionContextImpl.isValid} +1467          e8dd21a300           call      0x7f7977c66da5 {com/.../AtomicServiceReference.getService} +5  
...
0x7f7977238d3c {com/.../HttpSessionContextImpl.isValid} +18228          4992                 xchg      rax, r10 <<< +1335
0x7f7977238d3e {com/.../HttpSessionContextImpl.isValid} +18230          4987f0               xchg      rsi, r8
0x7f7977238d41 {com/.../HttpSessionContextImpl.isValid} +18233          e83a47781d           call      0x7f79949bd480 {libj9jit29.so}{jitWriteBarrierStoreGenerational} +0
0x7f7977238d46 {com/.../HttpSessionContextImpl.isValid} +18238          4992                 xchg      rax, r10
0x7f7977238d48 {com/.../HttpSessionContextImpl.isValid} +18240          4987f0               xchg      rsi, r8
0x7f7977238d4b {com/.../HttpSessionContextImpl.isValid} +18243          e9f5bdffff           jmp       0x7f7977234b45 U>> ^+1341
...

rbx does contain an object, but then
0x7f7977234b88 {com/.../HttpSessionContextImpl.isValid} +1408 498bdf mov rbx, r15
moved r15 into rbx, but the GC map didn't get updated.

I talked to @0dvictor, who said that the mov is likely because rbx is dead and so instead of saving r15 to the stack, it gets put into rbx (which is a callee saved register). However, as can be seen by
0x7f7977234ba0 GCMap Bytecode 18:3 Registers: rax rbx r8 r9 StackMap: P0xd8 P0xe0 P0xe8 0x68 0x70 0x78 0x80 0x90 0xb8
it looks like when r15 was moved into rbx, rbx wasn't updated to specify that it no longer contained a collected reference.

I tried getting a log, but I couldn't reproduce the issue; @0dvictor, do you know where the code that assigns a register instead of saving on the stack is?

This particular issue isn't AOT related, but I've only seen it happen with the "new" AOT code heh. It's possible that this is also the cause of the other problems seen.

@dsouzai
Copy link
Contributor

dsouzai commented Oct 19, 2018

@jdmpapin pointed out that in

0x7f7977234b25 {com/.../HttpSessionContextImpl.isValid} +1309 48bb000000c000000000 mov rbx, 0x0c0000000 Obj - {java/lang/Class}

we're not moving an object into rbx but rather the subtrahend. So I'm going to have to look at the disassembly further to see why rbx is considered a collected reference.

@0dvictor
Copy link
Contributor

0x7f7977234b25 {com/.../HttpSessionContextImpl.isValid} +1309 48bb000000c000000000 mov rbx, 0x0c0000000 Obj - {java/lang/Class}

This code seems to be part of a write barrier. 0x0c0000000 should be the heap base, which contains a valid object, and hence Debugger reports the address is an object.

@0dvictor
Copy link
Contributor

eclipse/omr#3125 should fix the above GC map issue.

@pshipton pshipton added this to the Release 0.12.0 milestone Nov 8, 2018
@pshipton
Copy link
Member Author

eclipse/omr#3125 is merged. @hangshao0 can you please create a new PR to enable bootstrap class sharing by default. I can't revert #3337 automatically.

@dsouzai
Copy link
Contributor

dsouzai commented Nov 13, 2018

Btw, eclipse/omr#3125 was the most likely culprit for the AOT issue, but I'm not 100% certain it was the only cause of the issues we were seeing, so it would be worth running a few grinders on misc jobs to make sure.

@pshipton
Copy link
Member Author

Right, I'll be running many sanity tests on the new PR before merging, we can try the grinders as well.

@hangshao0
Copy link
Contributor

can you please create a new PR to enable bootstrap class sharing by default.

There were two PRs originally, I will merge those two PRs into one single PR. Need to run some manual test locally, I will create a PR once it is done.

@hangshao0
Copy link
Contributor

hangshao0 commented Nov 14, 2018

I am still seeing issue eclipse/omr#3076. This time it is on linux ppcle.

@hangshao0
Copy link
Contributor

Change is here: #3681. I will remove WIP from #3681 after this issue is fixed.

@pshipton
Copy link
Member Author

@dsouzai fyi #3333 (comment)

@hangshao0
Copy link
Contributor

hangshao0 commented Nov 16, 2018

OK, I know what is going on. I have manually re-created the failure on one of the internal machines (ub14levm8). The HandleSIGXFSZ failure is not caused by AOT.

The ulimit setting of the failed machine is

ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) 4000

So creating any file > the maximum file size limit will result in signal SIGXFSZ.
What SignalXfszTest does is: It tries writing to a file called test.txt to exceed the maximum file size limit to trigger the signal SIGXFSZ.
However, when shared classes is enabled by default, we will try creating a cache file first (300MB, it is greater than ulimit), which triggers signal SIGXFSZ before any class is loaded. If -XX:-HandleSIGXFSZ is used in the CML, JVM exits immediately with the following message:
File size limit exceeded (core dumped)

Class SignalXfszTest is not loaded yet, so we won't see "Starting SignalXfszTest"

So we should probably increase the maximum file size limit on the testing machines, 4MB is too small.

@dsouzai
Copy link
Contributor

dsouzai commented Nov 16, 2018

Heh interesting; how come the other tests pass though? Wasn't the SCC created in test 1 - "Testing Default" (which didn't have -XX:+HandleSIGXFSZ) whereas the first failure is seen in test 6 - "Testing: -XX:-HandleSIGXFSZ (Linux)"?

@hangshao0
Copy link
Contributor

hangshao0 commented Nov 16, 2018

Wasn't the SCC created in test 1

No. It was not created. When shared classes is enabled by default, nonfatal is also on, which means we will proceed ignoring the SCC creation failure.

@dsouzai
Copy link
Contributor

dsouzai commented Nov 16, 2018

Ohh I see that makes sense; also I realized I was wrong above:

 argIndex = FIND_AND_CONSUME_ARG(EXACT_MATCH, VMOPT_XXNOHANDLESIGXFSZ, NULL); 
 argIndex2 = FIND_AND_CONSUME_ARG(EXACT_MATCH, VMOPT_XXHANDLESIGXFSZ, NULL); 
  
 if (argIndex2 >= argIndex) { 
 	sigOptions |= J9PORT_SIG_OPTIONS_SIGXFSZ; 
 	vm->extendedRuntimeFlags |= J9_EXTENDED_RUNTIME_HANDLE_SIGXFSZ; 
 } 

-XX:+HandleSIGXFSZ is the default behaviour when no options are set. That explains why the other tests all pass until -XX:-HandleSIGXFSZ is specified.

@hangshao0
Copy link
Contributor

hangshao0 commented Nov 16, 2018

So we should probably increase the maximum file size limit on the testing machines

I noticed that the file size limit 4000 is set by the test itself, it is not the default setting on the machine.

https://github.com/eclipse/openj9/blob/master/test/functional/cmdLineTests/sigxfszHandlingTest/playlist.xml#L32

So adding -Xshareclasses:none to the commands in sigxfszHandlingTest should solve this problem. After this change, the test is still testing what it was testing.

@pshipton
Copy link
Member Author

adding -Xshareclasses:none to the commands in sigxfszHandlingTest should solve this issue. The test is still testing what it was testing.

works for me. Please add a comment to explain why -Xshareclasses:none is specified.
Although if there is an AOT problem related to sigxfsz handling, or any other extended flags, we should open another issue(s), and that should be resolved as well @dsouzai

@hangshao0
Copy link
Contributor

PR #3713 open to fix sigxfszHandlingTest

@pshipton
Copy link
Member Author

This should not have been closed by the #3713 commit.

@pshipton pshipton reopened this Nov 20, 2018
@hangshao0
Copy link
Contributor

I am copying the links of OSX failure to this issue.
https://ci.eclipse.org/openj9/job/PullRequest-Sanity-JDK11-osx_x86-64_cmprssptrs-OpenJ9/34/

11:53:10 + build/macosx-x86_64-normal-server-release/images/jdk/bin/java -version
11:53:11 Exception in thread "main" java/lang/BootstrapMethodError: java.lang.NullPointerException
11:53:11 	at jdk/internal/loader/BuiltinClassLoader.setJimageURL (java.base@9/BuiltinClassLoader.java:253)
11:53:11 	at jdk/internal/loader/BuiltinClassLoader$LoadedModule.convertJrtToFileURL (java.base@9/BuiltinClassLoader.java:296)
11:53:11 	at jdk/internal/loader/BuiltinClassLoader$LoadedModule.<init> (java.base@9/BuiltinClassLoader.java:286)
11:53:11 	at jdk/internal/loader/BuiltinClassLoader.loadModule (java.base@9/BuiltinClassLoader.java:354)
11:53:11 	at jdk/internal/loader/BootLoader.loadModule (java.base@9/BootLoader.java:109)
11:53:11 	at jdk/internal/module/ModuleBootstrap.boot (java.base@9/ModuleBootstrap.java:216)
11:53:11 	at java/lang/ClassLoader.initializeClassLoaders (java.base@9/ClassLoader.java:217)
11:53:11 	at java/lang/Thread.initialize (java.base@9/Thread.java:422)
11:53:11 	at java/lang/Thread.<init> (java.base@9/Thread.java:153)

https://ci.eclipse.org/openj9/job/PullRequest-Sanity-JDK11-osx_x86-64_cmprssptrs-OpenJ9/35/
https://ci.eclipse.org/openj9/job/PullRequest-Sanity-JDK11-osx_x86-64_cmprssptrs-OpenJ9/36/

@dsouzai
Copy link
Contributor

dsouzai commented Nov 20, 2018

I got access to osxvm2, and I'm able to reproduce the issue (fairly intermittently); however I'm a bit blocked as I can't generate any cores. I'm following up with that, but this issue will likely take some time. From what I can tell, this issue happens with "old AOT"; why it only happens on osx is a complete mystery...I really hope it isn't the cause of some behaviour due to a different compiler.

@dsouzai
Copy link
Contributor

dsouzai commented Nov 30, 2018

I didn't have time all week to look at this issue. However, when I last left off my investigation, I managed to get core files and some logging; I'm going to have to go through them to see what the problem is. As far as I can tell, the AOT problem isn't a new one (since it happens with "old" AOT); we've just never really tested osx prior to OpenJ9.

@dsouzai
Copy link
Contributor

dsouzai commented Dec 5, 2018

@jdmpapin and I worked on this yesterday, and we think we know what the problem is. It isn't osx specific, but somehow only manifests itself on osx 😕

I had noticed that the problem only occurred when the JIT code cache address started in the 0xB0000000-0xC0000000 range. On passing runs, the JIT code cache would start somewhere in the 0x10000000-0x20000000 range. Initially, the issue would manifest itself as a NPE, but that was because the crash would occur in an exception range. So, after running with -Xrs -Xjit:verbose={compilePerformance},vlog=vlog -Xaot:forceAOT,{jdk/internal/org/objectweb/asm/Item.set(ILjava/lang/String;Ljava/lang/String;Ljava/lang/String;)V}(count=1), we took a look at the core:

(lldb) thread list
Process 0 stopped
...
  thread #5: tid = 0x0004, 0x0000000000000000, stop reason = signal SIGSTOP
  thread #6: tid = 0x0005, 0x00000000b0086d90, stop reason = signal SIGSTOP
  thread #7: tid = 0x0006, 0x00007fff79f08262 libsystem_kernel.dylib`sem_wait + 10, stop reason = signal SIGSTOP
...

Thread 6 looked like it was a PC in the JIT codecache, so we took a look at that:

(lldb) thread select 6
* thread #6, stop reason = signal SIGSTOP
    frame #0: 0x00000000b0086d90
->  0xb0086d90: jmp    qword ptr [8*rax - 0x4ff79330]
    0xb0086d97: nop    dword ptr [rax + rax]
    0xb0086d9c: movabs rax, 0x11a5a828
    0xb0086da6: mov    rax, qword ptr [rax]

The method that PC belongs to, from the vlog, was:

+ (AOT load) java/util/Set.of([Ljava/lang/Object;)Ljava/util/Set; @ 00000000B0086D40-00000000B0087349 Q_SZ=10 Q_SZI=10 QW=11 j9m=00000000119ED1A8 bcsz=66 time=3088us compThread=0

Disassembling that method:

(lldb) x/20i 00000000B0086D40
    0xb0086d40: 48 8b 44 24 08                 mov    rax, qword ptr [rsp + 0x8]
    0xb0086d45: 48 3b 65 50                    cmp    rsp, qword ptr [rbp + 0x50]
    0xb0086d49: 0f 86 25 04 00 00              jbe    0xb0087174
    0xb0086d4f: 48 83 ec 48                    sub    rsp, 0x48
    0xb0086d53: 4c 89 4c 24 28                 mov    qword ptr [rsp + 0x28], r9
    0xb0086d58: 48 89 5c 24 20                 mov    qword ptr [rsp + 0x20], rbx
    0xb0086d5d: 48 89 44 24 50                 mov    qword ptr [rsp + 0x50], rax
    0xb0086d62: 48 b8 e8 82 08 84 c2 7f 00 00  movabs rax, 0x7fc2840882e8
    0xb0086d6c: 83 3c 05 00 00 00 00 01        cmp    dword ptr [rax], 0x1
    0xb0086d74: 0f 84 9d 02 00 00              je     0xb0087017
    0xb0086d7a: 48 8b 44 24 50                 mov    rax, qword ptr [rsp + 0x50]
    0xb0086d7f: 48 85 c0                       test   rax, rax
    0xb0086d82: 0f 84 b8 05 00 00              je     0xb0087340
    0xb0086d88: 8b 40 04                       mov    eax, dword ptr [rax + 0x4]
    0xb0086d8b: 83 f8 03                       cmp    eax, 0x3
    0xb0086d8e: 73 2d                          jae    0xb0086dbd
->  0xb0086d90: ff 24 c5 d0 6c 08 b0           jmp    qword ptr [8*rax - 0x4ff79330]
    0xb0086d97: 0f 1f 44 00 00                 nop    dword ptr [rax + rax]
    0xb0086d9c: 48 b8 28 a8 a5 11 00 00 00 00  movabs rax, 0x11a5a828
    0xb0086da6: 48 8b 04 05 00 00 00 00        mov    rax, qword ptr [rax]

The instruction being executed is implementing a jump table. Looking at the relolog:

TR_MethodAddress 00000000D1477264
        size 14 type 3 flags 0 reloFlags 0
        reloLocation: from 00000000D147726C at 00000000B0086CD0 (offset 10)
                applyRelocation: old method address 000000001C5C4BF7
                applyRelocation: new method address 00000000B0086D97
        reloLocation: from 00000000D147726E at 00000000B0086CD8 (offset 18)
                applyRelocation: old method address 000000001C5C4C77
                applyRelocation: new method address 00000000B0086E17
        reloLocation: from 00000000D1477270 at 00000000B0086CE0 (offset 20)
                applyRelocation: old method address 000000001C5C4D68
                applyRelocation: new method address 00000000B0086F08
        reloLocation: from 00000000D1477272 at 00000000B0086D93 (offset d3)
                applyRelocation: old method address 00441F0F1C5C4B30
                applyRelocation: new method address 00441F0FB0086CD0
        reloLocation: from 00000000D1477274 at 00000000B008705C (offset 39c)
                applyRelocation: old method address 000000001C5C4BD3
                applyRelocation: new method address 00000000B0086D73
        reloLocation: from 00000000D1477276 at 00000000B0087080 (offset 3c0)
                applyRelocation: old method address 000000001C5C4BA0
                applyRelocation: new method address 00000000B0086D40

The address relocated at 00000000B0086D93 (offset d3) is (effectively) B0086CD0. From lldb:

(lldb) x/3g B0086CD0
0xb0086cd0: 0x00000000b0086d97 0x00000000b0086e17
0xb0086ce0: 0x00000000b0086f08

the three addresses in the jump table are exactly the first three relocations that were applied. The fourth relocation, which is applied in the middle of the jmp instruction, is the location of this array of addresses into the current AOT'd method.

The problem occurs because:

  1. the address of the jump table is specified in the "offset" portion of the jmp instruction
  2. the "offset" portion of the jmp instruction is signed 32 bits

Thus, when the address is larger than 0x7FFFFFFF, the offset is interpreted as a negative value; in essence, the address gets sign extended.

A secondary problem is that the relocation is done in the following manner:
https://github.com/eclipse/openj9/blob/36e28d275aa91d4539e9e354b8e54f8a7f36b0b0/runtime/compiler/runtime/RelocationRecord.cpp#L1349-L1353
Therefore, if the distance between the oldMethodAddress and compileMethodCodeStartPC is large enough, the new address will end up being more than 32 bits, which would cause it to, in this case, write over the instructions after the jmp instruction. This happens because the oldMethodAddress is read as a 64 bit value.

@jdmpapin is working on a fix for this; because the jmp table is going to be a known location, the idea is to do something like

lea rTmp, [rip - 0x...]
jmp qword ptr [rTmp + 8*rIndex]

This would remove the need for a relocation.

@jdmpapin
Copy link
Contributor

jdmpapin commented Dec 7, 2018

@jdmpapin is working on a fix for this

eclipse/omr#3303

@DanHeidinga
Copy link
Member

@jdmpapin now that the omr change is merged, is there further work here?

@dsouzai
Copy link
Contributor

dsouzai commented Dec 14, 2018

I don't think so, unless we see more issues when enabling the SCC by default on OSX. I am seeing a bunch of

Test result: FAILED
 [ERR] JVMSHRC659E An error has occurred while opening shared memory
 [ERR] JVMSHRC336E Port layer error code = -393970
 [ERR] JVMSHRC337E Platform error message: shmget : Cannot allocate memory
 [ERR] JVMSHRC029E Not enough memory left on the system
 [ERR] JVMSHRC663I Error recovery: destroyed semaphore set with id=196621 associated with shared class cache.
 [ERR] JVMJ9VM015W Initialization error for library j9shr29(11): JVMJ9VM009E J9VMDllMain failed
 [ERR] Error: Could not create the Java Virtual Machine.
 [ERR] Error: A fatal exception has occurred. Program will exit.

issues on osx sanity PR builds, so maybe this issue needs to stay open to address those?

@hangshao0
Copy link
Contributor

The error above is machine issue on non-persistent cache. The default cache type is persistent on OSX.

@dsouzai
Copy link
Contributor

dsouzai commented Dec 14, 2018

I see, well I suppose a PR enabling the SCC by default on OSX can be opened and tested; if there are no problems (beside the known ones), the PR can be merged and this issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants