Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core dumped when launching a modded minecraft version 1.12.2 with openj9-0.30.0-m1/m2 and -Xshareclasses -Xquickstart/-Xaot:count=0 #14300

Closed
ghost opened this issue Jan 18, 2022 · 11 comments · Fixed by #14421
Assignees
Labels
comp:jit segfault Issues that describe segfaults / JVM crashes userRaised

Comments

@ghost
Copy link

ghost commented Jan 18, 2022

Java -version output

openjdk version "1.8.0_322"
IBM Semeru Runtime Open Edition (build 1.8.0_322-b04)
Eclipse OpenJ9 VM (build openj9-0.30.0-m1, JRE 1.8.0 Windows 10 amd64-64-Bit Compressed References 20211130_265 (JIT enabled, AOT enabled)
OpenJ9 - ea06f74
OMR - eb9cf8525
JCL - a3346d8d7a based on jdk8u322-b04)

Summary of problem

I'm trying to launch minecraft 1.12.2 with rlcraft 2.9 modpack (forge-1.12.2-14.23.5.2860) using openj9-0.30.0-m1 or openj9-0.30.0-m2 with option -Xshareclasses -Xquickstart or -Xshareclasses -Xaot:count=0 and with a 2GB SCC. After destroying the SCC the first launch seems to be ok, but after the first launch it will not start anymore (with -Xaot:count=0 seems need to launch more times) and will crash when or just after initializing minecraft engine.

Diagnostic files

logs when using -Xshareclasses -Xquickstart:
image
image
javacore.20220118.221225.17824.0002.txt

logs when using -Xshareclasses -Xaot:count=0:
image
javacore.20220118.223207.17532.0002.txt

the dumps are with openj9-0.30.0-m1 but the bug still happens in openj9-0.30.0-m2.

@ghost ghost added the userRaised label Jan 18, 2022
@pshipton pshipton added comp:jit segfault Issues that describe segfaults / JVM crashes labels Jan 18, 2022
@pshipton
Copy link
Member

@0xdaryl fyi

@0xdaryl
Copy link
Contributor

0xdaryl commented Jan 18, 2022

@dsouzai : would you mind triaging this crash please?

@dsouzai
Copy link
Contributor

dsouzai commented Jan 18, 2022

@siyy123 could you upload the coredump somewhere? Also could you add -XcompilationThreads1 -Xjit:verbose={compilePerformance},vlog=vlog -Xaot:aotrtDebugLevel=30,rtLog=rtLog (in the run where you have -Xaot:count=0, do -Xaot:count=0,aotrtDebugLevel=30,rtLog=rtLog because if you add multiple -Xaot/-Xjit only the last of each is considered). If you can upload the coredump, javacore, vlog, and rtLog from the run that generated the coredump then I can try and use windbg to take a look at the problem.

@ghost
Copy link
Author

ghost commented Jan 19, 2022

@ghost
Copy link
Author

ghost commented Jan 19, 2022

and for -Xquickstart: https://1drv.ms/u/s!AotJ8siZ9lZhij7YTYQ8qgnmCie1?e=TOpNG2

@dsouzai
Copy link
Contributor

dsouzai commented Jan 21, 2022

Sorry, wasn't able to get to this this week; will try to take a look on Monday.

@dsouzai
Copy link
Contributor

dsouzai commented Jan 25, 2022

Not really sure what's going on as this is quite deep in the codegen. There does seem to be something off though. I've been looking at the -Xquickstart run.

The java stack at the time of the crash looks like this:

> !stack 0x010a8100
<10a8100>       !j9method 0x000000000310C630   org/lwjgl/openal/AL10.►  ♠♂?♀*
<10a8100>       !j9method 0x000000000310C610   org/lwjgl/openal/AL10.alGetError()I
<10a8100>       !j9method 0x0000000001883848   com/tmtravlr/soundfilters/ModifiedLWJGLOpenALLibrary.checkALError()Z
<10a8100>       !j9method 0x0000000001883608   com/tmtravlr/soundfilters/ModifiedLWJGLOpenALLibrary.init()V
<10a8100>       !j9method 0x00000000030FF7B0   paulscode/sound/SoundSystem.CommandNewLibrary(Ljava/lang/Class;)V
<10a8100>       !j9method 0x00000000030FFCD0   paulscode/sound/SoundSystem.CommandQueue(Lpaulscode/sound/CommandObject;)Z
<10a8100>       !j9method 0x0000000003108018   paulscode/sound/CommandThread.run()V
<10a8100>                               JNI call-in frame
<10a8100>                               Native method frame

From the stackslots:

Signature: org/lwjgl/openal/AL10.alGetError()I !bytecodes 0x000000000310C610
ROM Method: !j9rommethod 0x00000217C1554278
Next Method: !j9method 0x000000000310C630
> !stackslots 0x010a8100
<10a8100> *** BEGIN STACK WALK, flags = 00400001 walkThread = 0x00000000010A8100 ***
<10a8100>       ITERATE_O_SLOTS
<10a8100>       RECORD_BYTECODE_PC_OFFSET
<10a8100> Initial values: walkSP = 0x000000000314A8F8, PC = 0x0000000000000006, literals = 0x0000000000000000, A0 = 0x000000000314A970, j2iFrame = 0x0000000000000000, ELS = 0x000000EAF96FF500, decomp = 0x0000000
000000000
<10a8100> JIT JNI call-out frame: bp = 0x000000000314A918, sp = 0x000000000314A8F8, pc = 0x0000000000000006, cp = 0x000000000310D190, arg0EA = 0x000000000314A970, flags = 0x0000000020000000
<10a8100>       Method: org/lwjgl/openal/AL10.► ♠♂?♀* !j9method 0x000000000310C630
<10a8100> JIT frame: bp = 0x000000000314A948, pc = 0x00007FF988A9DA00, unwindSP = 0x000000000314A920, cp = 0x000000000310D190, arg0EA = 0x000000000314A948, jitInfo = 0x000002184C0AE068
<10a8100>       Method: org/lwjgl/openal/AL10.alGetError()I !j9method 0x000000000310C610
<10a8100>       Bytecode index = 0, inlineDepth = 0, PC offset = 0x00000000000000A8
<10a8100>       stackMap=0x000002184C0AE161, slots=I16(0x0000) parmBaseOffset=I16(0x0008), parmSlots=U16(0x0000), localBaseOffset=I16(0x0000)
<10a8100>       Described JIT temps starting at 0x000000000314A948 for IDATA(0x0000000000000000) slots
<10a8100>       JIT-RegisterMap = UDATA(0x0000000000000000)
<10a8100>       JIT-Frame-RegisterMap[0x000000000314A930] = UDATA(0x0000000000000000) (jit_rbx)
<10a8100>       JIT-Frame-RegisterMap[0x000000000314A938] = UDATA(0x0000000000000000) (jit_r9)
<10a8100> I2J values: PC = 0x00000217BFE707F0, A0 = 0x000000000314A970, walkSP = 0x000000000314A958, literals = 0x0000000001883848, JIT PC = 0x00007FF99C46AC10, pcAddress = 0x000000EAF96FF528, decomp = 0x0000000
000000000
<10a8100> Bytecode frame: bp = 0x000000000314A968, sp = 0x000000000314A958, pc = 0x00000217BFE707F0, cp = 0x00000000018824D0, arg0EA = 0x000000000314A970, flags = 0x0000000000000000
<10a8100>       Method: com/tmtravlr/soundfilters/ModifiedLWJGLOpenALLibrary.checkALError()Z !j9method 0x0000000001883848
<10a8100>       Bytecode index = 0
<10a8100>       Using local mapper
<10a8100>       Locals starting at 0x000000000314A970 for 0x0000000000000001 slots
<10a8100>               O-Slot: a0[0x000000000314A970] = 0x00000007FE8ECD90
...

The top frame is a JIT JNI call-out frame. The native stack looks like this:

...
[0x11]   ntdll!KiUserExceptionDispatcher + 0x2e   
[0x12]   j9jit29!fast_jitAcquireVMAccess + 0x1c   
[0x13]   j9jit29!jitAcquireVMAccess + 0xaf   

There is a helper call to acquire vmaccess in org/lwjgl/openal/AL10.alGetError()I so looking at that disassembly:

00007ff9`88a9d958 483b6550               cmp     rsp, qword ptr [rbp+50h]
00007ff9`88a9d95c 0f8607010000           jbe     00007ff9`88a9da69
00007ff9`88a9d962 4883ec28               sub     rsp, 28h
00007ff9`88a9d966 4c894c2418             mov     qword ptr [rsp+18h], r9
00007ff9`88a9d96b 48895c2410             mov     qword ptr [rsp+10h], rbx
00007ff9`88a9d970 4c8d542428             lea     r10, [rsp+28h]
00007ff9`88a9d975 48c7453800000000       mov     qword ptr [rbp+38h], 0
00007ff9`88a9d97d 6800000000             push    0
00007ff9`88a9d982 6800000000             push    0
00007ff9`88a9d987 48b800daa988f97f0000   mov     rax, 7FF988A9DA00h
00007ff9`88a9d991 50                     push    rax
00007ff9`88a9d992 6800000020             push    20000000h
00007ff9`88a9d997 48b830c6100300000000   mov     rax, 310C630h
00007ff9`88a9d9a1 50                     push    rax
00007ff9`88a9d9a2 48c7452806000000       mov     qword ptr [rbp+28h], 6
00007ff9`88a9d9aa 48c7453000000000       mov     qword ptr [rbp+30h], 0
00007ff9`88a9d9b2 48896520               mov     qword ptr [rbp+20h], rsp
00007ff9`88a9d9b6 488ba5000a0000         mov     rsp, qword ptr [rbp+0A00h]
00007ff9`88a9d9bd 55                     push    rbp
00007ff9`88a9d9be 4883ec28               sub     rsp, 28h
00007ff9`88a9d9c2 488bcd                 mov     rcx, rbp
00007ff9`88a9d9c5 48ba00208e0200000000   mov     rdx, 28E2000h
00007ff9`88a9d9cf 4883c230               add     rdx, 30h
00007ff9`88a9d9d3 48c785b809000001000000 mov     qword ptr [rbp+9B8h], 1
00007ff9`88a9d9de 83bd9800000020         cmp     dword ptr [rbp+98h], 20h
00007ff9`88a9d9e5 0f8577000000           jne     00007ff9`88a9da62
00007ff9`88a9d9eb 488b6c2458             mov     rbp, qword ptr [rsp+58h]
00007ff9`88a9d9f0 49bb904c904c00800100   mov     r11, 180004C904C90h
00007ff9`88a9d9fa 0000                   add     byte ptr [rax], al
00007ff9`88a9d9fc d3488b                 ror     dword ptr [rax-75h], cl
00007ff9`88a9d9ff c84883c4               enter   8348h, 0C4h
00007ff9`88a9da03 285d48                 sub     byte ptr [rbp+48h], bl
00007ff9`88a9da06 c785b809000000000000   mov     dword ptr [rbp+9B8h], 0
00007ff9`88a9da10 83bd9800000020         cmp     dword ptr [rbp+98h], 20h
00007ff9`88a9da17 7542                   jne     00007ff9`88a9da5b
00007ff9`88a9da19 4889a5000a0000         mov     qword ptr [rbp+0A00h], rsp
00007ff9`88a9da20 488b6520               mov     rsp, qword ptr [rbp+20h]
00007ff9`88a9da24 48036530               add     rsp, qword ptr [rbp+30h]
00007ff9`88a9da28 48f744240800000300     test    qword ptr [rsp+8], 30000h
00007ff9`88a9da31 7521                   jne     00007ff9`88a9da54
00007ff9`88a9da33 4883c428               add     rsp, 28h
00007ff9`88a9da37 48837d4800             cmp     qword ptr [rbp+48h], 0
00007ff9`88a9da3c 753a                   jne     00007ff9`88a9da78
00007ff9`88a9da3e 488bc1                 mov     rax, rcx
00007ff9`88a9da41 89442424               mov     dword ptr [rsp+24h], eax
00007ff9`88a9da45 4c8b4c2418             mov     r9, qword ptr [rsp+18h]
00007ff9`88a9da4a 488b5c2410             mov     rbx, qword ptr [rsp+10h]
00007ff9`88a9da4f 4883c428               add     rsp, 28h
00007ff9`88a9da53 c3                     ret     
00007ff9`88a9da54 e897629c13             call    j9jit29!jitCollapseJNIReferenceFrame (00007ff9`9c463cf0)
00007ff9`88a9da59 ebd8                   jmp     00007ff9`88a9da33
00007ff9`88a9da5b e8f0cc9c13             call    j9jit29!jitAcquireVMAccess (00007ff9`9c46a750)
00007ff9`88a9da60 ebb7                   jmp     00007ff9`88a9da19
00007ff9`88a9da62 e88b3b4113             call    j9jit29!jitReleaseVMAccess (00007ff9`9beb15f2)
00007ff9`88a9da67 eb82                   jmp     00007ff9`88a9d9eb
00007ff9`88a9da69 bf28000000             mov     edi, 28h
00007ff9`88a9da6e e8fd549c13             call    j9jit29!jitStackOverflow (00007ff9`9c462f70)
00007ff9`88a9da73 e9eafeffff             jmp     00007ff9`88a9d962
00007ff9`88a9da78 e8e3a99c13             call    j9jit29!jitThrowCurrentException (00007ff9`9c468460)

this bit looks very wrong:

00007ff9`88a9d9f0 49bb904c904c00800100   mov     r11, 180004C904C90h
00007ff9`88a9d9fa 0000                   add     byte ptr [rax], al
00007ff9`88a9d9fc d3488b                 ror     dword ptr [rax-75h], cl
00007ff9`88a9d9ff c84883c4               enter   8348h, 0C4h
00007ff9`88a9da03 285d48                 sub     byte ptr [rbp+48h], bl
00007ff9`88a9da06 c785b809000000000000   mov     dword ptr [rbp+9B8h], 0

The codegen generates this

TR::X86RegInstruction *patchedInstr=
generateRegImm64Instruction(
TR::InstOpCode::MOV8RegImm64,
callNode,
_JNIDispatchInfo.dispatchTrampolineRegister,
targetAddress,
cg(), reloTypes[reloType]);
TR::X86RegInstruction *instr = generateRegInstruction(
TR::InstOpCode::CALLReg,
callNode,
_JNIDispatchInfo.dispatchTrampolineRegister,
_JNIDispatchInfo.callPostDeps,
cg());

In the rtlog, I see this:

TR_JNITargetAddress 0000021832008A46
        size 22 type 12 flags 0 reloFlags 0
        inlined site index FFFFFFFFFFFFFFFF
        constant pool 0000013DB10E3758
        cpIndex 0000000000000013
        reloLocation: from 0000021832008A66 at 00007FF988A9D9F2 (offset d2)
                computeNewConstantPool: newCP 000000000310D190
        getMethodFromCP: found static method 000000000310C630
        JNI call relocation: found JNI target address 0000000180004C90
                applyRelocation: registered JNI Call redefinition site

where the 0000000180004C90 can be seen in mov r11, 180004C904C90h.

What's confusing though is that according to this log, the relocation was performed at 00007FF988A9D9F2, which is true:
00007FF988A9D9F2 000180004C904C90`

However there's two issues:
1. According to the vlog, the start of the method is 0x00007FF988A9D958 which is after 0x00007FF988A9D9F2.
2. If the relo infra patched 0x00007FF988A9D9F2 then why is 0x00007ff988a9d9f0 fubar.

EDIT: Clearly losing my mind. 0x00007FF988A9D958 is NOT after 0x00007FF988A9D9F2. I guess the relo seems ok in terms of materializing the address and everything; maybe the offset it's supposed to patch is wrong?

I'll have to look into this some more.

@dsouzai
Copy link
Contributor

dsouzai commented Jan 25, 2022

I looked at the AOT code as stored in the SCC, and this is what it's supposed to look like:

00000218`32008d57 488b6c2458             mov     rbp, qword ptr [rsp+58h]
00000218`32008d5c 49bb904c008001000000   mov     r11, offset (00000001`80004c90)
00000218`32008d66 41ffd3                 call    r11
00000218`32008d69 488bc8                 mov     rcx, rax
00000218`32008d6c 4883c428               add     rsp, 28h
00000218`32008d70 5d                     pop     rbp
00000218`32008d71 48c785b809000000000000 mov     qword ptr [rbp+9B8h], 0

Looking at the bad code in the code cache:

00007ff9`88a9d9f0 49bb904c904c00800100   mov     r11, 180004C904C90h
00007ff9`88a9d9fa 0000                   add     byte ptr [rax], al

it looks like the the JNI target address was patched at 00007ff988a9d9f4(offset d4) instead of00007ff988a9d9f2 (offset d2). However, the relo infra claims that it patched it at offset d2. So I guess there's (at least) two possibilities here:

  1. What the relo infra printed out is different from what it actually did
  2. The relo infra patched the right address, but then something else patched d4 (maybe, for example, the TR_PatchJNICallSite runtime assumption.).

I don't know how much I believe in the TR_PatchJNICallSite idea because the address to be patched is the same address supposedly used as the location to relocate.

@dsouzai
Copy link
Contributor

dsouzai commented Jan 26, 2022

Well I figured out what's wrong; I don't know how we didn't hit this sooner...

As can be seen in the disassembly, the address of the mov instruction is 00007ff988a9d9f0. As acquired from the reloLog, the value of reloLocation is 00007ff988a9d9f2 which is the immediate field of the mov instruction. Now, as part of relocation, we register a runtime assumption for this JNI call site:

createJNICallSite((void *)ramMethod, (void *)reloLocation,getMetadataAssumptionList(reloRuntime->exceptionTable()));
RELO_LOG(reloRuntime->reloLogger(), 6, "\t\tapplyRelocation: registered JNI Call redefinition site\n");

If we go and look at what happens when the assumption is compensated:

_patchJNICallSite((J9Method*)getKey(), getPc(), (uint8_t*) newAddress, fe, isSMP);

// The code in directToJNI would look like
// 49 bb ?? ?? ?? ?? ?? ?? ?? ?? mov r11, 0xaddress <--we patch here
// 41 ff d3 call r11
*(uintptr_t*)(callPC+2) = (uintptr_t)newAddress;

callPC here is going to be the value of reloLocation which points to the immediate field. However, _patchJNICallSite assumes that the address to patch is the start of the mov instruction. So the end result is we end up patching at 00007ff988a9d9f2+2 = 00007ff988a9d9f4, which is what we're seeing in the JIT'd code at the time of the crash.

I don't think the fix is trivial though; the relo code is common across all platforms so I can't just update the value of reloLocation in applyRelocation (unless I do something hacky like an ifdef). I'll have to look at the code to see how best to implement the solution.

@dsouzai
Copy link
Contributor

dsouzai commented Jan 26, 2022

@siyy123 in the meantime, I suppose you could run with -Xaot:disableDirectToJNIInline (after destroying any current SCC). If that doesn't work, then you'll have to do -Xaot:disableDirectToJNI (after destroying the SCC).

@ghost
Copy link
Author

ghost commented Jan 27, 2022

-Xaot:disableDirectToJNIInline seems to work for me. It would not core dump anymore for both -Xquickstart and -Xaot:count=0 options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:jit segfault Issues that describe segfaults / JVM crashes userRaised
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants