-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
-Xshareclasses by default causes a number of issues #3333
Comments
Is there any way to run some of these with |
Note that except for eclipse/omr#3076, the problems occurred when compiling the functional tests, and not an actual test case. The best thing would be set it up on a local machine and try to recreate there manually. Then you'll have the shared cache which caused the problem, and can also try setting environment variables to try different options. |
Note that @hangshao0 was trying to recreate #3319 and only found one internal jenkins machine where it occurred. |
I tried running locally, on ub16hcxrt2, with forceAOT, always using the "new" AOT code, always using the "old" AOT code, using an OpenJ9 build of a SHA right before the reverts, using the build from https://ci.eclipse.org/openj9/job/Test-sanity.functional-JDK11-linux_x86-64_cmprssptrs/100/console (https://ci.eclipse.org/openj9/job/Test-sanity.functional-JDK11-linux_x86-64_cmprssptrs/100/fingerprints/). I can't get it to reproduce at all. What makes me think it's not the new AOT changes is that this only happens on xLinux, whereas the new AOT changes are enabled on 64-bit x86 which includes Windows. What makes me think it could be the new AOT changes are 3 minor bugs @jdmpapin found in the code (which he's going to fix as part of this bigger code cleanup work). I figured if I could reproduce the issue I'd try fixing the bugs to see if it the issue goes away, but I can't reproduce it at all. |
@dsouzai maybe you already have, but if not please check with @hangshao0 as he was able to reproduce it on one machine. |
ub16hcxrt2 is the machine that I reproduced this failure yesterday (4 out of 5 runs failed). I have slacked @dsouzai the link to one of these failures. |
@hangshao0 thanks. Please help check what @dsouzai is doing to see if it can be repeated again. |
@dchopra001 note @DanHeidinga was also seeing problems on macOS. |
@dsouzai, you'd better clean up the existing shared cache on the system each time before running. |
Yup I've been doing that. The job @hangshao0 linked me used a different JVM than what I was running, so I'll try running with that tomorrow. |
openjdk-tests sha:
and I used
I ran it on |
I did however hit an issue on a test I was running:
From jdmpview:
I talked to @0dvictor, who said that the I tried getting a log, but I couldn't reproduce the issue; @0dvictor, do you know where the code that assigns a register instead of saving on the stack is? This particular issue isn't AOT related, but I've only seen it happen with the "new" AOT code heh. It's possible that this is also the cause of the other problems seen. |
@jdmpapin pointed out that in
we're not moving an object into |
This code seems to be part of a write barrier. |
eclipse/omr#3125 should fix the above GC map issue. |
eclipse/omr#3125 is merged. @hangshao0 can you please create a new PR to enable bootstrap class sharing by default. I can't revert #3337 automatically. |
Btw, eclipse/omr#3125 was the most likely culprit for the AOT issue, but I'm not 100% certain it was the only cause of the issues we were seeing, so it would be worth running a few grinders on misc jobs to make sure. |
Right, I'll be running many sanity tests on the new PR before merging, we can try the grinders as well. |
There were two PRs originally, I will merge those two PRs into one single PR. Need to run some manual test locally, I will create a PR once it is done. |
I am still seeing issue eclipse/omr#3076. This time it is on linux ppcle. |
OK, I know what is going on. I have manually re-created the failure on one of the internal machines (ub14levm8). The HandleSIGXFSZ failure is not caused by AOT. The ulimit setting of the failed machine is
So creating any file > the maximum file size limit will result in signal SIGXFSZ. Class SignalXfszTest is not loaded yet, so we won't see "Starting SignalXfszTest" So we should probably increase the maximum file size limit on the testing machines, 4MB is too small. |
Heh interesting; how come the other tests pass though? Wasn't the SCC created in test 1 - "Testing Default" (which didn't have |
No. It was not created. When shared classes is enabled by default, nonfatal is also on, which means we will proceed ignoring the SCC creation failure. |
Ohh I see that makes sense; also I realized I was wrong above:
|
I noticed that the file size limit 4000 is set by the test itself, it is not the default setting on the machine. So adding -Xshareclasses:none to the commands in sigxfszHandlingTest should solve this problem. After this change, the test is still testing what it was testing. |
works for me. Please add a comment to explain why -Xshareclasses:none is specified. |
PR #3713 open to fix sigxfszHandlingTest |
This should not have been closed by the #3713 commit. |
I am copying the links of OSX failure to this issue.
https://ci.eclipse.org/openj9/job/PullRequest-Sanity-JDK11-osx_x86-64_cmprssptrs-OpenJ9/35/ |
I got access to osxvm2, and I'm able to reproduce the issue (fairly intermittently); however I'm a bit blocked as I can't generate any cores. I'm following up with that, but this issue will likely take some time. From what I can tell, this issue happens with "old AOT"; why it only happens on osx is a complete mystery...I really hope it isn't the cause of some behaviour due to a different compiler. |
I didn't have time all week to look at this issue. However, when I last left off my investigation, I managed to get core files and some logging; I'm going to have to go through them to see what the problem is. As far as I can tell, the AOT problem isn't a new one (since it happens with "old" AOT); we've just never really tested osx prior to OpenJ9. |
@jdmpapin and I worked on this yesterday, and we think we know what the problem is. It isn't osx specific, but somehow only manifests itself on osx 😕 I had noticed that the problem only occurred when the JIT code cache address started in the
Thread 6 looked like it was a PC in the JIT codecache, so we took a look at that:
The method that PC belongs to, from the vlog, was:
Disassembling that method:
The instruction being executed is implementing a jump table. Looking at the relolog:
The address relocated at
the three addresses in the jump table are exactly the first three relocations that were applied. The fourth relocation, which is applied in the middle of the The problem occurs because:
Thus, when the address is larger than A secondary problem is that the relocation is done in the following manner: @jdmpapin is working on a fix for this; because the jmp table is going to be a known location, the idea is to do something like
This would remove the need for a relocation. |
|
@jdmpapin now that the omr change is merged, is there further work here? |
I don't think so, unless we see more issues when enabling the SCC by default on OSX. I am seeing a bunch of
issues on osx sanity PR builds, so maybe this issue needs to stay open to address those? |
The error above is machine issue on non-persistent cache. The default cache type is persistent on OSX. |
I see, well I suppose a PR enabling the SCC by default on OSX can be opened and tested; if there are no problems (beside the known ones), the PR can be merged and this issue can be closed. |
Closes eclipse-openj9#3333 Signed-off-by: hangshao <[email protected]>
When -Xshareclasses:bootClassesOnly was enabled by default on xlinux jdk11, a number of issues occurred which I believe are all caused by AOT code in the shared cache.
eclipse/omr#3075 / #3319
eclipse/omr#3076
eclipse/omr#3077
#3316
The text was updated successfully, but these errors were encountered: