Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable SCC/AOT and reduce counts when running with CRIU in containers #17253

Open
mpirvu opened this issue Apr 24, 2023 · 11 comments
Open

Disable SCC/AOT and reduce counts when running with CRIU in containers #17253

mpirvu opened this issue Apr 24, 2023 · 11 comments
Labels
comp:jit criu Used to track CRIU snapshot related work

Comments

@mpirvu
Copy link
Contributor

mpirvu commented Apr 24, 2023

Typically, when creating a container based on OpenJ9 we embed a shared class cache into the container to improve start-up time.
CRIU does a much better job at improving start-up time than the SCC/AOT. Moreover, it has been observed that using -Xshareclasses:none leads to a small footprint reduction.
This issues proposes to disable SCC (or just AOT) when creating containers that can take advantage of CRIU. Moreover, in order to not affect rampup of applications we might need to bring the invocation counts to the same values as those used for AOT (1000/250)

@mpirvu mpirvu added comp:jit criu Used to track CRIU snapshot related work labels Apr 24, 2023
@mpirvu
Copy link
Contributor Author

mpirvu commented Apr 24, 2023

Attn: @dsouzai

@dsouzai
Copy link
Contributor

dsouzai commented Apr 24, 2023

I created a build with the following change:

diff --git a/runtime/compiler/control/CompilationThread.cpp b/runtime/compiler/control/CompilationThread.cpp
index ef609eaa7..7c8296602 100644
--- a/runtime/compiler/control/CompilationThread.cpp
+++ b/runtime/compiler/control/CompilationThread.cpp
@@ -7559,7 +7559,12 @@ TR::CompilationInfoPerThreadBase::preCompilationTasks(J9VMThread * vmThread,
        && !entry->_doNotUseAotCodeFromSharedCache
        && !TR::Options::getAOTCmdLineOptions()->getOption(TR_NoLoadAOT)
        && !(_jitConfig->runtimeFlags & J9JIT_TOSS_CODE)
-       && !_jitConfig->inlineFieldWatches)
+       && !_jitConfig->inlineFieldWatches
+#if defined(J9VM_OPT_CRIU_SUPPORT)
+       && (!_jitConfig->javaVM->internalVMFunctions->isCheckpointAllowed(vmThread)
+           || !_jitConfig->javaVM->internalVMFunctions->isNonPortableRestoreMode(vmThread))
+#endif /* defined(J9VM_OPT_CRIU_SUPPORT) */
+       )
       {
       // Determine whether the compilation filters allows me to relocate
       // Filters should not be applied to out-of-process compilations
@@ -7765,6 +7770,11 @@ TR::CompilationInfoPerThreadBase::preCompilationTasks(J9VMThread * vmThread,
             // method to be compiled
             && (NULL != fe->sharedCache()->rememberClass(J9_CLASS_FROM_METHOD(method)))

+#if defined(J9VM_OPT_CRIU_SUPPORT)
+            && (!_jitConfig->javaVM->internalVMFunctions->isCheckpointAllowed(vmThread)
+                || !_jitConfig->javaVM->internalVMFunctions->isNonPortableRestoreMode(vmThread))
+#endif /* defined(J9VM_OPT_CRIU_SUPPORT) */
+
             // Do not perform AOT compilation if field watch is enabled; there
             // is no benefit to having an AOT body with field watch as it increases
             // the validation complexity, and in case the fields being watched changes,

At first I thought I might need to do something about the counts, but in my container runs with OpenLiberty, because a SCC is specified, the counts are already going to be set to 1000/250. These are the results:

Results for ol-instanton-test-pingperf-restore-deployment:jdk_testnojit_baseline
StartupTime     avg= 809        min= 769        max= 857        stdDev=20.3     maxVar=11.4%    confInt=0.52%   samples= 64
Application     avg= 206        min= 196        max= 220        stdDev= 7.3     maxVar=12.2%    confInt=0.74%   samples= 64
FirstResponse   avg= 909        min= 870        max= 955        stdDev=20.3     maxVar=9.8%     confInt=0.47%   samples= 64
Footprint       avg=1048576     min=1048576     max=1048576     stdDev= 0.0     maxVar=0.0%     confInt=0.00%   samples= 64
Results for ol-instanton-test-pingperf-restore-deployment:jdk_testnojit_nojit
StartupTime     avg= 823        min= 769        max= 882        stdDev=24.1     maxVar=14.7%    confInt=0.61%   samples= 64
Application     avg= 218        min= 204        max= 252        stdDev=12.5     maxVar=23.5%    confInt=1.21%   samples= 63
        Outlier values:  256
FirstResponse   avg= 943        min= 893        max=1009        stdDev=26.5     maxVar=13.0%    confInt=0.59%   samples= 64
Footprint       avg=1048576     min=1048576     max=1048576     stdDev= 0.0     maxVar=0.0%     confInt=0.00%   samples= 64

There seems to be a ~1% slowdown to startup and a ~3% slowdown to first response. The reason is likely because although the counts are the same as AOT compiles, an AOT load occurs at scount=20, so there's likely a lot more code loaded from the SCC than is compiled at 1000/250.

@mpirvu
Copy link
Contributor Author

mpirvu commented Apr 24, 2023

Could you run throughput runs with this setup? The footprint savings after load was we wanted to achieve with these changes.

@dsouzai
Copy link
Contributor

dsouzai commented Apr 25, 2023

Results for ol-instanton-test-pingperf-restore-deployment:jdk_testnojit_baseline
Throughput      avg=18009       min=17799       max=18283       stdDev=233.9    maxVar=2.7%     confInt=1.24%   samples=  5
Results for ol-instanton-test-pingperf-restore-deployment:jdk_testnojit_nojit
Throughput      avg=17863       min=17499       max=18284       stdDev=360.2    maxVar=4.5%     confInt=1.92%   samples=  5

Doesn't look like there's any throughput issue (considering the confidence interval).

FWIW, this throughput run is by running pingperf and using wrk in the following way:

Warmup run 0
Running numactl --physcpubind=16-31,48-63 --membind=1 /home/development/dsouzai/scripts/wrk/wrk -t40 -c40 -d60s http://localhost:9080/pingperf/ping/greeting
Warmup run 1
Running numactl --physcpubind=16-31,48-63 --membind=1 /home/development/dsouzai/scripts/wrk/wrk -t40 -c40 -d60s http://localhost:9080/pingperf/ping/greeting
Warmup run 2
Running numactl --physcpubind=16-31,48-63 --membind=1 /home/development/dsouzai/scripts/wrk/wrk -t40 -c40 -d60s http://localhost:9080/pingperf/ping/greeting
Measured run
Running numactl --physcpubind=16-31,48-63 --membind=1 /home/development/dsouzai/scripts/wrk/wrk -t40 -c40 -d60s http://localhost:9080/pingperf/ping/greeting

@mpirvu
Copy link
Contributor Author

mpirvu commented Apr 25, 2023

It's good to know that throughput is not affected. How about footprint? Since footprint is the motivation of this whole idea, if the memory consumption is not lower, the appeal for this change decreases considerably.

@dsouzai
Copy link
Contributor

dsouzai commented Apr 25, 2023

Results for ol-instanton-test-pingperf-restore-deployment:jdk_testnojit_baseline
Throughput                avg=17978     min=17325       max=18522       stdDev= 498     maxVar=6.9%     confInt=2.64%   samples=  5
ContainerFootprint (KiB)  avg=1048576   min=1048576     max=1048576     stdDev=   0     maxVar=0.0%     confInt=0.00%   samples=  5
RSS (KiB)                 avg=182698    min=181340      max=183636      stdDev=1074     maxVar=1.3%     confInt=0.69%   samples=  4
        Outlier values:  187916
Peak RSS (KiB)            avg=256807    min=252072      max=259432      stdDev=3028     maxVar=2.9%     confInt=1.12%   samples=  5
Results for ol-instanton-test-pingperf-restore-deployment:jdk_testnojit_nojit
Throughput                avg=17487     min=17131       max=18192       stdDev= 426     maxVar=6.2%     confInt=2.32%   samples=  5
ContainerFootprint (KiB)  avg=1048576   min=1048576     max=1048576     stdDev=   0     maxVar=0.0%     confInt=0.00%   samples=  5
RSS (KiB)                 avg=179361    min=179144      max=179492      stdDev= 189     maxVar=0.2%     confInt=0.18%   samples=  3
        Outlier values:  172968 181880
Peak RSS (KiB)            avg=262873    min=249820      max=281036      stdDev=12881    maxVar=12.5%    confInt=4.67%   samples=  5

The throughput again is within the noise, but given that this is the second time nojit build has lower throughput, I'll have to do a bigger set of runs. RSS does seem to be almost 2% lower; peakRSS is higher but the confidence interval isn't good so I'll see what a bigger set of runs shows.

@mpirvu
Copy link
Contributor Author

mpirvu commented Apr 25, 2023

It's possible that we'll get bigger footprint advantages if we use higher counts (like those for no SCC) but rampup may be affected in those cases.

@dsouzai
Copy link
Contributor

dsouzai commented Apr 26, 2023

Heres the data from a bigger set of runs:

Results for ol-instanton-test-pingperf-restore-deployment:jdk_testnojit_baseline
Throughput                avg=17948     min=17206       max=18399       stdDev= 298     maxVar=6.9%     confInt=0.64%   samples= 20
ContainerFootprint (KiB)  avg=1048576   min=1048576     max=1048576     stdDev=   0     maxVar=0.0%     confInt=0.00%   samples= 20
RSS (KiB)                 avg=184895    min=180812      max=190828      stdDev=2650     maxVar=5.5%     confInt=0.55%   samples= 20
Peak RSS (KiB)            avg=257545    min=239456      max=279100      stdDev=9935     maxVar=16.6%    confInt=1.49%   samples= 20
Results for ol-instanton-test-pingperf-restore-deployment:jdk_testnojit_nojit
Throughput                avg=17647     min=16658       max=18264       stdDev= 403     maxVar=9.6%     confInt=0.88%   samples= 20
ContainerFootprint (KiB)  avg=1048576   min=1048576     max=1048576     stdDev=   0     maxVar=0.0%     confInt=0.00%   samples= 20
RSS (KiB)                 avg=176453    min=172116      max=180404      stdDev=2072     maxVar=4.8%     confInt=0.45%   samples= 20
Peak RSS (KiB)            avg=260965    min=234964      max=315116      stdDev=25048    maxVar=34.1%    confInt=3.82%   samples= 19
        Outlier values:  338060

Throughput goes down by ~1%; RSS improves by ~4%, Peak RSS goes up by ~1% but that's within the noise.

@dsouzai
Copy link
Contributor

dsouzai commented Apr 27, 2023

I did a set of runs with

diff --git a/runtime/compiler/control/J9Options.cpp b/runtime/compiler/control/J9Options.cpp
index 93edd691d..a4b16203c 100644
--- a/runtime/compiler/control/J9Options.cpp
+++ b/runtime/compiler/control/J9Options.cpp
@@ -2593,6 +2593,12 @@ J9::Options::fePreProcess(void * base)
    self()->setOption(TR_EnableSymbolValidationManager);
 #endif

+#if defined(J9VM_OPT_CRIU_SUPPORT)
+   J9VMThread *curThread = vm->internalVMFunctions->currentVMThread(vm);
+   if (vm->internalVMFunctions->isCheckpointAllowed(curThread))
+      self()->setOption(TR_UseHigherMethodCounts);
+#endif
+
    return true;
    }

diff --git a/runtime/compiler/control/rossa.cpp b/runtime/compiler/control/rossa.cpp
index 26fe00d68..275b11c88 100644
--- a/runtime/compiler/control/rossa.cpp
+++ b/runtime/compiler/control/rossa.cpp
@@ -1983,6 +1983,8 @@ aboutToBootstrap(J9JavaVM * javaVM, J9JITConfig * jitConfig)
       }
 #endif

+   bool validateSCC = true;
+
 #if defined(J9VM_OPT_CRIU_SUPPORT)
    /* If the JVM is in CRIU mode and checkpointing is allowed, then the JIT should be
     * limited to the same processor features as those used in Portable AOT mode. This
@@ -1995,13 +1997,19 @@ aboutToBootstrap(J9JavaVM * javaVM, J9JITConfig * jitConfig)
       if (!J9_ARE_ANY_BITS_SET(javaVM->extendedRuntimeFlags2, J9_EXTENDED_RUNTIME2_ENABLE_PORTABLE_SHARED_CACHE))
          TR::Compiler->relocatableTarget.cpu = TR::CPU::detectRelocatable(TR::Compiler->omrPortLib);
       jitConfig->targetProcessor = TR::Compiler->target.cpu.getProcessorDescription();
+
+      static_cast<TR_JitPrivateConfig *>(jitConfig->privateConfig)->aotValidHeader = TR_no;
+      TR::Options::getAOTCmdLineOptions()->setOption(TR_NoLoadAOT);
+      TR::Options::getAOTCmdLineOptions()->setOption(TR_NoStoreAOT);
+      TR::Options::setSharedClassCache(false);
+      TR_J9SharedCache::setSharedCacheDisabledReason(TR_J9SharedCache::AOT_DISABLED);
+      validateSCC = false;
       }
 #endif /* defined(J9VM_OPT_CRIU_SUPPORT) */

 #if defined(J9VM_OPT_SHARED_CLASSES)
    if (isSharedAOT)
       {
-      bool validateSCC = true;

 #if defined(J9VM_OPT_JITSERVER)
       if (persistentInfo->getRemoteCompilationMode() == JITServer::SERVER)

(on top of the change in #17253 (comment)). While the RSS does reduce even further, the peak RSS is much higher and throughput is much lower:

Results for ol-instanton-test-pingperf-restore-deployment:jdk_testnojit_baseline
Throughput                avg=17867     min=17366       max=18345       stdDev= 229     maxVar=5.6%     confInt=0.50%   samples= 20
ContainerFootprint (KiB)  avg=1048576   min=1048576     max=1048576     stdDev=   0     maxVar=0.0%     confInt=0.00%   samples= 20
RSS (KiB)                 avg=184178    min=179688      max=189520      stdDev=2412     maxVar=5.5%     confInt=0.51%   samples= 20
Peak RSS (KiB)            avg=252351    min=231484      max=273264      stdDev=10638    maxVar=18.0%    confInt=1.63%   samples= 20
Results for ol-instanton-test-pingperf-restore-deployment:jdk_testnojit_nojit
Throughput                avg=17761     min=16727       max=18436       stdDev= 434     maxVar=10.2%    confInt=0.95%   samples= 20
ContainerFootprint (KiB)  avg=1048576   min=1048576     max=1048576     stdDev=   0     maxVar=0.0%     confInt=0.00%   samples= 20
RSS (KiB)                 avg=175654    min=171460      max=181236      stdDev=3131     maxVar=5.7%     confInt=0.69%   samples= 20
Peak RSS (KiB)            avg=263772    min=235876      max=330392      stdDev=26266    maxVar=40.1%    confInt=3.85%   samples= 20
Results for ol-instanton-test-pingperf-restore-deployment:jdk_testnojit_nojithighercounts
Throughput                avg=16545     min=15097       max=17292       stdDev= 619     maxVar=14.5%    confInt=1.53%   samples= 18
        Outlier values:  7303.4 12635.59
ContainerFootprint (KiB)  avg=1048576   min=1048576     max=1048576     stdDev=   0     maxVar=0.0%     confInt=0.00%   samples= 20
RSS (KiB)                 avg=165654    min=160388      max=170696      stdDev=3018     maxVar=6.4%     confInt=0.70%   samples= 20
Peak RSS (KiB)            avg=343566    min=228136      max=482372      stdDev=69261    maxVar=111.4%   confInt=7.79%   samples= 20

@dsouzai
Copy link
Contributor

dsouzai commented May 12, 2023

I did a couple of experiments. First, I generated a build, called jdk_testnoaot_nodowngrade, with the change in #17253 (comment) along with

diff --git a/runtime/compiler/control/CompilationThread.cpp b/runtime/compiler/control/CompilationThread.cpp
index 7c8296602..67df318a3 100644
--- a/runtime/compiler/control/CompilationThread.cpp
+++ b/runtime/compiler/control/CompilationThread.cpp
@@ -8711,7 +8711,15 @@ TR::CompilationInfoPerThreadBase::wrappedCompile(J9PortLibrary *portLib, void *
                   aotCompilationReUpgradedToWarm = true;
                   }
                }
-
+#if defined(J9VM_OPT_CRIU_SUPPORT)
+            else if (jitConfig->javaVM->internalVMFunctions->isCheckpointAllowed(vmThread)
+                     && p->_optimizationPlan->isOptLevelDowngraded()
+                     && p->_optimizationPlan->getOptLevel() == cold)
+               {
+               p->_optimizationPlan->setOptLevel(warm);
+               p->_optimizationPlan->setOptLevelDowngraded(false);
+               }
+#endif

             TR_PersistentCHTable *cht = that->_compInfo.getPersistentInfo()->getPersistentCHTable();
             if (cht && !cht->isActive())

This basically is the same logic that upgrades an AOT compilation that was downgraded.

Next I generated a build, called jdk_testnoaot_nodowngrade_noscc, which adding the following change to the previous build

diff --git a/runtime/compiler/control/rossa.cpp b/runtime/compiler/control/rossa.cpp
index 26fe00d68..275b11c88 100644
--- a/runtime/compiler/control/rossa.cpp
+++ b/runtime/compiler/control/rossa.cpp
@@ -1983,6 +1983,8 @@ aboutToBootstrap(J9JavaVM * javaVM, J9JITConfig * jitConfig)
       }
 #endif

+   bool validateSCC = true;
+
 #if defined(J9VM_OPT_CRIU_SUPPORT)
    /* If the JVM is in CRIU mode and checkpointing is allowed, then the JIT should be
     * limited to the same processor features as those used in Portable AOT mode. This
@@ -1995,13 +1997,19 @@ aboutToBootstrap(J9JavaVM * javaVM, J9JITConfig * jitConfig)
       if (!J9_ARE_ANY_BITS_SET(javaVM->extendedRuntimeFlags2, J9_EXTENDED_RUNTIME2_ENABLE_PORTABLE_SHARED_CACHE))
          TR::Compiler->relocatableTarget.cpu = TR::CPU::detectRelocatable(TR::Compiler->omrPortLib);
       jitConfig->targetProcessor = TR::Compiler->target.cpu.getProcessorDescription();
+
+      static_cast<TR_JitPrivateConfig *>(jitConfig->privateConfig)->aotValidHeader = TR_no;
+      TR::Options::getAOTCmdLineOptions()->setOption(TR_NoLoadAOT);
+      TR::Options::getAOTCmdLineOptions()->setOption(TR_NoStoreAOT);
+      TR::Options::setSharedClassCache(false);
+      TR_J9SharedCache::setSharedCacheDisabledReason(TR_J9SharedCache::AOT_DISABLED);
+      validateSCC = false;
       }
 #endif /* defined(J9VM_OPT_CRIU_SUPPORT) */

 #if defined(J9VM_OPT_SHARED_CLASSES)
    if (isSharedAOT)
       {
-      bool validateSCC = true;

 #if defined(J9VM_OPT_JITSERVER)
       if (persistentInfo->getRemoteCompilationMode() == JITServer::SERVER)

This makes it so that the compiler can't interface with the SCC for any reason. However, it doesn't affect the counts (except for I believe the scount - we won't ever give a method a count of 20 because we won't be able to check the SCC if the method exists).

Also jdk_testnoaot_noaot is a build with only the change in #17253 (comment).


These are the results:

Results for ol-instanton-test-pingperf-restore-deployment:jdk_testnoaot_baseline
Throughput                avg=17856     min=16298       max=18524       stdDev= 577     maxVar=13.7%    confInt=1.47%   samples= 15
ContainerFootprint (KiB)  avg=1048576   min=1048576     max=1048576     stdDev=   0     maxVar=0.0%     confInt=0.00%   samples= 15
RSS (KiB)                 avg=184570    min=181636      max=190824      stdDev=2435     maxVar=5.1%     confInt=0.60%   samples= 15
Peak RSS (KiB)            avg=260605    min=248436      max=274632      stdDev=6309     maxVar=10.5%    confInt=1.10%   samples= 15
Results for ol-instanton-test-pingperf-restore-deployment:jdk_testnoaot_noaot
Throughput                avg=17679     min=17244       max=18190       stdDev= 283     maxVar=5.5%     confInt=0.73%   samples= 15
ContainerFootprint (KiB)  avg=1048576   min=1048576     max=1048576     stdDev=   0     maxVar=0.0%     confInt=0.00%   samples= 15
RSS (KiB)                 avg=175781    min=172336      max=179028      stdDev=2075     maxVar=3.9%     confInt=0.54%   samples= 15
Peak RSS (KiB)            avg=248088    min=234736      max=268960      stdDev=9514     maxVar=14.6%    confInt=1.82%   samples= 14
        Outlier values:  322976
Results for ol-instanton-test-pingperf-restore-deployment:jdk_testnoaot_nodowngrade
Throughput                avg=17660     min=16967       max=18034       stdDev= 394     maxVar=6.3%     confInt=1.01%   samples= 15
ContainerFootprint (KiB)  avg=1048576   min=1048576     max=1048576     stdDev=   0     maxVar=0.0%     confInt=0.00%   samples= 15
RSS (KiB)                 avg=184889    min=181648      max=190928      stdDev=3181     maxVar=5.1%     confInt=0.78%   samples= 15
Peak RSS (KiB)            avg=243050    min=228664      max=266324      stdDev=11871    maxVar=16.5%    confInt=2.53%   samples= 12
        Outlier values:  381312 398072 416796

and

Results for ol-instanton-test-pingperf-restore-deployment:jdk_testnoaot_nodowngrade
Throughput                avg=17412     min=16894       max=17892       stdDev= 284     maxVar=5.9%     confInt=0.74%   samples= 15
ContainerFootprint (KiB)  avg=1048576   min=1048576     max=1048576     stdDev=   0     maxVar=0.0%     confInt=0.00%   samples= 15
RSS (KiB)                 avg=186589    min=181956      max=189364      stdDev=2594     maxVar=4.1%     confInt=0.63%   samples= 15
Peak RSS (KiB)            avg=337442    min=229540      max=441168      stdDev=76258    maxVar=92.2%    confInt=10.28%  samples= 15
Results for ol-instanton-test-pingperf-restore-deployment:jdk_testnoaot_nodowngrade_noscc
Throughput                avg=16752     min=15891       max=17400       stdDev= 339     maxVar=9.5%     confInt=0.96%   samples= 14
        Outlier values:  15748.32
ContainerFootprint (KiB)  avg=1048576   min=1048576     max=1048576     stdDev=   0     maxVar=0.0%     confInt=0.00%   samples= 15
RSS (KiB)                 avg=168518    min=165388      max=171672      stdDev=2209     maxVar=3.8%     confInt=0.60%   samples= 15
Peak RSS (KiB)            avg=389312    min=327544      max=471868      stdDev=37285    maxVar=44.1%    confInt=4.53%   samples= 14
        Outlier values:  241952

Having methods be compiled at warm (which also includes GCR trees) don't improve the throughput.

In terms of footprint, it appears that the majority of the improvement comes from not reading the SCC for any reason. In the jdk_testnoaot_nodowngrade, I guess we still load IProfiler information from the SCC, as well as perhaps some class chain info; this can't happen in jdk_testnoaot_nodowngrade_noscc.

It's possible that jdk_testnoaot_nodowngrade benefits from the scount=20 that occurs in jitHookInitializeSendTarget for methods that are in the SCC (even though we don't load them). It may also be that three warm up runs of 60s is not sufficient. My next steps will be to look at what methods are compiled and why the AOT methods seem to matter so much for throughput.

@vijaysun-omr
Copy link
Contributor

Is it possible that getting IProfiler info from the SCC resulted in a) larger footprint because we inlined more and b) better throughput because we have more and/or better profiling info in the SCC ? i.e. when we remove the SCC altogether we see lower footprint but also lower throughput (because of lesser inlining) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:jit criu Used to track CRIU snapshot related work
Projects
None yet
Development

No branches or pull requests

3 participants