Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent failures in omrsysinfo_get_process_start_time #7469

Open
tajila opened this issue Sep 23, 2024 · 5 comments
Open

Intermittent failures in omrsysinfo_get_process_start_time #7469

tajila opened this issue Sep 23, 2024 · 5 comments

Comments

@tajila
Copy link
Contributor

tajila commented Sep 23, 2024

omrsysinfo_get_process_start_time is used in CRIUSupport to determine the start time of the restored process. We are seeing intermittent failures as a result of using this API.

Stack Dump = io.openliberty.checkpoint.internal.criu.CheckpointFailedException: j9sysinfo_get_process_start_time failed with errno=-384
        at io.openliberty.checkpoint.internal.openj9.ExecuteCRIU_OpenJ9.dump(ExecuteCRIU_OpenJ9.java:63)
        at io.openliberty.checkpoint.internal.CheckpointImpl.checkpoint(CheckpointImpl.java:396)
        at io.openliberty.checkpoint.internal.CheckpointImpl.checkpointOrExitOnFailure(CheckpointImpl.java:303)
        at io.openliberty.checkpoint.internal.CheckpointImpl.check(CheckpointImpl.java:297)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
        at com.ibm.ws.kernel.feature.internal.FeatureManager.checkServerReady(FeatureManager.java:868)
        at com.ibm.ws.kernel.feature.internal.FeatureManager.update(FeatureManager.java:829)
        at com.ibm.ws.kernel.feature.internal.FeatureManager.processFeatureChanges(FeatureManager.java:931)
        at com.ibm.ws.kernel.feature.internal.FeatureManager$1.run(FeatureManager.java:714)
        at com.ibm.ws.threading.internal.ExecutorServiceImpl$RunnableWrapper.run(ExecutorServiceImpl.java:298)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1595)
Caused by: org.eclipse.openj9.criu.SystemRestoreException: j9sysinfo_get_process_start_time failed with errno=-384
        at openj9.criu/org.eclipse.openj9.criu.CRIUSupport.checkpointJVM(CRIUSupport.java:538)
        at io.openliberty.checkpoint.internal.openj9.ExecuteCRIU_OpenJ9.dump(ExecuteCRIU_OpenJ9.java:55)
        ... 12 more
Caused by: openj9.internal.criu.SystemRestoreException: j9sysinfo_get_process_start_time failed with errno=-384
        at java.base/openj9.internal.criu.InternalCRIUSupport.checkpointJVMImpl(Native Method)
        at java.base/openj9.internal.criu.InternalCRIUSupport.checkpointJVM(InternalCRIUSupport.java:997)
        at openj9.criu/org.eclipse.openj9.criu.CRIUSupport.checkpointJVM(CRIUSupport.java:530)
        ... 13 more
@tajila
Copy link
Contributor Author

tajila commented Sep 23, 2024

@ThanHenderson Please take a look at this?

@tajila
Copy link
Contributor Author

tajila commented Sep 23, 2024

Its likely possible that this issue can be recreated without Liberty, im still getting more info on the environment where this isssue was seen, but I would start with rhel9 and ubuntu22+.

@tajila
Copy link
Contributor Author

tajila commented Sep 23, 2024

Just confirming that the issue is seen on ub22

@ThanHenderson
Copy link
Contributor

seeing intermittent failures

How intermittent (or what is the occurrence rate)? Did we just start seeing this? And is it only happening in this benchmark?

likely possible that this issue can be recreated without Liberty

We can likely create a simpler test where we see the failure; I haven't seen this through my simple testing of the checkpointJVM APIs. It'd be nice to know the command which was run that resulted in this test failure.

im still getting more info on the environment

Were there any other diagnostic data collected from the runs with this failure?

@ThanHenderson
Copy link
Contributor

These failures -- and similar failures in the pipeline e.g. failing a process existence check -- happen (intermittently) only in criu restore invocations that pass in --restore-detached (info found here).

To get the process start time, we retrieve the PID of the parent process (criu restore ...), check if it exists, then call libc's stat(). Either of these can fail with --restore-detached because the parent process (criu restore...) exits immediately after execution.

We currently do not have a way of detecting whether the criu restore command has this option passed in. To reliably support this feature for criu, we should have a mechanism that stores some state indicating that this was passed in, and avoid checking for the process start time.

In the meantime, we can restrict getting the restore process start time to CRaC restores that aren't affected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants