Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore all Java exceptions when looking for Linux musl support #7844

Merged
merged 1 commit into from
Apr 28, 2022

Conversation

mallman
Copy link
Contributor

@mallman mallman commented Apr 25, 2022

Ignore all Java exceptions when looking for Linux musl support, not just IOException.

The current (1.6.0) implementation of the musl support detection code catches and ignores exceptions of type java.io.IOException. However, in our case the detection code is throwing an instance of java.io.UncheckedIOException, which is not a subtype of java.io.IOException. As a result, XGBoost fails to load.

To resolve this issue, this PR catches and ignores all exceptions. I don't think we should or can assume that all failures of this code will be of a particular subtype of java.io.Exception, as our experience shows.

Here is our complete exception stack trace:

java.lang.ExceptionInInitializerError
	at ml.dmlc.xgboost4j.java.DMatrix.(DMatrix.java:54)
	at ml.dmlc.xgboost4j.scala.DMatrix.(DMatrix.scala:43)
	at ml.dmlc.xgboost4j.scala.spark.Watches$.buildWatches(XGBoost.scala:574)
	at ml.dmlc.xgboost4j.scala.spark.PreXGBoost$.$anonfun$trainForNonRanking$1(PreXGBoost.scala:480)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.io.UncheckedIOException: java.nio.file.FileSystemException: /proc/self/map_files: Operation not permitted
	at ml.dmlc.xgboost4j.java.XGBoostJNI.(XGBoostJNI.java:37)
	... 32 more
Caused by: java.io.UncheckedIOException: java.nio.file.FileSystemException: /proc/self/map_files: Operation not permitted
	at java.nio.file.Files$2.hasNext(Files.java:3462)
	at java.util.Spliterators$IteratorSpliterator.tryAdvance(Spliterators.java:1811)
	at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
	at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:498)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:485)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
	at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:464)
	at ml.dmlc.xgboost4j.java.NativeLibLoader$OS.isMuslBased(NativeLibLoader.java:95)
	at ml.dmlc.xgboost4j.java.NativeLibLoader$OS.detectOS(NativeLibLoader.java:76)
	at ml.dmlc.xgboost4j.java.NativeLibLoader.initXGBoost(NativeLibLoader.java:169)
	at ml.dmlc.xgboost4j.java.XGBoostJNI.(XGBoostJNI.java:34)
	... 32 more
Caused by: java.nio.file.FileSystemException: /proc/self/map_files: Operation not permitted
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
	at sun.nio.fs.UnixException.asIOException(UnixException.java:111)
	at sun.nio.fs.UnixDirectoryStream$UnixDirectoryIterator.readNextEntry(UnixDirectoryStream.java:171)
	at sun.nio.fs.UnixDirectoryStream$UnixDirectoryIterator.hasNext(UnixDirectoryStream.java:201)
	at java.nio.file.Files$2.hasNext(Files.java:3460)
	... 44 more

Incidentally, our kernel version is 3.10.0-514.el7.x86_64.

@Craigacp
Copy link
Contributor

Hmm, that's unfortunate. If the method was refactored to not use streams then it should throw a subclass of IOException for those operations. I'd missed that the stream would convert the exception when I looked over the musl PR.

Are you running a hardened configuration of RHEL/CentOS/OL 7?

@wbo4958
Copy link
Contributor

wbo4958 commented Apr 26, 2022

@trivialfis looks like we should also port this PR to 1.6.1

@mallman
Copy link
Contributor Author

mallman commented Apr 26, 2022

@Craigacp I don't admin this cluster and really don't even understand your question. I'm pretty sure it is Centos 7. Otherwise I don't know. Sorry.

@mallman
Copy link
Contributor Author

mallman commented Apr 26, 2022

@wbo4958 That would be great, thank you!

@Craigacp
Copy link
Contributor

@Craigacp I don't admin this cluster and really don't even understand your question. I'm pretty sure it is Centos 7. Otherwise I don't know. Sorry.

The operation not permitted error makes me wonder if it's hitting SELinux or it's running inside a restricted container. It's separate from actually fixing the issue, I was just wondering what was different about your setup that caused you to hit it, because it doesn't throw in the CI.

@mallman
Copy link
Contributor Author

mallman commented Apr 26, 2022

The operation not permitted error makes me wonder if it's hitting SELinux or it's running inside a restricted container. It's separate from actually fixing the issue, I was just wondering what was different about your setup that caused you to hit it, because it doesn't throw in the CI.

@Craigacp Hmmmm... well, if there's a few system commands you want me to run on the machine to gather information I will do that for you.

@Craigacp
Copy link
Contributor

The operation not permitted error makes me wonder if it's hitting SELinux or it's running inside a restricted container. It's separate from actually fixing the issue, I was just wondering what was different about your setup that caused you to hit it, because it doesn't throw in the CI.

@Craigacp Hmmmm... well, if there's a few system commands you want me to run on the machine to gather information I will do that for you.

Unfortunately I'm not sure what the set of things is that can cause that part of /proc to not be readable, so I'm not sure what commands would tell me how it happened. As I said, we should still fix the issue, but if we knew the root cause we could note it in the release notes to explain when users should upgrade.

@trivialfis
Copy link
Member

Thank you for working on the fix, will backport it to 1.6

@trivialfis trivialfis merged commit f7db16a into dmlc:master Apr 28, 2022
@trivialfis trivialfis mentioned this pull request Apr 29, 2022
7 tasks
trivialfis pushed a commit to trivialfis/xgboost that referenced this pull request Apr 29, 2022
trivialfis added a commit that referenced this pull request Apr 29, 2022
* [jvm-packages] move the dmatrix building into rabit context (#7823)

This fixes the QuantileDeviceDMatrix in distributed environment.

* [doc] update the jvm tutorial to 1.6.1 [skip ci] (#7834)

* [Breaking][jvm-packages] Use barrier execution mode (#7836)

With the introduction of the barrier execution mode. we don't need to kill SparkContext when some xgboost tasks failed. Instead, Spark will handle the errors for us. So in this PR, `killSparkContextOnWorkerFailure` parameter is deleted.

* [doc] remove the doc about killing SparkContext [skip ci] (#7840)

* [jvm-package] remove the coalesce in barrier mode (#7846)

* [jvm-packages] Fix model compatibility (#7845)

* Ignore all Java exceptions when looking for Linux musl support (#7844)

Co-authored-by: Bobby Wang <[email protected]>
Co-authored-by: Michael Allman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants