Ignore all Java exceptions when looking for Linux musl support #7844

mallman · 2022-04-25T21:26:05Z

Ignore all Java exceptions when looking for Linux musl support, not just IOException.

The current (1.6.0) implementation of the musl support detection code catches and ignores exceptions of type java.io.IOException. However, in our case the detection code is throwing an instance of java.io.UncheckedIOException, which is not a subtype of java.io.IOException. As a result, XGBoost fails to load.

To resolve this issue, this PR catches and ignores all exceptions. I don't think we should or can assume that all failures of this code will be of a particular subtype of java.io.Exception, as our experience shows.

Here is our complete exception stack trace:

java.lang.ExceptionInInitializerError
	at ml.dmlc.xgboost4j.java.DMatrix.(DMatrix.java:54)
	at ml.dmlc.xgboost4j.scala.DMatrix.(DMatrix.scala:43)
	at ml.dmlc.xgboost4j.scala.spark.Watches$.buildWatches(XGBoost.scala:574)
	at ml.dmlc.xgboost4j.scala.spark.PreXGBoost$.$anonfun$trainForNonRanking$1(PreXGBoost.scala:480)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.io.UncheckedIOException: java.nio.file.FileSystemException: /proc/self/map_files: Operation not permitted
	at ml.dmlc.xgboost4j.java.XGBoostJNI.(XGBoostJNI.java:37)
	... 32 more
Caused by: java.io.UncheckedIOException: java.nio.file.FileSystemException: /proc/self/map_files: Operation not permitted
	at java.nio.file.Files$2.hasNext(Files.java:3462)
	at java.util.Spliterators$IteratorSpliterator.tryAdvance(Spliterators.java:1811)
	at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
	at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:498)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:485)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
	at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:464)
	at ml.dmlc.xgboost4j.java.NativeLibLoader$OS.isMuslBased(NativeLibLoader.java:95)
	at ml.dmlc.xgboost4j.java.NativeLibLoader$OS.detectOS(NativeLibLoader.java:76)
	at ml.dmlc.xgboost4j.java.NativeLibLoader.initXGBoost(NativeLibLoader.java:169)
	at ml.dmlc.xgboost4j.java.XGBoostJNI.(XGBoostJNI.java:34)
	... 32 more
Caused by: java.nio.file.FileSystemException: /proc/self/map_files: Operation not permitted
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
	at sun.nio.fs.UnixException.asIOException(UnixException.java:111)
	at sun.nio.fs.UnixDirectoryStream$UnixDirectoryIterator.readNextEntry(UnixDirectoryStream.java:171)
	at sun.nio.fs.UnixDirectoryStream$UnixDirectoryIterator.hasNext(UnixDirectoryStream.java:201)
	at java.nio.file.Files$2.hasNext(Files.java:3460)
	... 44 more

Incidentally, our kernel version is 3.10.0-514.el7.x86_64.

IOException

Craigacp · 2022-04-26T02:27:38Z

Hmm, that's unfortunate. If the method was refactored to not use streams then it should throw a subclass of IOException for those operations. I'd missed that the stream would convert the exception when I looked over the musl PR.

Are you running a hardened configuration of RHEL/CentOS/OL 7?

wbo4958 · 2022-04-26T04:36:29Z

@trivialfis looks like we should also port this PR to 1.6.1

mallman · 2022-04-26T18:19:55Z

@Craigacp I don't admin this cluster and really don't even understand your question. I'm pretty sure it is Centos 7. Otherwise I don't know. Sorry.

mallman · 2022-04-26T18:20:22Z

@wbo4958 That would be great, thank you!

Craigacp · 2022-04-26T18:27:25Z

@Craigacp I don't admin this cluster and really don't even understand your question. I'm pretty sure it is Centos 7. Otherwise I don't know. Sorry.

The operation not permitted error makes me wonder if it's hitting SELinux or it's running inside a restricted container. It's separate from actually fixing the issue, I was just wondering what was different about your setup that caused you to hit it, because it doesn't throw in the CI.

mallman · 2022-04-26T19:11:19Z

The operation not permitted error makes me wonder if it's hitting SELinux or it's running inside a restricted container. It's separate from actually fixing the issue, I was just wondering what was different about your setup that caused you to hit it, because it doesn't throw in the CI.

@Craigacp Hmmmm... well, if there's a few system commands you want me to run on the machine to gather information I will do that for you.

Craigacp · 2022-04-26T20:39:22Z

The operation not permitted error makes me wonder if it's hitting SELinux or it's running inside a restricted container. It's separate from actually fixing the issue, I was just wondering what was different about your setup that caused you to hit it, because it doesn't throw in the CI.

@Craigacp Hmmmm... well, if there's a few system commands you want me to run on the machine to gather information I will do that for you.

Unfortunately I'm not sure what the set of things is that can cause that part of /proc to not be readable, so I'm not sure what commands would tell me how it happened. As I said, we should still fix the issue, but if we knew the root cause we could note it in the release notes to explain when users should upgrade.

trivialfis · 2022-04-27T15:37:06Z

Thank you for working on the fix, will backport it to 1.6

…7844)

* [jvm-packages] move the dmatrix building into rabit context (#7823) This fixes the QuantileDeviceDMatrix in distributed environment. * [doc] update the jvm tutorial to 1.6.1 [skip ci] (#7834) * [Breaking][jvm-packages] Use barrier execution mode (#7836) With the introduction of the barrier execution mode. we don't need to kill SparkContext when some xgboost tasks failed. Instead, Spark will handle the errors for us. So in this PR, `killSparkContextOnWorkerFailure` parameter is deleted. * [doc] remove the doc about killing SparkContext [skip ci] (#7840) * [jvm-package] remove the coalesce in barrier mode (#7846) * [jvm-packages] Fix model compatibility (#7845) * Ignore all Java exceptions when looking for Linux musl support (#7844) Co-authored-by: Bobby Wang <[email protected]> Co-authored-by: Michael Allman <[email protected]>

Ignore all Java exceptions when looking for Linux musl support, not just

43c129b

IOException

trivialfis approved these changes Apr 28, 2022

View reviewed changes

trivialfis merged commit f7db16a into dmlc:master Apr 28, 2022

trivialfis mentioned this pull request Apr 29, 2022

1.6.1 Patch Release #7841

Closed

7 tasks

trivialfis pushed a commit to trivialfis/xgboost that referenced this pull request Apr 29, 2022

Ignore all Java exceptions when looking for Linux musl support (dmlc#…

67fcd35

…7844)

kstock mentioned this pull request Apr 29, 2022

XGBoost 4J spark giving XGBoostError: std::bad_alloc on databricks #7155

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore all Java exceptions when looking for Linux musl support #7844

Ignore all Java exceptions when looking for Linux musl support #7844

mallman commented Apr 25, 2022

Craigacp commented Apr 26, 2022

wbo4958 commented Apr 26, 2022

mallman commented Apr 26, 2022

mallman commented Apr 26, 2022

Craigacp commented Apr 26, 2022

mallman commented Apr 26, 2022

Craigacp commented Apr 26, 2022

trivialfis commented Apr 27, 2022

Ignore all Java exceptions when looking for Linux musl support #7844

Ignore all Java exceptions when looking for Linux musl support #7844

Conversation

mallman commented Apr 25, 2022

Craigacp commented Apr 26, 2022

wbo4958 commented Apr 26, 2022

mallman commented Apr 26, 2022

mallman commented Apr 26, 2022

Craigacp commented Apr 26, 2022

mallman commented Apr 26, 2022

Craigacp commented Apr 26, 2022

trivialfis commented Apr 27, 2022