You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When using GDS for rapids shuffle spilling, sometimes reading back a spilled buffer causes a "No such file or directory" exception.
Steps/Code to reproduce bug
Run TPC-DS queries with GDS spilling enabled. Seems to happen more when under GPU memory pressure.
Expected behavior
Should not throw CuFile exception.
Environment details (please complete the following information)
Environment location: Standalone
Spark configuration settings related to the issue: spark.rapids.memory.gpu.direct.storage.spill.enabled=true
Additional context
Stack trace:
21/01/29 18:42:33 WARN TaskSetManager: Lost task 58.0 in stage 39.0 (TID 848, 127.0.0.1, executor 1):ai.rapids.cudf.CudfException: cuDF failure at: /data/rou/src/cudf/java/src/main/native/src/CuFileJni.cpp:215: Failed to read file into buffer: No such file or directory at ai.rapids.cudf.CuFile.readFromFile(Native Method) at ai.rapids.cudf.CuFile.readFileToDeviceBuffer(CuFile.java:115) at com.nvidia.spark.rapids.RapidsGdsStore$RapidsGdsBuffer.$anonfun$getMemoryBuffer$1(RapidsGdsStore.scala:82) at com.nvidia.spark.rapids.Arm.closeOnExcept(Arm.scala:67) at com.nvidia.spark.rapids.Arm.closeOnExcept$(Arm.scala:65) at com.nvidia.spark.rapids.RapidsBufferStore$RapidsBufferBase.closeOnExcept(RapidsBufferStore.scala:245) at com.nvidia.spark.rapids.RapidsGdsStore$RapidsGdsBuffer.getMemoryBuffer(RapidsGdsStore.scala:81) at com.nvidia.spark.rapids.RapidsGdsStore$RapidsGdsBuffer.getColumnarBatch(RapidsGdsStore.scala:101) at org.apache.spark.sql.rapids.RapidsCachingReader.$anonfun$read$9(RapidsCachingReader.scala:146) at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28) at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26) at org.apache.spark.sql.rapids.RapidsCachingReader.withResource(RapidsCachingReader.scala:49) at org.apache.spark.sql.rapids.RapidsCachingReader.$anonfun$read$8(RapidsCachingReader.scala:145) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at scala.collection.Iterator$ConcatIterator.next(Iterator.scala:230) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$iterNext$1(GpuCoalesceBatches.scala:180) at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28) at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26) at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.withResource(GpuCoalesceBatches.scala:134) at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.iterNext(GpuCoalesceBatches.scala:179) at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$1(GpuCoalesceBatches.scala:185) at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$1$adapted(GpuCoalesceBatches.scala:183) at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28) at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26) at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.withResource(GpuCoalesceBatches.scala:134) at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:183) at com.nvidia.spark.rapids.ConcatAndConsumeAll$.getSingleBatchWithVerification(GpuCoalesceBatches.scala:80) at com.nvidia.spark.rapids.shims.spark300.GpuShuffledHashJoinExec.$anonfun$doExecuteColumnar$2(GpuShuffledHashJoinExec.scala:138) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
The text was updated successfully, but these errors were encountered:
Describe the bug
When using GDS for rapids shuffle spilling, sometimes reading back a spilled buffer causes a "No such file or directory" exception.
Steps/Code to reproduce bug
Run TPC-DS queries with GDS spilling enabled. Seems to happen more when under GPU memory pressure.
Expected behavior
Should not throw CuFile exception.
Environment details (please complete the following information)
spark.rapids.memory.gpu.direct.storage.spill.enabled=true
Additional context
Stack trace:
The text was updated successfully, but these errors were encountered: