[BUG] GDS exception when restoring spilled buffer #1627

rongou · 2021-01-29T18:45:37Z

Describe the bug
When using GDS for rapids shuffle spilling, sometimes reading back a spilled buffer causes a "No such file or directory" exception.

Steps/Code to reproduce bug
Run TPC-DS queries with GDS spilling enabled. Seems to happen more when under GPU memory pressure.

Expected behavior
Should not throw CuFile exception.

Environment details (please complete the following information)

Environment location: Standalone
Spark configuration settings related to the issue: spark.rapids.memory.gpu.direct.storage.spill.enabled=true

Additional context
Stack trace:

21/01/29 18:42:33 WARN TaskSetManager: Lost task 58.0 in stage 39.0 (TID 848, 127.0.0.1, executor 1):
ai.rapids.cudf.CudfException: cuDF failure at: /data/rou/src/cudf/java/src/main/native/src/CuFileJni.cpp:215:
 Failed to read file into buffer: No such file or directory
	at ai.rapids.cudf.CuFile.readFromFile(Native Method)
	at ai.rapids.cudf.CuFile.readFileToDeviceBuffer(CuFile.java:115)
	at com.nvidia.spark.rapids.RapidsGdsStore$RapidsGdsBuffer.$anonfun$getMemoryBuffer$1(RapidsGdsStore.scala:82)
	at com.nvidia.spark.rapids.Arm.closeOnExcept(Arm.scala:67)
	at com.nvidia.spark.rapids.Arm.closeOnExcept$(Arm.scala:65)
	at com.nvidia.spark.rapids.RapidsBufferStore$RapidsBufferBase.closeOnExcept(RapidsBufferStore.scala:245)
	at com.nvidia.spark.rapids.RapidsGdsStore$RapidsGdsBuffer.getMemoryBuffer(RapidsGdsStore.scala:81)
	at com.nvidia.spark.rapids.RapidsGdsStore$RapidsGdsBuffer.getColumnarBatch(RapidsGdsStore.scala:101)
	at org.apache.spark.sql.rapids.RapidsCachingReader.$anonfun$read$9(RapidsCachingReader.scala:146)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at org.apache.spark.sql.rapids.RapidsCachingReader.withResource(RapidsCachingReader.scala:49)
	at org.apache.spark.sql.rapids.RapidsCachingReader.$anonfun$read$8(RapidsCachingReader.scala:145)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at scala.collection.Iterator$ConcatIterator.next(Iterator.scala:230)
	at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
	at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$iterNext$1(GpuCoalesceBatches.scala:180)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.withResource(GpuCoalesceBatches.scala:134)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.iterNext(GpuCoalesceBatches.scala:179)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$1(GpuCoalesceBatches.scala:185)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$1$adapted(GpuCoalesceBatches.scala:183)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.withResource(GpuCoalesceBatches.scala:134)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:183)
	at com.nvidia.spark.rapids.ConcatAndConsumeAll$.getSingleBatchWithVerification(GpuCoalesceBatches.scala:80)
	at com.nvidia.spark.rapids.shims.spark300.GpuShuffledHashJoinExec.$anonfun$doExecuteColumnar$2(GpuShuffledHashJoinExec.scala:138)
	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

The text was updated successfully, but these errors were encountered:

rongou · 2021-03-02T05:58:29Z

Haven't seen this happening for a while.

rongou · 2021-04-07T17:27:32Z

On the DGX-2 we broke apart the 16-drive RAID 0 into 4x 4-drive ones, that helped with resolving this issue.

rongou added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 29, 2021

rongou self-assigned this Jan 29, 2021

rongou mentioned this issue Jan 29, 2021

[FEA] GDS Integration #1445

Closed

11 tasks

sameerz removed the ? - Needs Triage Need team to review and classify label Jan 30, 2021

sameerz added this to the Feb 1 - Feb 12 milestone Jan 30, 2021

sameerz modified the milestones: Feb 1 - Feb 12, Feb 16 - Feb 26 Feb 13, 2021

rongou closed this as completed Mar 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] GDS exception when restoring spilled buffer #1627

[BUG] GDS exception when restoring spilled buffer #1627

rongou commented Jan 29, 2021

rongou commented Mar 2, 2021

rongou commented Apr 7, 2021

[BUG] GDS exception when restoring spilled buffer #1627

[BUG] GDS exception when restoring spilled buffer #1627

Comments

rongou commented Jan 29, 2021

rongou commented Mar 2, 2021

rongou commented Apr 7, 2021