Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GDS exception when restoring spilled buffer #1627

Closed
rongou opened this issue Jan 29, 2021 · 2 comments
Closed

[BUG] GDS exception when restoring spilled buffer #1627

rongou opened this issue Jan 29, 2021 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@rongou
Copy link
Collaborator

rongou commented Jan 29, 2021

Describe the bug
When using GDS for rapids shuffle spilling, sometimes reading back a spilled buffer causes a "No such file or directory" exception.

Steps/Code to reproduce bug
Run TPC-DS queries with GDS spilling enabled. Seems to happen more when under GPU memory pressure.

Expected behavior
Should not throw CuFile exception.

Environment details (please complete the following information)

  • Environment location: Standalone
  • Spark configuration settings related to the issue: spark.rapids.memory.gpu.direct.storage.spill.enabled=true

Additional context
Stack trace:

21/01/29 18:42:33 WARN TaskSetManager: Lost task 58.0 in stage 39.0 (TID 848, 127.0.0.1, executor 1):
ai.rapids.cudf.CudfException: cuDF failure at: /data/rou/src/cudf/java/src/main/native/src/CuFileJni.cpp:215:
 Failed to read file into buffer: No such file or directory
	at ai.rapids.cudf.CuFile.readFromFile(Native Method)
	at ai.rapids.cudf.CuFile.readFileToDeviceBuffer(CuFile.java:115)
	at com.nvidia.spark.rapids.RapidsGdsStore$RapidsGdsBuffer.$anonfun$getMemoryBuffer$1(RapidsGdsStore.scala:82)
	at com.nvidia.spark.rapids.Arm.closeOnExcept(Arm.scala:67)
	at com.nvidia.spark.rapids.Arm.closeOnExcept$(Arm.scala:65)
	at com.nvidia.spark.rapids.RapidsBufferStore$RapidsBufferBase.closeOnExcept(RapidsBufferStore.scala:245)
	at com.nvidia.spark.rapids.RapidsGdsStore$RapidsGdsBuffer.getMemoryBuffer(RapidsGdsStore.scala:81)
	at com.nvidia.spark.rapids.RapidsGdsStore$RapidsGdsBuffer.getColumnarBatch(RapidsGdsStore.scala:101)
	at org.apache.spark.sql.rapids.RapidsCachingReader.$anonfun$read$9(RapidsCachingReader.scala:146)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at org.apache.spark.sql.rapids.RapidsCachingReader.withResource(RapidsCachingReader.scala:49)
	at org.apache.spark.sql.rapids.RapidsCachingReader.$anonfun$read$8(RapidsCachingReader.scala:145)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at scala.collection.Iterator$ConcatIterator.next(Iterator.scala:230)
	at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
	at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$iterNext$1(GpuCoalesceBatches.scala:180)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.withResource(GpuCoalesceBatches.scala:134)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.iterNext(GpuCoalesceBatches.scala:179)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$1(GpuCoalesceBatches.scala:185)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$1$adapted(GpuCoalesceBatches.scala:183)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.withResource(GpuCoalesceBatches.scala:134)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:183)
	at com.nvidia.spark.rapids.ConcatAndConsumeAll$.getSingleBatchWithVerification(GpuCoalesceBatches.scala:80)
	at com.nvidia.spark.rapids.shims.spark300.GpuShuffledHashJoinExec.$anonfun$doExecuteColumnar$2(GpuShuffledHashJoinExec.scala:138)
	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
@rongou rongou added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 29, 2021
@rongou rongou self-assigned this Jan 29, 2021
@rongou rongou mentioned this issue Jan 29, 2021
11 tasks
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jan 30, 2021
@sameerz sameerz added this to the Feb 1 - Feb 12 milestone Jan 30, 2021
@rongou
Copy link
Collaborator Author

rongou commented Mar 2, 2021

Haven't seen this happening for a while.

@rongou rongou closed this as completed Mar 2, 2021
@rongou
Copy link
Collaborator Author

rongou commented Apr 7, 2021

On the DGX-2 we broke apart the 16-drive RAID 0 into 4x 4-drive ones, that helped with resolving this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants