Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] GPU Shuffle coalesce read supports split-and-retry #11584

Open
firestarman opened this issue Oct 10, 2024 · 1 comment · May be fixed by #11598
Open

[FEA] GPU Shuffle coalesce read supports split-and-retry #11584

firestarman opened this issue Oct 10, 2024 · 1 comment · May be fixed by #11598
Assignees
Labels
? - Needs Triage Need team to review and classify feature request New feature or request

Comments

@firestarman
Copy link
Collaborator

firestarman commented Oct 10, 2024

We met the OOM error as below in some customer queries, which led to the Spark task retries.

1) Caused by: com.nvidia.spark.rapids.jni.GpuSplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$NoInputSpliterator.split(RmmRapidsRetryIterator.scala:386)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:588)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:291)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:185)
at com.nvidia.spark.rapids.cudf_utils.HostConcatResultUtil$.getColumnarBatch(HostConcatResultUtil.scala:54)
at com.nvidia.spark.rapids.GpuShuffleCoalesceIterator.$anonfun$next$4(GpuShuffleCoalesceExec.scala:229)

The process of moving to coalesced buffer to GPU currently support only retry with no split, however we can try to implement the split-and-retry to improve its stability.

@firestarman firestarman added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 10, 2024
@firestarman firestarman self-assigned this Oct 10, 2024
@firestarman firestarman changed the title [BUG] GPU OOM when doing the Shuffle coalesce read [FEA] GPU OOM when doing the Shuffle coalesce read Oct 10, 2024
@firestarman
Copy link
Collaborator Author

It is a feature request more than a bug.

@firestarman firestarman added feature request New feature or request and removed bug Something isn't working labels Oct 10, 2024
@firestarman firestarman changed the title [FEA] GPU OOM when doing the Shuffle coalesce read [FEA] GPU Shuffle coalesce read supports split-and-retry Oct 10, 2024
@firestarman firestarman added ? - Needs Triage Need team to review and classify and removed ? - Needs Triage Need team to review and classify labels Oct 11, 2024
@firestarman firestarman linked a pull request Oct 12, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant