Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] host memory Leak in MultiFileCoalescingPartitionReaderBase in UTC time zone #9974

Closed
res-life opened this issue Dec 6, 2023 · 5 comments
Assignees
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@res-life
Copy link
Collaborator

res-life commented Dec 6, 2023

Describe the bug
host memory Leak in MultiFileCoalescingPartitionReaderBase

Steps/Code to reproduce bug

On branch-24.02 Spark311
compile
export TZ= 
export TEST_PARALLEL=5
Add `-Dai.rapids.refcount.debug=true` to `PYSP_TEST_spark_driver_extraJavaOptions` in `run_pyspark_from_build.sh`
run all cases
./integration_tests/run_pyspark_from_build.sh -s

Errors:

23/12/06 08:21:23 ERROR MemoryCleaner: Leaked host buffer (ID: 10969477): 

2023-12-06 08:21:23.0129 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:90)
ai.rapids.cudf.MemoryBuffer.incRefCount(MemoryBuffer.java:275)
ai.rapids.cudf.MemoryBuffer.<init>(MemoryBuffer.java:117)
ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:196)
ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:192)
ai.rapids.cudf.HostMemoryBuffer.allocate(HostMemoryBuffer.java:144)
ai.rapids.cudf.HostMemoryBuffer.allocate(HostMemoryBuffer.java:155)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readPartFiles$1(GpuMultiFileReader.scala:1115)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.readPartFiles(GpuMultiFileReader.scala:1102)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$1(GpuMultiFileReader.scala:1062)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.readBatch(GpuMultiFileReader.scala:1041)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1014)
com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

2023-12-06 08:21:23.0129 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:90)
ai.rapids.cudf.HostMemoryBuffer.slice(HostMemoryBuffer.java:611)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readPartFiles$5(GpuMultiFileReader.scala:1125)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readPartFiles$5$adapted(GpuMultiFileReader.scala:1122)
scala.collection.mutable.LinkedHashMap.foreach(LinkedHashMap.scala:143)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readPartFiles$4(GpuMultiFileReader.scala:1122)
com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:88)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readPartFiles$1(GpuMultiFileReader.scala:1115)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.readPartFiles(GpuMultiFileReader.scala:1102)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$1(GpuMultiFileReader.scala:1062)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.readBatch(GpuMultiFileReader.scala:1041)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1014)
com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

2023-12-06 08:21:23.0130 UTC: DEC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.delRef(MemoryCleaner.java:98)
ai.rapids.cudf.MemoryBuffer.close(MemoryBuffer.java:242)
ai.rapids.cudf.MemoryBuffer$SlicedBufferCleaner.cleanImpl(MemoryBuffer.java:75)
ai.rapids.cudf.MemoryCleaner$Cleaner.clean(MemoryCleaner.java:117)
ai.rapids.cudf.MemoryBuffer.close(MemoryBuffer.java:247)
com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:61)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:31)
com.nvidia.spark.rapids.MultiFileParquetPartitionReader$ParquetCopyBlocksRunner.call(GpuParquetScan.scala:1905)
com.nvidia.spark.rapids.MultiFileParquetPartitionReader$ParquetCopyBlocksRunner.call(GpuParquetScan.scala:1893)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

2023-12-06 08:21:23.0131 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:90)
ai.rapids.cudf.HostMemoryBuffer.slice(HostMemoryBuffer.java:611)
com.nvidia.spark.rapids.MultiFileParquetPartitionReader.$anonfun$writeFileFooter$1(GpuParquetScan.scala:2013)
com.nvidia.spark.rapids.MultiFileParquetPartitionReader.$anonfun$writeFileFooter$1$adapted(GpuParquetScan.scala:2012)
com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:88)
com.nvidia.spark.rapids.MultiFileParquetPartitionReader.writeFileFooter(GpuParquetScan.scala:2012)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readPartFiles$1(GpuMultiFileReader.scala:1179)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.readPartFiles(GpuMultiFileReader.scala:1102)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$1(GpuMultiFileReader.scala:1062)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.readBatch(GpuMultiFileReader.scala:1041)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1014)
com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

2023-12-06 08:21:23.0131 UTC: DEC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.delRef(MemoryCleaner.java:98)
ai.rapids.cudf.MemoryBuffer.close(MemoryBuffer.java:242)
ai.rapids.cudf.MemoryBuffer$SlicedBufferCleaner.cleanImpl(MemoryBuffer.java:75)
ai.rapids.cudf.MemoryCleaner$Cleaner.clean(MemoryCleaner.java:117)
ai.rapids.cudf.MemoryBuffer.close(MemoryBuffer.java:247)
com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:61)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:31)
com.nvidia.spark.rapids.MultiFileParquetPartitionReader.$anonfun$writeFileFooter$1(GpuParquetScan.scala:2013)
com.nvidia.spark.rapids.MultiFileParquetPartitionReader.$anonfun$writeFileFooter$1$adapted(GpuParquetScan.scala:2012)
com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:88)
com.nvidia.spark.rapids.MultiFileParquetPartitionReader.writeFileFooter(GpuParquetScan.scala:2012)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readPartFiles$1(GpuMultiFileReader.scala:1179)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.readPartFiles(GpuMultiFileReader.scala:1102)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$1(GpuMultiFileReader.scala:1062)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.readBatch(GpuMultiFileReader.scala:1041)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1014)
com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

2023-12-06 08:21:23.0131 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:90)
ai.rapids.cudf.MemoryBuffer.incRefCount(MemoryBuffer.java:275)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$5(GpuMultiFileReader.scala:1071)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:470)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:593)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:510)
scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

2023-12-06 08:21:23.0132 UTC: DEC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.delRef(MemoryCleaner.java:98)
ai.rapids.cudf.MemoryBuffer.close(MemoryBuffer.java:242)
com.nvidia.spark.rapids.ParquetTableReader.close(GpuParquetScan.scala:2720)
com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:61)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:31)
com.nvidia.spark.rapids.CachedGpuBatchIterator$.apply(GpuDataProducer.scala:156)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$5(GpuMultiFileReader.scala:1075)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:470)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:593)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:510)
scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

2023-12-06 08:21:23.0132 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:90)
ai.rapids.cudf.MemoryBuffer.incRefCount(MemoryBuffer.java:275)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$5(GpuMultiFileReader.scala:1071)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:470)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:593)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:510)
scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

2023-12-06 08:21:23.0151 UTC: DEC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.delRef(MemoryCleaner.java:98)
ai.rapids.cudf.MemoryBuffer.close(MemoryBuffer.java:242)
com.nvidia.spark.rapids.ParquetTableReader.close(GpuParquetScan.scala:2720)
com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:61)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:31)
com.nvidia.spark.rapids.CachedGpuBatchIterator$.apply(GpuDataProducer.scala:156)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$5(GpuMultiFileReader.scala:1075)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:470)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:593)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:510)
scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

2023-12-06 08:21:23.0152 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:90)
ai.rapids.cudf.MemoryBuffer.incRefCount(MemoryBuffer.java:275)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$5(GpuMultiFileReader.scala:1071)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:470)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:593)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:510)
scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

2023-12-06 08:21:23.0155 UTC: DEC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.delRef(MemoryCleaner.java:98)
ai.rapids.cudf.MemoryBuffer.close(MemoryBuffer.java:242)
com.nvidia.spark.rapids.ParquetTableReader.close(GpuParquetScan.scala:2720)
com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:61)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:31)
com.nvidia.spark.rapids.CachedGpuBatchIterator$.apply(GpuDataProducer.scala:156)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$5(GpuMultiFileReader.scala:1075)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:470)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:593)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:510)
scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

2023-12-06 08:21:23.0155 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:90)
ai.rapids.cudf.MemoryBuffer.incRefCount(MemoryBuffer.java:275)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$5(GpuMultiFileReader.scala:1071)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:470)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:593)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:510)
scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

2023-12-06 08:21:23.0157 UTC: DEC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.delRef(MemoryCleaner.java:98)
ai.rapids.cudf.MemoryBuffer.close(MemoryBuffer.java:242)
com.nvidia.spark.rapids.ParquetTableReader.close(GpuParquetScan.scala:2720)
com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:61)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:31)
com.nvidia.spark.rapids.CachedGpuBatchIterator$.apply(GpuDataProducer.scala:156)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$5(GpuMultiFileReader.scala:1075)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:470)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:593)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:510)
scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

2023-12-06 08:21:23.0157 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:90)
ai.rapids.cudf.MemoryBuffer.incRefCount(MemoryBuffer.java:275)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$5(GpuMultiFileReader.scala:1071)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:470)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:593)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:510)
scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

2023-12-06 08:21:23.0160 UTC: DEC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.delRef(MemoryCleaner.java:98)
ai.rapids.cudf.MemoryBuffer.close(MemoryBuffer.java:242)
com.nvidia.spark.rapids.ParquetTableReader.close(GpuParquetScan.scala:2720)
com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:61)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:31)
com.nvidia.spark.rapids.CachedGpuBatchIterator$.apply(GpuDataProducer.scala:156)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$5(GpuMultiFileReader.scala:1075)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:470)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:593)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:510)
scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

2023-12-06 08:21:23.0160 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:90)
ai.rapids.cudf.MemoryBuffer.incRefCount(MemoryBuffer.java:275)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$5(GpuMultiFileReader.scala:1071)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:470)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:593)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:510)
scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

2023-12-06 08:21:23.0162 UTC: DEC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.delRef(MemoryCleaner.java:98)
ai.rapids.cudf.MemoryBuffer.close(MemoryBuffer.java:242)
com.nvidia.spark.rapids.ParquetTableReader.close(GpuParquetScan.scala:2720)
com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:61)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:31)
com.nvidia.spark.rapids.CachedGpuBatchIterator$.apply(GpuDataProducer.scala:156)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$5(GpuMultiFileReader.scala:1075)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:470)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:593)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:510)
scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

Environment details (please complete the following information)
On branch-24.02
Spark311

Additional context
I guess branch-23.12 also has this issue, Just guess, I did not test yet.
I also found this leak issue: #9971
From the recently passed CI log, we can also found the following error, CI sequence number is #8653.
Click Blue Ocean -> Premerge CI 2 -> Download log

[2023-12-06T06:38:26.868Z] 23/12/06 06:38:26 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 349187)
[2023-12-06T06:38:26.868Z] 23/12/06 06:38:26 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 349191)
[2023-12-06T06:38:26.868Z] 23/12/06 06:38:26 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 349188)
[2023-12-06T06:38:26.868Z] 23/12/06 06:38:26 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 349189)

@res-life res-life added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 6, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 12, 2023
@tgravescs tgravescs self-assigned this Dec 14, 2023
@tgravescs tgravescs removed their assignment Jan 25, 2024
@gerashegalov
Copy link
Collaborator

Good news: this exact stack trace associated with MultiFileCoalescingPartitionReaderBase does not reproduce

$ SPARK_HOME=~/dist/spark-3.1.1-bin-hadoop3.2 \
  JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
  TEST_PARALLEL=0 COVERAGE_SUBMIT_FLAGS="-Dai.rapids.refcount.debug=true" \
  TZ= ./integration_tests/run_pyspark_from_build.sh  -s |& tee parquet.log

$ grep -B 20 MultiFileCoalescingPartitionReaderBase parquet.log | grep -c Leaked
0

Bad news, there are memory leaks reported:

$ grep -c Leaked parquet.log 
76

Example:

24/02/07 09:09:30 ERROR ColumnVector: A DEVICE COLUMN VECTOR WAS LEAKED (ID: 30950065 7f8a067c9680)
24/02/07 09:09:30 ERROR MemoryCleaner: Leaked vector (ID: 30950065): 2024-02-07 09:09:30.0597 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:90)
ai.rapids.cudf.ColumnVector.incRefCountInternal(ColumnVector.java:298)
ai.rapids.cudf.ColumnVector.<init>(ColumnVector.java:81)
ai.rapids.cudf.ColumnVector.fromScalar(ColumnVector.java:431)
com.nvidia.spark.rapids.GpuColumnVector.from(GpuColumnVector.java:790)
com.nvidia.spark.rapids.GpuExpressionsUtils$.$anonfun$resolveColumnVector$1(GpuExpressions.scala:76)
com.nvidia.spark.rapids.Arm$.withResourceIfAllowed(Arm.scala:84)
com.nvidia.spark.rapids.GpuExpressionsUtils$.resolveColumnVector(GpuExpressions.scala:74)
com.nvidia.spark.rapids.GpuLiteral.columnarEval(literals.scala:725)
com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:35)
com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$project$1(basicPhysicalOperators.scala:110)
com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1(implicits.scala:221)
com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1$adapted(implicits.scala:218)
scala.collection.immutable.List.foreach(List.scala:392)
com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.safeMap(implicits.scala:218)
com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableProducingSeq.safeMap(implicits.scala:253)
com.nvidia.spark.rapids.GpuProjectExec$.project(basicPhysicalOperators.scala:110)
com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$projectWithRetrySingleBatch$7(basicPhysicalOperators.scala:160)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRestoreOnRetry(RmmRapidsRetryIterator.scala:272)
com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$projectWithRetrySingleBatch$6(basicPhysicalOperators.scala:160)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$projectWithRetrySingleBatch$5(basicPhysicalOperators.scala:157)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$NoInputSpliterator.next(RmmRapidsRetryIterator.scala:395)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:613)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:291)
com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:185)
com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$projectWithRetrySingleBatch$3(basicPhysicalOperators.scala:156)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:39)
com.nvidia.spark.rapids.GpuProjectExec$.projectWithRetrySingleBatch(basicPhysicalOperators.scala:154)
com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$projectAndCloseWithRetrySingleBatch$1(basicPhysicalOperators.scala:125)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
com.nvidia.spark.rapids.GpuProjectExec$.projectAndCloseWithRetrySingleBatch(basicPhysicalOperators.scala:124)
com.nvidia.spark.rapids.GpuGenerateExec.$anonfun$doGenerateAndClose$1(GpuGenerateExec.scala:906)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
com.nvidia.spark.rapids.GpuGenerateExec.doGenerateAndClose(GpuGenerateExec.scala:900)
com.nvidia.spark.rapids.GpuGenerateExec.$anonfun$internalDoExecuteColumnar$3(GpuGenerateExec.scala:888)
scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)

@mattahrens mattahrens added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Feb 7, 2024
@gerashegalov
Copy link
Collaborator

gerashegalov commented Feb 7, 2024

There is a single test reliably reproducing the new memory leak

SPARK_HOME=~/dist/spark-3.1.1-bin-hadoop3.2  \
JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
TEST_PARALLEL=0 \
COVERAGE_SUBMIT_FLAGS="-Dai.rapids.refcount.debug=true" \
TZ= ./integration_tests/run_pyspark_from_build.sh -k test_json_tuple_select_non_generator_col -s |& tee build.log

which is already responsible for 20 Leaked buffers . It reproduces on my laptop and desktop GPU, so pretty sure it's on the CI log and would have been caught by #6947 had it already been implemented.

@gerashegalov
Copy link
Collaborator

bisect points to #10131

@gerashegalov
Copy link
Collaborator

The memory leak from #9974 (comment) was fixed by #10360

@revans2 @sameerz I assume we don't want to release 24.02.0 like this and the fix should be cherry-picked to branch-24.02, right?

@gerashegalov
Copy link
Collaborator

The original memory leaks reported in this issue's description cannot be reproduced. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

No branches or pull requests

5 participants