You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
JNI MemoryCleaner holds a set of CleanerWeakReference which can go to very large size when doing NDS testing while full GC is not excuted frequently.
For the set, please refer to:
private static final Set<CleanerWeakReference> all =
Collections.newSetFromMap(new ConcurrentHashMap()); // We want to be thread safe
Not sure if this large size set can cause OOM, but we can remove the closed items as soon as possible to save memory.
Describe the solution you'd like
The MemoryCleaner.Cleaner already has isClean, because the cleaned object will not be used again.
We can safely remove the cleaned items.
Additional context
The K8s will kill executor if the memory of the Pod running this executor exceeds the definded value which is typically spark.executor.memory + spark.executor.memoryOverhead and report OOM which is a kind of K8s OOM.
Not sure if the large size of CleanerWeakReference is the cause.
Is your feature request related to a problem? Please describe.
JNI
MemoryCleaner
holds a set ofCleanerWeakReference
which can go to very large size when doing NDS testing while full GC is not excuted frequently.For the set, please refer to:
https://github.com/rapidsai/cudf/blob/v23.06.00a/java/src/main/java/ai/rapids/cudf/MemoryCleaner.java#L148-L149
I verified that the full GC will shrink the set. But in practice the full GC may be not executed for a long time.
And I verified there is no leak on the resources that the set references.
Not sure if this large size set can cause OOM, but we can remove the closed items as soon as possible to save memory.
Describe the solution you'd like
The
MemoryCleaner.Cleaner
already hasisClean
, because the cleaned object will not be used again.We can safely remove the cleaned items.
Refer to:
https://github.com/rapidsai/cudf/blob/v23.06.00a/java/src/main/java/ai/rapids/cudf/MemoryCleaner.java#L144
Update the thread code to remove cleaned items
https://github.com/rapidsai/cudf/blob/v23.06.00a/java/src/main/java/ai/rapids/cudf/MemoryCleaner.java#L182-L207
Additional context
The K8s will kill executor if the memory of the Pod running this executor exceeds the definded value which is typically
spark.executor.memory
+spark.executor.memoryOverhead
and report OOM which is a kind of K8s OOM.Not sure if the large size of CleanerWeakReference is the cause.
Also reported in NVIDIA/spark-rapids#8305
The text was updated successfully, but these errors were encountered: