Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Remove cleaned objects in memory cleaner #13412

Closed
res-life opened this issue May 23, 2023 · 0 comments · Fixed by #13378
Closed

[FEA] Remove cleaned objects in memory cleaner #13412

res-life opened this issue May 23, 2023 · 0 comments · Fixed by #13378
Labels
feature request New feature or request Java Affects Java cuDF API.

Comments

@res-life
Copy link
Contributor

Is your feature request related to a problem? Please describe.
JNI MemoryCleaner holds a set of CleanerWeakReference which can go to very large size when doing NDS testing while full GC is not excuted frequently.
For the set, please refer to:

  private static final Set<CleanerWeakReference> all =
      Collections.newSetFromMap(new ConcurrentHashMap()); // We want to be thread safe

https://github.com/rapidsai/cudf/blob/v23.06.00a/java/src/main/java/ai/rapids/cudf/MemoryCleaner.java#L148-L149
I verified that the full GC will shrink the set. But in practice the full GC may be not executed for a long time.
And I verified there is no leak on the resources that the set references.

Not sure if this large size set can cause OOM, but we can remove the closed items as soon as possible to save memory.

Describe the solution you'd like
The MemoryCleaner.Cleaner already has isClean, because the cleaned object will not be used again.
We can safely remove the cleaned items.

Refer to:
https://github.com/rapidsai/cudf/blob/v23.06.00a/java/src/main/java/ai/rapids/cudf/MemoryCleaner.java#L144

Update the thread code to remove cleaned items
https://github.com/rapidsai/cudf/blob/v23.06.00a/java/src/main/java/ai/rapids/cudf/MemoryCleaner.java#L182-L207

Additional context
The K8s will kill executor if the memory of the Pod running this executor exceeds the definded value which is typically spark.executor.memory + spark.executor.memoryOverhead and report OOM which is a kind of K8s OOM.
Not sure if the large size of CleanerWeakReference is the cause.

Also reported in NVIDIA/spark-rapids#8305

@res-life res-life added feature request New feature or request Needs Triage Need team to review and classify Java Affects Java cuDF API. labels May 23, 2023
rapids-bot bot pushed a commit that referenced this issue May 30, 2023
closes #13412
Remove weak references of cleaned resources when a resource is cleaned.
The cleaned objects are never leaked, it's safe to remove the weak references. 
This is to reduce the memory usage.

Authors:
  - Chong Gao (https://github.com/res-life)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - Robert (Bobby) Evans (https://github.com/revans2)
  - MithunR (https://github.com/mythrocks)

URL: #13378
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Java Affects Java cuDF API.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants