Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid allocating GPU memory out of RMM managed pool in test #9985

Merged
merged 1 commit into from
Dec 8, 2023

Conversation

winningsix
Copy link
Collaborator

This closes #9982.

The root cause is that within our test, we create a GPU column vector via Spark-Rapids-JNI API ColumnVector.fromBytes before RMM initialization. And we close it within Spark GPU session with RMM initialized. So RMM mistakenly deallocated that column vector via its memory address not visible to RMM (allocated before RMM initialized).

@winningsix winningsix marked this pull request as ready for review December 7, 2023 10:35
@winningsix
Copy link
Collaborator Author

build

@firestarman
Copy link
Collaborator

LGTM

@tgravescs
Copy link
Collaborator

do we know why this just started failing? Seems like this should have been an issue for a while. Were we not using the ARENA allocator before? Or maybe we aren't explicitly setting it and it was always on machine with newer cuda?

tgravescs
tgravescs previously approved these changes Dec 7, 2023
@sameerz sameerz added the bug Something isn't working label Dec 7, 2023
@winningsix
Copy link
Collaborator Author

Were we not using the ARENA allocator before? Or maybe we aren't explicitly setting it and it was always on machine with newer cuda?

Yes, I think so. To reproduce this, I hardcoded the memory pool to arena.

@winningsix
Copy link
Collaborator Author

build

@pxLi
Copy link
Collaborator

pxLi commented Dec 8, 2023

do we need this to be targeted to 23.12, or as this is the test only we just ignore it in 23.12?

@winningsix
Copy link
Collaborator Author

do we need this to be targeted to 23.12, or as this is the test only we just ignore it in 23.12?

My bad. Should target on 23.12 and merged back to 24.02. Let me re-target this.

@winningsix winningsix changed the base branch from branch-24.02 to branch-23.12 December 8, 2023 01:16
@winningsix winningsix dismissed tgravescs’s stale review December 8, 2023 01:16

The base branch was changed.

@winningsix
Copy link
Collaborator Author

build

@pxLi pxLi merged commit a5c37fb into NVIDIA:branch-23.12 Dec 8, 2023
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] test "convert large InternalRow iterator to cached batch single col" failed with arena pool
5 participants