-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48258][PYTHON][CONNECT][FOLLOW-UP] Bind relation ID to the plan instead of DataFrame #46694
Conversation
13a8d75
to
d8d5a47
Compare
d8d5a47
to
549dae7
Compare
549dae7
to
21f2d40
Compare
session_holder.dataFrameCache().getOrDefault(cached_remote_relation_id, None) | ||
) | ||
|
||
del df | ||
gc.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike JVM, this does trigger the full GC
|
||
def __del__(self) -> None: | ||
session = self._spark_session | ||
# If session is already closed, all cached DataFrame should be released. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to release those cached dataframes in server side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so.. we can only tell when to release at the client side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the change, we can only know if the session is disconnected, and we're already releasing all in this case.
Merged to master. |
What changes were proposed in this pull request?
This PR addresses #46683 (comment) comment within Python, by using ID at the plan instead of DataFrame itself.
Why are the changes needed?
Because the DataFrame holds the relation ID, if DataFrame B are derived from DataFrame A, and DataFrame A is garbage-collected, then the cache might not exist anymore. See the example below:
Does this PR introduce any user-facing change?
No, the main change has not been released out yet.
How was this patch tested?
Manually tested, and added a unittest.
Was this patch authored or co-authored using generative AI tooling?
No.