Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48056][CONNECT][PYTHON] Re-execute plan if a SESSION_NOT_FOUND error is raised and no partial response was received #46297

Closed
wants to merge 14 commits into from

Conversation

nija-at
Copy link
Contributor

@nija-at nija-at commented Apr 30, 2024

What changes were proposed in this pull request?

Similar to OPERATION_NOT_FOUND, re-attempt to execute
the original spark connect plan when a SESSION_NOT_FOUND is
received from the spark connect service and no partial responses
were previously received.

Why are the changes needed?

This error has been noticed to occur during a cluster cold start
and when a request arrives when the connect service is not fully
initialized.

Does this PR introduce any user-facing change?

Prevoiusly, connect-based pyspark APIs would fail with the error code
"INVALID_HANDLE.SESSION_NOT_FOUND" in the very first request to
the service.
With this change, the client will now automatically retry.

How was this patch tested?

Attached unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@nija-at nija-at changed the title [WIP][SPARK-48056][CONNECT][PYTHON] Reset original request if a SESSION_NOT_FOUND error is raised and no partial response was received [SPARK-48056][CONNECT][PYTHON] Reset original request if a SESSION_NOT_FOUND error is raised and no partial response was received Apr 30, 2024
@nija-at nija-at marked this pull request as ready for review April 30, 2024 07:25
@nija-at nija-at changed the title [SPARK-48056][CONNECT][PYTHON] Reset original request if a SESSION_NOT_FOUND error is raised and no partial response was received [SPARK-48056][CONNECT][PYTHON] Re-execute original request if a SESSION_NOT_FOUND error is raised and no partial response was received Apr 30, 2024
@nija-at nija-at changed the title [SPARK-48056][CONNECT][PYTHON] Re-execute original request if a SESSION_NOT_FOUND error is raised and no partial response was received [SPARK-48056][CONNECT][PYTHON] Re-execute plan if a SESSION_NOT_FOUND error is raised and no partial response was received Apr 30, 2024
@github-actions github-actions bot added the INFRA label Apr 30, 2024
Copy link
Contributor

@juliuszsompolski juliuszsompolski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nija-at
Copy link
Contributor Author

nija-at commented Apr 30, 2024

@juliuszsompolski - this will be followed up in a separate PR

Copy link
Contributor

@grundprinzip grundprinzip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a problem with adding the dependency, but we should make sure that it's only needed for testing and documented in the requirements file as such.

dev/requirements.txt Outdated Show resolved Hide resolved
dev/infra/Dockerfile Outdated Show resolved Hide resolved
@HyukjinKwon
Copy link
Member

Merged to master.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 4, 2024

This seems to be missed in another PR too. Maybe, we had better find a way to prevent this.

dongjoon-hyun added a commit that referenced this pull request May 4, 2024
…f `assertEquals` for Python 3.12

### What changes were proposed in this pull request?

This is a follow-up of
- #46297

This PR aims to use `assertEqual` instead of `assertEquals` for Python 3.12.

### Why are the changes needed?

To recover Python CI,
- https://github.com/apache/spark/actions/workflows/build_python.yml

From Python 3.12, `assertEquals` doesn't exist.

https://docs.python.org/3/library/unittest.html#assert-methods

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46377 from dongjoon-hyun/SPARK-48056.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@HyukjinKwon
Copy link
Member

Thanks for fixing

JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
… error is raised and no partial response was received

### What changes were proposed in this pull request?

Similar to OPERATION_NOT_FOUND, re-attempt to execute
the original spark connect plan when a SESSION_NOT_FOUND is
received from the spark connect service and no partial responses
were previously received.

### Why are the changes needed?

This error has been noticed to occur during a cluster cold start
and when a request arrives when the connect service is not fully
initialized.

### Does this PR introduce _any_ user-facing change?

Prevoiusly, connect-based pyspark APIs would fail with the error code
"INVALID_HANDLE.SESSION_NOT_FOUND" in the very first request to
the service.
With this change, the client will now automatically retry.

### How was this patch tested?

Attached unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46297 from nija-at/session-not-found.

Authored-by: Niranjan Jayakar <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
…f `assertEquals` for Python 3.12

### What changes were proposed in this pull request?

This is a follow-up of
- apache#46297

This PR aims to use `assertEqual` instead of `assertEquals` for Python 3.12.

### Why are the changes needed?

To recover Python CI,
- https://github.com/apache/spark/actions/workflows/build_python.yml

From Python 3.12, `assertEquals` doesn't exist.

https://docs.python.org/3/library/unittest.html#assert-methods

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46377 from dongjoon-hyun/SPARK-48056.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@changgyoopark-db
Copy link
Contributor

changgyoopark-db commented Jun 13, 2024

zhengruifeng pushed a commit that referenced this pull request Jun 14, 2024
…ESSION_NOT_FOUND error is raised and no partial response was received

### What changes were proposed in this pull request?

This change lets a Scala Spark Connect client reattempt execution of a plan when it receives a SESSION_NOT_FOUND error from the Spark Connect service if it has not received any partial responses.

This is a Scala version of the previous fix of the same issue - #46297.

### Why are the changes needed?

Spark Connect clients often get a spurious error from the Spark Connect service if the service is busy or the network is congested. This error leads to a situation where the client immediately attempts to reattach without the service being aware of the client; this leads to a query failure.

### Does this PR introduce _any_ user-facing change?

Prevoiusly, a Scala Spark Connect client would fail with the error code "INVALID_HANDLE.SESSION_NOT_FOUND" in the very first attempt to make a request to the service, but with this change, the client will automatically retry.

### How was this patch tested?

Attached unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46971 from changgyoopark-db/SPARK-48056.

Authored-by: Changgyoo Park <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants