Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32106][SQL][FOLLOWUP] Fix flaky tests in transform.sql #30896

Closed
wants to merge 5 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Dec 22, 2020

What changes were proposed in this pull request?

This PR intends to fix flaky GitHub Actions (GA) tests below in transform.sql (this flakiness does not seem to happen in the Jenkins tests):

This is because the error message is different between test runs in GA (the error message seems to be truncated indeterministically) ,e.g.,

# https://github.com/apache/spark/runs/1592987501
Expected "...h status 127. Error:[ /bin/bash: some_non_existent_command: command not found]", but got "...h status 127. Error:[]" Result did not match for query #2

# https://github.com/apache/spark/runs/1593196242
Expected "...istent_command: comm[and not found]", but got "...istent_command: comm[]" Result did not match for query #2

The root cause of this indeterministic behaviour happening only in GA is not clear though, this test throws SparkException consistently even in GA. So, this PR proposes to make the test just check if it will be thrown when running it.

This PR comes from the @dongjoon-hyun comment: https://github.com/apache/spark/pull/29414/files#r547414513

Why are the changes needed?

Bugfix.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added tests.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe what was the root cause of the flakiness, @maropu ?

@AngersZhuuuu
Copy link
Contributor

Seems some subprocess didn't write error message to stderrBuffer in checkFailureAndPropagate

@maropu
Copy link
Member Author

maropu commented Dec 23, 2020

Seems some subprocess didn't write error message to stderrBuffer in checkFailureAndPropagate

hm, that might be so because the error message seems to be truncated indeterministically. But, I've checked test logs in both GA/Jenkins, then it seems this flakiness happens only in GA tests. So, I'm not really sure about why...

@AngersZhuuuu
Copy link
Contributor

Seems some subprocess didn't write error message to stderrBuffer in checkFailureAndPropagate

hm, that might be so because the error message seems to be truncated indeterministically. But, it seems this flakiness happens only in GA tests. So, I'm not really sure about why...

Not sure +1...

@maropu
Copy link
Member Author

maropu commented Dec 23, 2020

Could you describe what was the root cause of the flakiness, @maropu ?

Yea, thanks for the comment, @dongjoon-hyun and I've updated the description. I actually don't know what the root cause is now. So, the current approach of this PR just rewrote the test in a safer way. Any suggestion?

@dongjoon-hyun
Copy link
Member

Then, could you add a few empty commits in this PR to verify the stableness, @maropu ?

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. +1 for @dongjoon-hyun's suggestion to trigger few more times with empty commits.

@SparkQA
Copy link

SparkQA commented Dec 23, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37849/

@SparkQA
Copy link

SparkQA commented Dec 23, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37849/

@SparkQA
Copy link

SparkQA commented Dec 23, 2020

Test build #133253 has finished for PR 30896 at commit b29d6a1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. (Pending GitHub Action sql - slow tests results)
Thank you for the investigation and this quick fix.

@SparkQA
Copy link

SparkQA commented Dec 23, 2020

Test build #133245 has finished for PR 30896 at commit d205a13.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@maropu
Copy link
Member Author

maropu commented Dec 23, 2020

Thanks, all!

@SparkQA
Copy link

SparkQA commented Dec 23, 2020

Test build #133251 has finished for PR 30896 at commit b38541f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

dongjoon-hyun pushed a commit that referenced this pull request Jan 6, 2021
…condition to fix flaky test

### What changes were proposed in this pull request?
Follow comment and fix. flaky test #30973 (comment).
This flaky test is similar as #30896

Some task's failed with root cause but in driver may return error without root cause , change. UT to check with status exit code since different root cause's exit code is not same.

### Why are the changes needed?
Fix flaky test

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #31046 from AngersZhuuuu/SPARK-33934-FOLLOW-UP.

Lead-authored-by: angerszhu <[email protected]>
Co-authored-by: AngersZhuuuu <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants