-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32658][CORE] Fix PartitionWriterStream
partition length overflow
#29474
Conversation
Test build #127626 has finished for PR 29474 at commit
|
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!! LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
cc @zhengruifeng FYI. This is a blocker |
Test build #127628 has finished for PR 29474 at commit
|
retest this please |
Test build #127641 has finished for PR 29474 at commit
|
retest this please |
Test build #127660 has finished for PR 29474 at commit
|
thanks, merging to master/3.0! |
…flow ### What changes were proposed in this pull request? The `count` in `PartitionWriterStream` should be a long value, instead of int. The issue is introduced by apache/sparkabef84a . When the overflow happens, the shuffle index file would record wrong index of a reduceId, thus lead to `FetchFailedException: Stream is corrupted` error. Besides the fix, I also added some debug logs, so in the future it's easier to debug similar issues. ### Why are the changes needed? This is a regression and bug fix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A Spark user reported this issue when migrating their workload to 3.0. One of the jobs fail deterministically on Spark 3.0 without the patch, and the job succeed after applied the fix. Closes #29474 from jiangxb1987/fixPartitionWriteStream. Authored-by: Xingbo Jiang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit f793977) Signed-off-by: Wenchen Fan <[email protected]>
cc @vanzin @squito @jerryshao who are the major reviewers of the original PR #25007 |
Nice catch @jiangxb1987 ! |
What changes were proposed in this pull request?
The
count
inPartitionWriterStream
should be a long value, instead of int. The issue is introduced by abef84a . When the overflow happens, the shuffle index file would record wrong index of a reduceId, thus lead toFetchFailedException: Stream is corrupted
error.Besides the fix, I also added some debug logs, so in the future it's easier to debug similar issues.
Why are the changes needed?
This is a regression and bug fix.
Does this PR introduce any user-facing change?
No
How was this patch tested?
A Spark user reported this issue when migrating their workload to 3.0. One of the jobs fail deterministically on Spark 3.0 without the patch, and the job succeed after applied the fix.