[SPARK-32658][CORE] Fix `PartitionWriterStream` partition length overflow #29474

jiangxb1987 · 2020-08-19T06:54:18Z

What changes were proposed in this pull request?

The count in PartitionWriterStream should be a long value, instead of int. The issue is introduced by abef84a . When the overflow happens, the shuffle index file would record wrong index of a reduceId, thus lead to FetchFailedException: Stream is corrupted error.

Besides the fix, I also added some debug logs, so in the future it's easier to debug similar issues.

Why are the changes needed?

This is a regression and bug fix.

Does this PR introduce any user-facing change?

No

How was this patch tested?

A Spark user reported this issue when migrating their workload to 3.0. One of the jobs fail deterministically on Spark 3.0 without the patch, and the job succeed after applied the fix.

jiangxb1987 · 2020-08-19T06:56:02Z

cc @mccheah @cloud-fan @Ngone51 @zhengruifeng

SparkQA · 2020-08-19T07:05:02Z

Test build #127626 has finished for PR 29474 at commit dd38f6d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-19T07:11:36Z

retest this please

Ngone51

Good catch!! LGTM

cloud-fan

good catch!

HyukjinKwon · 2020-08-19T09:30:30Z

cc @zhengruifeng FYI. This is a blocker

SparkQA · 2020-08-19T09:46:06Z

Test build #127628 has finished for PR 29474 at commit dd38f6d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-08-19T10:17:35Z

retest this please

SparkQA · 2020-08-19T12:50:02Z

Test build #127641 has finished for PR 29474 at commit dd38f6d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2020-08-19T20:03:41Z

retest this please

SparkQA · 2020-08-19T22:46:18Z

Test build #127660 has finished for PR 29474 at commit dd38f6d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-20T07:08:45Z

thanks, merging to master/3.0!

…flow ### What changes were proposed in this pull request? The `count` in `PartitionWriterStream` should be a long value, instead of int. The issue is introduced by apache/sparkabef84a . When the overflow happens, the shuffle index file would record wrong index of a reduceId, thus lead to `FetchFailedException: Stream is corrupted` error. Besides the fix, I also added some debug logs, so in the future it's easier to debug similar issues. ### Why are the changes needed? This is a regression and bug fix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A Spark user reported this issue when migrating their workload to 3.0. One of the jobs fail deterministically on Spark 3.0 without the patch, and the job succeed after applied the fix. Closes #29474 from jiangxb1987/fixPartitionWriteStream. Authored-by: Xingbo Jiang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit f793977) Signed-off-by: Wenchen Fan <[email protected]>

gatorsmile · 2020-08-24T07:44:49Z

cc @vanzin @squito @jerryshao who are the major reviewers of the original PR #25007

mridulm · 2020-08-24T16:42:39Z

Nice catch @jiangxb1987 !

fix

dd38f6d

probot-autolabeler bot added the CORE label Aug 19, 2020

Ngone51 approved these changes Aug 19, 2020

View reviewed changes

cloud-fan approved these changes Aug 19, 2020

View reviewed changes

HyukjinKwon approved these changes Aug 19, 2020

View reviewed changes

cloud-fan closed this in f793977 Aug 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32658][CORE] Fix `PartitionWriterStream` partition length overflow #29474

[SPARK-32658][CORE] Fix `PartitionWriterStream` partition length overflow #29474

jiangxb1987 commented Aug 19, 2020 •

edited by HyukjinKwon

Loading

jiangxb1987 commented Aug 19, 2020

SparkQA commented Aug 19, 2020

cloud-fan commented Aug 19, 2020

Ngone51 left a comment •

edited

Loading

cloud-fan left a comment

HyukjinKwon commented Aug 19, 2020

SparkQA commented Aug 19, 2020

HyukjinKwon commented Aug 19, 2020

SparkQA commented Aug 19, 2020

jiangxb1987 commented Aug 19, 2020

SparkQA commented Aug 19, 2020

cloud-fan commented Aug 20, 2020

gatorsmile commented Aug 24, 2020

mridulm commented Aug 24, 2020

[SPARK-32658][CORE] Fix PartitionWriterStream partition length overflow #29474

[SPARK-32658][CORE] Fix PartitionWriterStream partition length overflow #29474

Conversation

jiangxb1987 commented Aug 19, 2020 • edited by HyukjinKwon Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

jiangxb1987 commented Aug 19, 2020

SparkQA commented Aug 19, 2020

cloud-fan commented Aug 19, 2020

Ngone51 left a comment • edited Loading

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Aug 19, 2020

SparkQA commented Aug 19, 2020

HyukjinKwon commented Aug 19, 2020

SparkQA commented Aug 19, 2020

jiangxb1987 commented Aug 19, 2020

SparkQA commented Aug 19, 2020

cloud-fan commented Aug 20, 2020

gatorsmile commented Aug 24, 2020

mridulm commented Aug 24, 2020

[SPARK-32658][CORE] Fix `PartitionWriterStream` partition length overflow #29474

[SPARK-32658][CORE] Fix `PartitionWriterStream` partition length overflow #29474

jiangxb1987 commented Aug 19, 2020 •

edited by HyukjinKwon

Loading

Ngone51 left a comment •

edited

Loading