Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26734] [STREAMING] Fix StackOverflowError with large block queue #23716

Closed
wants to merge 5 commits into from

Conversation

rlodge
Copy link
Contributor

@rlodge rlodge commented Jan 31, 2019

What changes were proposed in this pull request?

SPARK-23991 introduced a bug in ReceivedBlockTracker#allocateBlocksToBatch: when a queue with more than a few thousand blocks are in the queue, serializing the queue throws a StackOverflowError. This change just adds dequeueAll to the new clone operation on the queue so that the fix in 23991 is preserved but the serialized data comes from an ArrayBuffer which doesn't have the serialization problems that mutable.Queue has.

How was this patch tested?

A unit test was added.

assembly/pom.xml Outdated Show resolved Hide resolved
@@ -112,7 +112,7 @@ private[streaming] class ReceivedBlockTracker(
def allocateBlocksToBatch(batchTime: Time): Unit = synchronized {
if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) {
val streamIdToBlocks = streamIds.map { streamId =>
(streamId, getReceivedBlockQueue(streamId).clone())
(streamId, getReceivedBlockQueue(streamId).clone().dequeueAll(x => true))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just basically copies the content right? would it be more robust to just construct a new mutable.Seq from the queue to be sure about what the type is? I know it won't be a Queue here but sounds like the representation does kind of matter.

Was https://issues.apache.org/jira/browse/SPARK-23358 / https://issues.apache.org/jira/browse/SPARK-23391 really the cause or did it mostly uncover this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I could do:

mutable.ArrayBuffer.empty[ReceivedBlockInfo] ++= getReceivedBlockQueue(streamId).clone()

which ensures that we know exactly the sequence type being serialized; would you prefer I did that?

https://issues.apache.org/jira/browse/SPARK-23991 caused it in the sense that it moved from using dequeueAll to just serializing a clone of the queue, and the queue just doesn't serialize correctly after some number of entries (a scala bug, IMO). You could say it "uncovered" a scala bug that was sitting there, but prior to that checking the code wouldn't error with large numbers of entries because dequeueAll constructs an ArrayBuffer and that's what was being serialized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I was looking at the wrong JIRA, number typo. Yeah I see it. @gaborgsomogyi that makes sense?

How about ArrayBuffer(...:_*)? might be better as it could exactly allocate the size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a scala but java limitation. Serialization is vulnerable to stack overflow for certain kind of structures; for example, a long linked list with no special writeObject() methods will be serialized by recursively writing each link. If you've got a 100k links, you're going to try to use 100k stack frames, and quite likely fail with a StackOverflowError. The main thing here is to use something which is not linked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also choose a solution which allocates the buffer in one step and we could avoid amortized O(1) issue.

Additionally it will be a weird construct which can be tried to simplified by later developers. I would add a comment here which describes the issue not to enforce people to analyze this again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think that's the point here indeed, but a Queue apparently has a linked-list-like representation. ArrayBuffer won't. Are you saying that's the right change or no? my point was we're not actually guaranteed what dequeueAll returns.

A comment would be good, sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think .dequeueAll(x => true) works but not the right change and I would propose similar things what you've mentioned, for example ArrayBuffer.... The main point is to allocate the buffer in one step which is not the case with dequeueAll.

Copy link
Contributor Author

@rlodge rlodge Feb 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can add a comment. Is

ArrayBuffer(queue.clone(): _*)

sufficient, or do we need

val clonedQueue = getReceivedBlockQueue(streamId).clone()
val bufferToWrite = new mutable.ArrayBuffer[ReceivedBlockInfo](clonedQueue.size)
bufferToWrite ++= clonedQueue
(streamId,
   bufferToWrite)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first one looks good to me. It definitely makes an ArrayBuffer! and I can only imagine the ordering that is produced is the order of dequeueing the elements. It was so when I tried it locally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I pushed those changes.

Copy link
Contributor

@gaborgsomogyi gaborgsomogyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the efforts! I would write down the root cause of this issue in the description as well. It's not obvious why it solves the problem.

@@ -112,7 +112,7 @@ private[streaming] class ReceivedBlockTracker(
def allocateBlocksToBatch(batchTime: Time): Unit = synchronized {
if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) {
val streamIdToBlocks = streamIds.map { streamId =>
(streamId, getReceivedBlockQueue(streamId).clone())
(streamId, getReceivedBlockQueue(streamId).clone().dequeueAll(x => true))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also choose a solution which allocates the buffer in one step and we could avoid amortized O(1) issue.

Additionally it will be a weird construct which can be tried to simplified by later developers. I would add a comment here which describes the issue not to enforce people to analyze this again.

val receivedBlockTracker = createTracker(setCheckpointDir = true)
val blockInfos = generateBlockInfos(100000)
blockInfos.map(receivedBlockTracker.addBlock)
receivedBlockTracker.allocateBlocksToBatch(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add assertions here as well. If it's not throwing exception doesn't mean the same blocks deserialized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@@ -96,6 +96,13 @@ class ReceivedBlockTrackerSuite
receivedBlockTracker.getUnallocatedBlocks(streamId) shouldEqual blockInfos
}

test("block addition, and block to batch allocation with many blocks") {
val receivedBlockTracker = createTracker(setCheckpointDir = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setCheckpointDir is already true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

@gaborgsomogyi gaborgsomogyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the minor things looks good.

val streamIdToBlocks = streamIds.map { streamId =>
(streamId, getReceivedBlockQueue(streamId).clone())
(streamId,
mutable.ArrayBuffer(getReceivedBlockQueue(streamId).clone(): _*))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: no line break required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, Sure.

@@ -96,6 +96,25 @@ class ReceivedBlockTrackerSuite
receivedBlockTracker.getUnallocatedBlocks(streamId) shouldEqual blockInfos
}

test("block addition, and block to batch allocation with many blocks") {
val receivedBlockTracker = createTracker()
receivedBlockTracker.isWriteAheadLogEnabled should be (true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to close the tracker. I know not all tests does this (which is a bug) but it would be good to make it clean here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done.

Copy link
Contributor

@gaborgsomogyi gaborgsomogyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@SparkQA
Copy link

SparkQA commented Feb 6, 2019

Test build #4550 has finished for PR 23716 at commit 9cd6fb5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

srowen pushed a commit that referenced this pull request Feb 6, 2019
## What changes were proposed in this pull request?

SPARK-23991 introduced a bug in `ReceivedBlockTracker#allocateBlocksToBatch`: when a queue with more than a few thousand blocks are in the queue, serializing the queue throws a StackOverflowError.  This change just adds `dequeueAll` to the new `clone` operation on the queue so that the fix in 23991 is preserved but the serialized data comes from an ArrayBuffer which doesn't have the serialization problems that mutable.Queue has.

## How was this patch tested?

A unit test was added.

Closes #23716 from rlodge/SPARK-26734.

Authored-by: Ross Lodge <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 8427e9b)
Signed-off-by: Sean Owen <[email protected]>
srowen pushed a commit that referenced this pull request Feb 6, 2019
## What changes were proposed in this pull request?

SPARK-23991 introduced a bug in `ReceivedBlockTracker#allocateBlocksToBatch`: when a queue with more than a few thousand blocks are in the queue, serializing the queue throws a StackOverflowError.  This change just adds `dequeueAll` to the new `clone` operation on the queue so that the fix in 23991 is preserved but the serialized data comes from an ArrayBuffer which doesn't have the serialization problems that mutable.Queue has.

## How was this patch tested?

A unit test was added.

Closes #23716 from rlodge/SPARK-26734.

Authored-by: Ross Lodge <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 8427e9b)
Signed-off-by: Sean Owen <[email protected]>
@srowen srowen closed this in 8427e9b Feb 6, 2019
@srowen
Copy link
Member

srowen commented Feb 6, 2019

Merged to master/2.4/2.3

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

SPARK-23991 introduced a bug in `ReceivedBlockTracker#allocateBlocksToBatch`: when a queue with more than a few thousand blocks are in the queue, serializing the queue throws a StackOverflowError.  This change just adds `dequeueAll` to the new `clone` operation on the queue so that the fix in 23991 is preserved but the serialized data comes from an ArrayBuffer which doesn't have the serialization problems that mutable.Queue has.

## How was this patch tested?

A unit test was added.

Closes apache#23716 from rlodge/SPARK-26734.

Authored-by: Ross Lodge <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
## What changes were proposed in this pull request?

SPARK-23991 introduced a bug in `ReceivedBlockTracker#allocateBlocksToBatch`: when a queue with more than a few thousand blocks are in the queue, serializing the queue throws a StackOverflowError.  This change just adds `dequeueAll` to the new `clone` operation on the queue so that the fix in 23991 is preserved but the serialized data comes from an ArrayBuffer which doesn't have the serialization problems that mutable.Queue has.

## How was this patch tested?

A unit test was added.

Closes apache#23716 from rlodge/SPARK-26734.

Authored-by: Ross Lodge <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 8427e9b)
Signed-off-by: Sean Owen <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 25, 2019
## What changes were proposed in this pull request?

SPARK-23991 introduced a bug in `ReceivedBlockTracker#allocateBlocksToBatch`: when a queue with more than a few thousand blocks are in the queue, serializing the queue throws a StackOverflowError.  This change just adds `dequeueAll` to the new `clone` operation on the queue so that the fix in 23991 is preserved but the serialized data comes from an ArrayBuffer which doesn't have the serialization problems that mutable.Queue has.

## How was this patch tested?

A unit test was added.

Closes apache#23716 from rlodge/SPARK-26734.

Authored-by: Ross Lodge <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 8427e9b)
Signed-off-by: Sean Owen <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
## What changes were proposed in this pull request?

SPARK-23991 introduced a bug in `ReceivedBlockTracker#allocateBlocksToBatch`: when a queue with more than a few thousand blocks are in the queue, serializing the queue throws a StackOverflowError.  This change just adds `dequeueAll` to the new `clone` operation on the queue so that the fix in 23991 is preserved but the serialized data comes from an ArrayBuffer which doesn't have the serialization problems that mutable.Queue has.

## How was this patch tested?

A unit test was added.

Closes apache#23716 from rlodge/SPARK-26734.

Authored-by: Ross Lodge <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 8427e9b)
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants