[SPARK-12429][Streaming][Doc]Add Accumulator and Broadcast example for Streaming #10385

zsxwing · 2015-12-18T19:54:07Z

This PR adds Scala, Java and Python examples to show how to use Accumulator and Broadcast in Spark Streaming to support checkpointing.

zsxwing · 2015-12-18T19:54:36Z

@tdas could you take a look before I start to add Java and Python examples?

SparkQA · 2015-12-18T20:16:37Z

Test build #48018 has finished for PR 10385 at commit 9928ca5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * public class JavaTwitterHashTagJoinSentiments\n * case class UnresolvedAlias(child: Expression, aliasName: Option[String] = None)\n

BenFradet · 2015-12-18T22:50:52Z

docs/streaming-programming-guide.md

@@ -1415,6 +1415,95 @@ Note that the connections in the pool should be lazily created on demand and tim

 ***

+## Accumulator and Broadcast
+
+Accumulator and Broadcast cannot be recovered from checkpoint in Streaming. If you enable checkpoint and use Accumulator or Broadcast as well, you have to create lazily instantiated singleton instances for Accumulator and Broadcast so that they can be restarted on driver failures. This is shown in the following example.


I'd say: "in Spark Streaming. If you enable checkpointing and use an Accumulator or Broadcast as well, you**'ll** have to create ..."

SparkQA · 2015-12-19T00:08:59Z

Test build #48038 has finished for PR 10385 at commit 78d15bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-21T19:11:20Z

Test build #48120 has finished for PR 10385 at commit 9e241e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class JavaWordBlacklist\n * class JavaDroppedWordsCounter\n

zsxwing · 2015-12-21T21:06:36Z

Added Java and Python examples.

BenFradet · 2015-12-21T21:20:20Z

lgtm

tdas · 2015-12-22T23:28:37Z

docs/streaming-programming-guide.md

@@ -1415,6 +1415,185 @@ Note that the connections in the pool should be lazily created on demand and tim

 ***

+## Accumulator and Broadcast


Accumulators and Broadcast variables

tdas · 2015-12-22T23:41:01Z

Small comments, otherwise LGTM.

zsxwing · 2015-12-23T00:04:07Z

docs/programming-guide.md

@@ -806,7 +806,7 @@ However, in `cluster` mode, what happens is more complicated, and the above may

 What is happening here is that the variables within the closure sent to each executor are now copies and thus, when **counter** is referenced within the `foreach` function, it's no longer the **counter** on the driver node. There is still a **counter** in the memory of the driver node but this is no longer visible to the executors! The executors only see the copy from the serialized closure. Thus, the final value of **counter** will still be zero since all operations on **counter** were referencing the value within the serialized closure.  

-To ensure well-defined behavior in these sorts of scenarios one should use an [`Accumulator`](#AccumLink). Accumulators in Spark are used specifically to provide a mechanism for safely updating a variable when execution is split up across worker nodes in a cluster. The Accumulators section of this guide discusses these in more detail.  


I also fixed the broken Accumulator link in programming-guide.md

SparkQA · 2015-12-23T00:28:39Z

Test build #48222 has finished for PR 10385 at commit 455968a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class JavaWordBlacklist\n * class JavaDroppedWordsCounter\n

…or Streaming This PR adds Scala, Java and Python examples to show how to use Accumulator and Broadcast in Spark Streaming to support checkpointing. Author: Shixiong Zhu <[email protected]> Closes #10385 from zsxwing/accumulator-broadcast-example. (cherry picked from commit 20591af) Signed-off-by: Tathagata Das <[email protected]>

Add Accumulator and Broadcast Scala example for Streaming

9928ca5

BenFradet reviewed Dec 18, 2015
View reviewed changes

zsxwing changed the title ~~[SPARK-12429][Streaming][Doc]Add Accumulator and Broadcast Scala example for Streaming~~ [SPARK-12429][Streaming][Doc][WIP]Add Accumulator and Broadcast example for Streaming Dec 18, 2015

Address

78d15bd

Add Java and Python examples

9e241e7

zsxwing changed the title ~~[SPARK-12429][Streaming][Doc][WIP]Add Accumulator and Broadcast example for Streaming~~ [SPARK-12429][Streaming][Doc]Add Accumulator and Broadcast example for Streaming Dec 21, 2015

tdas reviewed Dec 22, 2015
View reviewed changes

Address comments

455968a

zsxwing reviewed Dec 23, 2015
View reviewed changes

asfgit closed this in 20591af Dec 23, 2015

zsxwing deleted the accumulator-broadcast-example branch December 23, 2015 05:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12429][Streaming][Doc]Add Accumulator and Broadcast example for Streaming #10385

[SPARK-12429][Streaming][Doc]Add Accumulator and Broadcast example for Streaming #10385

zsxwing commented Dec 18, 2015

zsxwing commented Dec 18, 2015

SparkQA commented Dec 18, 2015

BenFradet Dec 18, 2015

SparkQA commented Dec 19, 2015

SparkQA commented Dec 21, 2015

zsxwing commented Dec 21, 2015

BenFradet commented Dec 21, 2015

tdas Dec 22, 2015

tdas commented Dec 22, 2015

zsxwing Dec 23, 2015

tdas Dec 23, 2015

SparkQA commented Dec 23, 2015

		@@ -1415,6 +1415,185 @@ Note that the connections in the pool should be lazily created on demand and tim

		***

		## Accumulator and Broadcast

		@@ -806,7 +806,7 @@ However, in `cluster` mode, what happens is more complicated, and the above may

		What is happening here is that the variables within the closure sent to each executor are now copies and thus, when counter is referenced within the `foreach` function, it's no longer the counter on the driver node. There is still a counter in the memory of the driver node but this is no longer visible to the executors! The executors only see the copy from the serialized closure. Thus, the final value of counter will still be zero since all operations on counter were referencing the value within the serialized closure.

		To ensure well-defined behavior in these sorts of scenarios one should use an [`Accumulator`](#AccumLink). Accumulators in Spark are used specifically to provide a mechanism for safely updating a variable when execution is split up across worker nodes in a cluster. The Accumulators section of this guide discusses these in more detail.

[SPARK-12429][Streaming][Doc]Add Accumulator and Broadcast example for Streaming #10385

[SPARK-12429][Streaming][Doc]Add Accumulator and Broadcast example for Streaming #10385

Conversation

zsxwing commented Dec 18, 2015

zsxwing commented Dec 18, 2015

SparkQA commented Dec 18, 2015

BenFradet Dec 18, 2015

Choose a reason for hiding this comment

SparkQA commented Dec 19, 2015

SparkQA commented Dec 21, 2015

zsxwing commented Dec 21, 2015

BenFradet commented Dec 21, 2015

tdas Dec 22, 2015

Choose a reason for hiding this comment

tdas commented Dec 22, 2015

zsxwing Dec 23, 2015

Choose a reason for hiding this comment

tdas Dec 23, 2015

Choose a reason for hiding this comment

SparkQA commented Dec 23, 2015