[SPARK-18758][SS] StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query #16186

tdas · 2016-12-07T03:42:40Z

What changes were proposed in this pull request?

Listeners added with sparkSession.streams.addListener(l) are added to a SparkSession. So events only from queries in the same session as a listener should be posted to the listener. Currently, all the events gets rerouted through the Spark's main listener bus, that is,

StreamingQuery posts event to StreamingQueryListenerBus. Only the queries associated with the same session as the bus posts events to it.
StreamingQueryListenerBus posts event to Spark's main LiveListenerBus as a SparkEvent.
StreamingQueryListenerBus also subscribes to LiveListenerBus events thus getting back the posted event in a different thread.
The received is posted to the registered listeners.

The problem is that all StreamingQueryListenerBuses in all sessions gets the events and posts them to their listeners. This is wrong.

In this PR, I solve it by making StreamingQueryListenerBus track active queries (by their runIds) when a query posts the QueryStarted event to the bus. This allows the rerouted events to be filtered using the tracked queries.

Note that this list needs to be maintained separately
from the StreamingQueryManager.activeQueries because a terminated query is cleared from
StreamingQueryManager.activeQueries as soon as it is stopped, but the this ListenerBus must
clear a query only after the termination event of that query has been posted lazily, much after the query has been terminated.

Credit goes to @zsxwing for coming up with the initial idea.

How was this patch tested?

Updated test harness code to use the correct session, and added new unit test.

tdas · 2016-12-07T03:45:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala

@@ -70,11 +70,11 @@ case class MemoryStream[A : Encoder](id: Int, sqlContext: SQLContext)

  def schema: StructType = encoder.schema

-  def toDS()(implicit sqlContext: SQLContext): Dataset[A] = {


removed this because this is not needed. the sqlContext is in the constructor.
rather this causes confusion in multi-session environment.

tdas · 2016-12-07T03:45:19Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala

    var pos = 0
-    var currentPlan: LogicalPlan = stream.logicalPlan


tdas · 2016-12-07T03:45:24Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala

@@ -319,7 +319,6 @@ trait StreamTest extends QueryTest with SharedSQLContext with Timeouts {
         """.stripMargin)
    }

-    val testThread = Thread.currentThread()


SparkQA · 2016-12-07T05:58:37Z

Test build #69769 has finished for PR 16186 at commit 9585ae4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-12-07T06:10:22Z

@marmbrus @zsxwing @brkyvz Please review.

zsxwing

Looks good overall. Just a minor style issue.

zsxwing · 2016-12-07T07:48:27Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingQueryListenerBus.scala

  override protected def doPostEvent(
      listener: StreamingQueryListener,
      event: StreamingQueryListener.Event): Unit = {
+    val runIdsToReportTo = activeQueryRunIds.synchronized { activeQueryRunIds.toSet }


Why need to clone the set? You can just use activeQueryRunIds.synchronized { activeQueryRunIds.contains(...) }. Right?

otherwise the code would look like

case queryStarted: QueryStartedEvent => if (activeQueryRunIds.synchronized { activeQueryRunIds.contains(queryStarted.runId) }) { listener.onQueryStarted(queryStarted) } case queryProgress: QueryProgressEvent => if (activeQueryRunIds.synchronized { activeQueryRunIds.contains(queryProgress.progress.runId) }) { listener.onQueryProgress(queryProgress) } case queryTerminated: QueryTerminatedEvent => if (activeQueryRunIds.synchronized { activeQueryRunIds.contains(queryTerminated.runId) }) { listener.onQueryTerminated(queryTerminated) activeQueryRunIds.synchronized { activeQueryRunIds -= queryTerminated.runId } }

Looks good? It also reduces the number of lines :)

i think this looks uglier. repeated code. longer lines? :)

Actually found a middle ground :D

def shouldReport(runId: UUID): Boolean = { activeQueryRunIds.synchronized { activeQueryRunIds.contains(runId) } } event match { case queryStarted: QueryStartedEvent => if (shouldReport(queryStarted.runId)) { listener.onQueryStarted(queryStarted) } case queryProgress: QueryProgressEvent => if (shouldReport(queryProgress.progress.runId)) { listener.onQueryProgress(queryProgress) } case queryTerminated: QueryTerminatedEvent => if (shouldReport(queryTerminated.runId)) { listener.onQueryTerminated(queryTerminated) activeQueryRunIds.synchronized { activeQueryRunIds -= queryTerminated.runId } } case _ => }

brkyvz · 2016-12-07T17:05:20Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingQueryListenerBus.scala

+   * `StreamingQueryManager.activeQueries` as soon as it is stopped, but the this ListenerBus must
+   * clear a query only after the termination event of that query has been posted.
+   */
+  private val activeQueryRunIds = new mutable.HashSet[UUID]


~~just a qq: why do we use runId's instead of the id's of the streams? We already don't want concurrent runs for streams since the offset log directories will be messed up.~~

Wait, ok, got it. onStart's are called synchronously where as onTerminations are asynchronous. Basically we can get a second stream start report before the firs run completes. Do you think that's worth adding to the docs? You don't have to if you don't need a second pass.

+1

It's worth to document it. This is different from other Spark's listener buses because of the synchronous event.

Even if this behavior was not different (that is, all async), this component should not be responsible preventing concurrent runs. This component should be simple and not deal with such issues. I have added more docs regarding why runIds instead of ids

zsxwing · 2016-12-08T00:48:51Z

LGTM pending tests

SparkQA · 2016-12-08T02:36:30Z

Test build #69831 has finished for PR 16186 at commit 1e4411f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-08T03:00:04Z

Test build #69835 has finished for PR 16186 at commit 986d7b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… should be sent only to the listeners in the same session as the query ## What changes were proposed in this pull request? Listeners added with `sparkSession.streams.addListener(l)` are added to a SparkSession. So events only from queries in the same session as a listener should be posted to the listener. Currently, all the events gets rerouted through the Spark's main listener bus, that is, - StreamingQuery posts event to StreamingQueryListenerBus. Only the queries associated with the same session as the bus posts events to it. - StreamingQueryListenerBus posts event to Spark's main LiveListenerBus as a SparkEvent. - StreamingQueryListenerBus also subscribes to LiveListenerBus events thus getting back the posted event in a different thread. - The received is posted to the registered listeners. The problem is that *all StreamingQueryListenerBuses in all sessions* gets the events and posts them to their listeners. This is wrong. In this PR, I solve it by making StreamingQueryListenerBus track active queries (by their runIds) when a query posts the QueryStarted event to the bus. This allows the rerouted events to be filtered using the tracked queries. Note that this list needs to be maintained separately from the `StreamingQueryManager.activeQueries` because a terminated query is cleared from `StreamingQueryManager.activeQueries` as soon as it is stopped, but the this ListenerBus must clear a query only after the termination event of that query has been posted lazily, much after the query has been terminated. Credit goes to zsxwing for coming up with the initial idea. ## How was this patch tested? Updated test harness code to use the correct session, and added new unit test. Author: Tathagata Das <[email protected]> Closes #16186 from tdas/SPARK-18758. (cherry picked from commit 9ab725e) Signed-off-by: Tathagata Das <[email protected]>

… should be sent only to the listeners in the same session as the query ## What changes were proposed in this pull request? Listeners added with `sparkSession.streams.addListener(l)` are added to a SparkSession. So events only from queries in the same session as a listener should be posted to the listener. Currently, all the events gets rerouted through the Spark's main listener bus, that is, - StreamingQuery posts event to StreamingQueryListenerBus. Only the queries associated with the same session as the bus posts events to it. - StreamingQueryListenerBus posts event to Spark's main LiveListenerBus as a SparkEvent. - StreamingQueryListenerBus also subscribes to LiveListenerBus events thus getting back the posted event in a different thread. - The received is posted to the registered listeners. The problem is that *all StreamingQueryListenerBuses in all sessions* gets the events and posts them to their listeners. This is wrong. In this PR, I solve it by making StreamingQueryListenerBus track active queries (by their runIds) when a query posts the QueryStarted event to the bus. This allows the rerouted events to be filtered using the tracked queries. Note that this list needs to be maintained separately from the `StreamingQueryManager.activeQueries` because a terminated query is cleared from `StreamingQueryManager.activeQueries` as soon as it is stopped, but the this ListenerBus must clear a query only after the termination event of that query has been posted lazily, much after the query has been terminated. Credit goes to zsxwing for coming up with the initial idea. ## How was this patch tested? Updated test harness code to use the correct session, and added new unit test. Author: Tathagata Das <[email protected]> Closes apache#16186 from tdas/SPARK-18758.

tdas added 3 commits December 6, 2016 19:19

Fixed bug

2c73f1e

Simpler fix

d3057d5

Fix sync bug

9585ae4

tdas commented Dec 7, 2016

View reviewed changes

zsxwing requested changes Dec 7, 2016

View reviewed changes

brkyvz reviewed Dec 7, 2016

View reviewed changes

tdas added 3 commits December 7, 2016 15:58

Updated comments

ec5c8e6

Updated comments

1e4411f

Minor refactoring

986d7b0

asfgit closed this in 9ab725e Dec 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18758][SS] StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query #16186

[SPARK-18758][SS] StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query #16186

tdas commented Dec 7, 2016 •

edited

Loading

tdas Dec 7, 2016

tdas Dec 7, 2016

tdas Dec 7, 2016

SparkQA commented Dec 7, 2016

tdas commented Dec 7, 2016

zsxwing left a comment

zsxwing Dec 7, 2016

tdas Dec 7, 2016 •

edited

Loading

zsxwing Dec 7, 2016

tdas Dec 8, 2016

tdas Dec 8, 2016

brkyvz Dec 7, 2016

zsxwing Dec 7, 2016 •

edited

Loading

tdas Dec 8, 2016

zsxwing commented Dec 8, 2016

SparkQA commented Dec 8, 2016

SparkQA commented Dec 8, 2016

		@@ -70,11 +70,11 @@ case class MemoryStream[A : Encoder](id: Int, sqlContext: SQLContext)

		def schema: StructType = encoder.schema

		def toDS()(implicit sqlContext: SQLContext): Dataset[A] = {

		var pos = 0
		var currentPlan: LogicalPlan = stream.logicalPlan

[SPARK-18758][SS] StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query #16186

[SPARK-18758][SS] StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query #16186

Conversation

tdas commented Dec 7, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 7, 2016

tdas commented Dec 7, 2016

zsxwing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdas Dec 7, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zsxwing Dec 7, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zsxwing commented Dec 8, 2016

SparkQA commented Dec 8, 2016

SparkQA commented Dec 8, 2016

tdas commented Dec 7, 2016 •

edited

Loading

tdas Dec 7, 2016 •

edited

Loading

zsxwing Dec 7, 2016 •

edited

Loading