[Fault tolerance] Make executor actor restartable and fetch data from blockManager in actor calls #249

kira-lin · 2022-06-29T01:24:54Z

No description provided.

kira-lin · 2022-06-29T01:37:49Z

If we use a normal task, which calls the actor internally, the actor method still need to be resubmitted because it's a dependency. I'm not sure if using spark to call the executors will work, but I think sparkcontext would be needed to do so.

Signed-off-by: Zhi Lin <[email protected]>

make RayAppMaster an actor enable cache locations to have better data locality Signed-off-by: Zhi Lin <[email protected]>

Signed-off-by: Zhi Lin <[email protected]>

carsonwang

Thanks for the work!

carsonwang · 2022-09-19T06:47:26Z

core/src/main/scala/org/apache/spark/deploy/raydp/ApplicationInfo.scala

@@ -138,6 +142,12 @@ private[spark] class ApplicationInfo(
    registeredExecutors
  }

+  def isRemovedExecutor(executorId: String): Boolean = {


Is this used anywhere?

This was used in previous implementation. I'll remove it

carsonwang · 2022-09-21T03:26:57Z

core/src/main/scala/org/apache/spark/deploy/raydp/RayAppMaster.scala

+          val memory = appInfo.desc.memoryPerExecutorMB
+          val newExecutorId = s"${appInfo.getNextExecutorId()}"
+          // ray actor will restart using the old ID
+          val handlerOpt = Ray.getActor("raydp-executor-" + executorId)


What if the executor failed twice and restart? In the second time, the executor Id is not the original Id and there is a problem to get the actor? Can you add a test for this case?

No matter how many time it has failed, it is restarted using the parameters given when first created. The name won't change, either.

After the first restart, I saw the executorId was modified in RayCoarseGrainedExecutorBackend? Where is the original Id stored and can be got?

Yes, it is changed. The original id is not stored in executor. It is saved in a mapping in RayAppMaster restartedExecutors. Ray also stores it in task's lineage.

carsonwang · 2022-09-21T03:29:44Z

core/src/main/scala/org/apache/spark/deploy/raydp/RayAppMaster.scala

+        if (appInfo.remainingUnRegisteredExecutors > 0) {
+          val cores = appInfo.desc.coresPerExecutor.getOrElse(1)
+          val memory = appInfo.desc.memoryPerExecutorMB
+          val newExecutorId = s"${appInfo.getNextExecutorId()}"


Should we move this line into else {}? Otherwise if we go to the if {} code path, we don't need this new executor Id but we still increase the Id.

carsonwang · 2022-09-21T03:47:18Z

core/src/main/scala/org/apache/spark/executor/RayCoarseGrainedExecutorBackend.scala

+  def getRDDPartition(rddId: Int,
+                      partitionId: Int,
+                      schemaStr: String,
+                      driverAgentUrl: String): Array[Byte] = {


can we follow Spark's code style for the indent? Please also check other places in the code. see https://github.com/databricks/scala-style-guide#indent

carsonwang · 2022-09-21T03:59:46Z

python/raydp/spark/ray_cluster_master.py

-from py4j.java_gateway import JavaGateway, GatewayParameters
+from py4j.java_gateway import java_import, JavaGateway, GatewayParameters
+from py4j.clientserver import ClientServer, JavaParameters, PythonParameters
+from pyspark.find_spark_home import _find_spark_home


It seems this is not used. Can you please remove it?

Signed-off-by: Zhi Lin <[email protected]>

kira-lin added 16 commits June 29, 2022 09:00

init

a6bb8c6

upd

cf9839e

upd

5212c69

upd

72992ac

Merge remote-tracking branch 'upstream/master' into new-fault-tolerant

d978ba4

restart executor

a2626fd

use old ray

753ce3d

shutdown ray in java

48cb6bc

upd

e6b7df6

add maxTaskRetries

79f4f13

add driver agent

aac30ea

cleanup & format

2b278d0

Signed-off-by: Zhi Lin <[email protected]>

Merge remote-tracking branch 'upstream/master' into new-fault-tolerant

18e020e

Signed-off-by: Zhi Lin <[email protected]>

revert ray_cluster_master to ray actor

696d716

Signed-off-by: Zhi Lin <[email protected]>

fix executorId consistency between spark & ray

9114204

make RayAppMaster an actor enable cache locations to have better data locality Signed-off-by: Zhi Lin <[email protected]>

make executor actor parallel and thread-safe

b3b2c6b

Signed-off-by: Zhi Lin <[email protected]>

kira-lin changed the title ~~[WIP] Implement fault tolerance by executor actor calls~~ [Fault tolerance] Make executor actor restartable and fetch data from blockManager in actor calls Sep 16, 2022

kira-lin added 8 commits September 16, 2022 12:59

modify driverAgent to use spark RPC

48eb7c6

Signed-off-by: Zhi Lin <[email protected]>

cache address

01b0bbd

Signed-off-by: Zhi Lin <[email protected]>

Add an experimental API

0006ee3

Signed-off-by: Zhi Lin <[email protected]>

add fault_tolerant_mode

95c5c64

Signed-off-by: Zhi Lin <[email protected]>

fix

0da0685

Signed-off-by: Zhi Lin <[email protected]>

fix added test

d7c33a8

Signed-off-by: Zhi Lin <[email protected]>

format

5ccc7b7

Signed-off-by: Zhi Lin <[email protected]>

add readme & provide storage_level option

c1ca1e3

Signed-off-by: Zhi Lin <[email protected]>

carsonwang reviewed Sep 21, 2022

View reviewed changes

kira-lin and others added 3 commits September 21, 2022 13:50

address comments & pylint

b408c97

Signed-off-by: Zhi Lin <[email protected]>

Merge remote-tracking branch 'upstream/master' into new-fault-tolerant

7244dc7

Signed-off-by: Zhi Lin <[email protected]>

resolve conflict

8c9b1fc

Signed-off-by: Zhi Lin <[email protected]>

kira-lin added 13 commits November 9, 2022 15:48

use ray 2.1.0 jar

6ffc61c

Signed-off-by: Zhi Lin <[email protected]>

update

ca6923a

Signed-off-by: Zhi Lin <[email protected]>

fix style

b040b7f

Signed-off-by: Zhi Lin <[email protected]>

install ray 2.1.0

fd5a5d4

Signed-off-by: Zhi Lin <[email protected]>

Merge remote-tracking branch 'upstream/master' into new-fault-tolerant

88b2f99

Signed-off-by: Zhi Lin <[email protected]>

add getDummyTaskContext to shim

59c51bd

Signed-off-by: Zhi Lin <[email protected]>

Merge remote-tracking branch 'upstream/master' into new-fault-tolerant

090c682

Signed-off-by: Zhi Lin <[email protected]>

work-around tensorflow issue

0ead1e2

Signed-off-by: Zhi Lin <[email protected]>

use arrow 6 for now as ray stable version does not support arrow 7

ac25819

Signed-off-by: Zhi Lin <[email protected]>

Merge remote-tracking branch 'upstream/master' into new-fault-tolerant

75c9268

Signed-off-by: Zhi Lin <[email protected]>

fix external shuffle service

0a07a05

Signed-off-by: Zhi Lin <[email protected]>

fix

5241ff9

Signed-off-by: Zhi Lin <[email protected]>

Merge remote-tracking branch 'upstream/master' into new-fault-tolerant

544e2e1

kira-lin merged commit 1315d93 into oap-project:master Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fault tolerance] Make executor actor restartable and fetch data from blockManager in actor calls #249

[Fault tolerance] Make executor actor restartable and fetch data from blockManager in actor calls #249

kira-lin commented Jun 29, 2022

kira-lin commented Jun 29, 2022

carsonwang left a comment

carsonwang Sep 19, 2022

kira-lin Sep 21, 2022

carsonwang Sep 21, 2022

kira-lin Sep 21, 2022

carsonwang Sep 21, 2022

kira-lin Sep 21, 2022

carsonwang Sep 21, 2022

kira-lin Sep 21, 2022

carsonwang Sep 21, 2022

kira-lin Sep 21, 2022

carsonwang Sep 21, 2022

kira-lin Sep 21, 2022

[Fault tolerance] Make executor actor restartable and fetch data from blockManager in actor calls #249

[Fault tolerance] Make executor actor restartable and fetch data from blockManager in actor calls #249

Conversation

kira-lin commented Jun 29, 2022

kira-lin commented Jun 29, 2022

carsonwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment