[SPARK-35182][K8S] Support driver-owned on-demand PVC #32288

dongjoon-hyun · 2021-04-22T06:56:09Z

What changes were proposed in this pull request?

This PR aims to support driver-owned on-demand PVC(Persistent Volume Claim)s. It means dynamically-created PVCs will have the ownerReference to driver pod instead of executor pod.

Why are the changes needed?

This allows K8s backend scheduler can reuse this later.

BEFORE

$ k get pvc tpcds-pvc-exec-1-pvc-0 -oyaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
...
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Pod
    name: tpcds-pvc-exec-1

AFTER

$ k get pvc tpcds-pvc-exec-1-pvc-0 -oyaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
...
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Pod
    name: tpcds-pvc

Does this PR introduce any user-facing change?

No. (The default is false)

How was this patch tested?

Manually check the above and pass K8s IT.

KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 16 minutes, 40 seconds.
Total number of tests run: 27
Suites: completed 2, aborted 0
Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

SparkQA · 2021-04-22T21:11:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42356/

SparkQA · 2021-04-22T21:11:26Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42356/

SparkQA · 2021-04-22T21:27:59Z

Test build #137828 has finished for PR 32288 at commit b350f25.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-04-22T21:30:24Z

Could you review this please, @viirya ?

viirya · 2021-04-22T21:35:26Z

This allows K8s backend scheduler can reuse this later.

So based on my understanding, it means that once an executor is removed, the PVC can be still there and reused by the driver on other executor added later?

viirya · 2021-04-22T21:36:38Z

...rc/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala

+    Utils.tryLogNonFatalError {
+      kubernetesClient
+        .persistentVolumeClaims()
+        .withLabel(SPARK_APP_ID_LABEL, applicationId())
+        .delete()
+    }


So we don't delete PVC before?

Previously, the lifecycle is tied with the executor pod.
Now, the lifecycle is tied with the driver pod. So, it will be deleted when the driver pod die.
This code is to support early deletion at the app termination.
This is the same one for spark.kubernetes.driver.service.deleteOnTermination~

@dongjoon-hyun shouldn't there be separate property spark.kubernetes.driver.pvc.deleteOnTermination based on which this been called.

viirya · 2021-04-22T21:40:33Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

+            if (conf.get(KUBERNETES_DRIVER_OWN_PVC) && driverPod.nonEmpty) {
+              addOwnerReference(driverPod.get, Seq(resource))
+            }


This overrides the added owner reference at L338 addOwnerReference(createdExecutorPod, resources), right?

Yes, correct!

viirya

I only have a few questions. Otherwise LGTM.

SparkQA · 2021-04-22T21:58:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42358/

SparkQA · 2021-04-22T21:58:52Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42358/

dongjoon-hyun · 2021-04-23T00:02:43Z

Thank you, @viirya . Merged to master for Apache Spark 3.2.0.

mridulm · 2021-06-02T03:49:52Z

...netes/core/src/main/scala/org/apache/spark/deploy/k8s/features/MountVolumesFeatureStep.scala

@@ -85,6 +85,7 @@ private[spark] class MountVolumesFeatureStep(conf: KubernetesConf)
              .withApiVersion("v1")
              .withNewMetadata()
                .withName(claimName)
+                .addToLabels(SPARK_APP_ID_LABEL, conf.sparkConf.getAppId)


I am trying to understand this ... will the sparkConf.getAppId be available at this point ?

Yes, Spark Driver pod is already launched in the K8s and the driver is building executor pod specs here.

### What changes were proposed in this pull request? This PR aims to support driver-owned on-demand PVC(Persistent Volume Claim)s. It means dynamically-created PVCs will have the `ownerReference` to `driver` pod instead of `executor` pod. ### Why are the changes needed? This allows K8s backend scheduler can reuse this later. **BEFORE** ``` $ k get pvc tpcds-pvc-exec-1-pvc-0 -oyaml apiVersion: v1 kind: PersistentVolumeClaim metadata: ... ownerReferences: - apiVersion: v1 controller: true kind: Pod name: tpcds-pvc-exec-1 ``` **AFTER** ``` $ k get pvc tpcds-pvc-exec-1-pvc-0 -oyaml apiVersion: v1 kind: PersistentVolumeClaim metadata: ... ownerReferences: - apiVersion: v1 controller: true kind: Pod name: tpcds-pvc ``` ### Does this PR introduce _any_ user-facing change? No. (The default is `false`) ### How was this patch tested? Manually check the above and pass K8s IT. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file - Test basic decommissioning - Test basic decommissioning with shuffle cleanup - Test decommissioning with dynamic allocation & shuffle cleanups - Test decommissioning timeouts - Run SparkR on simple dataframe.R example Run completed in 16 minutes, 40 seconds. Total number of tests run: 27 Suites: completed 2, aborted 0 Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes apache#32288 from dongjoon-hyun/SPARK-35182. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

This comment has been minimized.

Sign in to view

github-actions bot added the KUBERNETES label Apr 22, 2021

This comment has been minimized.

Sign in to view

[SPARK-35182][K8S] Support driver-owned on-demand PVC

b350f25

dongjoon-hyun marked this pull request as ready for review April 22, 2021 21:02

viirya reviewed Apr 22, 2021

View reviewed changes

viirya approved these changes Apr 22, 2021

View reviewed changes

dongjoon-hyun closed this in 6ab0048 Apr 23, 2021

dongjoon-hyun deleted the SPARK-35182 branch April 23, 2021 00:03

mridulm reviewed Jun 2, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-35182][K8S] Support driver-owned on-demand PVC #32288

[SPARK-35182][K8S] Support driver-owned on-demand PVC #32288

dongjoon-hyun commented Apr 22, 2021 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

SparkQA commented Apr 22, 2021

SparkQA commented Apr 22, 2021

SparkQA commented Apr 22, 2021

dongjoon-hyun commented Apr 22, 2021

viirya commented Apr 22, 2021

viirya Apr 22, 2021

dongjoon-hyun Apr 23, 2021

pralabhkumar Aug 2, 2022

viirya Apr 22, 2021

dongjoon-hyun Apr 23, 2021

viirya left a comment

SparkQA commented Apr 22, 2021

SparkQA commented Apr 22, 2021

dongjoon-hyun commented Apr 23, 2021

mridulm Jun 2, 2021

dongjoon-hyun Jun 2, 2021

[SPARK-35182][K8S] Support driver-owned on-demand PVC #32288

[SPARK-35182][K8S] Support driver-owned on-demand PVC #32288

Conversation

dongjoon-hyun commented Apr 22, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

SparkQA commented Apr 22, 2021

SparkQA commented Apr 22, 2021

SparkQA commented Apr 22, 2021

dongjoon-hyun commented Apr 22, 2021

viirya commented Apr 22, 2021

viirya Apr 22, 2021

Choose a reason for hiding this comment

dongjoon-hyun Apr 23, 2021

Choose a reason for hiding this comment

pralabhkumar Aug 2, 2022

Choose a reason for hiding this comment

viirya Apr 22, 2021

Choose a reason for hiding this comment

dongjoon-hyun Apr 23, 2021

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

SparkQA commented Apr 22, 2021

SparkQA commented Apr 22, 2021

dongjoon-hyun commented Apr 23, 2021

mridulm Jun 2, 2021

Choose a reason for hiding this comment

dongjoon-hyun Jun 2, 2021

Choose a reason for hiding this comment

dongjoon-hyun commented Apr 22, 2021 •

edited

Loading