Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34869][K8S][TEST] Extend "EXTRA LOGS FOR THE FAILED TEST" section of k8s integration test log with the describe pods output #31962

Closed
wants to merge 3 commits into from

Conversation

attilapiros
Copy link
Contributor

@attilapiros attilapiros commented Mar 25, 2021

What changes were proposed in this pull request?

Extending "EXTRA LOGS FOR THE FAILED TEST" section of k8s integration test log with kubectl describe pods output for the failed test.

Why are the changes needed?

PR builds frequently fails as the k8s integration tests are very flaky now in Amplab Jenkins environment.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Locally by making temporary one of the test fail. The output is:

21/03/25 16:55:16.722 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: 

===== EXTRA LOGS FOR THE FAILED TEST

21/03/25 16:55:17.167 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: BEGIN driver DESCRIBE POD
Name:         spark-test-app-a2b03971b7c049e8a2629f6a3198842b
Namespace:    35bdb17e308743afaec17538f89a7c3e
Priority:     0
Node:         minikube/192.168.64.119
Start Time:   Thu, 25 Mar 2021 16:52:10 +0100
Labels:       spark-app-locator=75f695685ae44314a99ec13bb39332bc
              spark-app-selector=spark-150230742d364a77927a08eed0222065
              spark-role=driver
Annotations:  <none>
Status:       Succeeded
IP:           172.17.0.4
Containers:
  spark-kubernetes-driver:
    Container ID:  docker://d6d27b0551060d9b094f12d1e232dfb5ae78ce38559680c7126c548996da4d95
    Image:         docker.io/kubespark/spark:3.2.0-SNAPSHOT_9575B805-9CB0-4A16-8A31-AA2F8DDA8EE5
    Image ID:      docker://sha256:3fc556c73a0d5187b5a14dbdc2f69ef292e60b544b4b4d3715f6749417c20918
    Ports:         7078/TCP, 7079/TCP, 4040/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      driver
      --properties-file
      /opt/spark/conf/spark.properties
      --class
      org.apache.spark.examples.SparkPi
      local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 25 Mar 2021 16:52:11 +0100
      Finished:     Thu, 25 Mar 2021 16:52:20 +0100
    Ready:          False
    Restart Count:  0
    Limits:
      memory:  1408Mi
    Requests:
      cpu:     1
      memory:  1408Mi
    Environment:
      SPARK_USER:                 attilazsoltpiros
      SPARK_APPLICATION_ID:       spark-150230742d364a77927a08eed0222065
      SPARK_DRIVER_BIND_ADDRESS:   (v1:status.podIP)
      SPARK_LOCAL_DIRS:           /var/data/spark-dab6f1c9-e538-40c8-a7d9-3e88f9b82cfa
      SPARK_CONF_DIR:             /opt/spark/conf
    Mounts:
      /opt/spark/conf from spark-conf-volume-driver (rw)
      /var/data/spark-dab6f1c9-e538-40c8-a7d9-3e88f9b82cfa from spark-local-dir-1 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-nmfwl (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  spark-local-dir-1:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  spark-conf-volume-driver:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      spark-drv-c60832786a15ffbe-conf-map
    Optional:  false
  default-token-nmfwl:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-nmfwl
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  3m7s  default-scheduler  Successfully assigned 35bdb17e308743afaec17538f89a7c3e/spark-test-app-a2b03971b7c049e8a2629f6a3198842b to minikube
  Normal  Pulled     3m7s  kubelet, minikube  Container image "docker.io/kubespark/spark:3.2.0-SNAPSHOT_9575B805-9CB0-4A16-8A31-AA2F8DDA8EE5" already present on machine
  Normal  Created    3m7s  kubelet, minikube  Created container spark-kubernetes-driver
  Normal  Started    3m6s  kubelet, minikube  Started container spark-kubernetes-driver
21/03/25 16:55:17.168 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: END driver DESCRIBE POD

@SparkQA
Copy link

SparkQA commented Mar 25, 2021

Test build #136529 has finished for PR 31962 at commit a92cca8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it works and helps debugging the failure, by all means get this in

@SparkQA
Copy link

SparkQA commented Mar 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41113/

@SparkQA
Copy link

SparkQA commented Mar 25, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41113/

@attilapiros
Copy link
Contributor Author

This is different error:

- Start pod creation from template
- PVs with local storage *** FAILED ***
  io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://192.168.39.195:8443/api/v1/persistentvolumes. Message: object is being deleted: persistentvolumes "test-local-pv" already exists. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=null, kind=persistentvolumes, name=test-local-pv, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=object is being deleted: persistentvolumes "test-local-pv" already exists, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=AlreadyExists, status=Failure, additionalProperties={}).
  at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:570)
  at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:509)
  at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:474)
  at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:435)
  at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:250)
  at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:881)
  at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:341)
  at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:84)
  at org.apache.spark.deploy.k8s.integrationtest.PVTestsSuite.setupLocalStorage(PVTestsSuite.scala:87)
  at org.apache.spark.deploy.k8s.integrationtest.PVTestsSuite.$anonfun$$init$$1(PVTestsSuite.scala:137)
  ...
- Launcher client dependencies

@srowen @shaneknapp do you know how Minikube runs are isolated on Jenkins?

@srowen
Copy link
Member

srowen commented Mar 25, 2021

(No idea about any of that here, sorry)

@attilapiros
Copy link
Contributor Author

No problem, let's hope the next error will be the one we look for.

@attilapiros
Copy link
Contributor Author

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Mar 25, 2021

Test build #136536 has finished for PR 31962 at commit a92cca8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41120/

@SparkQA
Copy link

SparkQA commented Mar 25, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41120/

@attilapiros
Copy link
Contributor Author

===== EXTRA LOGS FOR THE FAILED TEST

21/03/25 13:08:25.283 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: BEGIN driver DESCRIBE POD
Name:               spark-test-app-269a3ebb336a45c0b8485bcd9757bb01
Namespace:          048b7ef217b2443aa9ce20c72906872a
Priority:           0
PriorityClassName:  <none>
Node:               minikube/192.168.39.147
Start Time:         Thu, 25 Mar 2021 13:05:23 -0700
Labels:             spark-app-locator=5dad6c34644f4dd2864665239ad40d17
                    spark-app-selector=spark-a78c13817a9641b59801132dfde88a11
                    spark-role=driver
Annotations:        <none>
Status:             Pending
IP:                 172.17.0.4
Containers:
  spark-kubernetes-driver:
    Container ID:  
    Image:         docker.io/kubespark/spark-py:3.2.0-SNAPSHOT_14049b88-ef56-4044-b1b7-fcc98803fe36
    Image ID:      
    Ports:         7078/TCP, 7079/TCP, 4040/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      driver
      --properties-file
      /opt/spark/conf/spark.properties
      --class
      org.apache.spark.deploy.PythonRunner
      local:///opt/spark/examples/src/main/python/pi.py
      5
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Limits:
      memory:  1433Mi
    Requests:
      cpu:     1
      memory:  1433Mi
    Environment:
      SPARK_USER:                 jenkins
      SPARK_APPLICATION_ID:       spark-a78c13817a9641b59801132dfde88a11
      SPARK_DRIVER_BIND_ADDRESS:   (v1:status.podIP)
      SPARK_LOCAL_DIRS:           /var/data/spark-ec6ef64c-722c-4568-b02e-4077498109c5
      SPARK_CONF_DIR:             /opt/spark/conf
    Mounts:
      /opt/spark/conf from spark-conf-volume-driver (rw)
      /var/data/spark-ec6ef64c-722c-4568-b02e-4077498109c5 from spark-local-dir-1 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-m6lgm (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  spark-local-dir-1:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:  
  spark-conf-volume-driver:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      spark-drv-367160786afdd4ab-conf-map
    Optional:  false
  default-token-m6lgm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-m6lgm
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    3m2s                 default-scheduler  Successfully assigned 048b7ef217b2443aa9ce20c72906872a/spark-test-app-269a3ebb336a45c0b8485bcd9757bb01 to minikube
  Warning  FailedMount  3m1s                 kubelet, minikube  MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-367160786afdd4ab-conf-map" not found
  Normal   Pulling      86s (x4 over 3m)     kubelet, minikube  pulling image "docker.io/kubespark/spark-py:3.2.0-SNAPSHOT_14049b88-ef56-4044-b1b7-fcc98803fe36"
  Warning  Failed       85s (x4 over 2m59s)  kubelet, minikube  Failed to pull image "docker.io/kubespark/spark-py:3.2.0-SNAPSHOT_14049b88-ef56-4044-b1b7-fcc98803fe36": rpc error: code = Unknown desc = Error response from daemon: pull access denied for kubespark/spark-py, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
  Warning  Failed       85s (x4 over 2m59s)  kubelet, minikube  Error: ErrImagePull
  Normal   BackOff      74s (x6 over 2m59s)  kubelet, minikube  Back-off pulling image "docker.io/kubespark/spark-py:3.2.0-SNAPSHOT_14049b88-ef56-4044-b1b7-fcc98803fe36"
  Warning  Failed       74s (x6 over 2m59s)  kubelet, minikube  Error: ImagePullBackOff
21/03/25 13:08:25.284 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: END driver DESCRIBE POD

@attilapiros
Copy link
Contributor Author

@srowen looking the events this must be a clue repository does not exist or may require 'docker login':

  Warning  Failed       85s (x4 over 2m59s)  kubelet, minikube  Failed to pull image "docker.io/kubespark/spark-py:3.2.0-SNAPSHOT_14049b88-ef56-4044-b1b7-fcc98803fe36": rpc error: code = Unknown desc = Error response from daemon: pull access denied for kubespark/spark-py, repository does not exist or may require 'docker login': denied: requested access to the resource is denied

@attilapiros
Copy link
Contributor Author

attilapiros commented Mar 25, 2021

Although I think the image should not be pulled from docker.io but be available in the minikube's own local repository.
It would be good to look around:

$ eval $(minikube docker-env)
$ docker images

@shaneknapp do you have any idea?

Moreover this warning is also interesting:
Warning FailedMount 3m1s kubelet, minikube MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-367160786afdd4ab-conf-map" not found.

But I have to go now (it is 22:15 here) still I hope these events helps a bit.

@attilapiros
Copy link
Contributor Author

cc @dongjoon-hyun, @holdenk

@attilapiros attilapiros changed the title [WIP][SPARK-34869][K8S][TEST] Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output [SPARK-34869][K8S][TEST] Extend "EXTRA LOGS FOR THE FAILED TEST" section of k8s integration test log with the describe pods output Mar 25, 2021
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 26, 2021

Thank you for pining me, @attilapiros . I agree with your analysis for the AS-IS Jenkins failure. Apparently, Amplab Jenkins seems to have a setup issue still. FYI, I have a personal downstream Jenkins machine dedicated to run K8s integration test for all Apache branches (master/3.1/3.0/2.4). I usually keep them up-to-date. Currently, Minikube 1.18.1 and K8s 1.20.2. They never fails for last 7 days in all branches.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. +1 for the idea.

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Test build #136552 has finished for PR 31962 at commit 5ca3309.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@attilapiros
Copy link
Contributor Author

Thanks @dongjoon-hyun . I also thought about checking the Minikube's bug/issue database for this v1.7.3 version but as you mentioned this is very likely a Minikube bug and the migration is already planned to the Minikube 1.18.1 via SPARK-34738 I think we have to wait until it is finished.

@attilapiros
Copy link
Contributor Author

I quickly went through the issues for this version: https://github.com/kubernetes/minikube/issues?page=3&q=is%3Aissue+v1.7.3.
Although I haven't found a similar issue still I think this very likely a Minikube's bug.

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41136/

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41136/

@attilapiros
Copy link
Contributor Author

Error processing tar file(exit status 1): write /opt/spark/jars/kubernetes-model-batch-4.13.2.jar: no space left on device

@attilapiros
Copy link
Contributor Author

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Mar 27, 2021

Test build #136588 has finished for PR 31962 at commit 5ca3309.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 27, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41172/

@SparkQA
Copy link

SparkQA commented Mar 27, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41172/

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for update, @attilapiros .
I've observed that the Amblab Jenkins hit out of disk issues in this year. There is no workaround in that case. This PR itself is meaningful.

@dongjoon-hyun
Copy link
Member

Merged to master.

@attilapiros
Copy link
Contributor Author

Merged to master.

Thanks @dongjoon-hyun !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants