Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status is missing from job details #1314

Closed
Jeffwan opened this issue Jul 30, 2021 · 4 comments
Closed

Status is missing from job details #1314

Jeffwan opened this issue Jul 30, 2021 · 4 comments
Assignees
Labels

Comments

@Jeffwan
Copy link
Member

Jeffwan commented Jul 30, 2021

branch: all-in-one-operator , part of #1299

It might be a conflict using both together? need to double check

Status common.JobStatus `json:"status,omitempty"`

and

//+kubebuilder:subresource:status
Name:         dist-mnist-for-e2e-test
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         TFJob
Metadata:
  Creation Timestamp:  2021-07-30T00:39:33Z
  Self Link:         /apis/kubeflow.org/v1/namespaces/default/tfjobs/dist-mnist-for-e2e-test
  UID:               6103ab9e-026d-4a9f-bfd5-ab86d875f327
Spec:
  Success Policy:
  Tf Replica Specs:
    PS:
      Replicas:        2
      Restart Policy:  Never
      Template:
        Metadata:
        Spec:
          Containers:
            Image:  kubeflow/tf-dist-mnist-test:1.0
            Name:   tensorflow
            Ports:
              Container Port:  2222
              Name:            tfjob-port
              Protocol:        TCP
            Resources:
    Worker:
      Replicas:        4
      Restart Policy:  Never
      Template:
        Metadata:
        Spec:
          Containers:
            Image:  kubeflow/tf-dist-mnist-test:1.0
            Name:   tensorflow
            Ports:
              Container Port:  2222
              Name:            tfjob-port
              Protocol:        TCP
            Resources:
Events:
  Type    Reason                   Age    From         Message
  ----    ------                   ----   ----         -------
  Normal  SuccessfulCreatePod      4m19s  tf-operator  Created pod: dist-mnist-for-e2e-test-ps-0
  Normal  SuccessfulCreatePod      4m19s  tf-operator  Created pod: dist-mnist-for-e2e-test-ps-1
  Normal  SuccessfulCreateService  4m19s  tf-operator  Created service: dist-mnist-for-e2e-test-ps-0
  Normal  SuccessfulCreateService  4m19s  tf-operator  Created service: dist-mnist-for-e2e-test-ps-1
  Normal  SuccessfulCreatePod      4m19s  tf-operator  Created pod: dist-mnist-for-e2e-test-worker-0
  Normal  SuccessfulCreatePod      4m19s  tf-operator  Created pod: dist-mnist-for-e2e-test-worker-1
  Normal  SuccessfulCreatePod      4m19s  tf-operator  Created pod: dist-mnist-for-e2e-test-worker-2
  Normal  SuccessfulCreatePod      4m19s  tf-operator  Created pod: dist-mnist-for-e2e-test-worker-3
  Normal  SuccessfulCreateService  4m18s  tf-operator  Created service: dist-mnist-for-e2e-test-worker-0
  Normal  SuccessfulCreateService  4m18s  tf-operator  Created service: dist-mnist-for-e2e-test-worker-1
  Normal  SuccessfulCreateService  4m18s  tf-operator  Created service: dist-mnist-for-e2e-test-worker-2
  Normal  SuccessfulCreateService  4m18s  tf-operator  Created service: dist-mnist-for-e2e-test-worker-3

/kind bug
/assign

@Jeffwan
Copy link
Member Author

Jeffwan commented Jul 31, 2021

This issue actually blocks the CI jobs. #1315

This is a high priority problem we have to fix.


Update: our CI in all-in-one-operator branch should test against old version.. It's not blocking CI as I understand.

@Jeffwan
Copy link
Member Author

Jeffwan commented Jul 31, 2021

I figure out the problem.

  1. // +kubebuilder:subresource:status works with common.JobStatus
  2. Here, we use previous way to update entire Job object. However, /status is a subresource and reconcile actually give in-built way to update the status. client.Status().Update(context.Background(), &obj)

https://github.com/kubeflow/tf-operator/blob/b8d5d3cd6e824865de196388c2c2a4e852b709e7/pkg/controller.v1/tensorflow/tfjob_controller.go#L480

It works fine after the update.

Status:
  Completion Time:  2021-07-31T05:17:27Z
  Conditions:
    Last Transition Time:  2021-07-31T05:17:04Z
    Last Update Time:      2021-07-31T05:17:04Z
    Message:               xgboostJob xgboost-dist-iris-test-train is created.
    Reason:                XGBoostJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2021-07-31T05:17:04Z
    Last Update Time:      2021-07-31T05:17:04Z
    Message:               XGBoostJob xgboost-dist-iris-test-train is running.
    Reason:                XGBoostJobRunning
    Status:                False
    Type:                  Running
    Last Transition Time:  2021-07-31T05:17:27Z
    Last Update Time:      2021-07-31T05:17:27Z
    Message:               XGBoostJob xgboost-dist-iris-test-train is successfully completed.
    Reason:                XGBoostJobSucceeded
    Status:                True
    Type:                  Succeeded
  Replica Statuses:
    Master:
      Succeeded:  1
    Worker:
      Succeeded:  2

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 2, 2021

This issue has been fixed.

/close

@Jeffwan Jeffwan closed this as completed Aug 2, 2021
@google-oss-robot
Copy link

@Jeffwan: Closing this issue.

In response to this:

This issue has been fixed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants