Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phase is wrong unexpected TfJob phase: Done #110

Closed
jlewi opened this issue Nov 1, 2017 · 1 comment
Closed

Phase is wrong unexpected TfJob phase: Done #110

jlewi opened this issue Nov 1, 2017 · 1 comment

Comments

@jlewi
Copy link
Contributor

jlewi commented Nov 1, 2017

In the job below phase as reported as Failed but the state reports succeeded. This is a bug.

apiVersion: mlkube.io/v1beta1
kind: TfJob
metadata:
  clusterName: ""
  creationTimestamp: 2017-10-31T22:06:13Z
  generation: 0
  name: cifar10-171031-220613
  namespace: default
  resourceVersion: "58104"
  selfLink: /apis/mlkube.io/v1beta1/namespaces/default/tfjobs/cifar10-171031-220613
  uid: b5c82c47-be87-11e7-823a-42010a8e007e
spec:
  RuntimeId: zcuu
  replicaSpecs:
  - IsDefaultPS: false
    replicas: 1
    template:
      metadata:
        creationTimestamp: null
      spec:
        containers:
        - command:
          - python
          - /tensorflow_models/tutorials/image/cifar10_estimator/cifar10_main.py
          - --data-dir=gs://cloud-ml-dev_jlewi/cifar10/data
          - --job-dir=gs://cloud-ml-dev_jlewi/cifar10/jobs/cifar10-171031-220613
          - --train-steps=100000
          - --log-device-placement
          - --num-gpus=4
          image: gcr.io/cloud-ml-dev/tf-models-gpu:591ca2e-dirty-b75d293
          name: tensorflow
          resources:
            limits:
              nvidia.com/gpu: "4"
        restartPolicy: OnFailure
    tfPort: 2222
    tfReplicaType: MASTER
  tensorboard:
    logDir: gs://cloud-ml-dev_jlewi/cifar10/jobs/cifar10-171031-220613
    serviceType: ""
    volumeMounts: null
    volumes: null
  tfImage: gcr.io/cloud-ml-dev/tf-models-cpu:591ca2e-dirty-b75d293
status:
  conditions: null
  controlPaused: false
  phase: Failed
  reason: 'unexpected TfJob phase: Done'
  replicaStatuses:
  - ReplicasStates:
      Succeeded: 1
    state: Succeeded
    tf_replica_type: MASTER
  state: Succeeded
@gaocegege
Copy link
Member

We do not use phase now, thus close the issue soon 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants