Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubeflow 1.0.2 Katib NAS default example fails due to overwritten container #1365

Closed
philwinder opened this issue Oct 22, 2020 · 6 comments
Closed

Comments

@philwinder
Copy link

/kind bug

What steps did you take and what happened:

  • Using a vanilla Kubeflow 1.0.2 install.
  • Click on NAS->Submit->Params
  • Run the default NAS example in a namespace.

The trial pods fail due to a change in the underlying mnist dockerfile.

k logs nasrl-example-7fs7n5db-g28rq metrics-logger-and-collector
I1022 10:08:22.860482      48 main.go:85] Trial Name: nasrl-example-7fs7n5db
I1022 10:08:26.253857      48 main.go:79] usage: mnist.py [-h] [--num-classes NUM_CLASSES] [--num-examples NUM_EXAMPLES]
I1022 10:08:26.253903      48 main.go:79]                 [--add_stn] [--image_shape IMAGE_SHAPE] [--network NETWORK]
I1022 10:08:26.253923      48 main.go:79]                 [--num-layers NUM_LAYERS] [--gpus GPUS] [--kv-store KV_STORE]
I1022 10:08:26.253930      48 main.go:79]                 [--num-epochs NUM_EPOCHS] [--lr LR] [--lr-factor LR_FACTOR]
I1022 10:08:26.253944      48 main.go:79]                 [--lr-step-epochs LR_STEP_EPOCHS] [--initializer INITIALIZER]
I1022 10:08:26.253951      48 main.go:79]                 [--optimizer OPTIMIZER] [--mom MOM] [--wd WD]
I1022 10:08:26.253963      48 main.go:79]                 [--batch-size BATCH_SIZE] [--disp-batches DISP_BATCHES]
I1022 10:08:26.253969      48 main.go:79]                 [--model-prefix MODEL_PREFIX] [--save-period SAVE_PERIOD]
I1022 10:08:26.253983      48 main.go:79]                 [--monitor MONITOR] [--load-epoch LOAD_EPOCH] [--top-k TOP_K]
I1022 10:08:26.253991      48 main.go:79]                 [--loss LOSS] [--test-io TEST_IO] [--dtype DTYPE]
I1022 10:08:26.254005      48 main.go:79]                 [--gc-type GC_TYPE] [--gc-threshold GC_THRESHOLD]
I1022 10:08:26.254017      48 main.go:79]                 [--macrobatch-size MACROBATCH_SIZE]
I1022 10:08:26.254031      48 main.go:79]                 [--warmup-epochs WARMUP_EPOCHS]
I1022 10:08:26.254038      48 main.go:79]                 [--warmup-strategy WARMUP_STRATEGY]
I1022 10:08:26.254051      48 main.go:79]                 [--profile-worker-suffix PROFILE_WORKER_SUFFIX]
I1022 10:08:26.254058      48 main.go:79]                 [--profile-server-suffix PROFILE_SERVER_SUFFIX]
I1022 10:08:26.254083      48 main.go:79]                 [--use-imagenet-data-augmentation USE_IMAGENET_DATA_AUGMENTATION]
I1022 10:08:26.254091      48 main.go:79] mnist.py: error: unrecognized arguments: architecture=[[100], [63, 1], [38, 0, 0], [7, 0, 1, 1], [56, 1, 0, 1, 0], [96, 1, 1, 1, 1, 0], [14, 1, 1, 1, 1, 0, 1], [6, 0, 1, 1, 1, 1, 1, 0]] nn_config={num_layers: 8, input_sizes: [32, 32, 3], output_sizes: [10], embedding: {100: {opt_id: 100, opt_type: depthwise_convolution, opt_params: {filter_size: 7, stride: 2, depth_multiplier: 1}}, 63: {opt_id: 63, opt_type: separable_convolution, opt_params: {filter_size: 5, num_filter: 96, stride: 1, depth_multiplier: 2}}, 38: {opt_id: 38, opt_type: separable_convolution, opt_params: {filter_size: 3, num_filter: 64, stride: 1, depth_multiplier: 1}}, 7: {opt_id: 7, opt_type: convolution, opt_params: {filter_size: 3, num_filter: 96, stride: 2}}, 56: {opt_id: 56, opt_type: separable_convolution, opt_params: {filter_size: 5, num_filter: 48, stride: 2, depth_multiplier: 1}}, 96: {opt_id: 96, opt_type: depthwise_convolution, opt_params: {filter_size: 5, stride: 2, depth_multiplier: 1}}, 14: {opt_id: 14, opt_type: convolution, opt_params: {filter_size: 5, num_filter: 64, stride: 1}}, 6: {opt_id: 6, opt_type: convolution, opt_params: {filter_size: 3, num_filter: 96, stride: 1}}}}
F1022 10:08:26.862068      48 main.go:95] Failed to wait for worker container: Process 8 hadn't completed: open /var/log/katib/8.pid: no such file or directory
goroutine 1 [running]:
github.com/kubeflow/katib/vendor/k8s.io/klog.stacks(0xc0001e6100, 0xc0002ec000, 0xa0, 0xf5)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:830 +0xb8
github.com/kubeflow/katib/vendor/k8s.io/klog.(*loggingT).output(0x129da40, 0xc000000003, 0xc0002c8000, 0x12378d6, 0x7, 0x5f, 0x0)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:781 +0x2d0
github.com/kubeflow/katib/vendor/k8s.io/klog.(*loggingT).printf(0x129da40, 0x3, 0xc78f77, 0x27, 0xc0000d1ed8, 0x1, 0x1)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:678 +0x14b
github.com/kubeflow/katib/vendor/k8s.io/klog.Fatalf(...)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:1209
main.main()
	/go/src/github.com/kubeflow/katib/cmd/metricscollector/v1alpha3/file-metricscollector/main.go:95 +0x279

I tried to find an older version of the container here: https://hub.docker.com/r/kubeflowkatib/mxnet-mnist/tags, but only one exists, and it was updated 3 months ago. This is likely why it is failing now.

What did you expect to happen:

The default examples should work out of the box. That's going to be hard to fix now, because the container tag isn't set.

Ideally, in the future, I'd like properly tagged examples that work forever.

In the meantime, can you suggest the best way of getting a working out-of-the-box NAS example (CPU preferable, for testing), on the 1.0.2 version of Kubeflow?

Thanks,
Phil

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/katib 0.97

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@issue-label-bot
Copy link

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

@andreyvelich
Copy link
Member

andreyvelich commented Oct 22, 2020

@philwinder Can you check which Trial template you use in the UI? I assume, you use template for mxnet-mnist HP example ? You should use nas-template.

If you want to use out-of-the-box NAS, it's better to use at least Kubeflow 1.1.
It should have stable examples: https://github.com/kubeflow/katib/tree/master/examples/v1alpha3/nas.
And updated UI with pre-defined Trial templates for NAS.

We have issue to create tags for training container images: #1272.

@philwinder
Copy link
Author

Hi @andreyvelich. Yes it was the nas-template. The problem is that the container has been overwritten to work with KF 1.1, like you said, which has broken the previous example.

use at least Kubeflow 1.1.

Yes, that's a valid solution, but this particular cluster is stuck on 1.0.2 for the moment.

Thanks again!

@andreyvelich
Copy link
Member

Hi @andreyvelich. Yes it was the nas-template. The problem is that the container has been overwritten to work with KF 1.1, like you said, which has broken the previous example.

Sorry for that. We have added tags to training container images in this PR: #1372 to avoid this problem.

use at least Kubeflow 1.1.

Yes, that's a valid solution, but this particular cluster is stuck on 1.0.2 for the moment.

Thanks again!

Just for your information, you can update Katib version for your Kubeflow cluster without deleting other Kubeflow components.

  1. Delete all Katib experiments: kubectl delete experiment --all-namespaces --all

  2. Use these manifests: https://github.com/kubeflow/katib/tree/master/manifests/v1beta1 to delete and than deploy Katib components. Kubeflow namespace should not be re-created.

@andreyvelich
Copy link
Member

@philwinder I close this issue, feel free to re-open if you have any other questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants