Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[enh]: validate for bayesian optimization algorithm settings #1600

Merged

Conversation

anencore94
Copy link
Member

@anencore94 anencore94 commented Jul 31, 2021

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
fixes part of #1126

Checklist:

  • Docs included if any changes are user facing

How I Test

  • we could check with unit-test code
  • Also, I've checked in my katib cluster with new skopt image with this yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: random-example-3
  namespace: kubeflow
spec:
  algorithm:
    algorithmName: bayesianoptimization
    algorithmSettings:
      - name: "unknown"
        value: "10"
  maxFailedTrialCount: 1
  maxTrialCount: 1
  parallelTrialCount: 1
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    additionalMetricNames:
    - Train-accuracy
    goal: 0.99
    metricStrategies:
    - name: Validation-accuracy
      value: max
    - name: Train-accuracy
      value: max
    objectiveMetricName: Validation-accuracy
    type: maximize
  parameters:
  - feasibleSpace:
      max: "0.03"
      min: "0.01"
    name: lr
    parameterType: double
  - feasibleSpace:
      max: "2"
      min: "1"
    name: num-layers
    parameterType: int
  - feasibleSpace:
      list:
      - sgd
      - adam
      - ftrl
    name: optimizer
    parameterType: categorical
  resumePolicy: LongRunning
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: training-container
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    trialParameters:
    - description: Learning rate for the training model
      name: learningRate
      reference: lr
    - description: Number of training model layers
      name: numberLayers
      reference: num-layers
    - description: Training model optimizer (sdg, adam or ftrl)
      name: optimizer
      reference: optimizer
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
            - command:
              - python3
              - /opt/mxnet-mnist/mnist.py
              - --batch-size=256
              - --lr=${trialParameters.learningRate}
              - --num-layers=${trialParameters.numberLayers}
              - --optimizer=${trialParameters.optimizer}
              image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
              name: training-container
            restartPolicy: Never
  • And expected error msg was printed
    image

  • However, I'm afraid even if suggestion and experiment failed, corresponding deployment/pod stays in running... I'm not sure why does it happens, but I guess it is a bug and should be handled with another PR.

@aws-kf-ci-bot
Copy link
Contributor

Hi @anencore94. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@anencore94 anencore94 changed the title [enh]: validate for skopt algorithm settings [enh]: validate for bayesian optimization algorithm settings Jul 31, 2021
@anencore94
Copy link
Member Author

Here is another test case for sure with another wrong algorithm setting yaml

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: random-example-4
  namespace: kubeflow
spec:
  algorithm:
    algorithmName: bayesianoptimization
    algorithmSettings:
      - name: "random_state"
        value: "-1"
  maxFailedTrialCount: 1
  maxTrialCount: 1
  parallelTrialCount: 1
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    additionalMetricNames:
    - Train-accuracy
    goal: 0.99
    metricStrategies:
    - name: Validation-accuracy
      value: max
    - name: Train-accuracy
      value: max
    objectiveMetricName: Validation-accuracy
    type: maximize
  parameters:
  - feasibleSpace:
      max: "0.03"
      min: "0.01"
    name: lr
    parameterType: double
  - feasibleSpace:
      max: "2"
      min: "1"
    name: num-layers
    parameterType: int
  - feasibleSpace:
      list:
      - sgd
      - adam
      - ftrl
    name: optimizer
    parameterType: categorical
  resumePolicy: LongRunning
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: training-container
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    trialParameters:
    - description: Learning rate for the training model
      name: learningRate
      reference: lr
    - description: Number of training model layers
      name: numberLayers
      reference: num-layers
    - description: Training model optimizer (sdg, adam or ftrl)
      name: optimizer
      reference: optimizer
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
            - command:
              - python3
              - /opt/mxnet-mnist/mnist.py
              - --batch-size=256
              - --lr=${trialParameters.learningRate}
              - --num-layers=${trialParameters.numberLayers}
              - --optimizer=${trialParameters.optimizer}
              image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
              name: training-container
            restartPolicy: Never

image

@gaocegege
Copy link
Member

/assign @johnugeorge @andreyvelich

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for implementing this @anencore94!
I left few comments.

pkg/suggestion/v1beta1/skopt/service.py Outdated Show resolved Hide resolved
pkg/suggestion/v1beta1/skopt/service.py Outdated Show resolved Hide resolved
pkg/suggestion/v1beta1/skopt/service.py Outdated Show resolved Hide resolved
test/suggestion/v1beta1/test_skopt_service.py Outdated Show resolved Hide resolved
@andreyvelich
Copy link
Member

However, I'm afraid even if suggestion and experiment failed, corresponding deployment/pod stays in running... I'm not sure why does it happens, but I guess it is a bug and should be handled with another PR.

It's happening since Experiment ResumePolicy must be equal to Never or FromVolume to cleanup Suggestion resources after Experiment is complete.
Also, Suggestion should not be in Failed status to clean-up resources: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/experiment/experiment_controller_util.go#L152-L154.
I think that helps users to debug failed Suggestion logs.

@gaocegege @anencore94 @johnugeorge What do you think about this clean-up design ?

@andreyvelich
Copy link
Member

/ok-to-test

- use staticmethod rather than classmethod
- change convertAlgorithmSpec method name to a snake_case
- use .format() rather than f-string

Signed-off-by: Jaeyeon Kim <[email protected]>
@gaocegege
Copy link
Member

Also, Suggestion should not be in Failed status to clean-up resources: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/experiment/experiment_controller_util.go#L152-L154.
I think that helps users to debug failed Suggestion logs.

SGTM.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great contribution @anencore94!
/lgtm
cc @gaocegege @johnugeorge

@anencore94
Copy link
Member Author

I think that helps users to debug failed Suggestion logs.I think that helps users to debug failed Suggestion logs.

Yeap That makes sense. I must be helpful when user wants to debug. That pod should be alive. Thanks! @andreyvelich

@gaocegege
Copy link
Member

/lgtm

Thanks for your contribution! 🎉 👍 @anencore94

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, anencore94

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-robot google-oss-robot merged commit a57745e into kubeflow:master Aug 3, 2021
@anencore94 anencore94 deleted the enhance/skopt_validation branch August 3, 2021 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants