Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconcile semantics for Suggestion Algorithms #1633

Merged
merged 2 commits into from
Aug 24, 2021

Conversation

johnugeorge
Copy link
Member

Currently, GetSuggestions call does not follow Kubernetes reconcile semantics. eg: If suggestion controller cannot update the suggestions returned from GetSuggestions call(from suggestion algorithm service), new suggestions are created again during the next try. This causes few suggestions to be leaked out.

In this PR, new variable is passed in the GetSuggestions Call which indicates the total Suggestions requested till date. If there are more trials in DB which are not recorded, it reuses the missed suggestions from DB while remaining required number is generated. So, GetSuggestions will ensure that missed suggestions are reused first before generating new ones.

Fixes #1534
/hold

@gaocegege
Copy link
Member

/retest

@johnugeorge
Copy link
Member Author

/test

@aws-kf-ci-bot
Copy link
Contributor

@johnugeorge: The /test command needs one or more targets.
The following commands are available to trigger jobs:

  • /test kubeflow-katib-presubmit

Use /test all to run all jobs.

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@johnugeorge
Copy link
Member Author

/test kubeflow-katib-presubmit

1 similar comment
@johnugeorge
Copy link
Member Author

/test kubeflow-katib-presubmit

@johnugeorge
Copy link
Member Author

/hold cancel

Comment on lines 116 to 117
logger.Info("Getting suggestions", "endpoint", endpoint, "response", len(responseSuggestion.ParameterAssignments),
"requestNum", requestNum)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it more clear here ?

Suggested change
logger.Info("Getting suggestions", "endpoint", endpoint, "response", len(responseSuggestion.ParameterAssignments),
"requestNum", requestNum)
logger.Info("Getting suggestions", "endpoint", endpoint, "number of response parameters", len(responseSuggestion.ParameterAssignments),
"number of request parameters", requestNum)

if len(responseSuggestion.ParameterAssignments) != requestNum {
err := fmt.Errorf("The response contains unexpected trials")
logger.Error(err, "The response contains unexpected trials", "requestNum", requestNum, "response", responseSuggestion)
logger.Error(err, "The response contains unexpected trials", "requestNum", requestNum, "response", len(responseSuggestion.ParameterAssignments))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.Error(err, "The response contains unexpected trials", "requestNum", requestNum, "response", len(responseSuggestion.ParameterAssignments))
logger.Error(err, "The response contains unexpected trials", "number of request parameters", requestNum, "number of response parameters", len(responseSuggestion.ParameterAssignments))

@@ -6,10 +6,11 @@ package mock

import (
context "context"
reflect "reflect"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to define specific version for mockgen to have same generation files.
Do we need these changes in the PR ?

Comment on lines +163 to +167
new_actual_requested_no = total_request_number - len(self.created_trials)
prev_generated_no = request_number - new_actual_requested_no
logger.info("In this call, New {} Trials will be generated, {} Trials will be reused from previously generated".format(new_actual_requested_no, prev_generated_no))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first call, new_actual_requested_no = 0 and prev_generated_no = 3, when request_number = 3.
Is that correct ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the other way around.
In the normal case,
total_request_number == len(self.created_trials) + request_number where self.created_trials correspond to the number of previously created trials. So, prev_generated_no will be 0 in this case

When there is a difference, it means that some of the suggestions in self.created_trials(same as in DB) are not recorded in K8s Suggestions resource. So, prev_generated_no will be greater than 0 in this case

Comment on lines 166 to 167
if total_request_number != len(self.created_trials) + request_number:
logger.info("Mismatch in generated trials with k8s suggestions trials")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this log mean ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier comment

@andreyvelich
Copy link
Member

/hold for the testing

@gaocegege
Copy link
Member

/lgtm

@gaocegege
Copy link
Member

/retest

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this fix.

/lgtm
/approve
/hold cancel

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [andreyvelich,johnugeorge]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andreyvelich
Copy link
Member

/retest

@andreyvelich
Copy link
Member

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Grid Search stuck when parallelTrialCount < maxTrialCount
5 participants