Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance: Enable node assign policy on resource group #36968

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

weiliu1031
Copy link
Contributor

@weiliu1031 weiliu1031 commented Oct 17, 2024

issue: #36977
with node_label_filter on resource group, user can add label on querynode with env MILVUS_COMPONENT_LABEL, then resource group will prefer to accept node which match it's node_label_filter.

then querynode's can't be group by labels, and put querynodes with same label to same resource groups.

@sre-ci-robot sre-ci-robot added the area/dependency Pull requests that update a dependency file label Oct 17, 2024
@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: weiliu1031
To complete the pull request process, please assign jiaoew1991 after the PR has been reviewed.
You can assign the PR to them by writing /assign @jiaoew1991 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot sre-ci-robot added the size/XL Denotes a PR that changes 500-999 lines. label Oct 17, 2024
@mergify mergify bot added dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement labels Oct 17, 2024
Copy link
Contributor

mergify bot commented Oct 17, 2024

@weiliu1031 Please associate the related issue to the body of your Pull Request. (eg. “issue: #”)

@weiliu1031 weiliu1031 changed the title enhance: Enable node_label_filter on resource group enhance: Enable node assign policy on resource group Oct 18, 2024
Copy link
Contributor

mergify bot commented Oct 18, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

internal/util/sessionutil/session_util.go Outdated Show resolved Hide resolved
internal/util/sessionutil/session_util.go Outdated Show resolved Hide resolved
internal/querycoordv2/meta/resource_group.go Outdated Show resolved Hide resolved
@weiliu1031 weiliu1031 force-pushed the enable_cross_az branch 2 times, most recently from 78cf4dc to cc5ad70 Compare October 21, 2024 11:17
Copy link
Contributor

mergify bot commented Oct 21, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@weiliu1031
Copy link
Contributor Author

/run-cpu-e2e

Copy link
Contributor

mergify bot commented Oct 22, 2024

@weiliu1031 go-sdk check failed, comment rerun go-sdk can trigger the job again.

@weiliu1031
Copy link
Contributor Author

rerun go-sdk

@mergify mergify bot added the ci-passed label Oct 22, 2024
internal/util/sessionutil/session_util.go Outdated Show resolved Hide resolved
internal/querycoordv2/meta/resource_group.go Outdated Show resolved Hide resolved
internal/querycoordv2/meta/resource_manager.go Outdated Show resolved Hide resolved
internal/querycoordv2/meta/resource_manager.go Outdated Show resolved Hide resolved
Copy link
Contributor

mergify bot commented Oct 23, 2024

@weiliu1031 go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Oct 23, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Oct 24, 2024

@weiliu1031 cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

Copy link
Contributor

mergify bot commented Oct 24, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@weiliu1031
Copy link
Contributor Author

/run-cpu-e2e

@weiliu1031
Copy link
Contributor Author

/run-cpu-e2e

@weiliu1031
Copy link
Contributor Author

rerun go-sdk

Copy link
Contributor

mergify bot commented Oct 31, 2024

@weiliu1031 go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Oct 31, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 1, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 1, 2024

@weiliu1031 go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 4, 2024

@weiliu1031 cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 4, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@weiliu1031
Copy link
Contributor Author

rerun cpp-unit-test

@weiliu1031
Copy link
Contributor Author

/run-cpu-e2e

Copy link
Contributor

mergify bot commented Nov 4, 2024

@weiliu1031 cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 4, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 5, 2024

@weiliu1031 go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 5, 2024

@weiliu1031 cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

Copy link
Contributor

mergify bot commented Nov 5, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@weiliu1031
Copy link
Contributor Author

/run-cpu-e2e

@weiliu1031
Copy link
Contributor Author

rerun go-sdk

@weiliu1031
Copy link
Contributor Author

rerun cpp-unit-test

ret := make(map[string]string)
switch role {
case "querynode":
supportedLabelPrefix := paramtable.Get().QueryNodeCfg.MilvusServerLabelPrefix.GetValue()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better that hard code MILVUS_SERVER_LABEL_QUERYNODE_ directly?
We can conveniently add label like MILVUS_SERVER_LABEL_STREAMINGNODE_ for streamingnode, MILVUS_SERVER_LABEL_PROXY for proxy...
It is not easy to achieve compatibility and maintenance using environment variables that can be configured by another configuration.


if len(dirtyNodes) > 0 {
return len(dirtyNodes)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be len(dirtyNodes) + min(rg.NodeNum() - int(rg.cfg.Limits.NodeNum), 0)?

name: rg.name,
nodes: rg.nodes.Clone(),
cfg: rg.GetConfigCloned(),
nodeMgr: rg.nodeMgr,
}
}

// MeetRequirement return whether resource group meet requirement.
// Return error with reason if not meet requirement.
func (rg *ResourceGroup) MeetRequirement() error {
// if len(node) is less than requests, new node need to be assigned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use RedundantNumOfNodes() > 0 || MissingNumOfNodes() > 0 to keep the consistency between api.

@@ -576,7 +577,7 @@ func (rm *ResourceManager) selectMissingRecoverSourceRG(rg *ResourceGroup) *Reso
// First, Transfer node from most redundant resource group first. `len(nodes) > limits`
if redundantRG := rm.findMaxRGWithGivenFilter(
func(sourceRG *ResourceGroup) bool {
return rg.GetName() != sourceRG.GetName() && sourceRG.RedundantNumOfNodes() > 0
return rg.GetName() != sourceRG.GetName() && sourceRG.RedundantNumOfNodes() > 0 && rg.AcceptNode(sourceRG.PreferRemovedNode())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here could be a failure logic.
In current implementation, If the rg1's PerferRemoveNode() return a node that rg2 don't accept, the rg2 can never get node from the rg1.
So the PreferRemoveNode should be a logic that passing rg1 as a argument to determine the best and stable node that move from rg2 into rg1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer that the selectMissingRecoverSourceRG return the sourceRG and the target node id together to avoid unstable node selection

@@ -633,7 +634,7 @@ func (rm *ResourceManager) selectRedundantRecoverTargetRG(rg *ResourceGroup) *Re
// First, Transfer node to most missing resource group first.
if missingRG := rm.findMaxRGWithGivenFilter(
func(targetRG *ResourceGroup) bool {
return rg.GetName() != targetRG.GetName() && targetRG.MissingNumOfNodes() > 0
return rg.GetName() != targetRG.GetName() && targetRG.MissingNumOfNodes() > 0 && targetRG.AcceptNode(rg.PreferRemovedNode())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with selectMissingRecoverSourceRG.

Copy link
Contributor

mergify bot commented Nov 6, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@weiliu1031
Copy link
Contributor Author

/run-cpu-e2e

@mergify mergify bot added the ci-passed label Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dependency Pull requests that update a dependency file ci-passed dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement size/XL Denotes a PR that changes 500-999 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants