Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: ft slot capacity check for each trial [DET-9897] #8213

Merged
merged 10 commits into from
Nov 7, 2023

Conversation

jgongd
Copy link
Contributor

@jgongd jgongd commented Oct 20, 2023

Description

Test Plan

e2e tests.

Commentary (optional)

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

DET-9897

@jgongd jgongd requested a review from a team as a code owner October 20, 2023 15:59
@cla-bot cla-bot bot added the cla-signed label Oct 20, 2023
// The first line of trial logs is printing expconf which has the regex pattern.
// We skip monitoring this line.
regex := "(.*)(\\\"log_pattern_policies\\\":)(.*)"
compiledRegex, err := l.getCompiledRegex(regex)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can move this outside the policy loop and under the range logs loop

@@ -112,3 +112,37 @@ func (t TerminateDecision) String() string {
}
return strings.Join(item, ",")
}

// ValidateResourcePoolAvailabilityParam contains the params for ValidateResourcePoolAvailability().
type ValidateResourcePoolAvailabilityParam struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: call this ValidateResourcePoolAvailabilityRequest to better match other parameters like GetDefaultAuxResourcePoolRequest

totalSlots += len(a.slotStates)
if !disallowedNodes.Contains(a.Handler.Address().Local()) {
totalSlots += len(a.slotStates)
}
}
case rp.config.Provider.AWS != nil:
totalSlots = rp.config.Provider.MaxInstances * rp.config.Provider.AWS.SlotsPerInstance()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we somehow subtract the blocklist nodes fromtotalSlots if the agents blocked are in the provided pool and been launched by the provider?

I'm not sure how hard this will be, if it is too hard I think it might be okay to not do


// ValidateResourcePoolAvailabilityParamOption accepts functions that may modify the
// *ValidateResourcePoolAvailabilityParam instance.
type ValidateResourcePoolAvailabilityParamOption func(v *ValidateResourcePoolAvailabilityParam)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non blocking: These helpers seem like a little overkill to me. I'm not against them but I think just the struct would be fine for me too.

// TODO: The first return value is []command.LaunchWarning{command.CurrentSlotsExceeded}.
// How do we want to handle this? In newExperiment() the warning is part of the response back
// to the client.
_, err = t.rm.ValidateResourcePoolAvailability(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best we can do is just log any launch warnings if they have them?

// TODO: The first return value is []command.LaunchWarning{command.CurrentSlotsExceeded}.
// How do we want to handle this? In newExperiment() the warning is part of the response back
// to the client.
_, err = t.rm.ValidateResourcePoolAvailability(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be moved earlier in the function, so we don't

call this prom.AssociateJobExperiment(t.jobID, strconv.Itoa(t.experimentID), t.config.Labels()) or any other functions we might not want to?

// TODO: The first return value is []command.LaunchWarning{command.CurrentSlotsExceeded}.
// How do we want to handle this? In newExperiment() the warning is part of the response back
// to the client.
_, err = t.rm.ValidateResourcePoolAvailability(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't want to run this in the restore case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For understanding, what's the issue if we run this in the restore case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want skip the restore case because I think there is a possibility not all agents will be reconnected when the job is restoring.

In addition, really the only time we want this to fail is when a trial decides to reschedule and finds it can't.

@NicholasBlaskey NicholasBlaskey requested a review from a team as a code owner October 25, 2023 19:36
@NicholasBlaskey NicholasBlaskey requested review from wes-turner and removed request for a team October 25, 2023 19:36
@jgongd jgongd force-pushed the jgong-ft-capacity-check-rebased branch from d9fc86d to dbf1f4c Compare October 27, 2023 16:22
Copy link
Contributor

@NicholasBlaskey NicholasBlaskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

I think e2e test is pretty much the only non nit comment

switch {
case rp.config.Provider == nil:
rp.agentStatesCache = rp.fetchAgentStates(ctx)

defer func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this defer to outside of the switch so we are consistent about unsetting the cache

continue
}

regex = fmt.Sprintf("(.*)%s(.*)", policy.Pattern())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the (.*), I think this undoes a change that removed it in the original PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a cherry pick on top of your last commit in ft-hw-failure-e2e-cleanup. Forgot to check diff.

@@ -61,9 +61,21 @@ func (l *logPatternPolicies) monitor(ctx context.Context,
if log.AgentID == nil {
return fmt.Errorf("agentID must be non nil to monitor logs")
}
// The first line of trial logs is printing expconf which has the regex pattern.
// We skip monitoring this line.
regex := "(.*)(\\\"log_pattern_policies\\\":)(.*)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this a package level variable and use mustCompile since we will always need this regex

Like we do for latin text

latinText = regexp.MustCompile("[^[:graph:]\\s]")

if err != nil {
return err
}
if compiledRegex.MatchString(log.Log) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check can go outside of the policy loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@@ -459,6 +476,7 @@ func (t *trial) maybeAllocateTask() error {
Debugf("starting new trial allocation")

prom.AssociateJobExperiment(t.jobID, strconv.Itoa(t.experimentID), t.config.Labels())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't add line here

master/internal/trial.go Outdated Show resolved Hide resolved
// TODO: The first return value is []command.LaunchWarning{command.CurrentSlotsExceeded}.
// How do we want to handle this? In newExperiment() the warning is part of the response back
// to the client.
_, err = t.rm.ValidateResourcePoolAvailability(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want skip the restore case because I think there is a possibility not all agents will be reconnected when the job is restoring.

In addition, really the only time we want this to fail is when a trial decides to reschedule and finds it can't.

master/internal/rm/resource_manager_iface.go Show resolved Hide resolved
@cla-bot
Copy link

cla-bot bot commented Oct 31, 2023

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Ubuntu.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot cla-bot bot removed the cla-signed label Oct 31, 2023
@jgongd jgongd force-pushed the jgong-ft-capacity-check-rebased branch from 4a2a6bd to 373bbfa Compare October 31, 2023 12:44
@cla-bot cla-bot bot added the cla-signed label Oct 31, 2023
@jgongd jgongd force-pushed the jgong-ft-capacity-check-rebased branch 2 times, most recently from aa61403 to a649b2a Compare October 31, 2023 15:12
@NicholasBlaskey NicholasBlaskey force-pushed the ft-hw-failure-e2e-cleanup branch 2 times, most recently from 68b9d66 to 5698fe6 Compare October 31, 2023 15:35
@jgongd jgongd force-pushed the jgong-ft-capacity-check-rebased branch from a649b2a to b8a4395 Compare October 31, 2023 16:01
Base automatically changed from ft-hw-failure-e2e-cleanup to main October 31, 2023 16:10
@jgongd jgongd force-pushed the jgong-ft-capacity-check-rebased branch from b8a4395 to 52848a2 Compare October 31, 2023 17:57
Copy link

netlify bot commented Oct 31, 2023

Deploy Preview for determined-ui ready!

Name Link
🔨 Latest commit 74e6ac5
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/654a4d616faa5b00088f23d2
😎 Deploy Preview https://deploy-preview-8213--determined-ui.netlify.app/
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

if msg.TaskID != nil {
blockedNodes, err := logpattern.GetBlockedNodes(context.TODO(), *msg.TaskID)
if err != nil {
panic(err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't panic here instead we can

ctx.Respond(err)
return nil

@@ -381,6 +381,33 @@ func (t *trial) maybeAllocateTask() error {
}

restoredAllocation, err := t.maybeRestoreAllocation()

if restoredAllocation == nil {
launchWarnings, err := t.rm.ValidateResourcePoolAvailability(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put this inside another function?

we should also call it after we check the err from t.maybeRestoreAllocation


if restoredAllocation == nil {
launchWarnings, err := t.rm.ValidateResourcePoolAvailability(
&sproto.ValidateResourcePoolAvailabilityRequest{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add another check that we only check the capacity if we have at least one blocked nodes from logpattern.GetBlockedNodes(context.TODO(), *msg.TaskID)

I am somewhat worried that someone might lose a bunch of agents and all their jobs would fail because of this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this only apply to the restarted experiment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add the check right before we call ValidateResourcePoolAvailability

Yes so only don't check for the restart case of calling ValidateResourcePoolAvailability.

@jgongd jgongd force-pushed the jgong-ft-capacity-check-rebased branch 2 times, most recently from 6aadb8a to dfa9413 Compare October 31, 2023 20:51
Copy link
Contributor

@NicholasBlaskey NicholasBlaskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM nice work

Comment on lines 797 to 800
return t.transition(model.StateWithReason{
State: model.ErrorState,
InformationalReason: exitReason,
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just log this and move on, we don't need to error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job stuck in queue without this. Output of test_log_policy_exclude_node_single_agent:

ERRO[2023-11-01T14:13:08-04:00] task ID 57.65b30b87-e141-41fe-8816-7e1ee1a0ff36 slots requested exceeds default resource pool capacity
DEBU[2023-11-01T14:13:08-04:00] starting new trial allocation                 allocation-id=57.65b30b87-e141-41fe-8816-7e1ee1a0ff36.2 component=trial experiment-id=57 job-id=f54aab51-d245-46f5-a200-c8830908fe46 task-id=57.65b30b87-e141-41fe-8816-7e1ee1a0ff36 task-type=TRIAL trial-id=57 trial-run-id=2
DEBU[2023-11-01T14:13:08-04:00] requestResources add allocation               experiment-id=57 job-id=f54aab51-d245-46f5-a200-c8830908fe46 task-id=57.65b30b87-e141-41fe-8816-7e1ee1a0ff36 task-type=TRIAL trial-id=57 trial-run-id=2
INFO[2023-11-01T14:13:08-04:00] resources are requested by Trial 57 (Experiment 57) (Allocation ID: 57.65b30b87-e141-41fe-8816-7e1ee1a0ff36.2)  actor-local-addr=default actor-system=master allocation-id=57.65b30b87-e141-41fe-8816-7e1ee1a0ff36.2 go-type=resourcePool resource-pool=default restore=false restoring=false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh my bad, I thought the above code handled this check

a.m.config.LaunchError &&

https://docs.determined.ai/latest/reference/deploy/master-config-reference.html#launch-error

I think we should just return an error only if that launch-error config is set?

@jgongd jgongd force-pushed the jgong-ft-capacity-check-rebased branch from f18edd1 to f353484 Compare November 1, 2023 21:12
@jgongd jgongd force-pushed the jgong-ft-capacity-check-rebased branch 2 times, most recently from d8bcd7d to 60b8e9d Compare November 6, 2023 18:05
@jgongd jgongd force-pushed the jgong-ft-capacity-check-rebased branch from 60b8e9d to 74e6ac5 Compare November 7, 2023 14:44
@jgongd jgongd merged commit 8418029 into main Nov 7, 2023
68 of 80 checks passed
@jgongd jgongd deleted the jgong-ft-capacity-check-rebased branch November 7, 2023 15:17
@@ -0,0 +1,728 @@
// Code generated by mockery v2.20.0. DO NOT EDIT.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to .gitgnore mocks or something.

}
if len(blockedNodes) > 0 {
if err := t.checkResourcePoolRemainingCapacity(); err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did we test this? i don't think just returning an error is quite right here. i think like all the code around it, we need to transition to a terminal state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an invariant here we need to have commented so folks don't miss it (cc @erikwilson ). handleAllocationExit must transition to a terminal state or reallocate, otherwise the system is in an invalid state.

InformationalReason: msg,
})
}
logrus.Warn(msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use t.syslog here, so we get more info on the structured log

@dannysauer dannysauer added this to the 0.26.4 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants