-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: ft slot capacity check for each trial [DET-9897] #8213
Conversation
// The first line of trial logs is printing expconf which has the regex pattern. | ||
// We skip monitoring this line. | ||
regex := "(.*)(\\\"log_pattern_policies\\\":)(.*)" | ||
compiledRegex, err := l.getCompiledRegex(regex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can move this outside the policy loop and under the range logs
loop
@@ -112,3 +112,37 @@ func (t TerminateDecision) String() string { | |||
} | |||
return strings.Join(item, ",") | |||
} | |||
|
|||
// ValidateResourcePoolAvailabilityParam contains the params for ValidateResourcePoolAvailability(). | |||
type ValidateResourcePoolAvailabilityParam struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: call this ValidateResourcePoolAvailabilityRequest
to better match other parameters like GetDefaultAuxResourcePoolRequest
totalSlots += len(a.slotStates) | ||
if !disallowedNodes.Contains(a.Handler.Address().Local()) { | ||
totalSlots += len(a.slotStates) | ||
} | ||
} | ||
case rp.config.Provider.AWS != nil: | ||
totalSlots = rp.config.Provider.MaxInstances * rp.config.Provider.AWS.SlotsPerInstance() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we somehow subtract the blocklist nodes fromtotalSlots
if the agents blocked are in the provided pool and been launched by the provider?
I'm not sure how hard this will be, if it is too hard I think it might be okay to not do
|
||
// ValidateResourcePoolAvailabilityParamOption accepts functions that may modify the | ||
// *ValidateResourcePoolAvailabilityParam instance. | ||
type ValidateResourcePoolAvailabilityParamOption func(v *ValidateResourcePoolAvailabilityParam) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non blocking: These helpers seem like a little overkill to me. I'm not against them but I think just the struct would be fine for me too.
master/internal/trial.go
Outdated
// TODO: The first return value is []command.LaunchWarning{command.CurrentSlotsExceeded}. | ||
// How do we want to handle this? In newExperiment() the warning is part of the response back | ||
// to the client. | ||
_, err = t.rm.ValidateResourcePoolAvailability( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best we can do is just log any launch warnings if they have them?
master/internal/trial.go
Outdated
// TODO: The first return value is []command.LaunchWarning{command.CurrentSlotsExceeded}. | ||
// How do we want to handle this? In newExperiment() the warning is part of the response back | ||
// to the client. | ||
_, err = t.rm.ValidateResourcePoolAvailability( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be moved earlier in the function, so we don't
call this prom.AssociateJobExperiment(t.jobID, strconv.Itoa(t.experimentID), t.config.Labels())
or any other functions we might not want to?
master/internal/trial.go
Outdated
// TODO: The first return value is []command.LaunchWarning{command.CurrentSlotsExceeded}. | ||
// How do we want to handle this? In newExperiment() the warning is part of the response back | ||
// to the client. | ||
_, err = t.rm.ValidateResourcePoolAvailability( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't want to run this in the restore case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For understanding, what's the issue if we run this in the restore case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want skip the restore case because I think there is a possibility not all agents will be reconnected when the job is restoring.
In addition, really the only time we want this to fail is when a trial decides to reschedule and finds it can't.
4460f2a
to
03b74de
Compare
51011e3
to
7957f22
Compare
d9fc86d
to
dbf1f4c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
I think e2e test is pretty much the only non nit comment
switch { | ||
case rp.config.Provider == nil: | ||
rp.agentStatesCache = rp.fetchAgentStates(ctx) | ||
|
||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this defer to outside of the switch so we are consistent about unsetting the cache
continue | ||
} | ||
|
||
regex = fmt.Sprintf("(.*)%s(.*)", policy.Pattern()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the (.*)
, I think this undoes a change that removed it in the original PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a cherry pick on top of your last commit in ft-hw-failure-e2e-cleanup
. Forgot to check diff.
@@ -61,9 +61,21 @@ func (l *logPatternPolicies) monitor(ctx context.Context, | |||
if log.AgentID == nil { | |||
return fmt.Errorf("agentID must be non nil to monitor logs") | |||
} | |||
// The first line of trial logs is printing expconf which has the regex pattern. | |||
// We skip monitoring this line. | |||
regex := "(.*)(\\\"log_pattern_policies\\\":)(.*)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make this a package level variable and use mustCompile since we will always need this regex
Like we do for latin text
determined/master/internal/api_user.go
Line 36 in a33d2b8
latinText = regexp.MustCompile("[^[:graph:]\\s]") |
if err != nil { | ||
return err | ||
} | ||
if compiledRegex.MatchString(log.Log) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this check can go outside of the policy loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
master/internal/trial.go
Outdated
@@ -459,6 +476,7 @@ func (t *trial) maybeAllocateTask() error { | |||
Debugf("starting new trial allocation") | |||
|
|||
prom.AssociateJobExperiment(t.jobID, strconv.Itoa(t.experimentID), t.config.Labels()) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't add line here
master/internal/trial.go
Outdated
// TODO: The first return value is []command.LaunchWarning{command.CurrentSlotsExceeded}. | ||
// How do we want to handle this? In newExperiment() the warning is part of the response back | ||
// to the client. | ||
_, err = t.rm.ValidateResourcePoolAvailability( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want skip the restore case because I think there is a possibility not all agents will be reconnected when the job is restoring.
In addition, really the only time we want this to fail is when a trial decides to reschedule and finds it can't.
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Ubuntu.
|
4a2a6bd
to
373bbfa
Compare
89f109b
to
e06a350
Compare
aa61403
to
a649b2a
Compare
68b9d66
to
5698fe6
Compare
a649b2a
to
b8a4395
Compare
b8a4395
to
52848a2
Compare
✅ Deploy Preview for determined-ui ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
if msg.TaskID != nil { | ||
blockedNodes, err := logpattern.GetBlockedNodes(context.TODO(), *msg.TaskID) | ||
if err != nil { | ||
panic(err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't panic here instead we can
ctx.Respond(err)
return nil
master/internal/trial.go
Outdated
@@ -381,6 +381,33 @@ func (t *trial) maybeAllocateTask() error { | |||
} | |||
|
|||
restoredAllocation, err := t.maybeRestoreAllocation() | |||
|
|||
if restoredAllocation == nil { | |||
launchWarnings, err := t.rm.ValidateResourcePoolAvailability( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we put this inside another function?
we should also call it after we check the err from t.maybeRestoreAllocation
master/internal/trial.go
Outdated
|
||
if restoredAllocation == nil { | ||
launchWarnings, err := t.rm.ValidateResourcePoolAvailability( | ||
&sproto.ValidateResourcePoolAvailabilityRequest{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add another check that we only check the capacity if we have at least one blocked nodes from logpattern.GetBlockedNodes(context.TODO(), *msg.TaskID)
I am somewhat worried that someone might lose a bunch of agents and all their jobs would fail because of this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this only apply to the restarted experiment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add the check right before we call ValidateResourcePoolAvailability
Yes so only don't check for the restart case of calling ValidateResourcePoolAvailability
.
6aadb8a
to
dfa9413
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM nice work
master/internal/trial.go
Outdated
return t.transition(model.StateWithReason{ | ||
State: model.ErrorState, | ||
InformationalReason: exitReason, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just log this and move on, we don't need to error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The job stuck in queue without this. Output of test_log_policy_exclude_node_single_agent
:
ERRO[2023-11-01T14:13:08-04:00] task ID 57.65b30b87-e141-41fe-8816-7e1ee1a0ff36 slots requested exceeds default resource pool capacity
DEBU[2023-11-01T14:13:08-04:00] starting new trial allocation allocation-id=57.65b30b87-e141-41fe-8816-7e1ee1a0ff36.2 component=trial experiment-id=57 job-id=f54aab51-d245-46f5-a200-c8830908fe46 task-id=57.65b30b87-e141-41fe-8816-7e1ee1a0ff36 task-type=TRIAL trial-id=57 trial-run-id=2
DEBU[2023-11-01T14:13:08-04:00] requestResources add allocation experiment-id=57 job-id=f54aab51-d245-46f5-a200-c8830908fe46 task-id=57.65b30b87-e141-41fe-8816-7e1ee1a0ff36 task-type=TRIAL trial-id=57 trial-run-id=2
INFO[2023-11-01T14:13:08-04:00] resources are requested by Trial 57 (Experiment 57) (Allocation ID: 57.65b30b87-e141-41fe-8816-7e1ee1a0ff36.2) actor-local-addr=default actor-system=master allocation-id=57.65b30b87-e141-41fe-8816-7e1ee1a0ff36.2 go-type=resourcePool resource-pool=default restore=false restoring=false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh my bad, I thought the above code handled this check
determined/master/internal/api_command.go
Line 122 in 7772fb3
a.m.config.LaunchError && |
https://docs.determined.ai/latest/reference/deploy/master-config-reference.html#launch-error
I think we should just return an error only if that launch-error config is set?
f18edd1
to
f353484
Compare
d8bcd7d
to
60b8e9d
Compare
Address comments
60b8e9d
to
74e6ac5
Compare
@@ -0,0 +1,728 @@ | |||
// Code generated by mockery v2.20.0. DO NOT EDIT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to .gitgnore
mocks or something.
} | ||
if len(blockedNodes) > 0 { | ||
if err := t.checkResourcePoolRemainingCapacity(); err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how did we test this? i don't think just returning an error is quite right here. i think like all the code around it, we need to transition to a terminal state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is an invariant here we need to have commented so folks don't miss it (cc @erikwilson ). handleAllocationExit
must transition to a terminal state or reallocate, otherwise the system is in an invalid state.
InformationalReason: msg, | ||
}) | ||
} | ||
logrus.Warn(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use t.syslog
here, so we get more info on the structured log
Description
Test Plan
e2e tests.
Commentary (optional)
Checklist
docs/release-notes/
.See Release Note for details.
Ticket
DET-9897