-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: fix missing task stats for queued tasks #9745
Conversation
This reverts commit 6b15672.
✅ Deploy Preview for determined-ui canceled.
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #9745 +/- ##
==========================================
- Coverage 54.12% 54.06% -0.06%
==========================================
Files 1257 1257
Lines 154768 154772 +4
Branches 3479 3479
==========================================
- Hits 83762 83681 -81
- Misses 70860 70945 +85
Partials 146 146
Flags with carried forward coverage won't be shown. Click here to find out more.
|
monitoring the durations:
post fixing total aggregation
post setting start_time for restored task_stats
|
@@ -626,10 +626,14 @@ func (a *allocation) resourcesAllocated(msg *sproto.ResourcesAllocated) error { | |||
} | |||
|
|||
now := time.Now().UTC() | |||
taskStatStartTime := msg.JobSubmissionTime | |||
if msg.JobSubmissionTime.IsZero() && a.req.Restore { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even if we had jobSubmissionTime, in case of restores I think we'd want to use the request time? which would be the master start time when we're going through the restore path
assert day_stats.periodStart == time.strftime("%Y-%m-%d"), "day stats should be for today" | ||
seconds_since_start = time.time() - start_time | ||
# CM-404: checkpoint GC adds to the aggregation. | ||
checkpoint_gc_factor = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the right checkpoint policy I should be able to prevent this I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, modulo python lint + missing a release note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
going to land it since we're cutting a new RC as the changes make sense on their own even if I haven't figured out why/how we saw a high queued time on a few-day-old aggregation. I'll create a separate ticket for that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
c7124d1
to
8051140
Compare
fix a few issues around null and zero start_time on recorded queued tasks_stats (cherry picked from commit b8c6773)
Ticket
https://hpe-aiatscale.atlassian.net/browse/CM-34
the hypothesis is that the issue happens when the master bounces and pending tasks get restored which reproduces reliably locally (at least on the agent rm)
Description
task_stats
with nil start time for total queued aggregation calculation the same way we do for day by day aggregation.Test Plan
e2e tests for the on-the-fly calculations. manually tested the total aggregation
Checklist
docs/release-notes/
See Release Note for details.