fix: fix missing task stats for queued tasks #9745

hamidzr · 2024-07-29T18:10:57Z

Ticket

https://hpe-aiatscale.atlassian.net/browse/CM-34

the hypothesis is that the issue happens when the master bounces and pending tasks get restored which reproduces reliably locally (at least on the agent rm)

Description

use RequestTime for recording restored tasks stats.
ignore task_stats with nil start time for total queued aggregation calculation the same way we do for day by day aggregation.
apply the same for same-day API responses to calculating queued stats.

Test Plan

e2e tests for the on-the-fly calculations. manually tested the total aggregation

     assert (
            day_stats.seconds <= seconds_since_start * checkpoint_gc_factor
        ), "queued seconds should be less than when test started"
E       AssertionError: queued seconds should be less than when test started
E       assert 21285962000.0 <= (27.579550981521606 * 5)
E        +  where 21285962000.0 = <determined.common.api.bindings.v1AggregateQueueStats object at 0x10ea147f0>.seconds

e2e_tests/tests/cluster/test_master_restart.py:53: AssertionError

Checklist

Changes have been manually QA'd
New features have been approved by the corresponding PM
User-facing API changes have the "User-facing API Change" label
Release notes have been added as a separate file under docs/release-notes/
See Release Note for details.
Licenses have been included for new code which was copied and/or modified from any external code

This reverts commit 6b15672.

netlify · 2024-07-29T18:11:16Z

✅ Deploy Preview for determined-ui canceled.

Name	Link
🔨 Latest commit	`8051140`
🔍 Latest deploy log	https://app.netlify.com/sites/determined-ui/deploys/66a9242da9a500000850ed63

codecov · 2024-07-29T18:13:01Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 54.06%. Comparing base (0df7ad3) to head (8051140).
Report is 14 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9745      +/-   ##
==========================================
- Coverage   54.12%   54.06%   -0.06%     
==========================================
  Files        1257     1257              
  Lines      154768   154772       +4     
  Branches     3479     3479              
==========================================
- Hits        83762    83681      -81     
- Misses      70860    70945      +85     
  Partials      146      146

Flag	Coverage Δ
backend	`44.86% <100.00%> (-0.01%)`	⬇️
harness	`72.35% <ø> (-0.27%)`	⬇️
web	`53.39% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
master/internal/rm/db.go	`81.25% <100.00%> (+0.60%)`	⬆️
master/internal/task/allocation.go	`76.14% <100.00%> (-0.20%)`	⬇️

... and 6 files with indirect coverage changes

hamidzr · 2024-07-29T18:14:34Z

monitoring the durations:

select 
   a.start_time as all_s, 
   a.end_time as all_ee, 
   a.end_time - a.start_time as all_duration, 
   ts.start_time que_start, 
   ts.end_time que_end, 
   ts.end_time - ts.start_time as que_duration,
   a.allocation_id, 
   a.task_id
 from 
   allocations a
 join 
   task_stats ts on a.allocation_id = ts.allocation_id
 where 
   ts.event_type = 'QUEUED'
 order by 
   ts.start_time;

post fixing total aggregation

determined> select start_time, end_time from task_stats; select start_time, end_time from allocations;
+-------------------------------+-------------------------------+
| start_time                    | end_time                      |
|-------------------------------+-------------------------------|
| 2024-07-29 18:06:29.753275+00 | 2024-07-29 18:06:30.026537+00 |
| 2024-07-29 18:06:29.753275+00 | 2024-07-29 18:06:53.072657+00 |
| 2024-07-29 18:07:36.518763+00 | 2024-07-29 18:07:36.655546+00 |
| 0001-01-01 00:00:00+00        | 2024-07-29 18:07:43.935869+00 |
| 0001-01-01 00:00:00+00        | 2024-07-29 18:07:47.991606+00 |
| 2024-07-29 18:07:36.518763+00 | 2024-07-29 18:07:59.991408+00 |
+-------------------------------+-------------------------------+
SELECT 6
+-------------------------+----------------------------+
| start_time              | end_time                   |
|-------------------------+----------------------------|
| 2024-07-29 18:07:36.702 | 2024-07-29 18:07:59.430831 |
| 2024-07-29 18:06:30.042 | 2024-07-29 18:06:52.634289 |
| 2024-07-29 18:08:00.062 | 2024-07-29 18:08:00.062    |
| 2024-07-29 18:06:53.125 | 2024-07-29 18:07:02.723819 |
+-------------------------+----------------------------+
SELECT 4
Time: 0.019s
determined> select * from resource_aggregates;
+------------+------------------+-----------------+-----------+
| date       | aggregation_type | aggregation_key | seconds   |
|------------+------------------+-----------------+-----------|
| 2024-07-29 | username         | admin           | 45.32112  |
| 2024-07-29 | resource_pool    | default         | 45.32112  |
| 2024-07-29 | total            | total           | 45.32112  |
| 2024-07-29 | queued           | default         | 11.800518 |
| 2024-07-29 | queued           | total           | 11.800518 |
+------------+------------------+-----------------+-----------+
SELECT 5
Time: 0.017s
determined>

post setting start_time for restored task_stats

determined> select start_time, end_time, end_time - start_time as duration, allocation_id from task_stats where event_type = 'QUEUED' order by start_time; select start_time, end_time, end_time - start_time as durati
 on, allocation_id from allocations order by start_time;
+-------------------------------+-------------------------------+----------------+------------------------------------------+
| start_time                    | end_time                      | duration       | allocation_id                            |
|-------------------------------+-------------------------------+----------------+------------------------------------------|
| 2024-07-29 19:29:10.873954+00 | 2024-07-29 19:29:14.123556+00 | 0:00:03.249602 | 1.c075adcd-5c31-4726-a5b3-0b87e141f1fe.1 |
| 2024-07-29 19:29:10.873954+00 | 2024-07-29 19:29:38.295098+00 | 0:00:27.421144 | 1.c0b31654-2ee3-4f8c-9076-6556ef7a097c.1 |
| 2024-07-29 19:29:18.384781+00 | 2024-07-29 19:29:18.387981+00 | 0:00:00.003200 | 1.c075adcd-5c31-4726-a5b3-0b87e141f1fe.1 |
| 2024-07-29 19:29:23.04933+00  | 2024-07-29 19:29:23.052279+00 | 0:00:00.002949 | 1.c075adcd-5c31-4726-a5b3-0b87e141f1fe.1 |
| 2024-07-29 19:29:27.844883+00 | 2024-07-29 19:29:27.849668+00 | 0:00:00.004785 | 1.c075adcd-5c31-4726-a5b3-0b87e141f1fe.1 |
+-------------------------------+-------------------------------+----------------+------------------------------------------+
SELECT 5
+-------------------------+----------------------------+----------------+------------------------------------------+
| start_time              | end_time                   | duration       | allocation_id                            |
|-------------------------+----------------------------+----------------+------------------------------------------|
| 2024-07-29 19:29:14.167 | 2024-07-29 19:29:38.074801 | 0:00:23.907801 | 1.c075adcd-5c31-4726-a5b3-0b87e141f1fe.1 |
| 2024-07-29 19:29:38.342 | 2024-07-29 19:29:47.840999 | 0:00:09.498999 | 1.c0b31654-2ee3-4f8c-9076-6556ef7a097c.1 |
+-------------------------+----------------------------+----------------+------------------------------------------+
SELECT 2
Time: 0.019s
determined>

hamidzr · 2024-07-29T19:32:59Z

master/internal/task/allocation.go

@@ -626,10 +626,14 @@ func (a *allocation) resourcesAllocated(msg *sproto.ResourcesAllocated) error {
 	}

 	now := time.Now().UTC()
+	taskStatStartTime := msg.JobSubmissionTime
+	if msg.JobSubmissionTime.IsZero() && a.req.Restore {


even if we had jobSubmissionTime, in case of restores I think we'd want to use the request time? which would be the master start time when we're going through the restore path

hamidzr · 2024-07-29T21:44:17Z

e2e_tests/tests/cluster/test_master_restart.py

+        assert day_stats.periodStart == time.strftime("%Y-%m-%d"), "day stats should be for today"
+        seconds_since_start = time.time() - start_time
+        # CM-404: checkpoint GC adds to the aggregation.
+        checkpoint_gc_factor = 5


with the right checkpoint policy I should be able to prevent this I think

carolinaecalderon

LGTM, modulo python lint + missing a release note

ShreyaLnuHpe

LGTM

hamidzr · 2024-07-30T17:12:58Z

going to land it since we're cutting a new RC as the changes make sense on their own even if I haven't figured out why/how we saw a high queued time on a few-day-old aggregation. I'll create a separate ticket for that.

tara-det-ai

LGTM

fix a few issues around null and zero start_time on recorded queued tasks_stats (cherry picked from commit b8c6773)

hamidzr added 10 commits July 29, 2024 00:00

check for nilDate in today's date

57443bc

exclude it when calculating total aggregate as well

e7ee5d1

tmp: test aggregation

6b15672

Revert "tmp: test aggregation"

84ba20a

This reverts commit 6b15672.

use request time for calculating queued time on restored tasks

8e617b0

comment update

cf4572b

add master restart test

5f78fe4

accounted for checkpoint gc timing

e193cf4

less comments

fa960ca

check the aggregate is not accounting for other experiments

173e1f1

cla-bot bot added the cla-signed label Jul 29, 2024

hamidzr changed the title ~~Hz stats missingstart~~ fix: treat missing task stats for queued tasks Jul 29, 2024

carolinaecalderon added to-cherry-pick Pull requests that need to be cherry-picked into the current release and removed to-cherry-pick Pull requests that need to be cherry-picked into the current release labels Jul 29, 2024

hamidzr commented Jul 29, 2024

View reviewed changes

hamidzr requested a review from NicholasBlaskey July 29, 2024 19:38

hamidzr commented Jul 29, 2024

View reviewed changes

hamidzr changed the title ~~fix: treat missing task stats for queued tasks~~ fix: fix missing task stats for queued tasks Jul 29, 2024

carolinaecalderon added the to-cherry-pick Pull requests that need to be cherry-picked into the current release label Jul 30, 2024

carolinaecalderon approved these changes Jul 30, 2024

View reviewed changes

hamidzr added 2 commits July 30, 2024 00:00

lint

1bc824e

docs: add release notes

8051140

NicholasBlaskey approved these changes Jul 30, 2024

View reviewed changes

hamidzr marked this pull request as ready for review July 30, 2024 14:41

hamidzr requested review from a team as code owners July 30, 2024 14:41

hamidzr requested a review from ShreyaLnuHpe July 30, 2024 14:41

ShreyaLnuHpe approved these changes Jul 30, 2024

View reviewed changes

determined-ci added the documentation Improvements or additions to documentation label Jul 30, 2024

determined-ci requested a review from a team July 30, 2024 17:22

hamidzr enabled auto-merge (squash) July 30, 2024 17:23

hamidzr requested a review from tara-det-ai July 30, 2024 17:23

tara-det-ai approved these changes Jul 30, 2024

View reviewed changes

hamidzr force-pushed the hz-stats-missingstart branch from c7124d1 to 8051140 Compare July 30, 2024 17:34

determined-ci requested a review from a team July 30, 2024 17:34

hamidzr merged commit b8c6773 into main Jul 30, 2024
83 of 97 checks passed

hamidzr deleted the hz-stats-missingstart branch July 30, 2024 17:58

github-actions bot pushed a commit that referenced this pull request Jul 30, 2024

fix: fix missing task_stats start_time on restored allocation (#9745)

5cb9a32

fix a few issues around null and zero start_time on recorded queued tasks_stats (cherry picked from commit b8c6773)

hamidzr mentioned this pull request Aug 5, 2024

fix: alway use restored allocation request time for queue recording #9799

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix missing task stats for queued tasks #9745

fix: fix missing task stats for queued tasks #9745

hamidzr commented Jul 29, 2024 •

edited

Loading

netlify bot commented Jul 29, 2024 •

edited

Loading

codecov bot commented Jul 29, 2024 •

edited

Loading

hamidzr commented Jul 29, 2024 •

edited

Loading

hamidzr Jul 29, 2024 •

edited

Loading

hamidzr Jul 29, 2024

carolinaecalderon left a comment •

edited

Loading

ShreyaLnuHpe left a comment

hamidzr commented Jul 30, 2024

tara-det-ai left a comment

fix: fix missing task stats for queued tasks #9745

fix: fix missing task stats for queued tasks #9745

Conversation

hamidzr commented Jul 29, 2024 • edited Loading

Ticket

Description

Test Plan

Checklist

netlify bot commented Jul 29, 2024 • edited Loading

✅ Deploy Preview for determined-ui canceled.

codecov bot commented Jul 29, 2024 • edited Loading

Codecov Report

hamidzr commented Jul 29, 2024 • edited Loading

hamidzr Jul 29, 2024 • edited Loading

Choose a reason for hiding this comment

hamidzr Jul 29, 2024

Choose a reason for hiding this comment

carolinaecalderon left a comment • edited Loading

Choose a reason for hiding this comment

ShreyaLnuHpe left a comment

Choose a reason for hiding this comment

hamidzr commented Jul 30, 2024

tara-det-ai left a comment

Choose a reason for hiding this comment

hamidzr commented Jul 29, 2024 •

edited

Loading

netlify bot commented Jul 29, 2024 •

edited

Loading

codecov bot commented Jul 29, 2024 •

edited

Loading

hamidzr commented Jul 29, 2024 •

edited

Loading

hamidzr Jul 29, 2024 •

edited

Loading

carolinaecalderon left a comment •

edited

Loading