Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix missing task stats for queued tasks #9745

Merged
merged 12 commits into from
Jul 30, 2024
Merged

Conversation

hamidzr
Copy link
Member

@hamidzr hamidzr commented Jul 29, 2024

Ticket

https://hpe-aiatscale.atlassian.net/browse/CM-34

the hypothesis is that the issue happens when the master bounces and pending tasks get restored which reproduces reliably locally (at least on the agent rm)

Description

  • use RequestTime for recording restored tasks stats.
  • ignore task_stats with nil start time for total queued aggregation calculation the same way we do for day by day aggregation.
  • apply the same for same-day API responses to calculating queued stats.

Test Plan

e2e tests for the on-the-fly calculations. manually tested the total aggregation

     assert (
            day_stats.seconds <= seconds_since_start * checkpoint_gc_factor
        ), "queued seconds should be less than when test started"
E       AssertionError: queued seconds should be less than when test started
E       assert 21285962000.0 <= (27.579550981521606 * 5)
E        +  where 21285962000.0 = <determined.common.api.bindings.v1AggregateQueueStats object at 0x10ea147f0>.seconds

e2e_tests/tests/cluster/test_master_restart.py:53: AssertionError

Checklist

  • Changes have been manually QA'd
  • New features have been approved by the corresponding PM
  • User-facing API changes have the "User-facing API Change" label
  • Release notes have been added as a separate file under docs/release-notes/
    See Release Note for details.
  • Licenses have been included for new code which was copied and/or modified from any external code

@cla-bot cla-bot bot added the cla-signed label Jul 29, 2024
Copy link

netlify bot commented Jul 29, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 8051140
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/66a9242da9a500000850ed63

@hamidzr hamidzr changed the title Hz stats missingstart fix: treat missing task stats for queued tasks Jul 29, 2024
Copy link

codecov bot commented Jul 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 54.06%. Comparing base (0df7ad3) to head (8051140).
Report is 14 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9745      +/-   ##
==========================================
- Coverage   54.12%   54.06%   -0.06%     
==========================================
  Files        1257     1257              
  Lines      154768   154772       +4     
  Branches     3479     3479              
==========================================
- Hits        83762    83681      -81     
- Misses      70860    70945      +85     
  Partials      146      146              
Flag Coverage Δ
backend 44.86% <100.00%> (-0.01%) ⬇️
harness 72.35% <ø> (-0.27%) ⬇️
web 53.39% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/rm/db.go 81.25% <100.00%> (+0.60%) ⬆️
master/internal/task/allocation.go 76.14% <100.00%> (-0.20%) ⬇️

... and 6 files with indirect coverage changes

@hamidzr
Copy link
Member Author

hamidzr commented Jul 29, 2024

monitoring the durations:

select 
   a.start_time as all_s, 
   a.end_time as all_ee, 
   a.end_time - a.start_time as all_duration, 
   ts.start_time que_start, 
   ts.end_time que_end, 
   ts.end_time - ts.start_time as que_duration,
   a.allocation_id, 
   a.task_id
 from 
   allocations a
 join 
   task_stats ts on a.allocation_id = ts.allocation_id
 where 
   ts.event_type = 'QUEUED'
 order by 
   ts.start_time;

post fixing total aggregation

determined> select start_time, end_time from task_stats; select start_time, end_time from allocations;
+-------------------------------+-------------------------------+
| start_time                    | end_time                      |
|-------------------------------+-------------------------------|
| 2024-07-29 18:06:29.753275+00 | 2024-07-29 18:06:30.026537+00 |
| 2024-07-29 18:06:29.753275+00 | 2024-07-29 18:06:53.072657+00 |
| 2024-07-29 18:07:36.518763+00 | 2024-07-29 18:07:36.655546+00 |
| 0001-01-01 00:00:00+00        | 2024-07-29 18:07:43.935869+00 |
| 0001-01-01 00:00:00+00        | 2024-07-29 18:07:47.991606+00 |
| 2024-07-29 18:07:36.518763+00 | 2024-07-29 18:07:59.991408+00 |
+-------------------------------+-------------------------------+
SELECT 6
+-------------------------+----------------------------+
| start_time              | end_time                   |
|-------------------------+----------------------------|
| 2024-07-29 18:07:36.702 | 2024-07-29 18:07:59.430831 |
| 2024-07-29 18:06:30.042 | 2024-07-29 18:06:52.634289 |
| 2024-07-29 18:08:00.062 | 2024-07-29 18:08:00.062    |
| 2024-07-29 18:06:53.125 | 2024-07-29 18:07:02.723819 |
+-------------------------+----------------------------+
SELECT 4
Time: 0.019s
determined> select * from resource_aggregates;
+------------+------------------+-----------------+-----------+
| date       | aggregation_type | aggregation_key | seconds   |
|------------+------------------+-----------------+-----------|
| 2024-07-29 | username         | admin           | 45.32112  |
| 2024-07-29 | resource_pool    | default         | 45.32112  |
| 2024-07-29 | total            | total           | 45.32112  |
| 2024-07-29 | queued           | default         | 11.800518 |
| 2024-07-29 | queued           | total           | 11.800518 |
+------------+------------------+-----------------+-----------+
SELECT 5
Time: 0.017s
determined>

post setting start_time for restored task_stats

determined> select start_time, end_time, end_time - start_time as duration, allocation_id from task_stats where event_type = 'QUEUED' order by start_time; select start_time, end_time, end_time - start_time as durati
 on, allocation_id from allocations order by start_time;
+-------------------------------+-------------------------------+----------------+------------------------------------------+
| start_time                    | end_time                      | duration       | allocation_id                            |
|-------------------------------+-------------------------------+----------------+------------------------------------------|
| 2024-07-29 19:29:10.873954+00 | 2024-07-29 19:29:14.123556+00 | 0:00:03.249602 | 1.c075adcd-5c31-4726-a5b3-0b87e141f1fe.1 |
| 2024-07-29 19:29:10.873954+00 | 2024-07-29 19:29:38.295098+00 | 0:00:27.421144 | 1.c0b31654-2ee3-4f8c-9076-6556ef7a097c.1 |
| 2024-07-29 19:29:18.384781+00 | 2024-07-29 19:29:18.387981+00 | 0:00:00.003200 | 1.c075adcd-5c31-4726-a5b3-0b87e141f1fe.1 |
| 2024-07-29 19:29:23.04933+00  | 2024-07-29 19:29:23.052279+00 | 0:00:00.002949 | 1.c075adcd-5c31-4726-a5b3-0b87e141f1fe.1 |
| 2024-07-29 19:29:27.844883+00 | 2024-07-29 19:29:27.849668+00 | 0:00:00.004785 | 1.c075adcd-5c31-4726-a5b3-0b87e141f1fe.1 |
+-------------------------------+-------------------------------+----------------+------------------------------------------+
SELECT 5
+-------------------------+----------------------------+----------------+------------------------------------------+
| start_time              | end_time                   | duration       | allocation_id                            |
|-------------------------+----------------------------+----------------+------------------------------------------|
| 2024-07-29 19:29:14.167 | 2024-07-29 19:29:38.074801 | 0:00:23.907801 | 1.c075adcd-5c31-4726-a5b3-0b87e141f1fe.1 |
| 2024-07-29 19:29:38.342 | 2024-07-29 19:29:47.840999 | 0:00:09.498999 | 1.c0b31654-2ee3-4f8c-9076-6556ef7a097c.1 |
+-------------------------+----------------------------+----------------+------------------------------------------+
SELECT 2
Time: 0.019s
determined>

@carolinaecalderon carolinaecalderon added to-cherry-pick Pull requests that need to be cherry-picked into the current release and removed to-cherry-pick Pull requests that need to be cherry-picked into the current release labels Jul 29, 2024
@@ -626,10 +626,14 @@ func (a *allocation) resourcesAllocated(msg *sproto.ResourcesAllocated) error {
}

now := time.Now().UTC()
taskStatStartTime := msg.JobSubmissionTime
if msg.JobSubmissionTime.IsZero() && a.req.Restore {
Copy link
Member Author

@hamidzr hamidzr Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even if we had jobSubmissionTime, in case of restores I think we'd want to use the request time? which would be the master start time when we're going through the restore path

assert day_stats.periodStart == time.strftime("%Y-%m-%d"), "day stats should be for today"
seconds_since_start = time.time() - start_time
# CM-404: checkpoint GC adds to the aggregation.
checkpoint_gc_factor = 5
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the right checkpoint policy I should be able to prevent this I think

@hamidzr hamidzr changed the title fix: treat missing task stats for queued tasks fix: fix missing task stats for queued tasks Jul 29, 2024
@carolinaecalderon carolinaecalderon added the to-cherry-pick Pull requests that need to be cherry-picked into the current release label Jul 30, 2024
Copy link
Contributor

@carolinaecalderon carolinaecalderon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, modulo python lint + missing a release note

@hamidzr hamidzr marked this pull request as ready for review July 30, 2024 14:41
@hamidzr hamidzr requested review from a team as code owners July 30, 2024 14:41
Copy link
Contributor

@ShreyaLnuHpe ShreyaLnuHpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hamidzr
Copy link
Member Author

hamidzr commented Jul 30, 2024

going to land it since we're cutting a new RC as the changes make sense on their own even if I haven't figured out why/how we saw a high queued time on a few-day-old aggregation. I'll create a separate ticket for that.

@determined-ci determined-ci added the documentation Improvements or additions to documentation label Jul 30, 2024
@determined-ci determined-ci requested a review from a team July 30, 2024 17:22
@hamidzr hamidzr enabled auto-merge (squash) July 30, 2024 17:23
@hamidzr hamidzr requested a review from tara-det-ai July 30, 2024 17:23
Copy link
Member

@tara-det-ai tara-det-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@determined-ci determined-ci requested a review from a team July 30, 2024 17:34
@hamidzr hamidzr merged commit b8c6773 into main Jul 30, 2024
83 of 97 checks passed
@hamidzr hamidzr deleted the hz-stats-missingstart branch July 30, 2024 17:58
github-actions bot pushed a commit that referenced this pull request Jul 30, 2024
fix a few issues around null and zero start_time on recorded queued tasks_stats

(cherry picked from commit b8c6773)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation to-cherry-pick Pull requests that need to be cherry-picked into the current release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants