feat: Connect trial UI to workloads API; pass sort/filter to API #4407

mapmeld · 2022-06-22T13:53:21Z

Description

Connects the "Workloads" table on the single-trial page to the Workloads API. This code is part of a plan to remove the response of all workloads and metrics from the Trial API (#4300). We're splitting that PR up for easier review.

We already have a paginated workloads API (#3635). For the sortable table to connect to the API, this PR adds a sortKey field (default is batch #, this adds sorting by a metric) and filter field (a dropdown near the table selects All / Checkpoints / Validation / Checkpoints or Validation).

~~This commit includes the patch from #4406.~~ Done.

Remaining issue with same-name metrics:

coalesce(t.metrics, v.metrics) is odd, can we confirm that no one workload includes both training and validation metrics?
When we sort by a metric ("loss"), we sort after the coalesce(t.metrics, v.metrics) line distinguishing training and validation metrics.

Exploring the issue:

In this test, training-loss and validation-loss both show a sorted-column icon. The training and validation losses are displayed on the correct columns. The training workload 100 and validation workload 100 were returned as separate workloads by the API, but got combined into one table row by workloadsToSteps, and sorted as the highest value for that metric within that page.

latest-master also combines the 100 workloads, and sorts the row as the highest value of all workloads (i.e. DESC shows it first, ASC shows it last)

Checklist

User-facing API changes need the "User-facing API Change" label.
Release notes should be added as a separate file under docs/release-notes/.
Licenses should be included for new code which was copied and/or modified from any external code.

netlify · 2022-06-22T13:53:29Z

✅ Deploy Preview for determined-ui ready!

Name	Link
🔨 Latest commit	`c2e4879`
🔍 Latest deploy log	https://app.netlify.com/sites/determined-ui/deploys/62e95d232831fd0008419749
😎 Deploy Preview	https://deploy-preview-4407--determined-ui.netlify.app/
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

mapmeld · 2022-06-22T17:17:01Z

update: I think I have a good resolution on table rows and sorting when there are both training-loss and validation-loss. There would be an issue if typically we have workloads with the same batch number and we do want them merged, but I don't think that's the case?

trentwatt

Looks good, main caveat is just that sorting by loss ASC results in displaying null first. may have been the case previously to but might as well address now.

trentwatt · 2022-07-05T04:53:15Z

master/internal/db/postgres_experiments.go

@@ -396,7 +396,7 @@ FROM trials t
 WHERE t.id=$2
  AND s.state = 'COMPLETED'
  AND total_batches >= $3
-  AND total_batches <= $4
+  AND ($4 <= 0 OR total_batches <= $4)


question: don't see a need for it, but it seems like just for reasons of consistency, end_batches = 0 should be a valid filter. is there a reason we need this?

For infinite batches we could use -1, does that seem like a better option?

I've now noticed these:

if endBatches <= 0 { endBatches = math.MaxInt32 }

So I don't need to change this query at all / in the first place. Leaving this file unchanged.

trentwatt · 2022-07-05T04:53:45Z

master/internal/db/postgres_experiments.go

@@ -423,7 +423,7 @@ JOIN validations v ON t.id = v.trial_id
 WHERE t.id=$2
  AND v.state = 'COMPLETED'
  AND v.total_batches >= $3
-  AND v.total_batches <= $4
+  AND ($4 <= 0 OR v.total_batches <= $4)


same question

master/internal/api_trials.go

proto/src/determined/api/v1/trial.proto

webui/react/src/pages/TrialDetails/TrialDetailsWorkloads.tsx

trentwatt · 2022-07-05T05:40:27Z

webui/react/src/pages/TrialDetails/TrialDetailsWorkloads.tsx

          rowClassName={defaultRowClassName({ clickable: false })}
-          rowKey="batchNum"
+          rowKey="key"


trentwatt · 2022-07-05T05:42:51Z

master/static/srv/proto_get_trial_workloads.sql

@@ -58,7 +64,7 @@ page_info AS (
 )
 SELECT (
  SELECT jsonb_agg(w) FROM (SELECT validation, training, checkpoint FROM workloads
-    ORDER BY total_batches %s, end_time %s
+    ORDER BY %s %s NULLS LAST, total_batches %s, end_time %s


i think maybe there could be an extra NULLS LAST in here somewhere? sorting by loss asc results in nulls first

I don't have access to the same experiment, but as validation_loss would be desc here, I'm thinking that you're sorting by another column? And that column variable exists on both training and validation? I was thinking that it was going OK to sort these together, but I forgot how likely it is that only one of the columns would be displayed in the UI. So I probably want to revisit this and make it possible to treat these differently.

on 2nd thought, more likely it is doing the correct sort in the API but sorting nulls weirdly in the UI... I have an example locally where I will work on that issue too

✅ got the UI table's nulls last working

webui/react/src/pages/TrialDetails/TrialDetailsWorkloads.tsx

hkang1

Took a quick peak, looks like @trentwatt got a lot of coverage already and it's looking good! Added some minor things here and there.

proto/src/determined/api/v1/trial.proto

webui/react/src/services/decoder.ts

trentwatt · 2022-07-05T19:45:51Z

update: I think I have a good resolution on table rows and sorting when there are both training-loss and validation-loss. There would be an issue if typically we have workloads with the same batch number and we do want them merged, but I don't think that's the case?

what do you think we want to do here @hkang1 ? merge or not? if no merge, we end up with a single batch having separate rows for training metrics, validation metrics, and checkpoints, at least under certain circumstances.

would require significantly changing API structure/query logic though?

…nged

hkang1 · 2022-08-02T16:47:47Z

update: I think I have a good resolution on table rows and sorting when there are both training-loss and validation-loss. There would be an issue if typically we have workloads with the same batch number and we do want them merged, but I don't think that's the case?

what do you think we want to do here @hkang1 ? merge or not? if no merge, we end up with a single batch having separate rows for training metrics, validation metrics, and checkpoints, at least under certain circumstances.

would require significantly changing API structure/query logic though?

What's the level of effort to preserve the merged behavior?

We are not quite there yet but the goal is eventually to have a dedicated checkpoints section for each experiment, along with a dedicated data download capability. This workloads table is really for the user to understand what data is getting used to draw the chart and to find if there are problems with training (e.g. looking for exploding values of NaN, Inf or -Inf).

Basically saying if it's easy to preserve the merged behavior, let's do it, otherwise the separate rows are fine as it is a temporary state. The benefits of being able to sort and paginate metrics and checkpoints more efficiently is much needed for performance issues we've had around this that compared to losing merged behavior is acceptable IMO.

…led for now

mapmeld requested a review from hkang1 June 22, 2022 13:53

mapmeld requested a review from ioga as a code owner June 22, 2022 13:53

cla-bot bot added the cla-signed label Jun 22, 2022

mapmeld requested a review from trentwatt June 28, 2022 16:14

trentwatt self-assigned this Jun 28, 2022

trentwatt approved these changes Jul 5, 2022

View reviewed changes

hkang1 reviewed Jul 5, 2022

View reviewed changes

proto/src/determined/api/v1/trial.proto Show resolved Hide resolved

proto/src/determined/api/v1/trial.proto Outdated Show resolved Hide resolved

proto/src/determined/api/v1/trial.proto Show resolved Hide resolved

webui/react/src/services/decoder.ts Outdated Show resolved Hide resolved

trentwatt removed their assignment Jul 11, 2022

mapmeld added 14 commits August 2, 2022 11:30

limit changes so workloads table is paginated but nothing else is cha…

9493faa

…nged

return validation on checkpoints

6caee3c

dont merge workloads into one row when theyre on one batch

f26bb88

lint

57b8deb

new split-up workloads

4f0e9fc

nulls last on frontend table, use of ??, use of nil for end batch

0503e4a

lint

707dab7

Workloads filter as an enum

ba8e00d

keep other trial requests compatible with api_trials

89a0dc0

use Pagination

44785e1

better polling with rerunOnNewFn

77f7687

lint

4770c6e

allow endBatches = math.MaxInt32 to handle max batches

8a2bb5d

lint

489e276

avoid changes to how training/validation/checkpoint workloads are tab…

c2e4879

…led for now

mapmeld merged commit 972954d into determined-ai:master Aug 2, 2022

mapmeld deleted the mvp_downsample_v3 branch August 2, 2022 21:02

mapmeld mentioned this pull request Aug 2, 2022

chore: Remove obsolete workloads from Trials API #4703

Merged

4 tasks

dannysauer modified the milestones: 0.0.102, 0.19.1 Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Connect trial UI to workloads API; pass sort/filter to API #4407

feat: Connect trial UI to workloads API; pass sort/filter to API #4407

mapmeld commented Jun 22, 2022 •

edited

Loading

netlify bot commented Jun 22, 2022 •

edited

Loading

mapmeld commented Jun 22, 2022

trentwatt left a comment

trentwatt Jul 5, 2022

mapmeld Jul 7, 2022

mapmeld Jul 8, 2022 •

edited

Loading

trentwatt Jul 5, 2022

trentwatt Jul 5, 2022

trentwatt Jul 5, 2022

mapmeld Jul 5, 2022

mapmeld Jul 5, 2022

mapmeld Jul 5, 2022

hkang1 left a comment

trentwatt commented Jul 5, 2022

hkang1 commented Aug 2, 2022

feat: Connect trial UI to workloads API; pass sort/filter to API #4407

feat: Connect trial UI to workloads API; pass sort/filter to API #4407

Conversation

mapmeld commented Jun 22, 2022 • edited Loading

Description

Checklist

netlify bot commented Jun 22, 2022 • edited Loading

✅ Deploy Preview for determined-ui ready!

mapmeld commented Jun 22, 2022

trentwatt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapmeld Jul 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hkang1 left a comment

Choose a reason for hiding this comment

trentwatt commented Jul 5, 2022

hkang1 commented Aug 2, 2022

mapmeld commented Jun 22, 2022 •

edited

Loading

netlify bot commented Jun 22, 2022 •

edited

Loading

mapmeld Jul 8, 2022 •

edited

Loading